- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1 - - PowerPoint PPT Presentation

cassandra for time series data
SMART_READER_LITE
LIVE PREVIEW

- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1 - - PowerPoint PPT Presentation

- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1 Joris Gillis I am a software engineer at TrendMiner and focus on the enterprise scalability of our industrial analytics platform. I studied at the University of Hasselt and the


slide-1
SLIDE 1
  • Cassandra for Time Series Data -

Joris Gillis, June 28, 2017

1

slide-2
SLIDE 2

I am a software engineer at TrendMiner and focus on the enterprise scalability of our industrial analytics platform. I studied at the University of Hasselt and the University of Antwerp in the field of Database theory and Data Mining. My interests are:

  • Big Data technology
  • Functional Programming
  • Athletics

Joris Gillis

2

slide-3
SLIDE 3

Agenda

  • 1. Introduction
  • 2. Why Cassandra?
  • 3. How to model time series in Cassandra?
  • 4. How to configure Cassandra for Time Series Data?
  • 5. Q&A

3

slide-4
SLIDE 4

About TrendMiner

TrendMiner is the Leading Modelling Free Industrial Analytics Platform to Analyze, Monitor and Predict Asset and Process Performance. With a proven track record in the (Petro-) Chemical and Oil & Gas industry to increase overall profitability, by improving production yield, lower costs, avoid unplanned process downtime, increase overall equipment efficiency and reduce safety risks.

4

slide-5
SLIDE 5

About TrendMiner

About our company

  • Started in 2008 as Spin-off K.U.Leuven
  • > 70 Man Year research behind TrendMiner
  • Spin-out Idea from Bayer MaterialScience (Covestro)
  • Patented several core technologies for US/EU
  • Headquarter EMEA, Hasselt, Belgium
  • Headquarter US, Houston, TX
  • 60+ Employees and growing
  • Global OSIsoft PI ISV & OEM partner
  • Platform Agnostic vendor
  • Front runner in both Process & Asset Analytics

5

slide-6
SLIDE 6

Industry 4.0

CONNECTIVITY ADVANCED MANUFACTURING BIG DATA & ANALYTICS

Internet of Things Wearables Augmented Reality Optimization & Prediction Machine Learning Cyber Security Additive Manufacturing Advanced Materials Autonomous Robotics

Technologies that enable new ways of working and of doing business

6

slide-7
SLIDE 7

About TrendMiner

About our software

7

slide-8
SLIDE 8

About TrendMiner

Analyze

8

slide-9
SLIDE 9

About TrendMiner

Monitor

9

slide-10
SLIDE 10

About TrendMiner

Predict

10

slide-11
SLIDE 11

Problem statement

Time series

  • From thousands to millions
  • Resolution between 5

minutes and 1 second

  • E.g., 10 year history

Complex analyses Plants across the globe

11

slide-12
SLIDE 12

What is a Time Series?

  • A time series is a series of timestamped data points
  • Sometimes data points are spaced equidistantly
  • List<Tuple<Long, Float>?

12

slide-13
SLIDE 13

Agenda

  • 1. Introduction
  • 2. Why Cassandra?
  • 3. How to model time series in Cassandra?
  • 4. How to configure Cassandra for Time Series Data?
  • 5. Q&A

13

slide-14
SLIDE 14

Why Cassandra

In-store analytics too limited for our needs Only HTTP interface Big Data => Big Index Horizontal scaling New technology

slide-15
SLIDE 15

Why Cassandra

  • ADVANTAGES
  • Proven technology
  • Connectivity
  • JDBC connector (also to Spark)
  • Edge locations
  • Geographic distribution of data
  • Support for storing and querying time

series data

  • E.g., KairosDB uses Cassandra as

underlying store

  • DISADVANTAGES
  • Overhead vs custom optimised format
  • No time series specific optimisations
  • E.g., Gorilla/Beringei
  • Delta-delta encoding for semi-

equidistant points

  • Delta encoding for stable values
slide-16
SLIDE 16

Agenda

  • 1. Introduction
  • 2. Why Cassandra?
  • 3. How to model time series in Cassandra?
  • 4. How to configure Cassandra for Time Series Data?
  • 5. Q&A

16

slide-17
SLIDE 17

How to model time series in Cassandra?

Keys

  • Primary key
  • One or more columns identifying a row
  • PRIMARY KEY (A)
  • PRIMARY KEY (A, B)
  • Compound primary key
  • Partition key
  • First column(s) of primary key
  • E.g., PRIMARY KEY ((A, B), C)
  • A & B are composite partition key

17

slide-18
SLIDE 18

How to model time series in Cassandra?

Partitioning & Clustering

  • A partition is mapped to a Cassandra node
  • All rows with same partition key on same node(s)
  • Clustering columns
  • Part of compound primary key
  • Define sorting inside partition

Map<byte[], SortedMap<Clustering, Row Partition Key Clustering columns Other columns 18

slide-19
SLIDE 19

How to model time series in Cassandra?

Modelling: Simple

CREATE TABLE temperature ( weatherstation_id uuid, event_time timestamp, temperature float, PRIMARY KEY (weatherstation_id, event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 19

slide-20
SLIDE 20

How to model time series in Cassandra?

Modelling: Simple

  • Advantage
  • Easy to understand
  • Simple to query
  • Disadvantage
  • All data for one time series in one partition
  • Max 2 billion rows per partition

SELECT temperature FROM temperature WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; 20

slide-21
SLIDE 21

How to model time series in Cassandra?

Modelling: Partitioned

CREATE TABLE temperature_by_day ( weatherstation_id uuid, day date, event_time timestamp, temperature float, PRIMARY KEY ((weatherstation_id, date), event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 21

slide-22
SLIDE 22

How to model time series in Cassandra?

Modelling: Partitioned

  • Advantage
  • Virtually no storage limitation
  • Disadvantage
  • Crossing bucket boundary => multiple queries
  • Need to specify id and day; otherwise unpredictable

performance

  • If data comes in burst => uneven partition sizes

22

slide-23
SLIDE 23

Agenda

  • 1. Introduction
  • 2. Why Cassandra?
  • 3. How to model time series in Cassandra?
  • 4. How to configure Cassandra for Time Series Data?
  • 5. Q&A

23

slide-24
SLIDE 24

How to configure Cassandra for Time Series Data?

  • Cassandra v3
  • Storage engine refactored compared to v2
  • Options to influence read and write performance
  • Compression
  • Compaction

24

slide-25
SLIDE 25

How to configure Cassandra for Time Series Data?

How Cassandra Writes Data

Write Data row1 row2 row3

Commit Log

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud

row1 row2 row3 Index

SSTables http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html Disk Memory Flush MemTable

row1 row2 row3 Index row1 row2 row3 Index

Compaction

  • Commit log
  • Durability
  • Memtable
  • Cache writes in memory
  • Regularly flushed to disc
  • Sorted Strings Table (SSTable)
  • Compaction
  • Re-organise
  • Cleanup

25

slide-26
SLIDE 26

How to configure Cassandra for Time Series Data?

How Cassandra Maintains Data

  • SSTable = immutable
  • Updates => Timestamped version of row
  • Deletes => Tombstones
  • Clean up old versions and tombstones
  • Compaction

26

slide-27
SLIDE 27

How to configure Cassandra for Time Series Data?

How Cassandra Reads Data (Simplified)

  • How to get relevant data
  • SSTables
  • Partition key => node(s)
  • Bloom filter
  • Returns list of SSTables that might contain rows for query
  • Partition index on SSTable
  • Keeps offset for each partition key
  • Memtable
  • Extract qualifying rows
  • Resolve
  • Timestamped rows
  • Tombstones

Partition Key Bloom Filter

row1 row2 row3 Index row1 row2 row3 Index row1 row2 row3 Index

27

slide-28
SLIDE 28

How to configure Cassandra for Time Series Data?

Compaction options

  • Size tiered compaction (default)
  • Levelled compaction
  • Time window compaction

28

slide-29
SLIDE 29

How to configure Cassandra for Time Series Data?

Size tiered Compaction

  • Default strategy
  • Optimised for write heavy workloads
  • Compaction
  • When # similarly sized SSTables
  • Merge into one new file

29

slide-30
SLIDE 30

How to configure Cassandra for Time Series Data?

Size tiered Compaction Example

T1 155MB T2 165MB T3 155MB T4 150MB

Compaction

T5 600MB

Total size: 625MB 30

slide-31
SLIDE 31

How to configure Cassandra for Time Series Data?

Size tiered Compaction Example

T5 600MB T6 155MB T7 165MB

31

slide-32
SLIDE 32

How to configure Cassandra for Time Series Data?

Size tiered Compaction

  • Advantage
  • Write optimised
  • Disadvantage
  • Rows of a partition are spread across multiple SSTables
  • Holds on to stale data for a long time
  • A lot of memory needed as SSTables grow in size

32

slide-33
SLIDE 33

How to configure Cassandra for Time Series Data?

Levelled Compaction

  • L(0)
  • Flushes from memtable
  • L(N > 0)
  • Fixed size SSTables (default: 160MB)
  • Each SSTable has range of partitions => NO OVERLAP!
  • L(1) holds at most 10 SSTables
  • L(N+1) can hold 10x more SSTables than L(N)

33

slide-34
SLIDE 34

How to configure Cassandra for Time Series Data?

Levelled Compaction Example

L(0) L(1) L(2) 2 rows per SSTable 2 files in L(1) 20 files in L(2)

r1 r2

2

r3 r4

  • 1

1

r3 r1

  • 1

r4 r2

1 2 Compaction

Row Partition r1 #0 r2 #2 r3 #-1 r4 #1 r5 #-2 r6 #6 r7 #3 r8 #6 r9 #-1 r10 #-2

34

slide-35
SLIDE 35

How to configure Cassandra for Time Series Data?

Example ctd.

L(0) L(1) L(2) 2 rows per SSTable 2 files in L(1) 20 files in L(2)

r3 r1

  • 1

r4 r2

1 2

r5 r6

  • 2

6

r7 r8

3 6

r5 r3

  • 2
  • 1

r1 r4

1

r2 r7

2 3

r6 r8

6 6 Compaction

Row Partition r1 #0 r2 #2 r3 #-1 r4 #1 r5 #-2 r6 #6 r7 #3 r8 #6 r9 #-1 r10 #-2

35

slide-36
SLIDE 36

How to configure Cassandra for Time Series Data?

Example ctd.

L(0) L(1) L(2) 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction

r5 r3

  • 2
  • 1

r1 r4

1

r2 r7

2 3

r6 r8

6 6

r2 r7

2 3

r6 r8

6 6

r5 r3

  • 2
  • 1

r1 r4

1

Row Partition r1 #0 r2 #2 r3 #-1 r4 #1 r5 #-2 r6 #6 r7 #3 r8 #6 r9 #-1 r10 #-2

36

slide-37
SLIDE 37

How to configure Cassandra for Time Series Data?

Example ctd.

Row Partition r1 #0 r2 #2 r3 #-1 r4 #1 r5 #-2 r6 #6 r7 #3 r8 #6 r9 #-1 r10 #-2

L(0) L(1) L(2) 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction

r2 r7

2 3

r6 r8

6 6

r5 r3

  • 2
  • 1

r1 r4

1

r9 r10

  • 2
  • 1

r5 r3

  • 2
  • 1

r1 r4

1

r2 r7

2 3

r6 r8

6 6

r10 r9

  • 2
  • 1

37

slide-38
SLIDE 38

How to configure Cassandra for Time Series Data?

Levelled Compaction Configuration Trade-off

  • max_sstable_size (default: 160MB)
  • From L1 on, all SSTable are of this size
  • One partition must fit in one SSTable!
  • Bigger SSTable size:
  • Fewer levels => Faster reads
  • Smaller SSTable size:
  • Faster compaction
  • Write amplification

Level Max Data size L(1) 1.6GB L(2) 16GB L(3) 160GB L(4) 1.6TB

38

slide-39
SLIDE 39

How to configure Cassandra for Time Series Data?

Levelled Compaction Conclusion

  • Advantages
  • No overlap => at most 1 SSTable per level for a partition => 


Consistent read performance

  • Quicker removal of tombstones
  • Disadvantages
  • Max SSTable size must be bigger than biggest partition
  • More CPU and IO used during compaction (write amplification)
  • When to use
  • Limited/stable number of partitions
  • Compaction speed > write speed

39

slide-40
SLIDE 40

How to configure Cassandra for Time Series Data?

Time Window Compaction

  • Fixed time window
  • Compaction at end of time window
  • Older windows never compacted again!
  • Ideal for immutable, in order, fixed rate time series
  • No compaction needed
  • ± same sized SSTables
  • Downside
  • Disk usage only grows

40

slide-41
SLIDE 41

How to configure Cassandra for Time Series Data?

Compaction Comparison

  • Size tiered compaction
  • Heavy write load
  • Levelled compaction
  • Read load
  • Time window compaction
  • Immutable, equally space time series
  • Configurable per table!

41

slide-42
SLIDE 42

How to configure Cassandra for Time Series Data?

Conclusion

  • Cassandra is mature and powerful technology
  • Devil is in the details
  • Key: understanding of storage engine
  • Modelling
  • Start from queries
  • Configuration
  • Compaction? Compression? Caching? Replication? Heap size?

42

slide-43
SLIDE 43

How to configure Cassandra for Time Series Data?

References

  • DataStax
  • Docs: http://docs.datastax.com/en/landing_page/doc/index.html
  • Academy: https://academy.datastax.com/
  • Outworkers three part blog post on Cassandra:
  • http://outworkers.com/blog/post/a-series-on-cassandra-part-1-getting-rid-of-the-sql-mentality
  • http://outworkers.com/blog/post/a-series-on-cassandra-part-2-indexes-and-keys
  • http://outworkers.com/blog/post/a-series-on-cassandra-part-3-advanced-features
  • TimeScaleDB: https://blog.timescale.com/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2
  • Beringei: https://github.com/facebookincubator/beringei
  • Gorilla In-memory Time Series store: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
  • Time Series overview: https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit#gid=0
  • LevelledCompactionStrategy: https://www.slideshare.net/DataStax/the-missing-manual-for-leveled-compaction-strategy-wei-deng-datastax-

cassandra-summit-2016

43

slide-44
SLIDE 44

Questions & Answers

slide-45
SLIDE 45

Thank you for attending! For more information you can contact: joris.gillis@trendminer.com or Visit our website www.trendminer.com