- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1

Joris Gillis I am a software engineer at TrendMiner and focus on the enterprise scalability of our industrial analytics platform. I studied at the University of Hasselt and the University of Antwerp in the field of Database theory and Data Mining. My interests are: • Big Data technology • Functional Programming • Athletics 2

Agenda 1. Introduction 2. Why Cassandra? 3. How to model time series in Cassandra? 4. How to configure Cassandra for Time Series Data? 5. Q&A 3

About TrendMiner TrendMiner is the Leading Modelling Free Industrial Analytics Platform to Analyze, Monitor and Predict Asset and Process Performance. With a proven track record in the (Petro-) Chemical and Oil & Gas industry to increase overall profitability, by improving production yield, lower costs, avoid unplanned process downtime, increase overall equipment efficiency and reduce safety risks. 4

About TrendMiner About our company • Started in 2008 as Spin-off K.U.Leuven • > 70 Man Year research behind TrendMiner • Spin-out Idea from Bayer MaterialScience (Covestro) • Patented several core technologies for US/EU • Headquarter EMEA, Hasselt, Belgium • Headquarter US, Houston, TX • 60+ Employees and growing • Global OSIsoft PI ISV & OEM partner • Platform Agnostic vendor • Front runner in both Process & Asset Analytics 5

Industry 4.0 Internet of Things Augmented Reality Wearables CONNECTIVITY Cyber Security Machine Learning Optimization & Prediction BIG DATA & ANALYTICS Additive Manufacturing Advanced Materials Autonomous Robotics ADVANCED MANUFACTURING Technologies that enable new ways of working and of doing business 6

About TrendMiner About our software 7

About TrendMiner Analyze 8

About TrendMiner Monitor 9

About TrendMiner Predict 10

Problem statement Time series Complex analyses Plants across the • From thousands to millions globe • Resolution between 5 minutes and 1 second • E.g., 10 year history 11

What is a Time Series? • A time series is a series of timestamped data points • Sometimes data points are spaced equidistantly • List<Tuple<Long, Float �>? 12

Why Cassandra New technology Horizontal scaling Big Data => Big Index In-store analytics too limited for our needs Only HTTP interface

Why Cassandra • DISADVANTAGES • ADVANTAGES • Overhead vs custom optimised format • Proven technology • No time series specific optimisations • Connectivity • E.g., Gorilla/Beringei • JDBC connector (also to Spark) • Delta-delta encoding for semi- • Edge locations equidistant points • Geographic distribution of data • Delta encoding for stable values • Support for storing and querying time series data • E.g., KairosDB uses Cassandra as underlying store

How to model time series in Cassandra? Keys • Primary key • One or more columns identifying a row • PRIMARY KEY (A) • PRIMARY KEY (A, B) • Compound primary key • Partition key • First column(s) of primary key • E.g., PRIMARY KEY ((A, B), C) • A & B are composite partition key 17

How to model time series in Cassandra? Partitioning & Clustering • A partition is mapped to a Cassandra node • All rows with same partition key on same node(s) • Clustering columns • Part of compound primary key • Define sorting inside partition Map<byte[], SortedMap<Clustering, Row  Partition Key Clustering columns Other columns 18

How to model time series in Cassandra? Modelling: Simple CREATE TABLE temperature ( weatherstation_id uuid, event_time timestamp, temperature float, PRIMARY KEY (weatherstation_id, event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 19

How to model time series in Cassandra? Modelling: Simple • Advantage SELECT temperature FROM temperature • Easy to understand WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; • Simple to query • Disadvantage • All data for one time series in one partition • Max 2 billion rows per partition 20

How to model time series in Cassandra? Modelling: Partitioned CREATE TABLE temperature_by_day ( weatherstation_id uuid, day date, event_time timestamp, temperature float, PRIMARY KEY ((weatherstation_id, date), event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 21

How to model time series in Cassandra? Modelling: Partitioned • Advantage • Virtually no storage limitation • Disadvantage • Crossing bucket boundary => multiple queries • Need to specify id and day; otherwise unpredictable performance • If data comes in burst => uneven partition sizes 22

How to configure Cassandra for Time Series Data? • Cassandra v3 • Storage engine refactored compared to v2 • Options to influence read and write performance • Compression • Compaction 24

How to configure Cassandra for Time Series Data? How Cassandra Writes Data MemTable • Commit log row1 Write row2 Data row3 • Durability Memory Flush • Memtable Disk • Cache writes in memory Lorem ipsum dolor sit row1 amet, consectetur row1 adipiscing elit, sed do row1 Index row2 eiusmod tempor • Regularly flushed to disc Index row2 incididunt ut labore et Index row2 dolore magna aliqua. row3 Ut enim ad minim row3 veniam, quis nostrud row3 • Sorted Strings Table (SSTable) Commit Log SSTables • Compaction Compaction • Re-organise http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html • Cleanup 25

How to configure Cassandra for Time Series Data? How Cassandra Maintains Data • SSTable = immutable • Updates => Timestamped version of row • Deletes => Tombstones • Clean up old versions and tombstones • Compaction 26

How to configure Cassandra for Time Series Data? How Cassandra Reads Data (Simplified) • How to get relevant data • SSTables • Partition key => node(s) • Bloom filter Partition Key • Returns list of SSTables that might contain rows for query • Partition index on SSTable • Keeps o ff set for each partition key row1 row1 Bloom row1 Index row2 • Memtable Index row2 Index row2 row3 Filter row3 row3 • Extract qualifying rows • Resolve • Timestamped rows • Tombstones 27

How to configure Cassandra for Time Series Data? Compaction options • Size tiered compaction (default) • Levelled compaction • Time window compaction 28

How to configure Cassandra for Time Series Data? Size tiered Compaction • Default strategy • Optimised for write heavy workloads • Compaction • When # similarly sized SSTables • Merge into one new file 29

How to configure Cassandra for Time Series Data? Size tiered Compaction Example Total size: T4 T1 T3 T2 625MB 150MB 155MB 155MB 165MB Compaction T5 600MB 30

How to configure Cassandra for Time Series Data? Size tiered Compaction Example T6 T7 155MB 165MB T5 600MB 31

How to configure Cassandra for Time Series Data? Size tiered Compaction • Advantage • Write optimised • Disadvantage • Rows of a partition are spread across multiple SSTables • Holds on to stale data for a long time • A lot of memory needed as SSTables grow in size 32

How to configure Cassandra for Time Series Data? Levelled Compaction • L(0) • Flushes from memtable • L(N > 0) • Fixed size SSTables (default: 160MB) • Each SSTable has range of partitions => NO OVERLAP! • L(1) holds at most 10 SSTables • L(N+1) can hold 10x more SSTables than L(N) 33

How to configure Cassandra for Time Series Data? Levelled Compaction Example 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition 0 2 -1 1 r1 #0 r2 #2 L(0) r3 r4 r1 r2 r3 #-1 r4 #1 1 2 -1 0 r5 #-2 L(1) r6 #6 r4 r2 r3 r1 r7 #3 r8 #6 r9 #-1 L(2) r10 #-2 34

How to configure Cassandra for Time Series Data? Example ctd. 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition 3 6 -2 6 r1 #0 r2 #2 L(0) r5 r6 r7 r8 r3 #-1 r4 #1 1 2 -2 -1 0 1 2 3 6 6 -1 0 r5 #-2 L(1) r6 #6 r4 r2 r5 r3 r1 r4 r2 r7 r6 r8 r3 r1 r7 #3 r8 #6 r9 #-1 L(2) r10 #-2 35

How to configure Cassandra for Time Series Data? Example ctd. 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition r1 #0 L(0) r2 #2 r3 #-1 -2 -1 0 1 r4 #1 r5 r3 r1 r4 r5 #-2 r6 #6 L(1) 2 3 6 6 2 3 6 6 r7 #3 r2 r7 r6 r8 r2 r7 r6 r8 r8 #6 r9 #-1 -2 -1 0 1 r10 #-2 L(2) r5 r3 r1 r4 36

- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1 - PowerPoint PPT Presentation

- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1 Joris Gillis I am a software engineer at TrendMiner and focus on the enterprise scalability of our industrial analytics platform. I studied at the University of Hasselt and the

Apache Cassandra STL Java Users Group Cliff Gilmore DataStax Solutions Architect / Engineer

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

On Cassandra's evolution Berlin Buzzwords (June 4th 2013) Sylvain Lebresne Apache Cassandra

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and

Cassandra and Apollo By Octavia, Baylee, and Tilah Cassandra was not an oracle.she could not see

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Why do you care? Time-series data is all over the place. Time-Series Data Kaitlin Duck

standard series Overview DP series DX series H series M series bitte hier

Lessons Learned with Cassandra & Spark_ Matthias Niehoff Apache: Big Data 2017

Cassandra By Example: Data Modelling with CQL3 Berlin Buzzwords June 4, 2013 Eric Evans

Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda 1. Motivation 2. Approaches

Working w ith more than one time series VISU AL IZIN G TIME SE R IE S DATA IN P YTH ON Thomas

E- -Series: Series: Water Mist Extinguishers Water Mist Extinguishers E E- -Series: Series:

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

The Dark Matter density MW Components Global density Data: inner Data: outer Data: masers

OPEN-MIDPLANE DIPOLES FOR A MUON COLLIDER * R. Weggel # , J. Kolonko & R. Scanlan, Particle

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Scripted Components Massimo Felici and Conrad Hughes mfelici@inf.ed.ac.uk conrad.hughes@ed.ac.uk

Community Forums Receiving Input from Parents, Families and Staff School Location: Natomas

CITIES, HEALTH AND WELL-BEING NOVEMBER 2011 Urban density, overcrowding and health Promoting High

2 Corinthians 10:3, For though we walk in the flesh, we do not war according to the flesh. 2

Inform Decision-Making in Business Polly Moseley PhD Researcher & Producer Liverpool John