Cassandra Offline Analytics Dongqian Liu, Yi Liu 2017/05/02 Agenda - PowerPoint PPT Presentation

Cassandra Offline Analytics Dongqian Liu, Yi Liu 2017/05/02

Agenda • Introduction • Use Case • Problem & Solution • Suitable User Scenario • Cassandra Internals • Implementation Details • Performance • Similar Projects • Quick Start • Q&A 2

Cassandra • Cassandra is low latency, high throughput big table. It’s suitable for real-time query. • Good read/write performance • Bad performance for some operations like scan, count, groupBy, etc. 3

Our Use Case • Our team supports Feeds system for eBay Paid Internet Marketing • We use Cassandra to build our live item snapshots and other use cases • We need analytics on the data to better boost our business • Cluster – 30 nodes – r/w throughput: avg. 40k/s, peak 100k/s – Data size: 8~10T 4

Problem • Operations like full scan take long time. Ex. simple groupBy and count on a single table takes 10+ hrs • Cause cluster overload • Pollute in-memory cache that makes read request performance much worse 5

Solution • Cassandra internal structure is similar with HBase from high level. • We can do similar thing as what HBase MapReduce does. There are two gaps: – Cassandra sstables are on local disks of cluster nodes, the HBase HFiles are on HDFS. – There is no good way to read raw SSTables in MapReduce or Spark job efficiently. • To fill in above gaps, we can upload the sstables to HDFS in parallel and write some code to let MR and Spark job be able to read the raw SSTables efficiently and finally transform it to Hadoop Table. 6

Suitable User Scenario • Require offline analytics on big table, besides low latency and high throughput read/write performance. • Find HBase can not satisfy the requirements of random read/write performance. 7

High level Overview Hadoop Cassandra Table A Downstream offline Table A analytics snapshot SSTables on Hadoop SSTables on Cassandra cluster 8

Table Transformation • Compaction • Deduplication • Consistency 9

JIRA in Community • https://issues.apache.org/jira/browse/CASSANDRA-2527 10

Cassandra Internals 11

Cassandra Write 12

Storing Data on Disk • Data (Data.db): SSTable data • Primary Index (Index.db): Index of the row keys with pointers to their positions in the data file • Bloom filter (Filter.db) • Compression Information (CompressionInfo.db): A file holding information about uncompressed data length, chunk offsets and other compression information • Statistics (Statistics.db) • Digest (Digest.crc32 …) • CRC (CRC.db) • SSTable Index Summary (SUMMARY.db) • SSTable Table of Contents (TOC.txt) • Secondary Index (SI_.*.db) 13

Cassandra Read 14

Cassandra Read 15

Implementation Details 16

Build Split index for SSTables 17

Compaction, Deduplication and Consistency • Compaction works on a collection of SSTables. From these SSTables, compaction collects all versions of each unique row and assembles one complete row, using the most up-to-date version (by timestamp) of each of the row's columns. • We use reducer to handle the compaction, deduplication and consistency, which is the same logic as C*. 18

Core Classes • SSTableIndexInputFormat • SSTableIndexRecordReader • SSTableSplitIndexOutputFormat • SSTableSplitIndexRecordWriter • SSTableInputFormat • SSTableRecordReader • SSTableReducer • SSTableRowWritable 19

SSTableIndexInputFormat • protected boolean isSplitable(JobContext context, Path filename) { return false; } • protected List<FileStatus> listStatus(JobContext job) throws IOException { List<FileStatus> files = super.listStatus(job); List<FileStatus> indexFiles = new ArrayList<FileStatus>(); for (FileStatus file : files) { if (file.getLen() > 0 && IS_SSTABLE_INDEX.apply(file.getPath())) { indexFiles.add(file); } } return indexFiles; } 20

Performance • A SSTable of size 100G takes ~10 mins to complete building a snapshot table on HDFS 21

Similar Projects in Industry • https://github.com/fullcontact/hadoop-sstable • https://github.com/Knewton/KassandraMRHelper • https://github.com/Netflix/aegisthus • https://github.com/Netflix/Priam 22

Quick Start 23

Configuration • hadoop.sstable.cql="CREATE TABLE ..." • hadoop.sstable.keyspace="<KEYSPACE>" • mapreduce.job.reduces=<REDUCE_NUM> 24

Upload SSTables to Hadoop • sstable-backup/bin/backup.sh • Example : • pssh -i -h conf/hosts -p 7 -t 0 /data/applications/sstable-backup/bin/backup.sh -k <keyspace> -cf <column_family> -s <snapshot> -o <output_dir> 25

Build index • $HADOOP_HOME/bin/hadoop jar hadoop-cassandra-1.0-SNAPSHOT-allInOne.jar com.ebay.traffic.cassandra.sstable.index.hadoop.mapreduce.BuildIndex <input> <output> 26

SSTable to Hadoop file format • $HADOOP_HOME/bin/hadoop jar hadoop-cassandra-1.0-SNAPSHOT-allInOne.jar com.ebay.traffic.cassandra.sstable.hadoop.mapreduce.SSTable2Hadoop -D mapreduce.job.reduces=<REDUCE_NUM> <input> <output> 27

Q & A Thanks 28

Cassandra Offline Analytics Dongqian Liu, Yi Liu 2017/05/02 Agenda - PowerPoint PPT Presentation

Cassandra Offline Analytics Dongqian Liu, Yi Liu 2017/05/02 Agenda Introduction Use Case Problem & Solution Suitable User Scenario Cassandra Internals Implementation Details Performance Similar Projects

Apache Cassandra STL Java Users Group Cliff Gilmore DataStax Solutions Architect / Engineer

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

On Cassandra's evolution Berlin Buzzwords (June 4th 2013) Sylvain Lebresne Apache Cassandra

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Cassandra and Apollo By Octavia, Baylee, and Tilah Cassandra was not an oracle.she could not see

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Opyum: offline package management with Yum -- Debarshi Ray What is it? An offline package

Offline Inbox Interceptor - Ultimate Presentation Offline Inbox Interceptor - Ultimate

CAF Benchmarking CAF Benchmarking Marco MEONI CERN - Offline Week C N O e Wee Alice Offline

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 293S Table of Content

5.1 Online versus Offline SVMs We start with a review of the Offline Support Vector Machine.

Taking it all Offline with SQL Anywhere Eric Farrar, Product Manager Sybase iAnywhere March 5,

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N Table of Content

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency Shukai

Note Well Any submission to the IETF intended by the Contributor for publication as all or part of

Environmental Modeling and Decisions Interconnections and time scales Four Aspects:

Principles of Software Construction: Objects, Design, and Concurrency Case Studies in Data

The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K.

Welcome! Todays Agenda: Introduction The Prefix Sum Parallel Sorting

Infrastructures for Cloud Computing and Big Data Global Data Storage Luca Foschini Academic

Built-in Self-test October 26, 2011 1 Introduction Test generation and response evaluation