Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda - - PowerPoint PPT Presentation

cassandra on rocksdb
SMART_READER_LITE
LIVE PREVIEW

Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda - - PowerPoint PPT Presentation

Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda 1. Motivation 2. Approaches 3. Design 4. Performance metrics 2 3 Stories Direct Live Explore 4 5 Apache Cassandra Highly scalable partitioned data store


slide-1
SLIDE 1

Dikang Gu Software Engineer @ Facebook

Cassandra on RocksDB

slide-2
SLIDE 2

2

Agenda

  • 1. Motivation
  • 2. Approaches
  • 3. Design
  • 4. Performance metrics
slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

Stories Direct Live Explore

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

Apache Cassandra

  • Highly scalable partitioned data store
  • High performance
  • High availability
  • Tunable consistency
slide-7
SLIDE 7

7

Cassandra at Instagram

  • Thousands of Cassandra servers
  • 5 DCs
  • 100+ product use cases
  • millions of requests per seconds
slide-8
SLIDE 8

8

Top Line Metrics

  • Reliability
  • 5-9s, requests failure rate < 0.001%
  • Performance
  • Write throughput
  • Read latency
slide-9
SLIDE 9

Read Latency

slide-10
SLIDE 10

10

Read Latency

60ms 25ms 5ms

slide-11
SLIDE 11

11

GC Stalls

2.5% 1%

slide-12
SLIDE 12

12

Where do we play

slide-13
SLIDE 13

13

Approach 1: GC Tuning

slide-14
SLIDE 14

14

Approach 1: GC Tuning

Pros:

  • No code changes

Cons:

  • Hard to tune for both latency and

throughput

  • Highly depend on the work load
  • Max 20% P99 latency drop
slide-15
SLIDE 15

15

Approach 2: Off-heap Data Structures

  • Memtable
  • Caches
  • Indexes
  • Read/write path
  • Compaction
slide-16
SLIDE 16

16

Approach 2: Off-heap Data Structures

Pros:

  • Incremental improvements
  • Easier to be accepted by

community Cons:

  • Play with Java unsafe
  • Highly depend on the work load
  • Max 20% P99 latency drop
slide-17
SLIDE 17

17

Approach 3: C++ Storage Engines

  • Most memory consumed by storage engine
  • Memtable, compaction, read/write path, etc
  • Switch existing Java storage engine to C++ implementation
  • Pluggable storage engine
slide-18
SLIDE 18

18

Approach 3: C++ Storage Engines

Pros:

  • Greatly reduce JVM overhead
  • CPU efficiency
  • Long term benefit from pluggable

storage engine Cons:

  • Non-trivial effort to make storage

engine to be pluggable

  • JNI efficiency
slide-19
SLIDE 19

C++ Storage Engine

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

RocksDB

  • Embedded C++ key-value database
  • Optimized for Flash with extremely low latencies
  • Popular storage engine for Mysql, MongoDB, etc
  • Open source, Apache 2.0 license
slide-22
SLIDE 22

22

Prototype

  • Support single key-value case
  • Bypass C* own storage engine
  • No streaming support
  • Shadow one production use case
slide-23
SLIDE 23

23

Prototype Latency

35ms 15ms 2ms-5ms

slide-24
SLIDE 24

24

Prototype GC Stalls

1% 0.5% 0.1%

slide-25
SLIDE 25

RocksDB + Cassandra = Rocksandra

slide-26
SLIDE 26

26

Challenges

  • 1. Cassandra data model
  • 2. Streaming
slide-27
SLIDE 27

27

Design: Data Model

slide-28
SLIDE 28

28

Key Encoding

slide-29
SLIDE 29

29

Key Encoding

slide-30
SLIDE 30

30

Value Encoding

slide-31
SLIDE 31

31

Merge operator/Compaction filter

slide-32
SLIDE 32

32

Streaming

slide-33
SLIDE 33

33

Feature Milestone

Current Features:

  • Most of non-nested data types
  • Table data model
  • Point query
  • Range query
  • Mutations
  • Timestamp
  • TTL
  • Deletions/Cell tombstones

Future Features:

  • Multi-partition query
  • Nested data types
  • Counters
  • Range tombstone
  • Materialized views
  • Secondary indexes
  • Repair
slide-34
SLIDE 34

Performance metrics

slide-35
SLIDE 35

35

Cluster A

  • Similar P99 read/write latency
  • Footprint reduced to 1/3
slide-36
SLIDE 36

36

Cluster B

  • P99 read latency reduced 3X (60ms to 20ms)
  • Footprint reduced to 60%
slide-37
SLIDE 37

37

Cluster C

  • High write and large fanout read
  • P99 read latency reduced from 1s to 10ms
  • Same footprint
slide-38
SLIDE 38

38

Benchmark on AWS

  • C* cluster in one us-west-2a, replication factor 1.
  • 3 i3.8xlarge EC2 instance: 256GB memory, 32 core CPU, raid0 with 4

nvme flash disk

  • NDBench cluster, https://github.com/Netflix/ndbench, run from same AZ
slide-39
SLIDE 39

39

Benchmark Metrics

slide-40
SLIDE 40

40

Benchmark Metrics

slide-41
SLIDE 41

41

Benchmark Metrics

slide-42
SLIDE 42

42

Benchmark Metrics

slide-43
SLIDE 43

43

Recap

Switch to Rocksandra helped us:

  • Cuts down tail latency
  • Improves throughput
slide-44
SLIDE 44

44

Try it!

Don’t just believe what we said, download from github.com/instagram

  • Rocksandra code
  • Benchmark cloud formation template and scripts
slide-45
SLIDE 45

45

Future work

  • Support more Cassandra features
  • Cassandra pluggable storage engine
slide-46
SLIDE 46

46

Acknowledgement

Thanks for all the support from Cassandra community and RocksDB community

slide-47
SLIDE 47

Thank You!