Cassandra Jonathan Ellis Motivation Scaling reads to a relational - - PowerPoint PPT Presentation

cassandra
SMART_READER_LITE
LIVE PREVIEW

Cassandra Jonathan Ellis Motivation Scaling reads to a relational - - PowerPoint PPT Presentation

Cassandra Jonathan Ellis Motivation Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible and when you do, it usually isn't relational anymore The new face of data


slide-1
SLIDE 1

Cassandra

Jonathan Ellis

slide-2
SLIDE 2

Motivation

  • Scaling reads to a relational database is

hard

  • Scaling writes to a relational database is

virtually impossible

  • … and when you do, it usually isn't relational

anymore

slide-3
SLIDE 3

The new face of data

  • Scale out, not up
  • Online load balancing, cluster growth
  • Flexible schema
  • Key-oriented queries
  • CAP-aware
slide-4
SLIDE 4

CAP theorem

  • Pick two of Consistency, Availability,

Partition tolerance

slide-5
SLIDE 5

T wo famous papers

  • Bigtable: A distributed storage system for

structured data, 2006

  • Dynamo: amazon's highly available key-

value store, 2007

slide-6
SLIDE 6

T wo approaches

  • Bigtable: “How can we build a distributed

db on top of GFS?”

  • Dynamo: “How can we build a distributed

hash table appropriate for the data center?”

slide-7
SLIDE 7

10,000 ft summary

  • Dynamo partitioning and replication
  • Log-structured ColumnFamily data model

similar to Bigtable's

slide-8
SLIDE 8

Cassandra highlights

  • High availability
  • Incremental scalability
  • Eventually consistent
  • T

unable tradeoffs between consistency and latency

  • Minimal administration
  • No SPF
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Dynamo architecture & Lookup

slide-14
SLIDE 14

Architecture details

  • O(1) node lookup
  • Explicit replication
  • Eventually consistent
slide-15
SLIDE 15

Architecture layers

Messaging service Gossip Failure detection Cluster state Partitioner Replication Commit log Memtable SST able Indexes Compaction T

  • mbstones

Hinted handoff Read repair Bootstrap Monitoring Admin tools

slide-16
SLIDE 16

Writes

  • Any node
  • Partitioner
  • Commitlog, memtable
  • SST

able

  • Compaction
  • Wait for W responses
slide-17
SLIDE 17

Memtable / SST able

Commit log

Disk

slide-18
SLIDE 18

SST able format

  • Key / data
slide-19
SLIDE 19

SST able Indexes

  • Bloom filter
  • Key
  • Column

(Similar to Hadoop MapFile / Tfile)

slide-20
SLIDE 20

Compaction

  • Merge keys
  • Combine columns
  • Discard tombstones
slide-21
SLIDE 21

Remove

  • Deletion marker (tombstone) necessary

to suppress data in older SST ables, until compaction

  • Read repair complicates things a little
  • Eventually consistent complicates things

more

  • Solution: configurable delay before

tombstone GC, after which tombstones are not repaired

slide-22
SLIDE 22

Cassandra write properties

  • No reads
  • No seeks
  • Fast
  • Atomic within ColumnFamily
  • Always writable
slide-23
SLIDE 23

Read path

  • Any node
  • Partitioner
  • Wait for R responses
  • Wait for N – R responses in the

background and perform read repair

slide-24
SLIDE 24

Cassandra read properties

  • Read multiple SST

ables

  • Slower than writes (but still fast)
  • Seeks can be mitigated with more RAM
  • Scales to billions of rows
slide-25
SLIDE 25

Consistency in a BASE world

  • If W + R > N, you will have consistency
  • W=1, R=N
  • W=N, R=1
  • W=Q, R=Q where Q = N / 2 + 1
slide-26
SLIDE 26

vs MySQL with 50GB of data

  • MySQL
  • ~300ms write
  • ~350ms read
  • Cassandra
  • ~0.12ms write
  • ~15ms read
  • Achtung!
slide-27
SLIDE 27

Data model

  • Rows, ColumnFamilies, Columns
slide-28
SLIDE 28

ColumnFamilies

keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp

slide-29
SLIDE 29

Super ColumnFamilies

keyF Super1 Super2 keyJ Super1 Super5 column column column column column column column column column column column column

slide-30
SLIDE 30

T ypes of queries

  • Single column
  • Slice
  • Set of names / range of names
  • Simple slice -> columns
  • Super slice -> supercolumns
  • Key range
slide-31
SLIDE 31

Range queries

  • Add “master” server
  • Implement on top of K/V
  • Order-preserving partitioning
slide-32
SLIDE 32

Modification

  • Insert / update
  • Remove
  • Single column or batch
  • Specify W, number of nodes to wait for
slide-33
SLIDE 33

Thrift

struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } struct SuperColumn { 1: binary name, 2: list<Column> columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp, block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)

slide-34
SLIDE 34

Honestly, Thrift kinda sucks

slide-35
SLIDE 35

Example: a multiuser blog

T wo queries

  • the most recent posts belonging to a

given blog, in reverse chronological order

  • a single post and its comments, in

chronological order

slide-36
SLIDE 36

First try

JBE blog Cassandra is teh awesome BASE FTW Evan blog I like kittens And Ruby post comment comment post comment comment post comment comment post comment comment

<ColumnFamily T ype="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>

slide-37
SLIDE 37

Second try

<ColumnFamily CompareWith="UUIDT ype" Name="Blog"/>

JBE blog Cassandra is teh awesome BASE FTW Evan blog I like kittens And Ruby Cassandr a is teh awesome comment comment Base FTW comment comment I like kittens comment comment And Ruby comment comment

<ColumnFamily CompareWith="UUIDT ype" Name="Comment"/>

slide-38
SLIDE 38

Roadmap

slide-39
SLIDE 39

Cassandra 0.3

  • Remove support
  • OPP / Range queries
  • T

est suite

  • Workarounds for JDK bugs
  • Rudimentary multi-datacenter support
slide-40
SLIDE 40

Cassandra 0.4

  • Branched May 18
  • Data file format change to support billions
  • f rows per node instead of millions
  • API changes (no more colon delimiters)
  • Multi-table (keyspace) support
  • LRU key cache
  • fsync support
  • Bootstrap
  • Web interface
slide-41
SLIDE 41

Cassandra 0.5

  • Bootstrap
  • Load balancing
  • Closely related to “bootstrap done right”
  • Merkle tree repair
  • Millions of columns per row
  • This will require another data format change
  • Multiget
  • Callout support
slide-42
SLIDE 42

Users

Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, T witter Evaluating: 50+ in #cassandra on freenode

slide-43
SLIDE 43

More

  • Eventual consistency:

http://www.allthingsdistributed.com/2008/1

  • Introduction to distributed databases by

T

  • dd Lipcon at NoSQL 09:

http://www.vimeo.com/5145059

  • Other articles/videos about Cassandra:

http://wiki.apache.org/cassandra/ArticlesAn

  • #cassandra on irc.freenode.net
slide-44
SLIDE 44

Cassandra