cassandra
play

Cassandra Jonathan Ellis Motivation Scaling reads to a relational - PowerPoint PPT Presentation

Cassandra Jonathan Ellis Motivation Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible and when you do, it usually isn't relational anymore The new face of data


  1. Cassandra Jonathan Ellis

  2. Motivation ● Scaling reads to a relational database is hard ● Scaling writes to a relational database is virtually impossible ● … and when you do, it usually isn't relational anymore

  3. The new face of data ● Scale out, not up ● Online load balancing, cluster growth ● Flexible schema ● Key-oriented queries ● CAP-aware

  4. CAP theorem ● Pick two of Consistency, Availability, Partition tolerance

  5. T wo famous papers ● Bigtable: A distributed storage system for structured data , 2006 ● Dynamo: amazon's highly available key- value store , 2007

  6. T wo approaches ● Bigtable: “How can we build a distributed db on top of GFS?” ● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

  7. 10,000 ft summary ● Dynamo partitioning and replication ● Log-structured ColumnFamily data model similar to Bigtable's

  8. Cassandra highlights ● High availability ● Incremental scalability ● Eventually consistent ● T unable tradeoffs between consistency and latency ● Minimal administration ● No SPF

  9. Dynamo architecture & Lookup

  10. Architecture details ● O(1) node lookup ● Explicit replication ● Eventually consistent

  11. Architecture layers Messaging service Commit log T ombstones Gossip Memtable Hinted handoff Failure detection SST able Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools

  12. Writes ● Any node ● Partitioner ● Commitlog, memtable ● SST able ● Compaction ● Wait for W responses

  13. Memtable / SST able Disk Commit log

  14. SST able format ● Key / data

  15. SST able Indexes ● Bloom filter ● Key ● Column (Similar to Hadoop MapFile / Tfile)

  16. Compaction ● Merge keys ● Combine columns ● Discard tombstones

  17. Remove ● Deletion marker (tombstone) necessary to suppress data in older SST ables, until compaction ● Read repair complicates things a little ● Eventually consistent complicates things more ● Solution: configurable delay before tombstone GC, after which tombstones are not repaired

  18. Cassandra write properties ● No reads ● No seeks ● Fast ● Atomic within ColumnFamily ● Always writable

  19. Read path ● Any node ● Partitioner ● Wait for R responses ● Wait for N – R responses in the background and perform read repair

  20. Cassandra read properties ● Read multiple SST ables ● Slower than writes (but still fast) ● Seeks can be mitigated with more RAM ● Scales to billions of rows

  21. Consistency in a BASE world ● If W + R > N, you will have consistency ● W=1, R=N ● W=N, R=1 ● W=Q, R=Q where Q = N / 2 + 1

  22. vs MySQL with 50GB of data ● MySQL ● ~300ms write ● ~350ms read ● Cassandra ● ~0.12ms write ● ~15ms read ● Achtung!

  23. Data model ● Rows, ColumnFamilies, Columns

  24. ColumnFamilies keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp

  25. Super ColumnFamilies keyF Super1 Super2 column column column column column column keyJ Super1 Super5 column column column column column column

  26. T ypes of queries ● Single column ● Slice ● Set of names / range of names ● Simple slice -> columns ● Super slice -> supercolumns ● Key range

  27. Range queries ● Add “master” server ● Implement on top of K/V ● Order-preserving partitioning

  28. Modification ● Insert / update ● Remove ● Single column or batch ● Specify W, number of nodes to wait for

  29. Thrift struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } struct SuperColumn { 1: binary name, 2: list<Column> columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp, block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)

  30. Honestly, Thrift kinda sucks

  31. Example: a multiuser blog T wo queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order

  32. First try JBE Cassandra is teh awesome BASE FTW blog post comment comment post comment comment Evan I like kittens And Ruby blog post comment comment post comment comment <ColumnFamily T ype="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>

  33. Second try JBE blog Cassandra BASE FTW Cassandr comment comment is teh a is teh awesome awesome Evan blog I like kittens And Ruby Base FTW comment comment I like comment comment kittens And Ruby comment comment <ColumnFamily <ColumnFamily CompareWith="UUIDT ype" CompareWith="UUIDT ype" Name="Blog"/> Name="Comment"/>

  34. Roadmap

  35. Cassandra 0.3 ● Remove support ● OPP / Range queries ● T est suite ● Workarounds for JDK bugs ● Rudimentary multi-datacenter support

  36. Cassandra 0.4 ● Branched May 18 ● Data file format change to support billions of rows per node instead of millions ● API changes (no more colon delimiters) ● Multi-table (keyspace) support ● LRU key cache ● fsync support ● Bootstrap ● Web interface

  37. Cassandra 0.5 ● Bootstrap ● Load balancing ● Closely related to “bootstrap done right” ● Merkle tree repair ● Millions of columns per row ● This will require another data format change ● Multiget ● Callout support

  38. Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, T witter Evaluating: 50+ in #cassandra on freenode

  39. More ● Eventual consistency: http://www.allthingsdistributed.com/2008/1 ● Introduction to distributed databases by T odd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 ● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAn ● #cassandra on irc.freenode.net

  40. Cassandra

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend