Supercharging Cassandra... Tom Wilkie Founder & VP - - PowerPoint PPT Presentation

supercharging cassandra
SMART_READER_LITE
LIVE PREVIEW

Supercharging Cassandra... Tom Wilkie Founder & VP - - PowerPoint PPT Presentation

Supercharging Cassandra... Tom Wilkie Founder & VP Engineering @tom_wilkie Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware Two Revolutions 2010 Distributed, shared-nothing databases


slide-1
SLIDE 1

Supercharging Cassandra...

Tom Wilkie Founder & VP Engineering @tom_wilkie

slide-2
SLIDE 2

Before the Flood

Old hardware

1990

BTree File systems RAID Small databases BTree indexes

slide-3
SLIDE 3

Two Revolutions

BTree file systems

2010

New hardware RAID

Write-optimised indexes

Distributed, shared-nothing databases BTree file systems New hardware RAID

Write-optimised indexes

...

slide-4
SLIDE 4

Bridging the Gap

Castle

2011

Distributed, shared-nothing databases New hardware Castle New hardware

...

slide-5
SLIDE 5

Big Data Applications

Memcached

...

Acunu Storage Core

Open API Management Deployment Monitoring

... ...

...

. . .

... ...

. . .

... ...

. . .

... ...

Cross-Cluster Management UI

slide-6
SLIDE 6
  • 1. Predictability
slide-7
SLIDE 7
slide-8
SLIDE 8

Small random inserts Inserting 3 billion rows

Acunu powered Cassandra - ‘standard’ Cassandra -

slide-9
SLIDE 9

Insert latency

While inserting 3 billion rows

Acunu powered Cassandra x ‘standard’ Cassandra +

slide-10
SLIDE 10

Small random range queries

Performed immediately after inserts

Acunu powered Cassandra - ‘standard’ Cassandra -

slide-11
SLIDE 11

Standard Acunu Benefits inserts rate 95% latency ~32k/s ~32s ~45k/s ~0.3s >1.4x >100x gets rate 95% latency ~100/s ~2s ~350/s ~0.5s >3.5x >4x range queries 95% latency ~0.4/s ~15s ~40/s ~2s >100x >7.5x

Performance summary

slide-12
SLIDE 12

Doubling Array

2 9 2 9

Inserts Buffer arrays in memory until we have > B of them

slide-13
SLIDE 13

Doubling Array

11 8 8 11 2 9 2 8 9 11

Inserts etc...

Similar to log-structured merge trees (LSM), cache-

  • blivious lookahead array (COLA), ...
slide-14
SLIDE 14

https://acunu-videos.s3.amazonaws.com/dajs.html

Demo

slide-15
SLIDE 15

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Update Range Query (Size Z) B-Tree

O(logB N) random IOs O(Z/B) random IOs

Doubling Array

O((log N)/B) sequential IOs O(Z/B) sequential IOs

~ log (2^30)/log 100 = 5 IOs/update ~ log (2^30)/100 = 0.2 IOs/update 8KB @ 100MB/s = 13k IOs/s 8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s 13k / 0.2 = 65k updates/s 100 / 5 = 20 updates/s

slide-16
SLIDE 16 Acunu Kernel Userspace Linux Kernel Doubling Arrays arrays range queries key insert insert queues Bloom filters x userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer extent allocator & mapper freespace manager btree range queries key get key insert Version tree Streaming interface key insert key get buffered value get buffered value insert range queries Cache flusher extent block cache page cache prefetcher In-kernel workloads Block layer shared buffers async, shared memory ring Shared memory interface keys values Arrays value arrays btree key get arrays management merges
  • Opensource (GPLv2, MIT

for user libraries)

  • http://bitbucket.org/acunu
  • Loadable Kernel Module,

targeting CentOS’s 2.6.18

  • http://www.acunu.com/

blogs/andy-twigg/why- acunu-kernel/

Castle

http://goo.gl/gzihe http://goo.gl/wXNDQ

More

slide-17
SLIDE 17
  • 2. Monitoring
slide-18
SLIDE 18

jQuery VisualVM

slide-19
SLIDE 19

Munin, Nagios etc mx4j: Rest-JMX adapter

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
  • 3. Operations
slide-23
SLIDE 23
slide-24
SLIDE 24
  • bash-3.2$ nodetool

... Available commands: ring - Print informations on the token ring join - Join the ring info - Print node informations (uptime, load, ...) cfstats - Print statistics on column families version - Print cassandra version tpstats - Print usage statistics of thread pools drain - Drain the node (stop accepting writes and flush all column families) decommission - Decommission the node compactionstats - Print statistics on compactions disablegossip - Disable gossip (effectively marking the node dead) enablegossip - Reenable gossip disablethrift - Disable thrift server enablethrift - Reenable thrift server netstats [host] - Print network information on provided host (connecting node by default) move <new token> - Move node on the token ring to a new token removetoken status|force|<token> - Show status of current token removal, force completion of pending removal or remove providen token setcompactionthroughput <value_in_mb> - Set the MB/s throughput cap for compaction in the system, or 0 to disable throttling. snapshot [keyspaces...] -t [snapshotName] - Take a snapshot of the specified keyspaces using optional name snapshotName clearsnapshot [keyspaces...] -t [snapshotName] - Remove snapshots for the specified keyspaces. Either remove all snapshots or remove the snapshots with the given name. flush [keyspace] [cfnames] - Flush one or more column family repair [keyspace] [cfnames] - Repair one or more column family cleanup [keyspace] [cfnames] - Run cleanup on one or more column family compact [keyspace] [cfnames] - Force a (major) compaction on one or more column family scrub [keyspace] [cfnames] - Scrub (rebuild sstables for) one or more column family invalidatekeycache [keyspace] [cfnames] - Invalidate the key cache of one or more column family invalidaterowcache [keyspace] [cfnames] - Invalidate the key cache of one or more column family getcompactionthreshold <keyspace> <cfname> - Print min and max compaction thresholds for a given column family cfhistograms <keyspace> <cfname> - Print statistic histograms for a given column family setcachecapacity <keyspace> <cfname> <keycachecapacity> <rowcachecapacity> - Set the key and row cache capacities of a given column family setcompactionthreshold <keyspace> <cfname> <minthreshold> <maxthreshold> - Set the min and max compaction thresholds for a given column family

slide-25
SLIDE 25

S N A P S H O T S *

* And clones!

slide-26
SLIDE 26

v1 v2 v6 v5 v0 v1 v3 v4 v3

slide-27
SLIDE 27

Rebuild

slide-28
SLIDE 28

13 8 9 5 14 2 1 2 3 4 6 7 8 1 3 4 5 6 7 10 11 12 13 15 16 9 10 11 14 5 2 8 9 14 13 12 15 16

Disk Layout: RDA

random duplicate allocation

slide-29
SLIDE 29

Future

slide-30
SLIDE 30

Memcache + Cassandra

Castle H/W Castle H/W

...

Cassandra memcache Cassandra memcache Cass client memcached

get/insert get/put 100k random inserts/sec!

slide-31
SLIDE 31

v16 v24 v13 v1 v15 v12 v13 v16 v24 v13 v1 v15 v12 v13 v16 v24 v13 v1 v15 v12 v13 v16 v24 v13 v1 v15 v12 v13

slide-32
SLIDE 32

Beware the “write cliff”...

~device capacity

slide-33
SLIDE 33
  • Castle: Predictable Performance

for Big Data

  • Monitoring: distributed, multi-

master tools, give you aggregated and summarised view of your cluster

  • Snapshots & Clones: addressing

real problems with new workloads

  • RDA: lightening fast rebuilds for

massive disks

slide-34
SLIDE 34

Questions?

Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://github.com/acunu http://www.acunu.com/download http://www.acunu.com/insights

slide-35
SLIDE 35