Castle: Reinventing Storage for Big Data Tom Wilkie Founder & - - PowerPoint PPT Presentation

castle reinventing storage for big data
SMART_READER_LITE
LIVE PREVIEW

Castle: Reinventing Storage for Big Data Tom Wilkie Founder & - - PowerPoint PPT Presentation

Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware Two Revolutions 2010 Distributed, shared-nothing databases


slide-1
SLIDE 1

Tom Wilkie Founder & VP Engineering

Castle: Reinventing Storage for Big Data

slide-2
SLIDE 2

Before the Flood

Old hardware

1990

BTree File systems RAID Small databases BTree indexes

slide-3
SLIDE 3

Two Revolutions

BTree file systems

2010

New hardware RAID

Write-optimised indexes

Distributed, shared-nothing databases BTree file systems New hardware RAID

Write-optimised indexes

...

slide-4
SLIDE 4

Bridging the Gap

Castle

2011

Distributed, shared-nothing databases New hardware Castle New hardware

...

slide-5
SLIDE 5

With Big Data, how do I...

S N A P S H O T S

With Big Data, how do I...

slide-6
SLIDE 6

What’s in the Castle?

slide-7
SLIDE 7 Acunu Kernel Userspace Linux Kernel Doubling Arrays arrays range queries key insert insert queues Bloom filters x userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer extent allocator & mapper freespace manager btree range queries key get key insert Version tree Streaming interface key insert key get buffered value get buffered value insert range queries Cache flusher extent block cache page cache prefetcher In-kernel workloads Block layer shared buffers async, shared memory ring Shared memory interface keys values Arrays value arrays btree key get arrays management merges
slide-8
SLIDE 8

Acunu Kernel Userspace Linux Kernel Doubling Arrays

arrays range queries key insert insert queues Bloom filters x

userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer

extent allocator & mapper freespace manager btree range queries key get key insert Version tree

Streaming interface

key insert key get buffered value get buffered value insert range queries

Cache

flusher extent block cache page cache prefetcher

In-kernel workloads Block layer

shared buffers async, shared memory ring

Shared memory interface

keys values

Arrays

value arrays btree key get arrays management merges
  • Opensource (GPLv2, MIT

for user libraries)

  • http://bitbucket.org/acunu
  • Loadable Kernel Module,

targeting CentOS’s 2.6.18

  • http://www.acunu.com/

blogs/andy-twigg/why- acunu-kernel/

Castle

slide-9
SLIDE 9

Acunu Kernel Userspace Doubling Arrays

arrays range queries key insert queues Bloom filters

x

userspace interface kernelspace interface doubling array mapping layer Streaming interface

key insert key get buffered value get buffered value insert range queries

In-kernel workloads

shared buffers async, shared memory ring

Shared memory interface

keys

values key get arrays management

The Interface

castle_{back,objects}.c

slide-10
SLIDE 10

Acunu Kernel Doubling Arrays

arrays range queries key insert insert queues Bloom filters

x

userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer

btree range queries key get key insert Version tree

Streaming interface

key insert key get buffered value get buffered value insert range queries

In-kernel workloads

shared buffers async, shared memory ring values

Arrays

value arrays btree key get arrays management merges

Doubling Array

castle_{da,bloom}.c

slide-11
SLIDE 11

Update Range Query (Size Z) B-Tree

O(logB N) random IOs O(Z/B) random IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

slide-12
SLIDE 12

Doubling Array

2 9 2 9

Inserts Buffer arrays in memory until we have > B of them

slide-13
SLIDE 13

Doubling Array

11 8 8 11 2 9 2 8 9 11

Inserts etc...

Similar to log-structured merge trees (LSM), cache-

  • blivious lookahead array (COLA), ...
slide-14
SLIDE 14

https://acunu-videos.s3.amazonaws.com/dajs.html

Demo

slide-15
SLIDE 15

Update Range Query (Size Z) B-Tree

O(logB N) random IOs O(Z/B) random IOs

Doubling Array

O((log N)/B) sequential IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

slide-16
SLIDE 16

Doubling Array

Queries

  • Add an index to each array to do lookups
  • query(k) searches each array independently

query(k)

slide-17
SLIDE 17

Doubling Array

  • Bloom Filters can help exclude arrays from

search

  • ... but don’t help with range queries

Queries query(k)

slide-18
SLIDE 18

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Update Range Query (Size Z) B-Tree

O(logB N) random IOs O(Z/B) random IOs

Doubling Array

O((log N)/B) sequential IOs O(Z/B) sequential IOs

~ log (2^30)/log 100 = 5 IOs/update ~ log (2^30)/100 = 0.2 IOs/update 8KB @ 100MB/s = 13k IOs/s 8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s 13k / 0.2 = 65k updates/s 100 / 5 = 20 updates/s

slide-19
SLIDE 19

Acunu Kernel Doubling Arrays

arrays range queries key insert insert queues Bloom filters

x

userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer

btree range queries key get key insert Version tree

Streaming interface

key insert key get buffered value get buffered value insert range queries

In-kernel workloads

shared buffers async, shared memory ring values

Arrays

value arrays btree key get arrays management merges

Doubling Array

castle_{da,bloom}.c

slide-20
SLIDE 20

Linux Kernel Doubling Arrays

arrays range queries key insert insert queues Bloom filters

x

ke doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer ck & rs "Extent" layer

extent allocator & mapper freespace manager btree range queries key get key insert Version tree

Cache

flusher extent block cache page cache prefetcher

Arrays

value arrays btree key get arrays management merges

“Mod-list” B-Tree

castle_{btree,versions}.c

slide-21
SLIDE 21

Copy-on-Write BTree

Idea:

  • Apply path-copying [DSST] to

the B-tree Problems:

  • Space blowup: Each update may

rewrite an entire path

  • Slow updates: as above

A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)

slide-22
SLIDE 22

Nv = #keys live (accessible) at version v

Update Range Query Space CoW B- Tree

O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)

slide-23
SLIDE 23

1 a 1 b

  • Inserts produce arrays

v1

“BigTable” snapshots

slide-24
SLIDE 24

1 a 1 b

“BigTable” snapshots

  • Inserts produce arrays
  • Snapshots increment ref

counts on arrays

  • Merges product more

arrays, decrement ref count on old arrays

2 a 2 b

v1 v2

1 c

slide-25
SLIDE 25
  • Inserts produce arrays
  • Snapshots increment ref

counts on arrays

  • Merges product more

arrays, decrement ref count on old arrays

1 1

v1 v2

1 1 a 1 b 1 a b c

“BigTable” snapshots

slide-26
SLIDE 26

“BigTable” snapshots

  • Inserts produce arrays
  • Snapshots increment ref

counts on arrays

  • Merges product more

arrays, decrement ref count on old arrays

  • Space blowup

1 1

v1 v2

1 1 a 1 b 1 a b c

slide-27
SLIDE 27

Nv = #keys live (accessible) at version v

Update Range Query Space CoW B- Tree

O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B) sequential IOs O(Z/B) sequential IOs O(VN)

slide-28
SLIDE 28

“Mod-list” BTree

Idea:

  • Apply fat-nodes [DSST] to the

B-tree

  • ie insert (key, version, value)

tuples, with special operations Problems:

  • Similar performance to a BTree

If you limit the #versions, can be constructed sequentially, and embedded into a DA

slide-29
SLIDE 29

Nv = #keys live (accessible) at version v

Update Range Query Space CoW B- Tree

O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B) sequential IOs O(Z/B) sequential IOs O(VN)

“Mod-list” in a DA

O((log N)/B) sequential IOs O(Z/B) sequential IOs O(N)

CASTLE

LevelDB

slide-30
SLIDE 30

Stratified BTree

Problem: Embedded “Mod- list” #versions limit Solution: Version-split arrays during merges

v0 v1 v2

v-split

v2 v2 v2 v0 v0

k1 k4 k5 k3 k2

{v2} {v1,v0}

v1 v1 v1 v0 v1 v0 v0

k1 k4 k5 k2

v0 entries here are duplicates

v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1

newer

  • lder

merge

v1 v2 v2 v1 v2 v1 v0 v1 v0 v0

k1 k4 k5 k3 k2

(duplicates removed)

slide-31
SLIDE 31

Linux Kernel Doubling Arrays

arrays range queries key insert insert queues Bloom filters

x

ke doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer ck & rs "Extent" layer

extent allocator & mapper freespace manager btree range queries key get key insert Version tree

Cache

flusher extent block cache page cache prefetcher

Arrays

value arrays btree key get arrays management merges

“Mod-list” B-Tree

castle_{btree,versions}.c

slide-32
SLIDE 32

Linux Kernel modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer

extent allocator & mapper freespace manager btree range queries key get key insert Version tree

Cache

flusher extent block cache page cache prefetcher

Block layer Arrays

value arrays btree

Disk Layout: RDA

castle_{cache,extent,freespace,rebuild}.c

slide-33
SLIDE 33

13 8 9 5 14 2 1 2 3 4 6 7 8 1 3 4 5 6 7 10 11 12 13 15 16 9 10 11 14 5 2 8 9 14 13 12 15 16

Disk Layout: RDA

random duplicate allocation

slide-34
SLIDE 34

Performance Comparison

slide-35
SLIDE 35

Small random inserts Inserting 3 billion rows

Acunu powered Cassandra - ‘standard’ Cassandra -

slide-36
SLIDE 36

Insert latency

While inserting 3 billion rows

Acunu powered Cassandra x ‘standard’ Cassandra +

slide-37
SLIDE 37

Small random range queries

Performed immediately after inserts

Acunu powered Cassandra - ‘standard’ Cassandra -

slide-38
SLIDE 38

Standard Acunu Benefits inserts rate 95% latency ~32k/s ~32s ~45k/s ~0.3s >1.4x >100x gets rate 95% latency ~100/s ~2s ~350/s ~0.5s >3.5x >4x range queries 95% latency ~0.4/s ~15s ~40/s ~2s >100x >7.5x

Performance summary

slide-39
SLIDE 39
  • Castle: like BDB, but for Big Data
  • DA: transforms random IO into

sequential IO

  • Snapshots & Clones: addressing

real problems with new workloads

  • 2 orders of magnitude better

performance and predictability

slide-40
SLIDE 40

Questions?

Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://www.acunu.com/download http://www.acunu.com/insights

slide-41
SLIDE 41

References

[LSM] The Log-Structured Merge-Tree (LSM-Tree) Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The %20Log-Structured%20Merge-Tree%20%28LSM- Tree%29.pdf [COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al http://www.cs.sunysb.edu/~bender/newpub/ BenderFaFi07.pdf [DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data Structures Persistent, Journal of Computer and System Sciences,

  • Vol. 38, No. 1, 1989

http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf Stratified B-trees and versioned dictionaries, - Andy Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, John Wilkes, Tom Wilkie, HotStorage’11 http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf [RDA] Random duplicate storage strategies for load balancing in multimedia servers, 2000, Joep Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.