Castle: Reinventing Storage for Big Data Tom Wilkie Founder & - - PowerPoint PPT Presentation
Castle: Reinventing Storage for Big Data Tom Wilkie Founder & - - PowerPoint PPT Presentation
Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware Two Revolutions 2010 Distributed, shared-nothing databases
Before the Flood
Old hardware
1990
BTree File systems RAID Small databases BTree indexes
Two Revolutions
BTree file systems
2010
New hardware RAID
Write-optimised indexes
Distributed, shared-nothing databases BTree file systems New hardware RAID
Write-optimised indexes
...
Bridging the Gap
Castle
2011
Distributed, shared-nothing databases New hardware Castle New hardware
...
With Big Data, how do I...
S N A P S H O T S
With Big Data, how do I...
What’s in the Castle?
Acunu Kernel Userspace Linux Kernel Doubling Arrays
arrays range queries key insert insert queues Bloom filters xuserspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer
extent allocator & mapper freespace manager btree range queries key get key insert Version treeStreaming interface
key insert key get buffered value get buffered value insert range queriesCache
flusher extent block cache page cache prefetcherIn-kernel workloads Block layer
shared buffers async, shared memory ringShared memory interface
keys valuesArrays
value arrays btree key get arrays management merges- Opensource (GPLv2, MIT
for user libraries)
- http://bitbucket.org/acunu
- Loadable Kernel Module,
targeting CentOS’s 2.6.18
- http://www.acunu.com/
blogs/andy-twigg/why- acunu-kernel/
Castle
Acunu Kernel Userspace Doubling Arrays
arrays range queries key insert queues Bloom filters
x
userspace interface kernelspace interface doubling array mapping layer Streaming interface
key insert key get buffered value get buffered value insert range queries
In-kernel workloads
shared buffers async, shared memory ring
Shared memory interface
keys
values key get arrays management
The Interface
castle_{back,objects}.c
Acunu Kernel Doubling Arrays
arrays range queries key insert insert queues Bloom filters
x
userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer
btree range queries key get key insert Version tree
Streaming interface
key insert key get buffered value get buffered value insert range queries
In-kernel workloads
shared buffers async, shared memory ring values
Arrays
value arrays btree key get arrays management merges
Doubling Array
castle_{da,bloom}.c
Update Range Query (Size Z) B-Tree
O(logB N) random IOs O(Z/B) random IOs
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array
2 9 2 9
Inserts Buffer arrays in memory until we have > B of them
Doubling Array
11 8 8 11 2 9 2 8 9 11
Inserts etc...
Similar to log-structured merge trees (LSM), cache-
- blivious lookahead array (COLA), ...
https://acunu-videos.s3.amazonaws.com/dajs.html
Demo
Update Range Query (Size Z) B-Tree
O(logB N) random IOs O(Z/B) random IOs
Doubling Array
O((log N)/B) sequential IOs
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array
Queries
- Add an index to each array to do lookups
- query(k) searches each array independently
query(k)
Doubling Array
- Bloom Filters can help exclude arrays from
search
- ... but don’t help with range queries
Queries query(k)
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Update Range Query (Size Z) B-Tree
O(logB N) random IOs O(Z/B) random IOs
Doubling Array
O((log N)/B) sequential IOs O(Z/B) sequential IOs
~ log (2^30)/log 100 = 5 IOs/update ~ log (2^30)/100 = 0.2 IOs/update 8KB @ 100MB/s = 13k IOs/s 8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s 13k / 0.2 = 65k updates/s 100 / 5 = 20 updates/s
Acunu Kernel Doubling Arrays
arrays range queries key insert insert queues Bloom filters
x
userspace interface kernelspace interface doubling array mapping layer modlist btree mapping layer
btree range queries key get key insert Version tree
Streaming interface
key insert key get buffered value get buffered value insert range queries
In-kernel workloads
shared buffers async, shared memory ring values
Arrays
value arrays btree key get arrays management merges
Doubling Array
castle_{da,bloom}.c
Linux Kernel Doubling Arrays
arrays range queries key insert insert queues Bloom filters
x
ke doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer ck & rs "Extent" layer
extent allocator & mapper freespace manager btree range queries key get key insert Version tree
Cache
flusher extent block cache page cache prefetcher
Arrays
value arrays btree key get arrays management merges
“Mod-list” B-Tree
castle_{btree,versions}.c
Copy-on-Write BTree
Idea:
- Apply path-copying [DSST] to
the B-tree Problems:
- Space blowup: Each update may
rewrite an entire path
- Slow updates: as above
A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)
Nv = #keys live (accessible) at version v
Update Range Query Space CoW B- Tree
O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)
1 a 1 b
- Inserts produce arrays
v1
“BigTable” snapshots
1 a 1 b
“BigTable” snapshots
- Inserts produce arrays
- Snapshots increment ref
counts on arrays
- Merges product more
arrays, decrement ref count on old arrays
2 a 2 b
v1 v2
1 c
- Inserts produce arrays
- Snapshots increment ref
counts on arrays
- Merges product more
arrays, decrement ref count on old arrays
1 1
v1 v2
1 1 a 1 b 1 a b c
“BigTable” snapshots
“BigTable” snapshots
- Inserts produce arrays
- Snapshots increment ref
counts on arrays
- Merges product more
arrays, decrement ref count on old arrays
- Space blowup
1 1
v1 v2
1 1 a 1 b 1 a b c
Nv = #keys live (accessible) at version v
Update Range Query Space CoW B- Tree
O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)
“BigTable” style DA
O((log N)/B) sequential IOs O(Z/B) sequential IOs O(VN)
“Mod-list” BTree
Idea:
- Apply fat-nodes [DSST] to the
B-tree
- ie insert (key, version, value)
tuples, with special operations Problems:
- Similar performance to a BTree
If you limit the #versions, can be constructed sequentially, and embedded into a DA
Nv = #keys live (accessible) at version v
Update Range Query Space CoW B- Tree
O(logB Nv) random IOs O(Z/B) random IOs O(N B logB Nv)
“BigTable” style DA
O((log N)/B) sequential IOs O(Z/B) sequential IOs O(VN)
“Mod-list” in a DA
O((log N)/B) sequential IOs O(Z/B) sequential IOs O(N)
CASTLE
LevelDB
Stratified BTree
Problem: Embedded “Mod- list” #versions limit Solution: Version-split arrays during merges
v0 v1 v2
v-split
v2 v2 v2 v0 v0
k1 k4 k5 k3 k2
{v2} {v1,v0}
v1 v1 v1 v0 v1 v0 v0
k1 k4 k5 k2
v0 entries here are duplicates
v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1
newer
- lder
merge
v1 v2 v2 v1 v2 v1 v0 v1 v0 v0
k1 k4 k5 k3 k2
(duplicates removed)
Linux Kernel Doubling Arrays
arrays range queries key insert insert queues Bloom filters
x
ke doubling array mapping layer modlist btree mapping layer block mapping & cacheing layer ck & rs "Extent" layer
extent allocator & mapper freespace manager btree range queries key get key insert Version tree
Cache
flusher extent block cache page cache prefetcher
Arrays
value arrays btree key get arrays management merges
“Mod-list” B-Tree
castle_{btree,versions}.c
Linux Kernel modlist btree mapping layer block mapping & cacheing layer linux's block & MM layers Memory manager "Extent" layer
extent allocator & mapper freespace manager btree range queries key get key insert Version tree
Cache
flusher extent block cache page cache prefetcher
Block layer Arrays
value arrays btree
Disk Layout: RDA
castle_{cache,extent,freespace,rebuild}.c
13 8 9 5 14 2 1 2 3 4 6 7 8 1 3 4 5 6 7 10 11 12 13 15 16 9 10 11 14 5 2 8 9 14 13 12 15 16
Disk Layout: RDA
random duplicate allocation
Performance Comparison
Small random inserts Inserting 3 billion rows
Acunu powered Cassandra - ‘standard’ Cassandra -
Insert latency
While inserting 3 billion rows
Acunu powered Cassandra x ‘standard’ Cassandra +
Small random range queries
Performed immediately after inserts
Acunu powered Cassandra - ‘standard’ Cassandra -
Standard Acunu Benefits inserts rate 95% latency ~32k/s ~32s ~45k/s ~0.3s >1.4x >100x gets rate 95% latency ~100/s ~2s ~350/s ~0.5s >3.5x >4x range queries 95% latency ~0.4/s ~15s ~40/s ~2s >100x >7.5x
Performance summary
- Castle: like BDB, but for Big Data
- DA: transforms random IO into
sequential IO
- Snapshots & Clones: addressing
real problems with new workloads
- 2 orders of magnitude better
performance and predictability
Questions?
Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://www.acunu.com/download http://www.acunu.com/insights
References
[LSM] The Log-Structured Merge-Tree (LSM-Tree) Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The %20Log-Structured%20Merge-Tree%20%28LSM- Tree%29.pdf [COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al http://www.cs.sunysb.edu/~bender/newpub/ BenderFaFi07.pdf [DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data Structures Persistent, Journal of Computer and System Sciences,
- Vol. 38, No. 1, 1989
http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf Stratified B-trees and versioned dictionaries, - Andy Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, John Wilkes, Tom Wilkie, HotStorage’11 http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf [RDA] Random duplicate storage strategies for load balancing in multimedia servers, 2000, Joep Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.