M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos - - PowerPoint PPT Presentation
M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos - - PowerPoint PPT Presentation
M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos Athanassoulis, Stratos Idreos storage is cheaper inserts & updates price workload per GB time storage is cheaper inserts & updates price workload per GB time need
price per GB time storage is cheaper inserts & updates workload
price per GB time storage is cheaper inserts & updates workload
need for write-optimized database structures
time 1996 now LSM-tree invented need for write-optimized database structures
time 1996 now LSM-tree invented Key-Value Stores need for write-optimized database structures
LSM-tree Key-Value Stores
What are they really?
memory updates buffer storage level
memory updates buffer storage 1 level sort & flush runs
memory updates buffer storage 1 2 sort-merge sort & flush runs
memory buffer storage 1 2 3 level exponentially increasing capacities O(log(N)) levels
memory buffer storage 1 2 3 level fence pointers lookup key X X
- n
e I / O p e r r u n
memory buffer storage 1 2 3 level fence pointers lookup key X X
- n
e I / O p e r r u n
memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X
memory buffer storage 1 2 3 level fence pointers lookup key X X
true negative
Bloom filters
memory buffer storage 1 2 3 level fence pointers lookup key X X
false positive true negative
Bloom filters
memory buffer storage 1 2 3 level fence pointers lookup key X X
false positive true positive true negative
Bloom filters
memory buffer storage 1 2 3 level fence pointers lookup key X X
false positive true positive true negative
Performance & Cost Tradeoffs Bloom filters
memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X
false positive true positive true negative
Performance & Cost Tradeoffs bigger filters fewer false positives
memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X
false positive true positive true negative
Performance & Cost Tradeoffs bigger filters fewer false positives memory vs. lookups
memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X
false positive true positive true negative
Performance & Cost Tradeoffs memory vs. lookups bigger filters fewer false positives more merging fewer runs more merging fewer runs
memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X
false positive true positive true negative
Performance & Cost Tradeoffs lookups vs. updates memory vs. lookups bigger filters fewer false positives more merging fewer runs
lookup cost main memory update cost
lookup cost main memory update cost update cost lookup cost
existing systems
fixed memory
lookup cost main memory update cost
merge more merge less
fixed memory
existing systems
update cost lookup cost
lookup cost main memory update cost less memory more memory update cost lookup cost
Problem 1:
existing systems
Problem 2: update cost lookup cost
Problem 1:
existing systems
Problem 2: suboptimal filters allocation update cost lookup cost
suboptimal filters allocation fixed memory Pareto frontier
x x
existing systems
Problem 1: Problem 2: update cost lookup cost
x x
hard to tune Problem 1: Problem 2: suboptimal filters allocation update cost lookup cost
x x
Bloom filters size
lookups vs. memory
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
m a x t h r
- u
g h p u t
x x
merge policy greed
lookups vs. updates
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
Monkey: Optimal Navigable Key-Value Store
c
Monkey: Optimal Navigable Key-Value Store
insights: steps:
- bservations:
c
Monkey: Optimal Navigable Key-Value Store
fixed false positive rates lookup cost = ∑ pi suboptimal filters
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array performance
?
fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree
- ptimize allocation
asymptotically better memory vs. lookups updates vs. lookups navigate
insights: steps:
- bservations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates ad-hoc trade-offs update cost Monkey answer what-if design questions performance
?
lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
update cost lookup cost Pareto frontier
WiredTiger Cassandra, HBase
Monkey
RocksDB, LevelDB
for fixed memory
update cost lookup cost Pareto frontier
WiredTiger Cassandra, HBase RocksDB, LevelDB
max throughput Monkey for fixed memory
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates ad-hoc trade-offs update cost Monkey answer what-if design questions performance
?
lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates answer what-if design questions performance
?
update cost Monkey lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
f fence pointers Bloom filters buffer memory storage data
f fence pointers Bloom filters buffer memory storage data
f fence pointers Bloom filters buffer memory storage data
f fence pointers Bloom filters buffer memory storage data > <
f Bloom filters memory storage data
f Bloom filters storage memory
X bits per entry
data
f Bloom filters storage memory data
X bits per entry
Bloom filters memory
X bits per entry
= e
bits M entries N
- ln(2)2
false positive rate p
Bloom filters memory
X bits per entry
p p p = e
bits M entries N
- ln(2)2
false positive rate p
Bloom filters memory p p p = e
bits M entries N
- ln(2)2
false positive rate p X bits per entry
worst-case I/O overhead:
Bloom filters memory
O( ∑p )
p p p = e
bits M entries N
- ln(2)2
false positive rate p X bits per entry
worst-case I/O overhead:
Bloom filters memory
O( ∑p )
p p p =
ln(2)2 false positive rate p X bits per entry
e
bits M entries N
- worst-case I/O overhead:
Bloom filters memory p p p = e
bits M entries N
- ln(2)2
false positive rate p X bits per entry
O( ∑e-M/N ) worst-case I/O overhead:
Bloom filters memory p p p O(log(N))
X bits per entry
O( ∑e-M/N ) worst-case I/O overhead:
Bloom filters memory p p p O(log(N))
X bits per entry
O( log(N) · e-M/N ) worst-case I/O overhead:
Bloom filters memory p p p
X bits per entryCan we do better?
O( log(N) · e-M/N ) worst-case I/O overhead:
fence pointers Bloom filters lookup key X data runs … …
fence pointers Bloom filters
false positive
lookup key X
false positive false positive false positive false positive
data runs
I/O I/O I/O I/O I/O
… …
fence pointers Bloom filters
false positive
lookup key X
false positive false positive false positive false positive
data runs
I/O I/O I/O I/O I/O
most memory … …
fence pointers Bloom filters
false positive
lookup key X
false positive false positive false positive false positive
data runs
I/O I/O I/O I/O I/O
most memory saves at most 1 I/O
most memory … …
Bloom filters false positive rates reallocate some most memory
Bloom filters false positive rates
same memory, fewer lookup I/Os
reallocate some most memory
0 < p2 < 1 0 < p1 < 1 0 < p0 < 1
xx xx
false positive rates relax
0 < p2 < 1 0 < p1 < 1 0 < p0 < 1 false positive rates relax
lookup cost = f(p0, p1 …) = f(p0, p1 …) false positive rates relax model 0 < p2 < 1 0 < p1 < 1 0 < p0 < 1 memory footprint
0 < p2 < 1 memory footprint lookup cost 0 < p1 < 1 0 < p0 < 1 = f(p0, p1 …) = f(p0, p1 …) in terms of p0, p1 model false positive rates relax
- ptimize
lookup cost = ∑ pi p2 p1 p0 false positive rates Bloom filters …
= e
bits entries
- ln(2)2
false positive rate
memory footprint p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi
= -
ln(2)2
ln(
)
false positive rate entries bits
memory footprint p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi
bits(p0, N) bits(p1, N/T) bits(p2, N/T2)
memory footprint … p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi
bits(p0, N) bits(p1, N/T) bits(p2, N/T2)
memory footprint … p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi memory = c · N ·
- ∑ ln(pi)
Ti entries constant size ratio false positive rates
- ptimize
p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi memory = c · N ·
- ∑ ln(pi)
Ti
Monkey Bloom filters … false positive rates p0/T2 p0/T p0 e x p
- n
e n t i a l d e c r e a s e
s a m e State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p e x p
- n
e n t i a l d e c r e a s e
State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost
= ∑pi = ∑p
< … … <
State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost
= ∑pi = ∑p = O( log(N) · e-M/N ) = O( e-M/N )
N | number of entries M | overall memory for Bloom filters < … … <
State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost
= ∑pi = ∑p
N | number of entries M | overall memory for Bloom filters
asymptotic win
lookup cost increases at slower rate as data grows
… … <
= O( log(N) · e-M/N ) = O( e-M/N )
<
Monkey Bloom filters … false positive rates p0/T2 p0/T p0 convergent geometric series
Monkey Bloom filters … false positive rates p0/T2 p0/T p0 c · entries ·
- ln(pi)
∑ Ti = memory
Monkey Bloom filters … false positive rates p0/T2 p0/T p0
- ln(lookup cost)
c · entries · = memory
Monkey Bloom filters … false positive rates p0/T2 p0/T p0
model lookups vs. memory trade-off
- ln(lookup cost)
= memory c · entries ·
fixed memory
existing systems
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
fixed memory Pareto frontier
x x
existing systems
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
x x
Bloom filters size
lookups vs. memory
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
m a x t h r
- u
g h p u t
x x
merge policy greed
lookups vs. updates
Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance
?
lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance
?
lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations:
Identify size ratio merge policy
Identify Map size ratio lookups updates merge policy
Identify Map size ratio merge policy lookups updates
sorted array log LSM-tree
Identify Map size ratio merge policy sorted array log lookups updates
Identify Map size ratio lookups updates merge policy Navigate workload hardware
- ptimal
maximum throughout log sorted array
Leveling Tiering
Merge Policies
read-optimized write-optimized
Leveling
read-optimized
Tiering
write-optimized
read-optimized write-optimized
Leveling Tiering T runs per level
read-optimized write-optimized
Leveling Tiering T runs per level merge & flush
read-optimized write-optimized
Leveling Tiering T runs per level
read-optimized write-optimized
Leveling Tiering T runs per level merge
read-optimized write-optimized
Leveling Tiering T runs per level flush T times bigger
read-optimized write-optimized
Leveling Tiering T runs per level T times bigger
T runs per level 1 run per level
write-optimized read-optimized
Leveling Tiering
O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)
write-optimized read-optimized
Leveling Tiering runs per level levels levels false positive rate false positive rate lookup cost: T runs per level 1 run per level
write-optimized read-optimized
Leveling Tiering T runs per level 1 run per level O(T · logT(N) · e-M/N) O(logT(N) · e-M/N) runs per level levels levels false positive rate false positive rate lookup cost:
write-optimized read-optimized
Leveling Tiering T runs per level 1 run per level O(T · e-M/N) O(e-M/N) runs per level false positive rate false positive rate lookup cost:
O(logT(N)) O(T · logT(N)) merges per level levels levels
write-optimized read-optimized
Leveling Tiering update cost: T runs per level 1 run per level O(T · e-M/N) O(e-M/N) lookup cost:
write-optimized read-optimized
Leveling Tiering size ratio T T runs per level 1 run per level O(logT(N)) O(T · logT(N)) update cost: O(T · e-M/N) O(e-M/N) lookup cost:
1 run per level 1 run per level
write-optimized read-optimized
Leveling Tiering O(e-M/N) = O(e-M/N) O(log(N)) = O(log(N)) update cost: lookup cost: size ratio T
write-optimized read-optimized
Leveling Tiering T runs per level 1 run per level O(logT(N)) O(T · logT(N)) update cost: O(T · e-M/N) O(e-M/N) lookup cost: size ratio T
write-optimized read-optimized
Leveling Tiering O(1) O(N) O(Na · e-M/N) O(e-M/N) O(lNl) runs per level 1 run per level update cost: lookup cost: size ratio T
1 run per level
write-optimized read-optimized
Leveling Tiering log sorted array O(lNl) runs per level O(N) O(e-M/N) update cost: lookup cost: size ratio T O(Na · e-M/N) O(1)
lookup cost update cost Tiering log Leveling sorted array
lookup cost update cost Tiering log Leveling sorted array T=2 T | size ratio
lookup cost update cost Tiering log Leveling sorted array T | size ratio T=2
sorted array log LSM-tree
lookup cost update cost Tiering log Leveling sorted array T | size ratio workload hardware
- ptimal
maximum throughout T=2
update cost lookup cost m a x t h r
- u
g h p u t
x x
merge policy greed
lookups vs. updates
Problem 1: Problem 2: suboptimal filters allocation hard to tune
better asymptotic scalability number of entries (log scale) lookup latency (ms)
LevelDB Monkey
better asymptotic scalability number of entries (log scale) lookup latency (ms) workload adaptability
(F)
T4 T2L L4 L4 L6 L6 L8 L8 L16
% lookups in workload lookup latency (ms)
LevelDB Monkey LevelDB fixed Monkey navigable Monkey
http://daslab.seas.harvard.edu/crimsondb/ self-designs navigates what-if?
Monkey: Optimal Navigable Key-Value Store
merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance
?
lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate
- ptimize allocation
asymptotically better memory vs. lookups
insights: steps:
- bservations: