Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky - - PowerPoint PPT Presentation
Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky - - PowerPoint PPT Presentation
Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan Log-Structured KV-Stores Log-Structured KV-Stores Why Log-Structured KV-Stores? Why Log-Structured KV-Stores? fast writes Why Log-Structured
Log-Structured KV-Stores
Log-Structured KV-Stores
Why Log-Structured KV-Stores?
Why Log-Structured KV-Stores?
fast writes
Why Log-Structured KV-Stores?
memory storage
Why Log-Structured KV-Stores?
Why Log-Structured KV-Stores?
block-addressable byte-addressable
Why Log-Structured KV-Stores?
write data
write data
write data
In-Place Writes
write data
In-Place Writes
B-trees write data
In-Place Writes
write data B-trees
Log-Structured Writes
Log-Structured Writes
buffer writes
Log-Structured Writes
buffer writes
Log-Structured Writes
buffer writes
Log-Structured Writes
buffer writes
Log-Structured Writes
buffer writes
Log-Structured KV-Stores
fast writes
buffer writes
massive data fast writes fast reads
Log-Structured KV-Stores
Background
Background The Log-Structured Merge-Tree
buffer
Background LSM-tree
buffer
buffer
buffer writes
buffer key value pairs
buffer Sherlock: key value Waldo: a fictional detective an inconspicuous traveler
buffer gets full
buffer 1 level sort & flush
buffer 1 level sort & flush sorted runs …
buffer 1 2 sort-merge
buffer 1 2 3 level exponentially increasing capacities level 1 level 2 level 3
- n
e I / O p e r r u n
buffer 1 2 3 level where’s Waldo b i n a r y s e a r c h i n g
buffer 1 2 3 level
- n
e I / O p e r r u n pointers where’s Waldo
buffer 1 2 3 level Bloom filters pointers where’s Waldo
buffer 1 2 3 level
true negative
Bloom filters pointers where’s Waldo
buffer 1 2 3 level
false positive true negative
Bloom filters pointers where’s Waldo
buffer 1 2 3 level
false positive true positive true negative
Bloom filters pointers where’s Waldo
buffer 1 2 3 Bloom filters merging frequency pointers
merging writes reads
merging writes reads
Leveling Tiering
read-optimized write-optimized
merging
Leveling Tiering
write-optimized read-optimized
Leveling Tiering gather
write-optimized read-optimized
Leveling Tiering merge & flush gather
write-optimized read-optimized
Leveling Tiering
write-optimized read-optimized
gather
Leveling Tiering merge
write-optimized read-optimized
gather
Leveling Tiering flush merge
write-optimized read-optimized
gather
Leveling Tiering merge
write-optimized read-optimized
gather
Leveling Tiering
write-optimized read-optimized
logR(N)
Leveling Tiering 1 run per level R runs per level
write-optimized read-optimized
size ratio logR(N)
Leveling Tiering
size ratio logR(N)
1 run per level R runs per level
write-optimized read-optimized
Leveling Tiering R runs per level 1 run per level
size ratio R
write-optimized read-optimized
1 run per level Leveling Tiering 1 run per level
size ratio R
write-optimized read-optimized
Leveling Tiering T runs per level 1 run per level
size ratio R
write-optimized read-optimized
1 run per level Leveling Tiering log sorted array O(lNl) runs per level
size ratio R
write-optimized read-optimized
Tiering Leveling log sorted array
Tiering Leveling log sorted array size ratio R
Tiering Leveling size ratio R log sorted array
Tiering Leveling log sorted array R R size ratio R
Dostoevsky
Monkey
Monkey: Optimal Navigable Key-Value Store SIGMOD17
Monkey: Optimal Navigable Key-Value Store SIGMOD17
Niv Dayan Manos Athanassoulis Stratos Idreos
Bloom filters data
Monkey: Optimal Navigable Key-Value Store SIGMOD17
x x x bits/entry Bloom filters data
x x x bits/entry Bloom filters data
Bloom filters false positive rate O(e-x) O(e-x) O(e-x) data
Bloom filters false positive rate O(e-x) O(e-x) O(e-x) O(e-x · logR(N)) I/O =
Bloom filters false positive rate O(e-x) O(e-x) O(e-x) O(e-x · logR(N)) I/O =
false positive rate O(e-x) O(e-x) O(e-x) Bloom filters most memory
false positive rate O(e-x) O(e-x) O(e-x) Bloom filters most memory saves at most 1 I/O!
reallocate
reallocate
same memory - fewer false positives
reallocate
false positive rates relax 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1
f(p0, p1 …) f(p0, p1 …) false positive rates relax model 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1 read cost memory footprint = =
read cost false positive rates relax model memory footprint 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1
L
∑
1
pi −
L
∑
i
N TL−i ⋅ ln(pi) ln(2)2
= =
0 < p0 < 1 0 < p1 < 1 0 < p2 < 1 in terms of p0, p1 … model false positive rates relax
- ptimize
read cost memory footprint = =
L
∑
1
pi −
L
∑
i
N TL−i ⋅ ln(pi) ln(2)2
false positive rate O(e-x/R0) O(e-x/R1) O(e-x/R2) p0 ≈ p1 ≈ p2 ≈
O(e-x/R0) O(e-x) I/O = geometric progression false positive rate O(e-x/R2) O(e-x/R1)
O(e-x) I/O > O(e-x · logR(N))
O(e-x · logR(N)) O(e-x) I/O
number of entries (log scale) read latency (ms)
RocksDB Monkey
O(e-x · logR(N)) O(e-x) I/O
Existing Monkey
Existing Monkey Dostoevsky
tiering
Monkey
leveling
point writes I/O overheads with leveling long range short range
exponentially decreasing O(e-x) O(e-x/R) O(e-x/R2) point false positive rates
false positive rates point largest level O(e-x) O(e-x/R) O(e-x/R2)
largest level O(e-x)
writes long range short range point
target key range target range O(s) O(s/R) O(s/R2) long range
target key range target range largest level long range O(s) O(s/R) O(s/R2)
largest level largest level O(e-x) O(s)
point writes long range short range
short range target range 1 1 1
all levels target range 1 1 1 O(logR(N)) short range
largest level largest level
point
O(e-x) O(s) all levels O(logR(N))
writes long range short range
exponentially more work exponentially less frequent writes
exponentially more work exponentially less frequent writes
= = all levels more work less frequent writes
= = all levels writes
merge 1 writes
merge 2 writes
R merge writes
O(R) O(R) O(R) write-amplification writes
O(R · logR(N)) O(R) O(R) O(R) writes
O(e-x) O(s) O(logR(N)) O(R · logR(N)) O(s/R2) O(s/R) 1 1 1 O(s) O(e-x) O(e-x/R) O(e-x/R2) = = = + + + + + + largest level largest level all levels all levels
= + + long range point short range writes
O(R) O(R) O(R)
O(s) O(e-x)
largest level largest level all levels writes long range point
O(R) O(R) O(R)
O(s) O(e-x) largest level largest level all levels
long range point superfluous
O(R) O(R) O(R)
writes
merging at smaller levels is superfluous for point lookups and long range lookups
worse as data grows!
poor performance
poor performance lower device lifetime (on SSD)
Dostoevsky
SIGMOD18
Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store
Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store very write-optimized
Leveling Tiering
read-optimized write-optimized
Leveling Tiering
read-optimized write-optimized
Lazy Leveling
mixed-optimized
Lazy Leveling Leveling Tiering
Lazy Leveling merge to have at most 1 run merge when level fills
long range short range point writes
O(e-x) O(e-x/R2) O(e-x/R3) false positive rates point
false positive rates exponentially decreasing point O(e-x) O(e-x/R2) O(e-x/R3)
false positive rates largest level point O(e-x) O(e-x/R2) O(e-x/R3)
O(e-x) point
O(e-x) point O(e-x) O(e-x) O(e-x) O(logR(N) · R · e-x) with uniform FPRs
O(e-x)
point long range short range writes
target key range target range O(s) O(s/R) O(s/R2) long range
target key range target range largest level long range O(s) O(s/R) O(s/R2)
O(e-x) O(s)
point writes long range short range
1 O(R) O(1+R·(logR(N)-1)) target key range O(R) short range
O(1+R·(logR(N)-1)) O(e-x) O(s)
long range point writes short range
write-amplification O(1) O(1) O(R) writes
write-amplification O( R + logR(N) ) writes O(1) O(1) O(R)
O( R + logR(N) ) O(e-x) O(s)
long range point short range writes
O(1+R · (logR(N)-1))
O( R + logR(N) ) O(1+R · (logR(N)-1)) O(e-x) O(s)
Lazy Leveling Leveling
O(e-x) O(s) O(logR(N)) O( R · logR(N) )
= = V V long range point short range writes
O( R + logR(N) ) O(1+R · (logR(N)-1)) O(e-x) O(s)
Lazy Leveling Leveling
O(e-x) O(s) O(logR(N)) O( R · logR(N) )
Tiering
O( logR(N) ) O(R · logR(N)) O(R · e-x) O(R · s)
V V V V long range point short range writes
Leveling writes Lazy Leveling Tiering point
Leveling short range Lazy Leveling Tiering writes
Leveling long range Lazy Leveling Tiering writes
Leveling Tiering Lazy Leveling
Leveling Tiering Lazy Leveling
writes
Leveling Tiering Lazy Leveling
short range writes
Leveling Tiering
short range writes & point
Lazy Leveling
writes
Tiering Lazy Leveling
Fluid
Leveling
K runs Z runs
LSM-Tree Fluid
R runs 1 runs Lazy Leveling
LSM-Tree Fluid
R runs 1 runs long range short range point writes Lazy Leveling
R runs 1 runs long range short range point writes Lazy Leveling
- ptimize
2 runs 1 runs long range short range point writes Lazy Leveling
- ptimize
Leveling short range writes 1 runs 1 runs
- ptimize
long range point
Leveling short range writes 1 runs 1 runs long range point long range point
- ptimize
R runs 1 runs long range short range point writes Lazy Leveling
- ptimize
R runs 2 runs long range short range point Lazy Leveling
- ptimize
writes
R runs R runs long range short range point Tiering
- ptimize
writes
R runs R runs long range short range point Tiering writes
- ptimize
R runs 1 runs long range short range point writes Lazy Leveling
- ptimize
R runs 1 runs long range short range point writes Lazy Leveling
- ptimize
R size ratio
R runs 1 runs long range short range point writes Lazy Leveling
- ptimize
R size ratio
Fluid LSM-Tree
R size ratio K runs at smaller levels Z runs at largest level
Fluid LSM-Tree Tiering Lazy Leveling Leveling
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
l e v e l i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
l e v e l i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
l e v e l i n g t i e r i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
l e v e l i n g t i e r i n g lazy leveling point lookups / updates 1/100 1/10 0.5 1 normalized throughput
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
l e v e l i n g t i e r i n g lazy leveling Dostoevsky point lookups / updates 0.5 1 normalized throughput 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
0.5 1 normalized throughput point lookups / updates Dostoevsky 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
point lookups / updates 0.5 1 normalized throughput Dostoevsky Tuned RocksDB Monkey 1/100 1/10
0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates
point lookups / updates 0.5 1 normalized throughput Dostoevsky Tuned RocksDB Monkey Untuned RocksDB 1/100 1/10
Conclusion
Conclusion
Bloom filters LSM-tree
Conclusion
Bloom filters LSM-tree
- ptimizes
memory allocation
Conclusion
Bloom filters LSM-tree removes superfluous merging
- ptimizes