Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky - - PowerPoint PPT Presentation

scaling log structured kv stores
SMART_READER_LITE
LIVE PREVIEW

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky - - PowerPoint PPT Presentation

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan Log-Structured KV-Stores Log-Structured KV-Stores Why Log-Structured KV-Stores? Why Log-Structured KV-Stores? fast writes Why Log-Structured


slide-1
SLIDE 1

Niv Dayan

Scaling Log-Structured KV-Stores

featuring

Monkey and Dostoevsky

SIGMOD17 / SIGMOD18

slide-2
SLIDE 2

Log-Structured KV-Stores

slide-3
SLIDE 3

Log-Structured KV-Stores

slide-4
SLIDE 4

Why Log-Structured KV-Stores?

slide-5
SLIDE 5

Why Log-Structured KV-Stores?

fast writes

slide-6
SLIDE 6

Why Log-Structured KV-Stores?

memory storage

slide-7
SLIDE 7

Why Log-Structured KV-Stores?

slide-8
SLIDE 8

Why Log-Structured KV-Stores?

slide-9
SLIDE 9

block-addressable byte-addressable

Why Log-Structured KV-Stores?

slide-10
SLIDE 10

write data

slide-11
SLIDE 11

write data

slide-12
SLIDE 12

write data

slide-13
SLIDE 13

In-Place Writes

write data

slide-14
SLIDE 14

In-Place Writes

B-trees write data

slide-15
SLIDE 15

In-Place Writes

write data B-trees

slide-16
SLIDE 16

Log-Structured Writes

slide-17
SLIDE 17

Log-Structured Writes

buffer writes

slide-18
SLIDE 18

Log-Structured Writes

buffer writes

slide-19
SLIDE 19

Log-Structured Writes

buffer writes

slide-20
SLIDE 20

Log-Structured Writes

buffer writes

slide-21
SLIDE 21

Log-Structured Writes

buffer writes

slide-22
SLIDE 22

Log-Structured KV-Stores

fast writes

buffer writes

slide-23
SLIDE 23

massive data fast writes fast reads

Log-Structured KV-Stores

slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27

Background

slide-28
SLIDE 28

Background The Log-Structured Merge-Tree

buffer

slide-29
SLIDE 29

Background LSM-tree

buffer

slide-30
SLIDE 30

buffer

slide-31
SLIDE 31

buffer writes

slide-32
SLIDE 32

buffer key value pairs

slide-33
SLIDE 33

buffer Sherlock: key value Waldo: a fictional detective an inconspicuous traveler

slide-34
SLIDE 34

buffer gets full

slide-35
SLIDE 35

buffer 1 level sort & flush

slide-36
SLIDE 36

buffer 1 level sort & flush sorted runs …

slide-37
SLIDE 37

buffer 1 2 sort-merge

slide-38
SLIDE 38

buffer 1 2 3 level exponentially increasing capacities level 1 level 2 level 3

  • n

e I / O p e r r u n

slide-39
SLIDE 39

buffer 1 2 3 level where’s Waldo b i n a r y s e a r c h i n g

slide-40
SLIDE 40

buffer 1 2 3 level

  • n

e I / O p e r r u n pointers where’s Waldo

slide-41
SLIDE 41

buffer 1 2 3 level Bloom filters pointers where’s Waldo

slide-42
SLIDE 42

buffer 1 2 3 level

true negative

Bloom filters pointers where’s Waldo

slide-43
SLIDE 43

buffer 1 2 3 level

false positive true negative

Bloom filters pointers where’s Waldo

slide-44
SLIDE 44

buffer 1 2 3 level

false positive true positive true negative

Bloom filters pointers where’s Waldo

slide-45
SLIDE 45

buffer 1 2 3 Bloom filters merging frequency pointers

slide-46
SLIDE 46

merging writes reads

slide-47
SLIDE 47

merging writes reads

slide-48
SLIDE 48

Leveling Tiering

read-optimized write-optimized

merging

slide-49
SLIDE 49

Leveling Tiering

write-optimized read-optimized

slide-50
SLIDE 50

Leveling Tiering gather

write-optimized read-optimized

slide-51
SLIDE 51

Leveling Tiering merge & flush gather

write-optimized read-optimized

slide-52
SLIDE 52

Leveling Tiering

write-optimized read-optimized

gather

slide-53
SLIDE 53

Leveling Tiering merge

write-optimized read-optimized

gather

slide-54
SLIDE 54

Leveling Tiering flush merge

write-optimized read-optimized

gather

slide-55
SLIDE 55

Leveling Tiering merge

write-optimized read-optimized

gather

slide-56
SLIDE 56

Leveling Tiering

write-optimized read-optimized

logR(N)

slide-57
SLIDE 57

Leveling Tiering 1 run per level R runs per level

write-optimized read-optimized

size ratio logR(N)

slide-58
SLIDE 58

Leveling Tiering

size ratio logR(N)

1 run per level R runs per level

write-optimized read-optimized

slide-59
SLIDE 59

Leveling Tiering R runs per level 1 run per level

size ratio R

write-optimized read-optimized

slide-60
SLIDE 60

1 run per level Leveling Tiering 1 run per level

size ratio R

write-optimized read-optimized

slide-61
SLIDE 61

Leveling Tiering T runs per level 1 run per level

size ratio R

write-optimized read-optimized

slide-62
SLIDE 62

1 run per level Leveling Tiering log sorted array O(lNl) runs per level

size ratio R

write-optimized read-optimized

slide-63
SLIDE 63

Tiering Leveling log sorted array

slide-64
SLIDE 64

Tiering Leveling log sorted array size ratio R

slide-65
SLIDE 65

Tiering Leveling size ratio R log sorted array

slide-66
SLIDE 66

Tiering Leveling log sorted array R R size ratio R

slide-67
SLIDE 67
slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71

Dostoevsky

Monkey

slide-72
SLIDE 72

Monkey: Optimal Navigable Key-Value Store SIGMOD17

slide-73
SLIDE 73

Monkey: Optimal Navigable Key-Value Store SIGMOD17

Niv Dayan Manos Athanassoulis
 Stratos Idreos

slide-74
SLIDE 74

Bloom filters data

Monkey: Optimal Navigable Key-Value Store SIGMOD17

slide-75
SLIDE 75

x x x bits/entry Bloom filters data

slide-76
SLIDE 76

x x x bits/entry Bloom filters data

slide-77
SLIDE 77

Bloom filters false positive rate O(e-x) O(e-x) O(e-x) data

slide-78
SLIDE 78

Bloom filters false positive rate O(e-x) O(e-x) O(e-x) O(e-x · logR(N)) I/O =

slide-79
SLIDE 79

Bloom filters false positive rate O(e-x) O(e-x) O(e-x) O(e-x · logR(N)) I/O =

slide-80
SLIDE 80

false positive rate O(e-x) O(e-x) O(e-x) Bloom filters most memory

slide-81
SLIDE 81

false positive rate O(e-x) O(e-x) O(e-x) Bloom filters most memory saves at most 1 I/O!

slide-82
SLIDE 82

reallocate

slide-83
SLIDE 83

reallocate

slide-84
SLIDE 84

same memory - fewer false positives

reallocate

slide-85
SLIDE 85

false positive rates relax 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1

slide-86
SLIDE 86

f(p0, p1 …) f(p0, p1 …) false positive rates relax model 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1 read cost memory footprint = =

slide-87
SLIDE 87

read cost false positive rates relax model memory footprint 0 < p0 < 1 0 < p1 < 1 0 < p2 < 1

L

1

pi −

L

i

N TL−i ⋅ ln(pi) ln(2)2

= =

slide-88
SLIDE 88

0 < p0 < 1 0 < p1 < 1 0 < p2 < 1 in terms of p0, p1 … model false positive rates relax

  • ptimize

read cost memory footprint = =

L

1

pi −

L

i

N TL−i ⋅ ln(pi) ln(2)2

slide-89
SLIDE 89

false positive rate O(e-x/R0) O(e-x/R1) O(e-x/R2) p0 ≈ p1 ≈ p2 ≈

slide-90
SLIDE 90

O(e-x/R0) O(e-x) I/O = geometric progression false positive rate O(e-x/R2) O(e-x/R1)

slide-91
SLIDE 91

O(e-x) I/O > O(e-x · logR(N))

slide-92
SLIDE 92

O(e-x · logR(N)) O(e-x) I/O

slide-93
SLIDE 93

number of entries (log scale) read latency (ms)

RocksDB Monkey

O(e-x · logR(N)) O(e-x) I/O

slide-94
SLIDE 94

Existing Monkey

slide-95
SLIDE 95

Existing Monkey Dostoevsky

slide-96
SLIDE 96

tiering

Monkey

leveling

slide-97
SLIDE 97

point writes I/O overheads with leveling long range short range

slide-98
SLIDE 98

exponentially decreasing O(e-x) O(e-x/R) O(e-x/R2) point false positive rates

slide-99
SLIDE 99

false positive rates point largest level O(e-x) O(e-x/R) O(e-x/R2)

slide-100
SLIDE 100

largest level O(e-x)

writes long range short range point

slide-101
SLIDE 101

target key range target range O(s) O(s/R) O(s/R2) long range

slide-102
SLIDE 102

target key range target range largest level long range O(s) O(s/R) O(s/R2)

slide-103
SLIDE 103

largest level largest level O(e-x) O(s)

point writes long range short range

slide-104
SLIDE 104

short range target range 1 1 1

slide-105
SLIDE 105

all levels target range 1 1 1 O(logR(N)) short range

slide-106
SLIDE 106

largest level largest level

point

O(e-x) O(s) all levels O(logR(N))

writes long range short range

slide-107
SLIDE 107

exponentially more work exponentially less frequent writes

slide-108
SLIDE 108

exponentially more work exponentially less frequent writes

slide-109
SLIDE 109

= = all levels more work less frequent writes

slide-110
SLIDE 110

= = all levels writes

slide-111
SLIDE 111

merge 1 writes

slide-112
SLIDE 112

merge 2 writes

slide-113
SLIDE 113

R merge writes

slide-114
SLIDE 114

O(R) O(R) O(R) write-amplification writes

slide-115
SLIDE 115

O(R · logR(N)) O(R) O(R) O(R) writes

slide-116
SLIDE 116

O(e-x) O(s) O(logR(N)) O(R · logR(N)) O(s/R2) O(s/R) 1 1 1 O(s) O(e-x) O(e-x/R) O(e-x/R2) = = = + + + + + + largest level largest level all levels all levels

= + + long range point short range writes

O(R) O(R) O(R)

slide-117
SLIDE 117

O(s) O(e-x)

largest level largest level all levels writes long range point

O(R) O(R) O(R)

slide-118
SLIDE 118

O(s) O(e-x) largest level largest level all levels

long range point superfluous

O(R) O(R) O(R)

writes

slide-119
SLIDE 119

merging at smaller levels is superfluous for point lookups and long range lookups

slide-120
SLIDE 120

worse as data grows!

slide-121
SLIDE 121

poor performance

slide-122
SLIDE 122

poor performance lower device lifetime (on SSD)

slide-123
SLIDE 123

Dostoevsky

SIGMOD18

slide-124
SLIDE 124

Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store

slide-125
SLIDE 125

Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store very write-optimized

slide-126
SLIDE 126

Leveling Tiering

read-optimized write-optimized

slide-127
SLIDE 127

Leveling Tiering

read-optimized write-optimized

Lazy Leveling

mixed-optimized

slide-128
SLIDE 128

Lazy Leveling Leveling Tiering

slide-129
SLIDE 129

Lazy Leveling merge to have at most 1 run merge when level fills

slide-130
SLIDE 130

long range short range point writes

slide-131
SLIDE 131

O(e-x) O(e-x/R2) O(e-x/R3) false positive rates point

slide-132
SLIDE 132

false positive rates exponentially decreasing point O(e-x) O(e-x/R2) O(e-x/R3)

slide-133
SLIDE 133

false positive rates largest level point O(e-x) O(e-x/R2) O(e-x/R3)

slide-134
SLIDE 134

O(e-x) point

slide-135
SLIDE 135

O(e-x) point O(e-x) O(e-x) O(e-x) O(logR(N) · R · e-x) with uniform FPRs

slide-136
SLIDE 136

O(e-x)

point long range short range writes

slide-137
SLIDE 137

target key range target range O(s) O(s/R) O(s/R2) long range

slide-138
SLIDE 138

target key range target range largest level long range O(s) O(s/R) O(s/R2)

slide-139
SLIDE 139

O(e-x) O(s)

point writes long range short range

slide-140
SLIDE 140

1 O(R) O(1+R·(logR(N)-1)) target key range O(R) short range

slide-141
SLIDE 141

O(1+R·(logR(N)-1)) O(e-x) O(s)

long range point writes short range

slide-142
SLIDE 142

write-amplification O(1) O(1) O(R) writes

slide-143
SLIDE 143

write-amplification O( R + logR(N) ) writes O(1) O(1) O(R)

slide-144
SLIDE 144

O( R + logR(N) ) O(e-x) O(s)

long range point short range writes

O(1+R · (logR(N)-1))

slide-145
SLIDE 145

O( R + logR(N) ) O(1+R · (logR(N)-1)) O(e-x) O(s)

Lazy Leveling Leveling

O(e-x) O(s) O(logR(N)) O( R · logR(N) )

= = V V long range point short range writes

slide-146
SLIDE 146

O( R + logR(N) ) O(1+R · (logR(N)-1)) O(e-x) O(s)

Lazy Leveling Leveling

O(e-x) O(s) O(logR(N)) O( R · logR(N) )

Tiering

O( logR(N) ) O(R · logR(N)) O(R · e-x) O(R · s)

V V V V long range point short range writes

slide-147
SLIDE 147

Leveling writes Lazy Leveling Tiering point

slide-148
SLIDE 148

Leveling short range Lazy Leveling Tiering writes

slide-149
SLIDE 149

Leveling long range Lazy Leveling Tiering writes

slide-150
SLIDE 150

Leveling Tiering Lazy Leveling

slide-151
SLIDE 151

Leveling Tiering Lazy Leveling

writes

slide-152
SLIDE 152

Leveling Tiering Lazy Leveling

short range writes

slide-153
SLIDE 153

Leveling Tiering

short range writes & point

Lazy Leveling

writes

slide-154
SLIDE 154

Tiering Lazy Leveling

Fluid

Leveling

slide-155
SLIDE 155

K runs Z runs

LSM-Tree Fluid

slide-156
SLIDE 156

R runs 1 runs Lazy Leveling

LSM-Tree Fluid

slide-157
SLIDE 157

R runs 1 runs long range short range point writes Lazy Leveling

slide-158
SLIDE 158

R runs 1 runs long range short range point writes Lazy Leveling

  • ptimize
slide-159
SLIDE 159

2 runs 1 runs long range short range point writes Lazy Leveling

  • ptimize
slide-160
SLIDE 160

Leveling short range writes 1 runs 1 runs

  • ptimize

long range point

slide-161
SLIDE 161

Leveling short range writes 1 runs 1 runs long range point long range point

  • ptimize
slide-162
SLIDE 162

R runs 1 runs long range short range point writes Lazy Leveling

  • ptimize
slide-163
SLIDE 163

R runs 2 runs long range short range point Lazy Leveling

  • ptimize

writes

slide-164
SLIDE 164

R runs R runs long range short range point Tiering

  • ptimize

writes

slide-165
SLIDE 165

R runs R runs long range short range point Tiering writes

  • ptimize
slide-166
SLIDE 166

R runs 1 runs long range short range point writes Lazy Leveling

  • ptimize
slide-167
SLIDE 167

R runs 1 runs long range short range point writes Lazy Leveling

  • ptimize

R size ratio

slide-168
SLIDE 168

R runs 1 runs long range short range point writes Lazy Leveling

  • ptimize

R size ratio

slide-169
SLIDE 169

Fluid LSM-Tree

R size ratio K runs at smaller levels Z runs at largest level

slide-170
SLIDE 170

Fluid LSM-Tree Tiering Lazy Leveling Leveling

slide-171
SLIDE 171
slide-172
SLIDE 172

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

l e v e l i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10

slide-173
SLIDE 173

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

l e v e l i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10

slide-174
SLIDE 174

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

l e v e l i n g t i e r i n g point lookups / updates 0.5 1 normalized throughput 1/100 1/10

slide-175
SLIDE 175

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

l e v e l i n g t i e r i n g lazy leveling point lookups / updates 1/100 1/10 0.5 1 normalized throughput

slide-176
SLIDE 176

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

l e v e l i n g t i e r i n g lazy leveling Dostoevsky point lookups / updates 0.5 1 normalized throughput 1/100 1/10

slide-177
SLIDE 177

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

0.5 1 normalized throughput point lookups / updates Dostoevsky 1/100 1/10

slide-178
SLIDE 178

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

point lookups / updates 0.5 1 normalized throughput Dostoevsky Tuned RocksDB Monkey 1/100 1/10

slide-179
SLIDE 179

0.5 1 normalized throughput (ops/s) 10−2 10−1 point lookups / updates

point lookups / updates 0.5 1 normalized throughput Dostoevsky Tuned RocksDB Monkey Untuned RocksDB 1/100 1/10

slide-180
SLIDE 180

Conclusion

slide-181
SLIDE 181

Conclusion

Bloom filters LSM-tree

slide-182
SLIDE 182

Conclusion

Bloom filters LSM-tree

  • ptimizes

memory allocation

slide-183
SLIDE 183

Conclusion

Bloom filters LSM-tree removes superfluous merging

  • ptimizes

memory allocation

slide-184
SLIDE 184

Conclusion

slide-185
SLIDE 185

Conclusion

slide-186
SLIDE 186

Conclusion

slide-187
SLIDE 187

Conclusion

t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-188
SLIDE 188

Conclusion

little memory t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-189
SLIDE 189

Conclusion

ample memory t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-190
SLIDE 190

Conclusion

ample memory t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-191
SLIDE 191

Conclusion

ample memory t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-192
SLIDE 192

Conclusion

ample memory t i e r i n g ( 1 9 9 7 ) leveling (1996)

slide-193
SLIDE 193

Conclusion

slide-194
SLIDE 194

Conclusion

slide-195
SLIDE 195

Conclusion

slide-196
SLIDE 196

Conclusion

slide-197
SLIDE 197
slide-198
SLIDE 198

Thanks!