M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos - - PowerPoint PPT Presentation

m onkey o ptimal n avigable key value store
SMART_READER_LITE
LIVE PREVIEW

M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos - - PowerPoint PPT Presentation

M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos Athanassoulis, Stratos Idreos storage is cheaper inserts & updates price workload per GB time storage is cheaper inserts & updates price workload per GB time need


slide-1
SLIDE 1

Monkey: Optimal Navigable Key-Value Store

Niv Dayan, Manos Athanassoulis, Stratos Idreos

slide-2
SLIDE 2

price per GB time storage is cheaper inserts & updates workload

slide-3
SLIDE 3

price per GB time storage is cheaper inserts & updates workload

need for write-optimized database structures

slide-4
SLIDE 4

time 1996 now LSM-tree invented need for write-optimized database structures

slide-5
SLIDE 5

time 1996 now LSM-tree invented Key-Value Stores need for write-optimized database structures

slide-6
SLIDE 6

LSM-tree Key-Value Stores

What are they really?

slide-7
SLIDE 7

memory updates buffer storage level

slide-8
SLIDE 8

memory updates buffer storage 1 level sort & flush runs

slide-9
SLIDE 9

memory updates buffer storage 1 2 sort-merge sort & flush runs

slide-10
SLIDE 10

memory buffer storage 1 2 3 level exponentially increasing capacities O(log(N)) levels

slide-11
SLIDE 11

memory buffer storage 1 2 3 level fence pointers lookup key X X

  • n

e I / O p e r r u n

slide-12
SLIDE 12

memory buffer storage 1 2 3 level fence pointers lookup key X X

  • n

e I / O p e r r u n

slide-13
SLIDE 13

memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X

slide-14
SLIDE 14

memory buffer storage 1 2 3 level fence pointers lookup key X X

true negative

Bloom filters

slide-15
SLIDE 15

memory buffer storage 1 2 3 level fence pointers lookup key X X

false positive true negative

Bloom filters

slide-16
SLIDE 16

memory buffer storage 1 2 3 level fence pointers lookup key X X

false positive true positive true negative

Bloom filters

slide-17
SLIDE 17

memory buffer storage 1 2 3 level fence pointers lookup key X X

false positive true positive true negative

Performance & Cost Tradeoffs Bloom filters

slide-18
SLIDE 18

memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X

false positive true positive true negative

Performance & Cost Tradeoffs bigger filters fewer false positives

slide-19
SLIDE 19

memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X

false positive true positive true negative

Performance & Cost Tradeoffs bigger filters fewer false positives memory vs. lookups

slide-20
SLIDE 20

memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X

false positive true positive true negative

Performance & Cost Tradeoffs memory vs. lookups bigger filters fewer false positives more merging fewer runs more merging fewer runs

slide-21
SLIDE 21

memory buffer storage 1 2 3 level fence pointers Bloom filters lookup key X X

false positive true positive true negative

Performance & Cost Tradeoffs lookups vs. updates memory vs. lookups bigger filters fewer false positives more merging fewer runs

slide-22
SLIDE 22

lookup cost main memory update cost

slide-23
SLIDE 23

lookup cost main memory update cost update cost lookup cost

existing systems

fixed memory

slide-24
SLIDE 24

lookup cost main memory update cost

merge more merge less

fixed memory

existing systems

update cost lookup cost

slide-25
SLIDE 25

lookup cost main memory update cost less memory more memory update cost lookup cost

slide-26
SLIDE 26

Problem 1:

existing systems

Problem 2: update cost lookup cost

slide-27
SLIDE 27

Problem 1:

existing systems

Problem 2: suboptimal filters allocation update cost lookup cost

slide-28
SLIDE 28

suboptimal filters allocation fixed memory Pareto frontier

x x

existing systems

Problem 1: Problem 2: update cost lookup cost

slide-29
SLIDE 29

x x

hard to tune Problem 1: Problem 2: suboptimal filters allocation update cost lookup cost

slide-30
SLIDE 30

x x

Bloom filters size

lookups vs. memory

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-31
SLIDE 31

m a x t h r

  • u

g h p u t

x x

merge policy greed

lookups vs. updates

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-32
SLIDE 32

Monkey: Optimal Navigable Key-Value Store

slide-33
SLIDE 33

c

Monkey: Optimal Navigable Key-Value Store

insights: steps:

  • bservations:
slide-34
SLIDE 34

c

Monkey: Optimal Navigable Key-Value Store

fixed false positive rates lookup cost = ∑ pi suboptimal filters

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-35
SLIDE 35

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array performance

?

fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree

  • ptimize allocation

asymptotically better memory vs. lookups updates vs. lookups navigate

insights: steps:

  • bservations:
slide-36
SLIDE 36

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates ad-hoc trade-offs update cost Monkey answer what-if design questions performance

?

lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-37
SLIDE 37

update cost lookup cost Pareto frontier

WiredTiger Cassandra, HBase

Monkey

RocksDB, LevelDB

for fixed memory

slide-38
SLIDE 38

update cost lookup cost Pareto frontier

WiredTiger Cassandra, HBase RocksDB, LevelDB

max throughput Monkey for fixed memory

slide-39
SLIDE 39

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates ad-hoc trade-offs update cost Monkey answer what-if design questions performance

?

lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-40
SLIDE 40

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates answer what-if design questions performance

?

update cost Monkey lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-41
SLIDE 41

f fence pointers Bloom filters buffer memory storage data

slide-42
SLIDE 42

f fence pointers Bloom filters buffer memory storage data

slide-43
SLIDE 43

f fence pointers Bloom filters buffer memory storage data

slide-44
SLIDE 44

f fence pointers Bloom filters buffer memory storage data > <

slide-45
SLIDE 45

f Bloom filters memory storage data

slide-46
SLIDE 46

f Bloom filters storage memory

X bits per entry

data

slide-47
SLIDE 47

f Bloom filters storage memory data

X bits per entry

slide-48
SLIDE 48

Bloom filters memory

X bits per entry

= e

bits M entries N

  • ln(2)2

false positive rate p

slide-49
SLIDE 49

Bloom filters memory

X bits per entry

p p p = e

bits M entries N

  • ln(2)2

false positive rate p

slide-50
SLIDE 50

Bloom filters memory p p p = e

bits M entries N

  • ln(2)2

false positive rate p X bits per entry

worst-case I/O overhead:

slide-51
SLIDE 51

Bloom filters memory

O( ∑p )

p p p = e

bits M entries N

  • ln(2)2

false positive rate p X bits per entry

worst-case I/O overhead:

slide-52
SLIDE 52

Bloom filters memory

O( ∑p )

p p p =

ln(2)2 false positive rate p X bits per entry

e

bits M entries N

  • worst-case I/O overhead:
slide-53
SLIDE 53

Bloom filters memory p p p = e

bits M entries N

  • ln(2)2

false positive rate p X bits per entry

O( ∑e-M/N ) worst-case I/O overhead:

slide-54
SLIDE 54

Bloom filters memory p p p O(log(N))

X bits per entry

O( ∑e-M/N ) worst-case I/O overhead:

slide-55
SLIDE 55

Bloom filters memory p p p O(log(N))

X bits per entry

O( log(N) · e-M/N ) worst-case I/O overhead:

slide-56
SLIDE 56

Bloom filters memory p p p

X bits per entryCan we do better?

O( log(N) · e-M/N ) worst-case I/O overhead:

slide-57
SLIDE 57

fence pointers Bloom filters lookup key X data runs … …

slide-58
SLIDE 58

fence pointers Bloom filters

false positive

lookup key X

false positive false positive false positive false positive

data runs

I/O I/O I/O I/O I/O

… …

slide-59
SLIDE 59

fence pointers Bloom filters

false positive

lookup key X

false positive false positive false positive false positive

data runs

I/O I/O I/O I/O I/O

most memory … …

slide-60
SLIDE 60

fence pointers Bloom filters

false positive

lookup key X

false positive false positive false positive false positive

data runs

I/O I/O I/O I/O I/O

most memory saves at most 1 I/O

most memory … …

slide-61
SLIDE 61

Bloom filters false positive rates reallocate some most memory

slide-62
SLIDE 62

Bloom filters false positive rates

same memory, fewer lookup I/Os

reallocate some most memory

slide-63
SLIDE 63

0 < p2 < 1 0 < p1 < 1 0 < p0 < 1

xx xx

false positive rates relax

slide-64
SLIDE 64

0 < p2 < 1 0 < p1 < 1 0 < p0 < 1 false positive rates relax

slide-65
SLIDE 65

lookup cost = f(p0, p1 …) = f(p0, p1 …) false positive rates relax model 0 < p2 < 1 0 < p1 < 1 0 < p0 < 1 memory footprint

slide-66
SLIDE 66

0 < p2 < 1 memory footprint lookup cost 0 < p1 < 1 0 < p0 < 1 = f(p0, p1 …) = f(p0, p1 …) in terms of p0, p1 model false positive rates relax

  • ptimize
slide-67
SLIDE 67

lookup cost = ∑ pi p2 p1 p0 false positive rates Bloom filters …

slide-68
SLIDE 68

= e

bits entries

  • ln(2)2

false positive rate

memory footprint p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi

slide-69
SLIDE 69

= -

ln(2)2

ln(

)

false positive rate entries bits

memory footprint p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi

slide-70
SLIDE 70

bits(p0, N) bits(p1, N/T) bits(p2, N/T2)

memory footprint … p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi

slide-71
SLIDE 71

bits(p0, N) bits(p1, N/T) bits(p2, N/T2)

memory footprint … p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi memory = c · N ·

  • ∑ ln(pi)

Ti entries constant size ratio false positive rates

slide-72
SLIDE 72
  • ptimize

p2 p1 p0 false positive rates Bloom filters … lookup cost = ∑ pi memory = c · N ·

  • ∑ ln(pi)

Ti

slide-73
SLIDE 73

Monkey Bloom filters … false positive rates p0/T2 p0/T p0 e x p

  • n

e n t i a l d e c r e a s e

slide-74
SLIDE 74

s a m e State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p e x p

  • n

e n t i a l d e c r e a s e

slide-75
SLIDE 75

State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost

= ∑pi = ∑p

< … … <

slide-76
SLIDE 76

State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost

= ∑pi = ∑p = O( log(N) · e-M/N ) = O( e-M/N )

N | number of entries M | overall memory for Bloom filters < … … <

slide-77
SLIDE 77

State-of-the-Art Bloom filters Monkey Bloom filters … false positive rates p0/T2 p0/T p0 p p p > < < lookup cost

= ∑pi = ∑p

N | number of entries M | overall memory for Bloom filters

asymptotic win

lookup cost increases at slower rate as data grows

… … <

= O( log(N) · e-M/N ) = O( e-M/N )

<

slide-78
SLIDE 78

Monkey Bloom filters … false positive rates p0/T2 p0/T p0 convergent geometric series

slide-79
SLIDE 79

Monkey Bloom filters … false positive rates p0/T2 p0/T p0 c · entries ·

  • ln(pi)

∑ Ti = memory

slide-80
SLIDE 80

Monkey Bloom filters … false positive rates p0/T2 p0/T p0

  • ln(lookup cost)

c · entries · = memory

slide-81
SLIDE 81

Monkey Bloom filters … false positive rates p0/T2 p0/T p0

model lookups vs. memory trade-off

  • ln(lookup cost)

= memory c · entries ·

slide-82
SLIDE 82

fixed memory

existing systems

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-83
SLIDE 83

fixed memory Pareto frontier

x x

existing systems

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-84
SLIDE 84

x x

Bloom filters size

lookups vs. memory

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-85
SLIDE 85

m a x t h r

  • u

g h p u t

x x

merge policy greed

lookups vs. updates

Problem 1: Problem 2: suboptimal filters allocation hard to tune update cost lookup cost

slide-86
SLIDE 86

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance

?

lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-87
SLIDE 87

c

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance

?

lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-88
SLIDE 88

Identify size ratio merge policy

slide-89
SLIDE 89

Identify Map size ratio lookups updates merge policy

slide-90
SLIDE 90

Identify Map size ratio merge policy lookups updates

sorted array log LSM-tree

slide-91
SLIDE 91

Identify Map size ratio merge policy sorted array log lookups updates

slide-92
SLIDE 92

Identify Map size ratio lookups updates merge policy Navigate workload hardware

  • ptimal

maximum throughout log sorted array

slide-93
SLIDE 93

Leveling Tiering

Merge Policies

read-optimized write-optimized

slide-94
SLIDE 94

Leveling

read-optimized

Tiering

write-optimized

slide-95
SLIDE 95

read-optimized write-optimized

Leveling Tiering T runs per level

slide-96
SLIDE 96

read-optimized write-optimized

Leveling Tiering T runs per level merge & flush

slide-97
SLIDE 97

read-optimized write-optimized

Leveling Tiering T runs per level

slide-98
SLIDE 98

read-optimized write-optimized

Leveling Tiering T runs per level merge

slide-99
SLIDE 99

read-optimized write-optimized

Leveling Tiering T runs per level flush T times bigger

slide-100
SLIDE 100

read-optimized write-optimized

Leveling Tiering T runs per level T times bigger

slide-101
SLIDE 101

T runs per level 1 run per level

write-optimized read-optimized

Leveling Tiering

slide-102
SLIDE 102

O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)

write-optimized read-optimized

Leveling Tiering runs per level levels levels false positive rate false positive rate lookup cost: T runs per level 1 run per level

slide-103
SLIDE 103

write-optimized read-optimized

Leveling Tiering T runs per level 1 run per level O(T · logT(N) · e-M/N) O(logT(N) · e-M/N) runs per level levels levels false positive rate false positive rate lookup cost:

slide-104
SLIDE 104

write-optimized read-optimized

Leveling Tiering T runs per level 1 run per level O(T · e-M/N) O(e-M/N) runs per level false positive rate false positive rate lookup cost:

slide-105
SLIDE 105

O(logT(N)) O(T · logT(N)) merges per level levels levels

write-optimized read-optimized

Leveling Tiering update cost: T runs per level 1 run per level O(T · e-M/N) O(e-M/N) lookup cost:

slide-106
SLIDE 106

write-optimized read-optimized

Leveling Tiering size ratio T T runs per level 1 run per level O(logT(N)) O(T · logT(N)) update cost: O(T · e-M/N) O(e-M/N) lookup cost:

slide-107
SLIDE 107

1 run per level 1 run per level

write-optimized read-optimized

Leveling Tiering O(e-M/N) = O(e-M/N) O(log(N)) = O(log(N)) update cost: lookup cost: size ratio T

slide-108
SLIDE 108

write-optimized read-optimized

Leveling Tiering T runs per level 1 run per level O(logT(N)) O(T · logT(N)) update cost: O(T · e-M/N) O(e-M/N) lookup cost: size ratio T

slide-109
SLIDE 109

write-optimized read-optimized

Leveling Tiering O(1) O(N) O(Na · e-M/N) O(e-M/N) O(lNl) runs per level 1 run per level update cost: lookup cost: size ratio T

slide-110
SLIDE 110

1 run per level

write-optimized read-optimized

Leveling Tiering log sorted array O(lNl) runs per level O(N) O(e-M/N) update cost: lookup cost: size ratio T O(Na · e-M/N) O(1)

slide-111
SLIDE 111

lookup cost update cost Tiering log Leveling sorted array

slide-112
SLIDE 112

lookup cost update cost Tiering log Leveling sorted array T=2 T | size ratio

slide-113
SLIDE 113

lookup cost update cost Tiering log Leveling sorted array T | size ratio T=2

sorted array log LSM-tree

slide-114
SLIDE 114

lookup cost update cost Tiering log Leveling sorted array T | size ratio workload hardware

  • ptimal

maximum throughout T=2

slide-115
SLIDE 115

update cost lookup cost m a x t h r

  • u

g h p u t

x x

merge policy greed

lookups vs. updates

Problem 1: Problem 2: suboptimal filters allocation hard to tune

slide-116
SLIDE 116
slide-117
SLIDE 117

better asymptotic scalability number of entries (log scale) lookup latency (ms)

LevelDB Monkey

slide-118
SLIDE 118

better asymptotic scalability number of entries (log scale) lookup latency (ms) workload adaptability

(F)

T4 T2L L4 L4 L6 L6 L8 L8 L16

% lookups in workload lookup latency (ms)

LevelDB Monkey LevelDB fixed Monkey navigable Monkey

slide-119
SLIDE 119

http://daslab.seas.harvard.edu/crimsondb/ self-designs navigates what-if?

slide-120
SLIDE 120

Monkey: Optimal Navigable Key-Value Store

merge policy log sorted array memory lookups updates update cost Monkey answer what-if design questions performance

?

lookup cost existing fixed false positive rates lookup cost = ∑ pi suboptimal filters LSM-tree ad-hoc trade-offs updates vs. lookups navigate

  • ptimize allocation

asymptotically better memory vs. lookups

insights: steps:

  • bservations:
slide-121
SLIDE 121

Monkey: Optimal Navigable Key-Value Store

0 < memory < ∞ more in paper: buffer filters cache skewed & range lookups

slide-122
SLIDE 122

Monkey: Optimal Navigable Key-Value Store

skewed & range lookups 0 < memory < ∞ more in paper: buffer filters cache http://daslab.seas.harvard.edu/monkey/

slide-123
SLIDE 123

Monkey: Optimal Navigable Key-Value Store

skewed & range lookups Thanks! 0 < memory < ∞ more in paper: buffer filters cache http://daslab.seas.harvard.edu/monkey/