WiscKey: Separating Keys from Values in SSD-Conscious Storage - - PowerPoint PPT Presentation

wisckey separating keys from values in ssd conscious
SMART_READER_LITE
LIVE PREVIEW

WiscKey: Separating Keys from Values in SSD-Conscious Storage - - PowerPoint PPT Presentation

WiscKey: Separating Keys from Values in SSD-Conscious Storage Lanyue Lu, Thanumalayan Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin-Madison Key-Value Stores Key-Value Stores Key-value stores are important


slide-1
SLIDE 1

WiscKey: Separating Keys from Values in SSD-Conscious Storage

Lanyue Lu, Thanumalayan Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

University of Wisconsin-Madison

slide-2
SLIDE 2

Key-Value Stores

slide-3
SLIDE 3

Key-Value Stores

Key-value stores are important

➡ web indexing, e-commerce, social networks ➡ various key-value stores ➡ hash table, b-tree ➡ log-structured merge trees (LSM-trees)

slide-4
SLIDE 4

Key-Value Stores

Key-value stores are important

➡ web indexing, e-commerce, social networks ➡ various key-value stores ➡ hash table, b-tree ➡ log-structured merge trees (LSM-trees)

LSM-tree based key-value stores are popular

➡ optimize for write intensive workloads ➡ widely deployed ➡ BigTable and LevelDB at Google ➡ HBase, Cassandra and RocksDB at FaceBook

slide-5
SLIDE 5

Why LSM-trees ?

slide-6
SLIDE 6

Why LSM-trees ?

Good for hard drives

➡ batch and write sequentially ➡ high sequential throughput ➡ sequential access up to 1000x faster than random

slide-7
SLIDE 7

Why LSM-trees ?

Good for hard drives

➡ batch and write sequentially ➡ high sequential throughput ➡ sequential access up to 1000x faster than random

Not optimal for SSDs

➡ large write/read amplification ➡ wastes device resources

slide-8
SLIDE 8

Why LSM-trees ?

Good for hard drives

➡ batch and write sequentially ➡ high sequential throughput ➡ sequential access up to 1000x faster than random

Not optimal for SSDs

➡ large write/read amplification ➡ wastes device resources ➡ unique characteristics of SSDs ➡ fast random reads ➡ internal parallelism

slide-9
SLIDE 9

Our Solution: WiscKey

slide-10
SLIDE 10

Our Solution: WiscKey

Separate keys from values

slide-11
SLIDE 11

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection

LSM-tree

key value

slide-12
SLIDE 12

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection

LSM-tree

key value

Value Log

slide-13
SLIDE 13

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection ➡ harness SSD’s internal parallelism for range queries

LSM-tree

key value

Value Log

slide-14
SLIDE 14

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection ➡ harness SSD’s internal parallelism for range queries ➡ online and light-weight garbage collection

LSM-tree

key value

Value Log

slide-15
SLIDE 15

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection ➡ harness SSD’s internal parallelism for range queries ➡ online and light-weight garbage collection ➡ minimize I/O amplification and crash consistent

LSM-tree

key value

Value Log

slide-16
SLIDE 16

Our Solution: WiscKey

Separate keys from values

➡ decouple sorting and garbage collection ➡ harness SSD’s internal parallelism for range queries ➡ online and light-weight garbage collection ➡ minimize I/O amplification and crash consistent

LSM-tree

key value

Value Log

Performance of WiscKey

➡ 2.5x to 111x for loading, 1.6x to 14x for lookups

slide-17
SLIDE 17

Background Key-Value Separation Challenges and Optimizations Evaluation Conclusion

slide-18
SLIDE 18

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory disk

slide-19
SLIDE 19

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory disk

LevelDB

slide-20
SLIDE 20

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

KV

disk

LevelDB

slide-21
SLIDE 21

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV

disk

LevelDB

slide-22
SLIDE 22

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT

2 disk

LevelDB

slide-23
SLIDE 23

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 disk

LevelDB

slide-24
SLIDE 24

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 4 disk

LevelDB

slide-25
SLIDE 25

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 4 5 disk

LevelDB

slide-26
SLIDE 26

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 4 5 disk

  • 1. Write sequentially 2. Sort data for quick lookups

LevelDB

slide-27
SLIDE 27

LSM-trees: Insertion

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory 1

KV memT memT

2 3 4 5 disk

  • 1. Write sequentially 2. Sort data for quick lookups
  • 3. Sorting and garbage collection are coupled

LevelDB

slide-28
SLIDE 28

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory disk

slide-29
SLIDE 29

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory disk

LevelDB

slide-30
SLIDE 30

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K

disk

LevelDB

slide-31
SLIDE 31

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K memT

1 disk

LevelDB

slide-32
SLIDE 32

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K memT

1 2 disk

LevelDB

slide-33
SLIDE 33

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K memT

1 2 disk

LevelDB

slide-34
SLIDE 34

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K memT

1 2 3 L1 to L6 disk

LevelDB

slide-35
SLIDE 35

LSM-trees: Lookup

Log

L0 (8MB) L1 (10MB) L2 (100MB) L6 (ITB)

memory

K memT

1 2 3 L1 to L6 disk

LevelDB

  • 1. Random reads
  • 2. Travel many levels for a large LSM-tree
slide-36
SLIDE 36

I/O Amplification in LSM-trees

slide-37
SLIDE 37

I/O Amplification in LSM-trees

1 10 100 1000

Amplification Ratio

14 327

100 GB

Write Read

slide-38
SLIDE 38

Random load: a 100GB database Random lookup: 100,000 lookups

I/O Amplification in LSM-trees

1 10 100 1000

Amplification Ratio

14 327

100 GB

Write Read

slide-39
SLIDE 39

Problems: large write amplification large read amplification

Random load: a 100GB database Random lookup: 100,000 lookups

I/O Amplification in LSM-trees

1 10 100 1000

Amplification Ratio

14 327

100 GB

Write Read

slide-40
SLIDE 40

Background Key-Value Separation Challenges and Optimizations Evaluation Conclusion

slide-41
SLIDE 41

Key-Value Separation

slide-42
SLIDE 42

Key-Value Separation

Main idea: only keys are required to be sorted

slide-43
SLIDE 43

Key-Value Separation

Main idea: only keys are required to be sorted Decouple sorting and garbage collection

slide-44
SLIDE 44

Key-Value Separation

LSM-tree Value Log SSD device

Main idea: only keys are required to be sorted Decouple sorting and garbage collection

slide-45
SLIDE 45

Key-Value Separation

key

LSM-tree

value

Value Log SSD device

Main idea: only keys are required to be sorted Decouple sorting and garbage collection

slide-46
SLIDE 46

Key-Value Separation

key

LSM-tree

value

Value Log

value

SSD device

Main idea: only keys are required to be sorted Decouple sorting and garbage collection

slide-47
SLIDE 47

Key-Value Separation

key

LSM-tree

value

Value Log

k, addr

value

SSD device

Main idea: only keys are required to be sorted Decouple sorting and garbage collection

slide-48
SLIDE 48

Random Load

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

slide-49
SLIDE 49

Random Load

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database

slide-50
SLIDE 50

Random Load

  • nly 2 MB/s to 4.1 MB/s

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database

slide-51
SLIDE 51

Random Load

  • nly 2 MB/s to 4.1 MB/s

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database large write amplification (12 to 16) in LevelDB

slide-52
SLIDE 52

Random Load

  • nly 2 MB/s to 4.1 MB/s

Small write amplification in WiscKey due to key- value separation (up to 111x in throughput)

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database large write amplification (12 to 16) in LevelDB

slide-53
SLIDE 53

Random Load

  • nly 2 MB/s to 4.1 MB/s

Small write amplification in WiscKey due to key- value separation (up to 111x in throughput)

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 350 400 450 500 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

load 100 GB database large write amplification (12 to 16) in LevelDB

slide-54
SLIDE 54

L0 L1 (5) L2 (50) L3 (500) L4 (5000) L5 (50000) L6 (500000)

LevelDB

30 365 2184 15752 23733 9 limits of files num of files

slide-55
SLIDE 55

L0 L1 (5) L2 (50) L3 (500) L4 (5000) L5 (50000) L6 (500000)

LevelDB

30 365 2184 15752 23733 9 limits of files num of files

Large LSM-tree: Intensive compaction

➡ repeated reads/writes ➡ stall foreground I/Os

Many levels

➡ travel several levels for

each lookup

slide-56
SLIDE 56

L0 L1 (5) L2 (50) L3 (500) L4 (5000) L5 (50000) L6 (500000)

LevelDB WiscKey

30 365 2184 15752 23733 9 11 127 460 7 limits of files num of files num of files

slide-57
SLIDE 57

L0 L1 (5) L2 (50) L3 (500) L4 (5000) L5 (50000) L6 (500000)

LevelDB WiscKey

30 365 2184 15752 23733 9 11 127 460 7

Small LSM-tree: less compaction, fewer levels to search, and better caching

limits of files num of files num of files

slide-58
SLIDE 58

Random Lookup

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

slide-59
SLIDE 59

Random Lookup

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

100,000 lookups on a randomly loaded 100 GB database

slide-60
SLIDE 60

Random Lookup

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

100,000 lookups on a randomly loaded 100 GB database large read amplification in LevelDB

slide-61
SLIDE 61

Random Lookup

Smaller LSM-tree in WiscKey leads to better lookup performance (1.6x - 14x)

64B 256B 1KB 4KB 16KB 64KB 256KB 50 100 150 200 250 300 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB WiscKey

100,000 lookups on a randomly loaded 100 GB database large read amplification in LevelDB

slide-62
SLIDE 62

Background Key-Value Separation Challenges and Optimizations

➡ Parallel range query ➡ Garbage collection ➡ LSM-log

Evaluation Conclusion

slide-63
SLIDE 63

Parallel Range Query

slide-64
SLIDE 64

Parallel Range Query

SSD read performance

➡ sequential, random, parallel

slide-65
SLIDE 65

Parallel Range Query

1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600

Request size: 1KB to 256KB Throughput (MB/s)

Sequential Rand-1thread Rand-32threads

SSD: Samsung 840 EVO 500GB Reads on a 100GB file on ext4

SSD read performance

➡ sequential, random, parallel

slide-66
SLIDE 66

Parallel Range Query

slide-67
SLIDE 67

Parallel Range Query

Challenge

➡ sequential reads in LevelDB ➡ read keys and values separately in WiscKey

slide-68
SLIDE 68

Parallel Range Query

Challenge

➡ sequential reads in LevelDB ➡ read keys and values separately in WiscKey

Parallel range query

➡ leverage parallel random reads of SSDs

slide-69
SLIDE 69

Parallel Range Query

Challenge

➡ sequential reads in LevelDB ➡ read keys and values separately in WiscKey

Parallel range query

➡ leverage parallel random reads of SSDs ➡ prefetch key-value pairs in advance ➡ range query interface: seek(), next(), prev() ➡ detect a sequential pattern ➡ prefetch concurrently in background

slide-70
SLIDE 70

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

slide-71
SLIDE 71

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

read 4GB from a randomly loaded 100 GB database

slide-72
SLIDE 72

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

read 4GB from a randomly loaded 100 GB database For large kv pairs, WiscKey can perform better

slide-73
SLIDE 73

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

read 4GB from a randomly loaded 100 GB database WiscKey is limited by SSD’s parallel random read performance For large kv pairs, WiscKey can perform better

slide-74
SLIDE 74

Range Query

Better for large kv pairs, but worse for small kv pairs on an unsorted database

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand

read 4GB from a randomly loaded 100 GB database WiscKey is limited by SSD’s parallel random read performance For large kv pairs, WiscKey can perform better

slide-75
SLIDE 75

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq

slide-76
SLIDE 76

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq

read 4GB from a sequentially loaded 100 GB database

slide-77
SLIDE 77

Range Query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq

read 4GB from a sequentially loaded 100 GB database Both WiscKey and LevelDB read sequentially

slide-78
SLIDE 78

Range Query

Sorted databases help WiscKey’s range query

64B 256B 1KB 4KB 16KB 64KB 256KB 100 200 300 400 500 600 Key: 16B, Value: 64B to 256KB Throughput (MB/s) LevelDB-Rand WiscKey-Rand LevelDB-Seq WiscKey-Seq

read 4GB from a sequentially loaded 100 GB database Both WiscKey and LevelDB read sequentially

slide-79
SLIDE 79

Optimizations

slide-80
SLIDE 80

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

slide-81
SLIDE 81

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

slide-82
SLIDE 82

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

slide-83
SLIDE 83

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

slide-84
SLIDE 84

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

Remove LSM-tree log in WiscKey

slide-85
SLIDE 85

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

Remove LSM-tree log in WiscKey

➡ store head in LSM-tree periodically

slide-86
SLIDE 86

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

Remove LSM-tree log in WiscKey

➡ store head in LSM-tree periodically ➡ scan the value log from the head to recover

log

slide-87
SLIDE 87

Optimizations

LSM-tree Value Log

value

k, addr

value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight garbage collection

➡ append (ksize, vsize, key, value) in value log

Remove LSM-tree log in WiscKey

➡ store head in LSM-tree periodically ➡ scan the value log from the head to recover

slide-88
SLIDE 88

WiscKey Implementation

slide-89
SLIDE 89

WiscKey Implementation

Based on LevelDB

➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ leverages most of high-quality LevelDB source code

slide-90
SLIDE 90

WiscKey Implementation

Based on LevelDB

➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ leverages most of high-quality LevelDB source code

Range query

➡ thread pool launches queries in parallel ➡ detect sequential pattern with the Iterator interface

slide-91
SLIDE 91

WiscKey Implementation

Based on LevelDB

➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ leverages most of high-quality LevelDB source code

Range query

➡ thread pool launches queries in parallel ➡ detect sequential pattern with the Iterator interface

File-system support

➡ fadvise to predeclare access patterns ➡ hole-punching to free space

slide-92
SLIDE 92

Background Key-Value Separation Challenges and Optimizations Evaluation Conclusion

slide-93
SLIDE 93

YCSB Benchmarks

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

50% R 50% U 95% R 5% U 100% R 95% R 5% I 95% S 5% I 50% R 50% RMW

Key size: 16B, Value size: 1KB

slide-94
SLIDE 94

YCSB Benchmarks

48x-116x 6x-16x 2x-20x 2.6x-25x 1.5x-4x 1x-7x 6x-8x

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

50% R 50% U 95% R 5% U 100% R 95% R 5% I 95% S 5% I 50% R 50% RMW

Key size: 16B, Value size: 1KB

slide-95
SLIDE 95

YCSB Benchmarks

48x-116x 6x-16x 2x-20x 2.6x-25x 1.5x-4x 1x-7x 6x-8x

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

50% R 50% U 95% R 5% U 100% R 95% R 5% I 95% S 5% I 50% R 50% RMW

Key size: 16B, Value size: 1KB

slide-96
SLIDE 96

YCSB Benchmarks

48x-116x 6x-16x 2x-20x 2.6x-25x 1.5x-4x 1x-7x 6x-8x

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

50% R 50% U 95% R 5% U 100% R 95% R 5% I 95% S 5% I 50% R 50% RMW

Key size: 16B, Value size: 1KB

slide-97
SLIDE 97

YCSB Benchmarks

48x-116x 6x-16x 2x-20x 2.6x-25x 1.5x-4x 1x-7x 6x-8x

0.1 1 10 100 1000 Normalized Performance Key size: 16B, Value size: 1KB LOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

many small range queries

50% R 50% U 95% R 5% U 100% R 95% R 5% I 95% S 5% I 50% R 50% RMW

Key size: 16B, Value size: 1KB

slide-98
SLIDE 98

Conclusion

slide-99
SLIDE 99

Conclusion

WiscKey: a LSM-tree based key-value store

➡ decouple sorting and garbage collection by

separating keys from values

➡ SSD-conscious designs ➡ significant performance gain

slide-100
SLIDE 100

Conclusion

WiscKey: a LSM-tree based key-value store

➡ decouple sorting and garbage collection by

separating keys from values

➡ SSD-conscious designs ➡ significant performance gain

Transition to new storage hardware

➡ understand and leverage existing software ➡ explore new designs to utilize the new hardware ➡ get the best of two worlds