Scheduling Problems in Write-Optimized Key-Value Stores Prashant - - PowerPoint PPT Presentation

scheduling problems in write optimized key value stores
SMART_READER_LITE
LIVE PREVIEW

Scheduling Problems in Write-Optimized Key-Value Stores Prashant - - PowerPoint PPT Presentation

Scheduling Problems in Write-Optimized Key-Value Stores Prashant Pandey 1 Michael A. Bender 1 Rob Johnson 1,2 1 Stony Brook University, NY 2 VMware Research Key-Value Stores are Ubiquitous K1 Rob K2 Michael K3 Don K4 Bill K5 Jun K6


slide-1
SLIDE 1

Prashant Pandey1 Michael A. Bender1 Rob Johnson1,2

1Stony Brook University, NY 2VMware Research

Scheduling Problems in Write-Optimized Key-Value Stores

slide-2
SLIDE 2
  • Can store and retrieve <key, value> pairs.
  • KV stores are building blocks of databases, file systems, etc.
  • Example: B-tree, Hash tables, etc.

Key-Value Stores are Ubiquitous

2

K1 K2 K3 K4 K5 K6 Rob Michael Don Bill Jun Yang

slide-3
SLIDE 3
  • State-of-the-art key-value stores are write optimized.
  • I.e. they move data around in batches.
  • Batching amortizes the I/O cost of moving data.
  • Write-optimized tree are designed for external memory.
  • Examples: Bε-trees or Log-structured merge trees.

Write-Optimized Key-Value Stores

3

slide-4
SLIDE 4

Main idea of this talk: how should we schedule these batch data moves?

4

slide-5
SLIDE 5

5

  • Bε-tree and operations
  • Operations analysis
  • Tradeoff between latency and I/O efficiency
  • Scheduling problem in batch data moves

Outline

slide-6
SLIDE 6

Insert Operation in a Bε-tree

6

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-7
SLIDE 7

Insert Operation in a Bε-tree

7

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-8
SLIDE 8

Insert Operation in a Bε-tree

8

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-9
SLIDE 9

Insert Operation in a Bε-tree

9

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-10
SLIDE 10

Insert Operation in a Bε-tree

10

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-11
SLIDE 11

Insert Operation in a Bε-tree

11

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-12
SLIDE 12

Insert Operation in a Bε-tree

12

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-13
SLIDE 13

Insert Operation in a Bε-tree

13

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-14
SLIDE 14

Insert Operation in a Bε-tree

14

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-15
SLIDE 15

Insert Operation in a Bε-tree

15

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

#Messages going to one child must be at least (B- Bε) / Bε ≈ B1-ε

slide-16
SLIDE 16

Insert Operation in a Bε-tree

16

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-17
SLIDE 17

Insert Operation in a Bε-tree

17

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-18
SLIDE 18

Insert Operation in a Bε-tree

18

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

slide-19
SLIDE 19

Insert Operation in a Bε-tree

19

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

#Messages going to one child must be at least (B- Bε) / Bε ≈ B1-ε

slide-20
SLIDE 20

Query Operation in a Bε-tree

20

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. ….. Result

slide-21
SLIDE 21

Bε-tree

21

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. …..

O (logBε N)

0 < ε < 1

... ≈ Bε children ... ... ≈ N / B leaves ...

slide-22
SLIDE 22
  • How computation works

○ Data is transferred in blocks between RAM and disk. ○ The number of block transfers dominates the running time.

  • Goal: minimize number of block transfers

○ Performance bounds are parameterized by block size B, memory size M, data size N.

Performance Model

Disk RAM B B M

22

slide-23
SLIDE 23

Operations

Insert query Range query B-tree LogBN logBN logBN + k/N Bε-tree LogBN / εB1-ε logBN / ε logBN / ε + k/N Bε-tree (ε = 1/2) logBN / √B logBN logBN + k/N

23

slide-24
SLIDE 24

Operations

Insert query Range query B-tree LogBN logBN logBN + k/N Bε-tree

LogBN / εB1-ε

logBN / ε logBN / ε + k/N Bε-tree (ε = 1/2) logBN / √B logBN logBN + k/N

24

slide-25
SLIDE 25

25

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Moving More than B1-ε Messages in a Flush

slide-26
SLIDE 26

26

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Moving More than B1-ε Messages in a Flush

slide-27
SLIDE 27

27

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Flushing > B1-ε messages during a flush to a child reduces I/O costs per insert.

Moving More than B1-ε Messages in a Flush

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Avalanche

slide-30
SLIDE 30

30

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Avalanche

slide-31
SLIDE 31

31

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Avalanche

slide-32
SLIDE 32

32

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Avalanche

slide-33
SLIDE 33

33

B - Bε Bε Message buffer Pivots B - Bε Bε B - Bε Bε B - Bε Bε ….. ….. …..

Avalanche

An avalanche can increase the latency of an

  • peration.
slide-34
SLIDE 34
  • Flushing less number of messages to a child can result in

sub-optimal I/O performance.

  • Flushing a lot of messages to a child can cause an avalanche.

Flushing tradeoff

34

slide-35
SLIDE 35
  • We now have a scheduling problem.
  • Flushes are scheduled every εB1-ε / logBN inserts.
  • We can allow nodes to grow larger temporarily.

Scheduling Problem

35

slide-36
SLIDE 36

Is there a schedule in which if we pick a point and flush to a chosen child we can bound the maximum size of a node?

36

slide-37
SLIDE 37
  • Pick the child to which you can flush the most number of

messages.

  • Pick the largest child such and find its sub-child where you

can flush messages to resize the child without causing an avalanche.

Possible Strategies to Pick the Child to Flush To?

37

slide-38
SLIDE 38
  • http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf
  • https://www.usenix.org/system/files/conference/fast15/fast1

5-paper-jannen_william.pdf

  • https://www.usenix.org/system/files/conference/fast16/fast1

6-papers-yuan.pdf

References

38

slide-39
SLIDE 39

Thank You!

slide-40
SLIDE 40
slide-41
SLIDE 41

Abstract

Write-optimized key-value stores, such as Bε-trees, are the state-of-the-art key-value

  • stores. Bε-trees move data around in batches thereby amortizing the I/O cost of moving

data. During batch data moves in practice, we see an inherent tension between operation latency and I/O bandwidth utilization in Bε-trees trees. This talk presents an open problem

  • n how to schedule batch data moves in a Bε-tree.

41