MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree - - PowerPoint PPT Presentation

matrixkv reducing write stalls and write amplification in
SMART_READER_LITE
LIVE PREVIEW

MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree - - PowerPoint PPT Presentation

MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with a Matrix Container in NVM Ting Yao 1 , Yiwen Zhang 1 , Jiguang Wan 1 , Qiu Cui 2 , Liu Tang 2 , Hong Jiang 3 , Changsheng Xie 1 , and Xubin He 4 1 Huazhong


slide-1
SLIDE 1

MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with a Matrix Container in NVM

Ting Yao1, Yiwen Zhang1, Jiguang Wan1, Qiu Cui2, Liu Tang2, Hong Jiang3, Changsheng Xie1, and Xubin He4

1Huazhong University of Science and Technology, China 2PingCAP, China 3University of Texas at Arlington, USA 4Temple University, USA

slide-2
SLIDE 2

Outline

 Background and Motivations  MatrixKV  Evaluation  Conclusion

2

slide-3
SLIDE 3

LSM-tree based Key-value stores

  • Log-structured merge tree (LSM-tree)
  • Write intensive scenarios
  • Applications :
  • Properties:
  • Batched sequential writes: high write throughput
  • Fast read
  • Fast range queries

3

slide-4
SLIDE 4

LSM-tree and RocksDB

  • Systems with DRAM-SSD storage
  • Exponentially increased level sizes (AF)
  • Operations

1. Insert 2. Flush 3. Compaction between Li-Li+1

  • L0-L1 compaction
  • L1-L2 compaction
  • ……

Compaction SSD based RocksDB

L0 L1 Ln

MemTable Immutable MemTable

Flush Insert DRAM SSD

slide-5
SLIDE 5

Challenge 1: Write stall

5

Write stall: Application throughput periodically drop to nearly zero.

  • Unpredictable performance.
  • Long tail latency.

Random write an 80 GB Dataset to an SSD based RocksDB. (20 million KV items, 16byte-4KB) L0-L1 compaction! 3.1GB compaction data.

slide-6
SLIDE 6

Root cause of write stall: L0-L1 compaction

6

L2 L1 Ln L0

Memory Disk

Cm

SSTable

Read Merge & Sort

L0-L1 compaction: The all-to-all coarse-grained compaction

SSD bandwidth. CPU cycle.

slide-7
SLIDE 7

Challenge 2: write amplification

7

Random write an 80 GB Dataset to an SSD based RocksDB. (20 million KV items, 16byte-4KB) Write amplification: Average throughput decreases gradually.

  • Decreased performance.

Increased LSM depth! More compaction and higher WA

slide-8
SLIDE 8

Root cause of increased write amplification

8

L2 L1 Ln L0

Memory Disk

Cm

SSTable

 Level by level compactions: Write amplification increases with the depth of LSM-trees.  WA=AF * N; AF is the amplification factor of adjacent two levels. (AF=10) N is the number of levels.

slide-9
SLIDE 9

State-of-art solution with NVM

NVM is byte-addressable, persistent, and fast! NoveLSM: Adopting NVM to store large mutable MemTable. 1.7x higher random write performance but more severe write stalls!

9

NoveLSM

MemTable Immutable MemTable

L0 L1 Ln

MemTable Immutable MemTable

DRAM NVM SSD

*Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Redesigning lsms for nonvolatile memory with novelsm. In 2018 USENIX Annual Technical Conference (ATC18), 2018.

slide-10
SLIDE 10

Motivation

10

Decreased performance Write stall Higher write amplification Unstable performance

MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores by exploiting NVM

All-to-all L0-L1 compaction Increased depth

slide-11
SLIDE 11

Outline

Background and Motivations MatrixKV Evaluation Conclusion

11

slide-12
SLIDE 12

Overall Architecture

12

  • 1. Matrix container in NVM: Manage L0’s data
  • n NVM
  • 2. Column compaction: A fine granularity

column compaction to reduce write stalls

  • 3. Reducing levels on SSD: Reduce LSM-tree’s

level numbers to decrease WA (on SSD)

  • 4. Cross-Row hint search: A hint search

algorithm in Matrix container to improve read performance

DRAM LSM-trees with reduced levels L1, L2, Matrix Container (L0 of LSM-trees)

mem imm

Posix PMDK SSD NVM Flush Column compaction Put Cross-row hints

Compactor Receiver

slide-13
SLIDE 13

Matrix Container

Matrix container includes a receiver and a compactor. Receiver stores flushed data row by row and organized in RowTable. A: A receiver turns into a compactor

  • nce filled with RowTables

Compactor compacts data from L0 to L1

  • n SSD column by column.

B: NVM pages of a column are freed and available for receiver to accept new data after the column compaction.

13

Receiver Compactor NVM SSD

SSTables on L1 A B a-c c-e e-n n-o u-z

c a d e g h n

  • 1

3 2 1 3 2

Flush from DRAM RowTable

c a

slide-14
SLIDE 14

RowTable

14

k0 v0 ... k1 v1 kn vn Metadata: a sorted array Data: sorted kv items

P0

k0

P1

K1

P...

K...

Page0 Offset Page0 Offset Page... Offset Pn

kn

Pagen Offset

(a) RowTable structure

 Consisting of data and metadata.  Data region: serialized KV items from the immutable MemTable  Metadata region: a sorted array.

  • Key
  • page number
  • ffset in the page
  • forward pointer (i.e., $p_n$)
slide-15
SLIDE 15

Fine grained column compaction

15

 The non-overlapped L1 is a key space with multiple contiguous key ranges.  Example:

  • 1. Range 0-3.
  • 2. The amount of compaction data VS.

the threshold of compaction.

  • 3. Add the next subrange 3-5 -> Range

0-5.

  • 4. Add the next subrange 5-8 -> Range

0-8.

  • 5. Reach the threshold of compaction,

Start column compaction

...

L1 (SSD)

8 3 13 6 30 45 51 7 3 10 5 23 28 35 9 1 10 4 13 38 42 11 3 12 8 14 40 48

Compactor ( NVM)

3 26 30 10 15 20 5 8 3 3 1 3

slide-16
SLIDE 16

Fine grained column compaction

16

 The non-overlapped L1 is a key space with multiple contiguous key ranges.  Example:

  • 1. Range 0-3.
  • 2. The amount of compaction data VS.

the threshold of compaction.

  • 3. Add the next subrange 3-5 -> Range

0-5.

  • 4. Add the next subrange 5-8 -> Range

0-8.

  • 5. Reach the threshold of compaction,

Start column compaction

...

L1 (SSD)

8 3 13 6 30 45 51 7 3 10 5 23 28 35 9 1 10 4 13 38 42 11 3 12 8 14 40 48

Compactor ( NVM)

3 26 30 10 15 20 5 8 3 3 1 3 5 4

slide-17
SLIDE 17

Fine grained column compaction

17

 The non-overlapped L1 is a key space with multiple contiguous key ranges.  Example:

  • 1. Range 0-3.
  • 2. The amount of compaction data VS.

the threshold of compaction.

  • 3. Add the next subrange 3-5 -> Range

0-5.

  • 4. Add the next subrange 5-8 -> Range

0-8.

  • 5. Reach the threshold of compaction,

Start column compaction

...

L1 (SSD)

8 3 13 6 30 45 51 7 3 10 5 23 28 35 9 1 10 4 13 38 42 11 3 12 8 14 40 48

Compactor ( NVM)

3 26 30 10 15 20 5 8 3 3 1 3 5 4 8 6 7 8

Range [0-8] Range (8-30] Range ...

slide-18
SLIDE 18

Reducing LSM-tree depth

 WA=AF * N  Flattening LSM-trees with wider levels

  • Make the AF unchanged
  • Reduce N

 Increased unsorted L0  Column compaction  Decrease search efficiency in L0  Cross-row hint search

L0 256MB

L1 256 MB L2 2.56 GB L3 25.6 GB L4 256 GB L5 2.56 TB L1 8 GB L2 80 GB L0 8 GB SSD NVM SSD

Conventional LSM-tree

Flattened LSM-tree in MatrixKV L3 800 GB L4 8 TB

slide-19
SLIDE 19

RowTable3 RowTable2 RowTable1 RowTable0

7 3 10 5 23 28 35 8 3 13 6 30 45 51 9 1 10 4 13 38 42 11 3 12 8 14 40 48

Cross-Row hint search

12

23 10 8 13 30 9 10 13 11 12 14

 Constructing with forward pointer

  • RowTable i key x
  • RowTable i-1, key y
  • y ≥ x

 Search process with forward pointer

  • E.g., fetch key=12
slide-20
SLIDE 20

Evaluation Setup

Comparisons

  • RocksDB-SSD: SSD based RocksDB
  • RocksDB-L0-NVM: placing L0 in NVM, system with DRAM, NVM, and SSD (8GB NVM)
  • NoveLSM: a heterogeneous system of DRAM, NVM, and SSD (8GB NVM)
  • MatrixKV: a heterogeneous system of DRAM, NVM, and SSD (8GB NVM)

Test environment

20

Linux 64-bit Linux 4.13.9 CPU 2 * Genuine Intel(R) 2.20GHz processors Memory 32 GB NVM 128 GB * 2 Intel Optane DC PMM FIO 4 KB (MB/s) Random: 2346(R), 1363(W) Sequential: 2567(R),1444(W) SSD 800GB Intel SSDSC2BB800G7 FIO 4 KB (MB/s) Random: 250(R), 68(W) Sequential: 445(R),354(W)

slide-21
SLIDE 21

Random Write Throughput

21

 MatrixKV obtains the best performance in different value sizes  E.g. 4 KB value size MatrixKV outperforms RocksDB- L0-NVM and NoveLSM by 3.6x and 2.6x.

slide-22
SLIDE 22

Write stalls

  • 1. Better random write throughout.
  • 2. MatrixKV has more stable throughput. Reduce write stalls!

22

slide-23
SLIDE 23

Tail Latency

23

Latency (us) avg. 90% 99% 99.9% RocksDB-SSD 974 566 11055 17983 NoveLSM 450 317 2080 2169 RocksDB-L0-NVM 477 528 786 1112 MatrixKV 263 247 405 663  MatrixKV obtains the shortest latency in all cases.  E.g. 99% latency of MatrixKV is 27x, 5x, and 1.9x lower than RocksDB-SSD, NoveLSM, and RocksDB-L0-NVM respectively.

slide-24
SLIDE 24

Fine granularity column compaction

24

 Why MatrixKV reduces write stalls?

  • 467 times column compaction
  • 0.33 GB each
slide-25
SLIDE 25

Write amplification

25

The WA of randomly writing 80 GB dataset. WA = Amount of data written to SSDs / Amount of data written by users MatrixKV’ WA is 3.43x. MatrixKV reduces the number of compactions with flattened LSM-trees.

slide-26
SLIDE 26

Summary

Conventional SSD-based KV stores

  • unpredictable performance due to write stalls
  • sacrificed performance due to WA

MatrixKV: an LSM-tree based KV store on systems with DRAM, NVM, and SSD storages

  • Matrix container in NVM
  • Column compaction
  • Hint search
  • Reducing levels on SSD

Reduce write stalls and improves write performance.

26

slide-27
SLIDE 27

Thanks!

Open-source code: https://github.com/PDS-Lab/MatrixKV Email: tingyao@hust.edu.cn

27