WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems - - PowerPoint PPT Presentation

wort write optimal radix tree for persistent memory
SMART_READER_LITE
LIVE PREVIEW

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems - - PowerPoint PPT Presentation

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 , Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST 1 Hongik University Persistent Memory (PM) Persistent memory is expected to replace both DRAM


slide-1
SLIDE 1

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Se Kwon Lee

  • K. Hyun Lim1, Hyunsub Song, Beomseok Nam, Sam H. Noh

UNIST

1Hongik University

slide-2
SLIDE 2

§ Persistent memory is expected to replace both DRAM & NAND

2

Persistent Memory (PM)

Non-volatile High performance Persistent Memory NAND

STT-MRAM

PCM DRAM Non-volatility

  • x

Read (ns) 2.5 X 104 5 - 30 20 – 70 10 Write (ns) 2 X 105 10 - 100 150 - 220 10

Byte-addressable

x

  • Density

185.8 Gbit/cm2 0.36 Gbit/cm2 13.5 Gbit/cm2 9.1 Gbit/cm2

  • K. Suzuki and S. Swanson. “A Survey of Trends in Non-Volatile Memory Technologies: 2000-2014”, IMW 2015
slide-3
SLIDE 3

3

Indexing Structure for PM Storage Systems

30 13 5 20 50 40 38 30 48 70 60 4 1 10 9

B+Tree

slide-4
SLIDE 4

4

Consistency Issue of B+tree in PM § B+tree is a block-based index

  • Key sorting à Block granularity write
  • Rebalancing à Multi-blocks granularity write

§ Persistent memory

  • Byte-addressable à Byte granularity write
  • Write reordering

Can result in consistency problem

slide-5
SLIDE 5

§ Traditional case

5

Consistency Issue of B+tree in PM

Non-volatile

Volatile CPU Caches DRAM 35 30 2 31 30 35 3

Block based storage Write reordering Not persistent data

35 30 2

Block granularity update

31 30 35 3

slide-6
SLIDE 6

6

Consistency Issue of B+tree in PM § PM case

Non-volatile

Volatile CPU Caches Persistent Memory 35 30 2

Byte granularity update Write reordering Persistent data

35 30 2 31 30 35 3

Crash Garbage data persistently stored

slide-7
SLIDE 7

§ Durability

  • CLFLUSH (Flush cache line)

− Can be reordered

§ Ordering

  • MFENCE (Load and Store fence)

− Order CPU cache line flush instructions

7

Primitives for Data Consistency in PM

Non-volatile

Volatile CPU Caches Persistent Memory

slide-8
SLIDE 8

§ Durability

  • CLFLUSH (Flush cache line)

− Can be reordered

§ Ordering

  • MFENCE (Load and Store fence)

− Order CPU cache line flush instructions

8

Primitives for Data Consistency in PM

Non-volatile

Volatile CPU CPU Caches Persistent Memory

Serialization of CLFLUSH and MFENCE is known to cause large overhead

slide-9
SLIDE 9

Non-volatile Data area Log area

9

Primitives for Data Consistency in PM § Atomicity

  • 8-byte failure atomicity

− Need only CLFLUSH

  • Logging or CoW based atomicity

(more than 8 bytes)

− Requires duplicate copies

35 30 2 31 30 35 3 31 30 35 3

slide-10
SLIDE 10

Non-volatile Data area Log area

10

Primitives for Data Consistency in PM § Atomicity

  • 8-byte failure atomicity

− Need only CLFLUSH

  • Logging or CoW based atomicity

(more than 8 bytes)

− Requires duplicate copies

35 30 2 31 30 35 3 31 30 35 3

Logging increases cache line flush overhead

slide-11
SLIDE 11

wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16) 11

B+tree Variants for Persistent Memory

How can we ensure consistency using failure-atomic writes without logging? Unsorted keys à Append-only with metadata Failure-atomic update of metadata

9 2 3 1 7

5

Slot array

bmp K1 P1 … … Kn Pn K1 P1

Flag (+/-)

K2 P2

Flag (+/-)

K3 P3

Flag (+/-)

Entry Cnt. bmp

Pnext

K1 P1 … … Kn Pn

Fingerprints

Unsorted key à Decreases search performance

slide-12
SLIDE 12

12

B+tree Variants for Persistent Memory

32 30 35

New key Overflow

32 30 38 35

Split

38

§ Logging still necessary

  • Multi-block granularity updates

due to node splits and merges

− Cannot update atomically

  • Logging-based solution

− wB+Tree, FPTree

  • Tree reconstruction based solution

− NVTree large overhead

slide-13
SLIDE 13

13

B+tree Variants for Persistent Memory

35 30 2 31 30 35 3

Key sorting

32 30 35

New key Overflow

32 30 38 35

Split

38

Rebalancing Fundamental characteristics of B+tree cause problems

slide-14
SLIDE 14

14

B+tree Variants for Persistent Memory

35 30 2 31 30 35 3

Key sorting

32 30 35

New key Overflow

32 30 38 35

Split

38

Rebalancing Fundamental characteristics of B+tree cause problems

Why use B+ trees in the first place?

Perhaps there is a better tree data structure more suited for PM?

slide-15
SLIDE 15

§ Show Radix Tree is a suitable data structure for PM § Propose optimal radix tree variants WORT and WOART

  • WORT: Write Optimal Radix Tree
  • WOART: Write Optimal redesigned Adaptive Radix Tree (ART)

− Optimal: maintain consistency only with single failure-atomic write without any duplicate copies

15

Our Contributions

slide-16
SLIDE 16

16

Radix Tree § Deterministic structure

ACA ACC ACZ CAC

… … … … …

A C C A A C Z C

slide-17
SLIDE 17

17

Radix Tree § Deterministic structure

  • No key comparison

ACA ACC ACZ CAC

… … … … …

A C C A A C Z C

slide-18
SLIDE 18

18

Radix Tree § Deterministic structure

  • No key comparison

− Only 8-byte pointer entries − Implicitly stored keys … … …

A C C A

8-byte pointer

ACA ACC ACZ CAC

… …

A C Z C

slide-19
SLIDE 19

19

Radix Tree § Deterministic structure

  • No key comparison

− Only 8-byte pointer entries − Implicitly stored keys − No problem caused by key sorting … … …

A N C A ACA ACC ACZ CAC

… …

A C Z C

slide-20
SLIDE 20

20

Radix Tree § Deterministic structure

  • No key comparison

− Only 8-byte pointer entries − Implicitly stored keys − No problem caused by key sorting

  • No modification of other keys

− Single 8-byte pointer write per node − Easy to use failure-atomic write … … …

A N C A ACA ACC ACZ CAC

… …

A C Z C

slide-21
SLIDE 21

Low utilization High utilization

§ For sparse key distribution

  • Waste excessive memory space à Optimized through path compression

21

Problem of Deterministic Structure

key key key

… … …

key key key key key

… … … …

… … … …

slide-22
SLIDE 22

§ Path compression

  • Search paths that do not need to be distinguished can be removed

22

Path Compression in Radix Tree

Unnecessary search path

ACA ACC ACZ CAC

… … … … …

A C C A A C Z C

slide-23
SLIDE 23

§ Path compression

  • Common search path is compressed in header
  • Improve memory utilization & indexing performance

23

Path Compression in Radix Tree

ACA ACC ACZ

… … …

A C A C Z

Compression header

slide-24
SLIDE 24

§ Path compression split

24

Node Split with Path Compression

AZA to be inserted

ACA ACC ACZ

AC

Prefix keys are not equal AZ != AC

A C Z

slide-25
SLIDE 25

§ Path compression split

25

Node Split with Path Compression

ACA ACC ACZ

C A

① New parent allocation

AZA A C Z Z

C A C A

C

Split

slide-26
SLIDE 26

§ Path compression split

26

Node Split with Path Compression

ACA ACC ACZ

AC

② Decompression of old common prefix

A C Z

A

AZA Z C

slide-27
SLIDE 27

§ Path compression split

27

Node Split with Path Compression

ACA ACC ACZ

AC

② Old common prefix update

A C Z

A

AZA Z C

However, this split process causes consistency problem in PM.

slide-28
SLIDE 28

Path compression Problem in PM

28

slide-29
SLIDE 29

§ Path compression split

  • cause updates of multiple nodes
  • have to employ expensive logging methods

29

Consistency Issue of Path Compression

ACA ACC ACZ

A C Z

A

AZA Z C Consistent state Inconsistent state

A C

Crash

slide-30
SLIDE 30

Path compression Solution

30

slide-31
SLIDE 31

§ Failure-atomic path compression

  • Add node depth field to compression header

31

WORT (Write-Optimal Radix Tree) for PM

ACA ACC ACZ

AC

struct Header { unsigned char depth; unsigned char PrefixArr[7]; }

Compression header (8 bytes)

A C Z

slide-32
SLIDE 32

32

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression

  • Add node depth field to compression header

Compression header (8 bytes)

AZA to be inserted

ACA ACC ACZ

AC

A C Z

slide-33
SLIDE 33

Consistent state Inconsistent state 33

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression

  • Add node depth field to compression header

ACA ACC ACZ

A C

AZA ② Decompression of old common prefix C

2

A C Z

Crash

Z

A Compression header (8 bytes)

slide-34
SLIDE 34

Inconsistent state

§ Failure-atomic path compression

  • Failure detection in WORT

− Depth in a header ≠ Counted depth à Crashed header

34

WORT (Write-Optimal Radix Tree) for PM

Not equal to expected tree depth (2)

ACA ACC ACZ

… …

AZA C A C Z Z

A C A Compression header (8 bytes)

slide-35
SLIDE 35

Inconsistent state

§ Failure-atomic path compression

  • Failure recovery in WORT

− Compression header can be reconstructed à Atomically overwrite

35

WORT (Write-Optimal Radix Tree) for PM

A C

ACA ACC ACA ACC ACZ

… …

A

AZA C

Compression header (8 bytes)

A C Z Z Consistent state

2

slide-36
SLIDE 36

WORT (Write Optimal Radix Tree) WOART (Write Optimal Adaptive Radix Tree)

  • 1. Failure-atomic path compression
  • 2. Redesigned adaptive node

§ Our proposed radix tree variant is optimal for PM

  • Consistency is always guaranteed with a single 8-byte failure-atomic

write without any additional copies for logging or CoW

36

Write Optimal Data Structure for PM

slide-37
SLIDE 37

§ Experimental environment

37

Evaluation

System configuration Description CPU Intel Xeon E5-2620V3 X 2 OS Linux CentOS 6.6 (64bit) kernel v4.7.0 PM Emulated with 256GB DRAM Write latency: Injecting additional stall cycles

slide-38
SLIDE 38

§ Experimental environment

38

Evaluation

Comparison group

Radix tree variants B+tree variants WORT wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16)

DRAM

PM

DRAM

PM

slide-39
SLIDE 39

§ Experimental environment

− Dense [1 … N] − Sparse [1 … 264) − Clustered [1 … 264) 39

Evaluation

Synthetic Workload Characteristics

N

slide-40
SLIDE 40

§ Insertion performance

  • WORT outperform the B+tree variants in general

40

Evaluation

slide-41
SLIDE 41

§ Insertion performance

  • WORT outperform the B+tree variants in general

− DRAM-based internal node à more favorable performance for FPTree

41

Evaluation

slide-42
SLIDE 42

§ Insertion performance

  • WORT vs wB+Tree

− Performance differences increase in proportion to write latency

42

Evaluation

slide-43
SLIDE 43

§ CLFLUSH count per operation

  • B-tree variants incur more cache flush instructions

43

Evaluation

slide-44
SLIDE 44

§ Search performance

  • WORT always perform better than B+Tree variants

44

Evaluation

slide-45
SLIDE 45

§ Range query performance

  • Performance gap for range query decreases for PM indexes compared

with it between WORT and original B+Tree

− B+Tree variants do not keep the keys sorted à Rearrangement overhead

45

Evaluation

slide-46
SLIDE 46

§ MC-benchmark performance on Memcached

  • WORT outperform B+Tree variants in both SET and GET

− Additional indirection & flush overhead in B-tree variants

46

Evaluation

slide-47
SLIDE 47

§ Showed suitability of radix tree as PM indexing structure § Proposed optimal radix tree variants WORT and WOART

  • Optimal: maintain consistency only with single failure-atomic write

without any duplicate copies

47

Conclusion

slide-48
SLIDE 48

§ Any question?

48

Thank you