WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems - PowerPoint PPT Presentation

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 , Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST 1 Hongik University

Persistent Memory (PM) § Persistent memory is expected to replace both DRAM & NAND NAND PCM DRAM STT-MRAM Non-volatility o o o x Read (ns) 2.5 X 10 4 5 - 30 20 – 70 10 2 X 10 5 Write (ns) 10 - 100 150 - 220 10 x o o o Byte-addressable Density 185.8 Gbit/cm 2 0.36 Gbit/cm 2 13.5 Gbit/cm 2 9.1 Gbit/cm 2 K. Suzuki and S. Swanson. “A Survey of Trends in Non-Volatile Memory Technologies: 2000-2014”, IMW 2015 Non-volatile High performance Persistent Memory 2

Indexing Structure for PM Storage Systems 13 30 B+Tree 5 20 40 50 … 1 4 9 10 30 38 48 60 70 3

Consistency Issue of B+tree in PM § B+tree is a block-based index • Key sorting à Block granularity write • Rebalancing à Multi-blocks granularity write § Persistent memory Can result in • Byte-addressable à Byte granularity write consistency problem • Write reordering 4

Consistency Issue of B+tree in PM § Traditional case Volatile CPU Caches 30 35 30 31 35 2 3 DRAM Write reordering 30 31 35 3 Not persistent data Non-volatile Block based storage 30 35 Block granularity update 2 5

Consistency Issue of B+tree in PM § PM case Volatile CPU Caches 30 35 30 31 35 2 3 Byte granularity update Non-volatile Persistent Memory Write reordering 30 35 Crash 2 Persistent data Garbage data persistently stored 6

Primitives for Data Consistency in PM § Durability Volatile • CLFLUSH (Flush cache line) CPU Caches − Can be reordered § Ordering • MFENCE (Load and Store fence) Non-volatile Persistent − Order CPU cache line flush Memory instructions 7

Primitives for Data Consistency in PM § Durability CPU Volatile • CLFLUSH (Flush cache line) Serialization of CLFLUSH and MFENCE is CPU Caches − Can be reordered known to cause large overhead § Ordering • MFENCE (Load and Store fence) Non-volatile Persistent − Order CPU cache line flush Memory instructions 8

Primitives for Data Consistency in PM § Atomicity • 8-byte failure atomicity 30 31 35 30 31 35 3 3 − Need only CLFLUSH • Logging or CoW based atomicity Non-volatile (more than 8 bytes) Log area Data area − Requires duplicate copies 30 35 2 9

Primitives for Data Consistency in PM § Atomicity • 8-byte failure atomicity 30 31 35 30 31 35 3 3 − Need only CLFLUSH • Logging or CoW based atomicity Non-volatile Logging increases cache line flush overhead (more than 8 bytes) Log area Data area − Requires duplicate copies 30 35 2 10

B+tree Variants for Persistent Memory How can we ensure consistency using failure-atomic writes without logging? Unsorted keys à Append-only with metadata Failure-atomic update of metadata wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16) Slot array 5 Fingerprints 9 2 3 1 7 Flag Flag Flag (+/-) (+/-) (+/-) Entry … K1 Kn … K1 K2 K3 K1 Kn … Cnt. bmp bmp P1 Pn … P1 P2 P3 P1 Pn … P next Unsorted key à Decreases search performance 11

B+tree Variants for Persistent Memory § Logging still necessary 30 32 Overflow • Multi-block granularity updates 35 30 32 38 Split due to node splits and merges 35 38 New key − Cannot update atomically • Logging-based solution − wB+Tree, FPTree large overhead • Tree reconstruction based solution − NVTree 12

B+tree Variants for Persistent Memory Key sorting 30 35 30 31 35 2 3 Fundamental characteristics of B+tree cause problems Rebalancing 30 32 Overflow 35 30 32 38 Split 35 38 New key 13

B+tree Variants for Persistent Memory Key sorting 30 35 30 31 35 2 3 Why use B+ trees in the first place? Fundamental characteristics of B+tree cause problems Rebalancing Perhaps there is a better tree data structure more suited for PM? 30 32 Overflow 35 30 32 38 Split 35 38 New key 14

Our Contributions § Show Radix Tree is a suitable data structure for PM § Propose optimal radix tree variants WORT and WOART • WORT: Write Optimal Radix Tree • WOART: Write Optimal redesigned Adaptive Radix Tree (ART) − Optimal: maintain consistency only with single failure-atomic write without any duplicate copies 15

Radix Tree § Deterministic structure … C A … … A C … … A C Z C ACA ACC ACZ CAC 16

Radix Tree § Deterministic structure • No key comparison … C A … … A C … … A C Z C ACA ACC ACZ CAC 17

Radix Tree § Deterministic structure 8-byte pointer • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys … … A C … … A C Z C ACA ACC ACZ CAC 18

Radix Tree § Deterministic structure • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys − No problem caused by key sorting … … A N … … A C Z C ACA ACC ACZ CAC 19

Radix Tree § Deterministic structure • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys − No problem caused by key sorting … … A N • No modification of other keys … … − Single 8-byte pointer write per node A C Z C − Easy to use failure-atomic write ACA ACC ACZ CAC 20

Problem of Deterministic Structure § For sparse key distribution • Waste excessive memory space à Optimized through path compression High utilization … … … … … Low utilization … … … … key key key key key key key key … … … 21

Path Compression in Radix Tree § Path compression • Search paths that do not need to be distinguished can be removed … Unnecessary search path C A … … A C … … C A C Z ACA ACC ACZ CAC 22

Path Compression in Radix Tree § Path compression • Common search path is compressed in header • Improve memory utilization & indexing performance … A … C Compression header … A C Z ACA ACC ACZ 23

Node Split with Path Compression § Path compression split AZA to be inserted Prefix keys are not equal AZ != AC … AC A C Z ACA ACC ACZ 24

Node Split with Path Compression § Path compression split ① New parent allocation … Split A C Z C AZA … A A C C A C Z ACA ACC ACZ 25

Node Split with Path Compression § Path compression split … A C Z ② Decompression of old common prefix AZA … A C A C Z ACA ACC ACZ 26

Node Split with Path Compression § Path compression split … A C However, this split process causes consistency Z ② Old common prefix update problem in PM. AZA … A C A C Z ACA ACC ACZ 27

Path compression Problem in PM 28

Consistency Issue of Path Compression § Path compression split • cause updates of multiple nodes • have to employ expensive logging methods … A C Z Consistent state AZA … A C Z Crash Inconsistent state ACA ACC ACZ A C 29

Path compression Solution 30

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header Compression header (8 bytes) … struct Header { 0 AC unsigned char depth; A C Z unsigned char PrefixArr[7]; } ACA ACC ACZ 31

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header AZA to be inserted Compression header (8 bytes) … 0 AC A C Z ACA ACC ACZ 32

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header Compression header (8 bytes) … 0 A C Z Consistent state AZA 2 … ② Decompression of old common prefix A C Z Crash Inconsistent state ACA ACC ACZ 0 A C 33

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Failure detection in WORT − Depth in a header ≠ Counted depth à Crashed header Compression header (8 bytes) … 0 A C Z Inconsistent state AZA … 0 A C A C Z Not equal to ACA ACC ACZ expected tree depth (2) 34

WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Failure recovery in WORT − Compression header can be reconstructed à Atomically overwrite Compression header (8 bytes) Consistent state ACA … 0 A 2 ACC C Z Inconsistent state AZA … 0 A C A C Z ACA ACC ACZ 35

Write Optimal Data Structure for PM § Our proposed radix tree variant is optimal for PM • Consistency is always guaranteed with a single 8-byte failure-atomic write without any additional copies for logging or CoW WORT (Write Optimal Radix Tree) WOART (Write Optimal Adaptive Radix Tree) 1. Failure-atomic path compression 2. Redesigned adaptive node 36

Evaluation § Experimental environment System configuration Description CPU Intel Xeon E5-2620V3 X 2 OS Linux CentOS 6.6 (64bit) kernel v4.7.0 Emulated with 256GB DRAM PM Write latency: Injecting additional stall cycles 37

Evaluation § Experimental environment Comparison group Radix tree variants B+tree variants WORT wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16) DRAM DRAM PM PM 38

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems - PowerPoint PPT Presentation

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 , Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST 1 Hongik University Persistent Memory (PM) Persistent memory is expected to replace both DRAM

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Enhancing the Linux Radix Tree MATTHEW WILCOX LINUXCON NORTH AMERICA 2016-08-24 Enhancing the

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

DHTM: Durable Hardware Transactional Memory Arpit Joshi , Vijay Nagarajan, Marcelo Cintra, Stratis

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

AMTH140 Lecture 20 Radix Conversion Slide 1 April 10, 2006 Reading: Lecture Notes 14.2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

What is Planning? Planning is the process of deciding exactly what you , your team or your

Assessing the Groundwater Quality and Related Hazards of Neighbourhood in Roorkee,

RDN Drinking Water & Watershed Protection Technical Advisory Committee October 25 th , 2018

1 Remember from the presentation on Fundamentals of ASR we learned that there are three

Persistent Memory Ordering Michael Swi6 Includes slides from

A Persistent Friedman Lock-Free Queue Maurice Herlihy for Non-Volatile Memory Virendra

Lazy Persistency: a High-Performing and Write-Efficient Software Persistency Technique Mohammad

Transaction Logging Unleashed with NVRAM Tianzheng Wang Ryan Johnson No, Im not talking

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems - PowerPoint PPT Presentation

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 , Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST 1 Hongik University Persistent Memory (PM) Persistent memory is expected to replace both DRAM

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Enhancing the Linux Radix Tree MATTHEW WILCOX LINUXCON NORTH AMERICA 2016-08-24 Enhancing the

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

DHTM: Durable Hardware Transactional Memory Arpit Joshi , Vijay Nagarajan, Marcelo Cintra, Stratis

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

AMTH140 Lecture 20 Radix Conversion Slide 1 April 10, 2006 Reading: Lecture Notes 14.2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

What is Planning? Planning is the process of deciding exactly what you , your team or your

Assessing the Groundwater Quality and Related Hazards of Neighbourhood in Roorkee,

RDN Drinking Water &amp; Watershed Protection Technical Advisory Committee October 25 th , 2018

1 Remember from the presentation on Fundamentals of ASR we learned that there are three

Persistent Memory Ordering Michael Swi6 Includes slides from

A Persistent Friedman Lock-Free Queue Maurice Herlihy for Non-Volatile Memory Virendra

Lazy Persistency: a High-Performing and Write-Efficient Software Persistency Technique Mohammad

Transaction Logging Unleashed with NVRAM Tianzheng Wang Ryan Johnson No, Im not talking

RDN Drinking Water & Watershed Protection Technical Advisory Committee October 25 th , 2018