Hashing Index Scheme for Persistent Memory Pengfei Zuo , Yu Hua, Jie - - PowerPoint PPT Presentation

hashing index scheme for persistent memory
SMART_READER_LITE
LIVE PREVIEW

Hashing Index Scheme for Persistent Memory Pengfei Zuo , Yu Hua, Jie - - PowerPoint PPT Presentation

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Pengfei Zuo , Yu Hua, Jie Wu Huazhong University of Science and Technology, China 13th USENIX Symposium on Operating Systems Design and Implementation ( OSDI ), 2018


slide-1
SLIDE 1

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

Pengfei Zuo, Yu Hua, Jie Wu Huazhong University of Science and Technology, China

13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

slide-2
SLIDE 2

Persistent Memory (PM) ➢ Non-volatile memory as PM is expected to replace

  • r complement DRAM as main memory

– Non-volatility, low power, large capacity

PCM ReRAM DRAM Read (ns) 20-70 20-50 10 Write (ns) 150-220 70-140 10 Non-volatility √ √ × Standby Power ~0 ~0 High Density (Gb/cm2) 13.5 24.5 9.1 PCM ReRAM

  • K. Suzuki and S. Swanson. “A Survey of Trends in Non-Volatile Memory Technologies: 2000-2014”, IMW 2015.
  • C. Xu et al. “Overcoming the Challenges of Crossbar Resistive Memory Architectures”, HPCA, 2015.

2

slide-3
SLIDE 3

Index Structures in DRAM vs PM

➢ Index structures are critical for memory&storage systems ➢ Traditional indexing techniques originally designed for DRAM become inefficient in PM

– Hardware limitations of NVM

  • Limited cell endurance
  • Asymmetric read/write latency and energy
  • Write optimization matters

– The requirement of data consistency

  • Data are persistently stored in PM
  • Crash consistency on system failures

Persist CPU

3

slide-4
SLIDE 4

Tree-based vs Hashing Index Structures

➢ Tree-based index structures

– Pros: good for range query – Cons: O(log(n)) time complexity for point query – Ones for PM have been widely studied

  • CDDS B-tree [FAST’11]
  • NV-Tree [FAST’15]
  • wB+-Tree [VLDB’15]
  • FP-Tree [SIGMOD’16]
  • WORT [FAST’17]
  • FAST&FAIR [FAST’18]

4

slide-5
SLIDE 5

Tree-based vs Hashing Index Structures

➢ Tree-based index structures

– Pros: good for range query – Cons: O(log(n)) time complexity for point query – Ones for PM have been widely studied

  • CDDS B-tree [FAST’11]
  • NV-Tree [FAST’15]
  • wB+-Tree [VLDB’15]
  • FP-Tree [SIGMOD’16]
  • WORT [FAST’17]
  • FAST&FAIR [FAST’18]

➢ Hashing index structures

– Pros: constant time complexity for point query – Cons: do not support range query – Widely used in main memory

  • Main memory databases
  • In-memory key-value stores, e.g.,

Memcached and Redis

– When maintained in PM, multiple non-trivial challenges exist

  • Rarely touched by existing work

5

slide-6
SLIDE 6

Challenges of Hashing Indexes for PM

① High overhead for consistency guarantee

– Ordering memory writes

  • Cache line flush and memory fence instructions

– Avoiding partial updates for non-atomic writes

  • Logging or copy-on-write (CoW) mechanisms

CPU Memory Bus Volatile caches Non-volatile memory 8-byte width

6

slide-7
SLIDE 7

Challenges of Hashing Indexes for PM

① High overhead for consistency guarantee ② Performance degradation for reducing writes

– Hashing schemes for DRAM usually cause many extra writes for dealing with hash collisions [INFLOW’15, MSST’17] – Write-friendly hashing schemes reduce writes but at the cost of decreasing access performance

  • PCM-friendly hash table (PFHT) [INFLOW’15]
  • Path hashing [MSST’17]

7

slide-8
SLIDE 8

Challenges of Hashing Indexes for PM

① High overhead for consistency guarantee ② Performance degradation for reducing writes ③ Cost inefficiency for resizing hash table

− Double the table size and iteratively rehash all items − Take O(N) time to complete − N insertions with cache line flushes & memory fences

8

Old Hash Table New Hash Table Rehash all items

slide-9
SLIDE 9

Existing Hashing Index Schemes for PM

Bucketized Cuckoo (BCH) PFHT1 Path Hashing2 Memory efficiency √ √ √ Search √

  • Deletion

  • Insertion

×

  • NVM writes

× √ √ Resizing × × × Consistency × × ×

[1] B. Debnath et al. “Revisiting hash table design for phase change memory”, INFLOW, 2015. [2] P. Zuo and Y. Hua. “A write-friendly hashing scheme for non-volatile memory systems”, MSST, 2017.

(“×”: bad, “√”: good , “--”: moderate)

9

slide-10
SLIDE 10

Existing Hashing Index Schemes for PM

Bucketized Cuckoo (BCH) PFHT1 Path Hashing2 Level Hashing Memory efficiency √ √ √ √ Search √

Deletion √

Insertion ×

NVM writes × √ √ √ Resizing × × × √ Consistency × × × √

[1] B. Debnath et al. “Revisiting hash table design for phase change memory”, INFLOW, 2015. [2] P. Zuo and Y. Hua. “A write-friendly hashing scheme for non-volatile memory systems”, MSST, 2017.

(“×”: bad, “√”: good , “--”: moderate)

10

slide-11
SLIDE 11

Level Hashing

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL: BL:

One movement One movement

Write-optimized & High-performance Hash Table Structure Cost-efficient In-place Resizing Scheme Low-overhead Consistency Guarantee Scheme Resizing support Consistency support

11

slide-12
SLIDE 12

Write-optimized Hash Table Structure

① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion

12

slide-13
SLIDE 13

Write-optimized Hash Table Structure

① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL:

2.2% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor

13

slide-14
SLIDE 14

Write-optimized Hash Table Structure

① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL:

2.2% 47.6% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor

14

slide-15
SLIDE 15

Write-optimized Hash Table Structure

① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL: BL:

2.2% 47.6% 82.5% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor

15

slide-16
SLIDE 16

Write-optimized Hash Table Structure

① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL: BL:

One movement One movement

2.2% 47.6% 82.5% 91.1% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor

16

slide-17
SLIDE 17

Write-optimized Hash Table Structure

x

1 2 3 4 5 N-1 N-2 N-3 N-4

TL: BL:

One movement One movement

➢ Write-optimized: only 1.2% of insertions incur one movement ➢ High-performance: constant-scale time complexity for all operations ➢ Memory-efficient: achieve high load factor by evenly distributing items

17

slide-18
SLIDE 18

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

TL: BL:

1 2 3 N-1 N-2

18

slide-19
SLIDE 19

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1

TL: TL: BL:

19

slide-20
SLIDE 20

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1

TL:

BL:

IL:

(the interim level )

20

slide-21
SLIDE 21

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1

TL:

BL:

IL:

(the interim level )

Rehashing

21

slide-22
SLIDE 22

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1

TL:

BL:

22

slide-23
SLIDE 23

Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and

  • nly rehash items in the old bottom level

– The new hash table is exactly double size of the old one – Only 1/3 buckets (i.e., the old bottom level) are rehashed

2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1

TL:

BL:

23

slide-24
SLIDE 24

Low-overhead Consistency Guarantee

➢ A token associated with each slot in the open- addressing hash tables

– Indicate whether the slot is empty – A token is 1 bit, e.g., “1” for non-empty, “0” for empty

KV1 KV0 1 1 Tokens

A bucket:

Slots

24

slide-25
SLIDE 25

Low-overhead Consistency Guarantee

➢ A token associated with each slot in the open- addressing hash tables

– Indicate whether the slot is empty – A token is 1 bit, e.g., “1” for non-empty, “0” for empty

➢ Modifying the token area only needs an atomic write

– Leveraging the token to perform log-free operations

KV1 KV0 1 1 Tokens Slots

A bucket:

25

slide-26
SLIDE 26

Log-free Deletion

➢ Delete an existing item

KV1 KV0 1 1 Delete

26

slide-27
SLIDE 27

Log-free Deletion

➢ Delete an existing item

KV1 KV0 1 1 Delete KV1 KV0 1 Modify the token in an atomic write

27

slide-28
SLIDE 28

Log-free Deletion

➢ Delete an existing item

KV1 KV0 1 1 Delete KV1 KV0 1 Modify the token in an atomic write

➢ Log-free insertion and log-free resizing

– Please find them in our paper

28

slide-29
SLIDE 29

Consistency Guarantee for Update

➢ If directly update an existing key-value item in place

– Inconsistency on system failures

KV1 KV0 1 1 Update

29

slide-30
SLIDE 30

Consistency Guarantee for Update

➢ If directly update an existing key-value item in place

– Inconsistency on system failures

➢ A straightforward solution is to use logging

KV1 KV0 1 1 Update Expensive!

30

slide-31
SLIDE 31

Opportunistic Log-free Update

➢ Our scheme: check whether there is an empty slot in the bucket storing the old item

– Yes: log-free update – No: using logging KV1 KV0 1 1 KV1’ KV1 KV0 1 1 KV1’ KV1 KV0 1 1 Update ① Write KV1’ in an empty slot ② Modify the two tokens in an atomic write

31

slide-32
SLIDE 32

Opportunistic Log-free Update

➢ Our scheme: check whether there is an empty slot in the bucket storing the old item

– Yes: log-free update – No: using logging KV1 KV0 1 1 KV1’ KV1 KV0 1 1 KV1’ KV1 KV0 1 1 Update ① Write KV1’ in an empty slot ② Modify the two tokens in an atomic write

0% 20% 40% 60% 80% 100% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Log-free Probability Load Factor 4 slots/bucket 8 slots/bucket 16 slots/bucket

32

slide-33
SLIDE 33

Performance Evaluation ➢ Both in DRAM and simulated PM platforms

– Quartz (Hewlett Packard)

  • A DRAM-based performance emulator for PM

➢ Comparisons

– Bucketized cuckoo hashing (BCH) [NSDI’13] – PCM-friendly hash table (PFHT) [INFLOW’15] – Path hashing [MSST’17] – In PM, implement their persistent versions using our proposed log-free consistency guarantee schemes

33

slide-34
SLIDE 34

Insertion Latency

➢ Level hashing has the best insertion performance in both DRAM and NVM

600 6000 0.4 0.5 0.6 0.7 0.8 0.9 Insertion Latency (ns) Load Factor BCH PFHT Path Level

DRAM NVM read/write latency: 200/600

2400 24000 0.4 0.5 0.6 0.7 0.8 0.9 Insertion Latency (ns) Load Factor BCH PFHT Path Level

34

slide-35
SLIDE 35

Update Latency

➢ Opportunistic log-free update scheme reduces the update latency by 15%∼ 52%, i.e., speeding up the updates by 1.2×− 2.1×

1000 2000 3000 4000 5000 6000 7000 8000 9000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Update Latency (ns) Load Factor BCH PFHT Path Level Level w/o Opp

35

slide-36
SLIDE 36

Search Latency

➢ The search latency of level hashing is close to that of BCH, which is much lower than PFHT and path hashing 500 1000 1500 2000

Negative Search

0.8 0.8 0.6 Search Latency (ns) BCH PFHT Path Level 0.6

Positive Search

36

slide-37
SLIDE 37

Resizing Time

➢ Level hashing reduces the resizing time by about 76%, i.e., speeding up the resizing by 4.3×

50 100 150 200 250 NVM-200ns/600ns DRAM The Resizing Time (s)

BCH PFHT Path Level-Trad Level

37

slide-38
SLIDE 38

Concurrent Throughput

➢ Concurrent level hashing: Support multiple-reader multiple- writer concurrency via simply using fine-grained locking ➢ Concurrent level hashing has 1.6×− 2.1× higher throughput than libcuckoo1, due to locking fewer slots for insertions

90/10 70/30 50/50 30/70 10/90 3 6 9 12 15 18 Throughput (M reqs/s) Libcu-2 Level-2 Libcu-4 Level-4 Libcu-8 Level-8 Libcu-16 Level-16

Search/Insertion Ratio (%)

[1] X. Li et al.. “Algorithmic improvements for fast concurrent cuckoo hashing”, Eurosys, 2014.

38

slide-39
SLIDE 39

Conclusion

➢ Traditional indexing techniques originally designed for DRAM become inefficient in PM ➢ We propose level hashing, a write-optimized and high- performance hashing index scheme for PM

– Write-optimized hash table structure – Cost-efficient in-place resizing – Log-free consistency guarantee

➢ 1.4×−3.0× speedup for insertion, 1.2×−2.1× speedup for update, and over 4.3× speedup for resizing

39

slide-40
SLIDE 40

Thanks! Q&A

Open-source code: https://github.com/Pfzuo/Level-Hashing