Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory
Pengfei Zuo, Yu Hua, Jie Wu Huazhong University of Science and Technology, China
13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018
Hashing Index Scheme for Persistent Memory Pengfei Zuo , Yu Hua, Jie - - PowerPoint PPT Presentation
Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Pengfei Zuo , Yu Hua, Jie Wu Huazhong University of Science and Technology, China 13th USENIX Symposium on Operating Systems Design and Implementation ( OSDI ), 2018
Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory
Pengfei Zuo, Yu Hua, Jie Wu Huazhong University of Science and Technology, China
13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018
Persistent Memory (PM) ➢ Non-volatile memory as PM is expected to replace
– Non-volatility, low power, large capacity
PCM ReRAM DRAM Read (ns) 20-70 20-50 10 Write (ns) 150-220 70-140 10 Non-volatility √ √ × Standby Power ~0 ~0 High Density (Gb/cm2) 13.5 24.5 9.1 PCM ReRAM
2
Index Structures in DRAM vs PM
➢ Index structures are critical for memory&storage systems ➢ Traditional indexing techniques originally designed for DRAM become inefficient in PM
– Hardware limitations of NVM
– The requirement of data consistency
Persist CPU
3
Tree-based vs Hashing Index Structures
➢ Tree-based index structures
– Pros: good for range query – Cons: O(log(n)) time complexity for point query – Ones for PM have been widely studied
4
Tree-based vs Hashing Index Structures
➢ Tree-based index structures
– Pros: good for range query – Cons: O(log(n)) time complexity for point query – Ones for PM have been widely studied
➢ Hashing index structures
– Pros: constant time complexity for point query – Cons: do not support range query – Widely used in main memory
Memcached and Redis
– When maintained in PM, multiple non-trivial challenges exist
5
Challenges of Hashing Indexes for PM
① High overhead for consistency guarantee
– Ordering memory writes
– Avoiding partial updates for non-atomic writes
CPU Memory Bus Volatile caches Non-volatile memory 8-byte width
6
Challenges of Hashing Indexes for PM
① High overhead for consistency guarantee ② Performance degradation for reducing writes
– Hashing schemes for DRAM usually cause many extra writes for dealing with hash collisions [INFLOW’15, MSST’17] – Write-friendly hashing schemes reduce writes but at the cost of decreasing access performance
7
Challenges of Hashing Indexes for PM
① High overhead for consistency guarantee ② Performance degradation for reducing writes ③ Cost inefficiency for resizing hash table
− Double the table size and iteratively rehash all items − Take O(N) time to complete − N insertions with cache line flushes & memory fences
8
Old Hash Table New Hash Table Rehash all items
Existing Hashing Index Schemes for PM
Bucketized Cuckoo (BCH) PFHT1 Path Hashing2 Memory efficiency √ √ √ Search √
√
×
× √ √ Resizing × × × Consistency × × ×
[1] B. Debnath et al. “Revisiting hash table design for phase change memory”, INFLOW, 2015. [2] P. Zuo and Y. Hua. “A write-friendly hashing scheme for non-volatile memory systems”, MSST, 2017.
(“×”: bad, “√”: good , “--”: moderate)
9
Existing Hashing Index Schemes for PM
Bucketized Cuckoo (BCH) PFHT1 Path Hashing2 Level Hashing Memory efficiency √ √ √ √ Search √
Deletion √
Insertion ×
NVM writes × √ √ √ Resizing × × × √ Consistency × × × √
[1] B. Debnath et al. “Revisiting hash table design for phase change memory”, INFLOW, 2015. [2] P. Zuo and Y. Hua. “A write-friendly hashing scheme for non-volatile memory systems”, MSST, 2017.
(“×”: bad, “√”: good , “--”: moderate)
10
Level Hashing
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL: BL:
One movement One movement
Write-optimized & High-performance Hash Table Structure Cost-efficient In-place Resizing Scheme Low-overhead Consistency Guarantee Scheme Resizing support Consistency support
11
Write-optimized Hash Table Structure
① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion
12
Write-optimized Hash Table Structure
① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL:
2.2% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor
13
Write-optimized Hash Table Structure
① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL:
2.2% 47.6% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor
14
Write-optimized Hash Table Structure
① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL: BL:
2.2% 47.6% 82.5% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor
15
Write-optimized Hash Table Structure
① Multiple slots per bucket ② Two hash locations for each key ③ Sharing-based two-level structure ④ At most one movement for each successful insertion
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL: BL:
One movement One movement
2.2% 47.6% 82.5% 91.1% 0% 20% 40% 60% 80% 100% D1 D1+D2 D1+D2+D3 All Maximum Load Factor
16
Write-optimized Hash Table Structure
x
1 2 3 4 5 N-1 N-2 N-3 N-4
TL: BL:
One movement One movement
➢ Write-optimized: only 1.2% of insertions incur one movement ➢ High-performance: constant-scale time complexity for all operations ➢ Memory-efficient: achieve high load factor by evenly distributing items
17
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
TL: BL:
1 2 3 N-1 N-2
18
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1
TL: TL: BL:
19
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1
TL:
BL:
IL:
(the interim level )
20
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1
TL:
BL:
IL:
(the interim level )
Rehashing
21
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1
TL:
BL:
22
Cost-efficient In-place Resizing ➢ Put a new level on top of the old hash table and
– The new hash table is exactly double size of the old one – Only 1/3 buckets (i.e., the old bottom level) are rehashed
2 3 4 5 6 7 2N-2 2N-3 2N-4 1 2N-1
TL:
BL:
23
Low-overhead Consistency Guarantee
➢ A token associated with each slot in the open- addressing hash tables
– Indicate whether the slot is empty – A token is 1 bit, e.g., “1” for non-empty, “0” for empty
KV1 KV0 1 1 Tokens
A bucket:
Slots
24
Low-overhead Consistency Guarantee
➢ A token associated with each slot in the open- addressing hash tables
– Indicate whether the slot is empty – A token is 1 bit, e.g., “1” for non-empty, “0” for empty
➢ Modifying the token area only needs an atomic write
– Leveraging the token to perform log-free operations
KV1 KV0 1 1 Tokens Slots
A bucket:
25
Log-free Deletion
➢ Delete an existing item
KV1 KV0 1 1 Delete
26
Log-free Deletion
➢ Delete an existing item
KV1 KV0 1 1 Delete KV1 KV0 1 Modify the token in an atomic write
27
Log-free Deletion
➢ Delete an existing item
KV1 KV0 1 1 Delete KV1 KV0 1 Modify the token in an atomic write
➢ Log-free insertion and log-free resizing
– Please find them in our paper
28
Consistency Guarantee for Update
➢ If directly update an existing key-value item in place
– Inconsistency on system failures
KV1 KV0 1 1 Update
29
Consistency Guarantee for Update
➢ If directly update an existing key-value item in place
– Inconsistency on system failures
➢ A straightforward solution is to use logging
KV1 KV0 1 1 Update Expensive!
30
Opportunistic Log-free Update
➢ Our scheme: check whether there is an empty slot in the bucket storing the old item
– Yes: log-free update – No: using logging KV1 KV0 1 1 KV1’ KV1 KV0 1 1 KV1’ KV1 KV0 1 1 Update ① Write KV1’ in an empty slot ② Modify the two tokens in an atomic write
31
Opportunistic Log-free Update
➢ Our scheme: check whether there is an empty slot in the bucket storing the old item
– Yes: log-free update – No: using logging KV1 KV0 1 1 KV1’ KV1 KV0 1 1 KV1’ KV1 KV0 1 1 Update ① Write KV1’ in an empty slot ② Modify the two tokens in an atomic write
0% 20% 40% 60% 80% 100% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Log-free Probability Load Factor 4 slots/bucket 8 slots/bucket 16 slots/bucket
32
Performance Evaluation ➢ Both in DRAM and simulated PM platforms
– Quartz (Hewlett Packard)
➢ Comparisons
– Bucketized cuckoo hashing (BCH) [NSDI’13] – PCM-friendly hash table (PFHT) [INFLOW’15] – Path hashing [MSST’17] – In PM, implement their persistent versions using our proposed log-free consistency guarantee schemes
33
Insertion Latency
➢ Level hashing has the best insertion performance in both DRAM and NVM
600 6000 0.4 0.5 0.6 0.7 0.8 0.9 Insertion Latency (ns) Load Factor BCH PFHT Path Level
DRAM NVM read/write latency: 200/600
2400 24000 0.4 0.5 0.6 0.7 0.8 0.9 Insertion Latency (ns) Load Factor BCH PFHT Path Level
34
Update Latency
➢ Opportunistic log-free update scheme reduces the update latency by 15%∼ 52%, i.e., speeding up the updates by 1.2×− 2.1×
1000 2000 3000 4000 5000 6000 7000 8000 9000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Update Latency (ns) Load Factor BCH PFHT Path Level Level w/o Opp
35
Search Latency
➢ The search latency of level hashing is close to that of BCH, which is much lower than PFHT and path hashing 500 1000 1500 2000
Negative Search
0.8 0.8 0.6 Search Latency (ns) BCH PFHT Path Level 0.6
Positive Search
36
Resizing Time
➢ Level hashing reduces the resizing time by about 76%, i.e., speeding up the resizing by 4.3×
50 100 150 200 250 NVM-200ns/600ns DRAM The Resizing Time (s)
BCH PFHT Path Level-Trad Level
37
Concurrent Throughput
➢ Concurrent level hashing: Support multiple-reader multiple- writer concurrency via simply using fine-grained locking ➢ Concurrent level hashing has 1.6×− 2.1× higher throughput than libcuckoo1, due to locking fewer slots for insertions
90/10 70/30 50/50 30/70 10/90 3 6 9 12 15 18 Throughput (M reqs/s) Libcu-2 Level-2 Libcu-4 Level-4 Libcu-8 Level-8 Libcu-16 Level-16
Search/Insertion Ratio (%)
[1] X. Li et al.. “Algorithmic improvements for fast concurrent cuckoo hashing”, Eurosys, 2014.
38
Conclusion
➢ Traditional indexing techniques originally designed for DRAM become inefficient in PM ➢ We propose level hashing, a write-optimized and high- performance hashing index scheme for PM
– Write-optimized hash table structure – Cost-efficient in-place resizing – Log-free consistency guarantee
➢ 1.4×−3.0× speedup for insertion, 1.2×−2.1× speedup for update, and over 4.3× speedup for resizing
39
Open-source code: https://github.com/Pfzuo/Level-Hashing