smartcuckoo a fast and cost efficient hashing index
play

SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for - PowerPoint PPT Presentation

SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo Huazhong University of Science and Technology *University of Texas, Arlington Presented


  1. SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo Huazhong University of Science and Technology *University of Texas, Arlington Presented in the USENIX ATC 2017 1

  2. Indexing services in cloud storage n Large amounts of data From small hand-held devices to large-scale data centers Ø 44ZB in total, 5.2TB for each user in 2020 (IDC' 2014) Ø n Fast query services are important to both users and systems Returning accurate results in a real-time manner Ø Improving system performance and storage efficiency Ø 2

  3. The importance of hash tables n Hash tables are widely used in data stores and caches Key-value stores, e.g., Memcached, Redis Ø Relational databases, e.g., MonetDB, HyPer Ø In-cache index (ICS 2014, MICRO 2015) Ø n Strengths: Constant-scale addressing complexity ~O(1) Ø Fast query response Ø n Weakness: Risk of high-latency for handling hashing collisions Ø n Cuckoo hashing 3

  4. Cuckoo hashing n Kick-out operations: like cuckoo birds n Open addressing n Supporting fast lookups: O(1) time complexity n However, insertion latency can be very high and unpredictable, especially Ø when an endless loop occurs! 4

  5. How is an endless loop formed? 0 H 1 ( ) 1 2 a 3 4 5 6 7 5

  6. How is an endless loop formed? c a H 1 ( ) 0 1 2 3 4 5 6 7 6

  7. How is an endless loop formed? a 0 H 2 ( ) 1 c 2 b 3 H 1 ( ) 4 5 6 7 7

  8. How is an endless loop formed? T 1 T 2 a 0 b 1 c 2 3 4 5 6 7 8

  9. How is an endless loop formed? T 1 T 2 a 0 b 1 H 1 ( ) c 2 x d 3 H 2 ( ) 4 e 5 6 7 9

  10. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 c 2 x d 3 4 e 5 6 My alternative location 7 10

  11. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 2 x d 3 4 e 5 c 6 My alternative location 7 11

  12. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 x 2 d 3 4 e 5 c 6 My alternative location 7 12

  13. How is an endless loop formed? T 1 T 2 a b a 0 b 1 x 2 x d d 3 4 n An endless loop is formed. e 5 c n Endless kickouts for any 6 insertion within the loop. My alternative location 7 13

  14. Observations n Endless loops widely exist in the Cuckoo hashing structures. More than 25% (cuckoo hashing with a stash) Ø n Loop ratio: the percentage of insertion failures due to loops 50 45 RandomInteger 40 MacOS 35 Loop Ratios (%) DocWords 30 25 20 15 10 5 0 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Load Factor 14

  15. Existing works n ChunkStash @USENIX ATC’10 Collisions: resursive strategy to relocate one of keys in candidates Ø Loops: an auxiliary linked list (or, hash table) Ø n MemC3 @NSDI’13 Collisions: random and repeat relocation (500 times) Ø Loops: an expansion process Ø Stand-alone implementation: libcuckoo @ EuroSys’14 Ø n Horton tables @USENIX ATC’16 Recursively evicting keys within a certain search tree height Ø 15

  16. Motivations n Due to endless loops: Substantial resources consumption Ø u A large number of step-by-step kick-out operations Unbounded performance Ø u Fruitless effort n Design Goal: Predetermining and avoiding occurrence of endless loops Ø 16

  17. Our approach: SmartCuckoo n Tracking item placements in the hash table Representing the hashing relationship as a directed pseudoforest Ø Classifying item insertions into three cases Ø Predetermining and avoiding loops during insertion without any Ø kick-out attempts. 17

  18. How to identify loop(s)? n Pseudoforest: A graph: each vertex has an outdegree of at most one Ø Each connected component (subgraph) has at most one cycle (loop) Ø In a subgraph: Ø Loop #Vertices = #Edges No loop #Vertices = #Edges + 1 j j d d n n c c m m k k b b e e i l a a Vacancy f g f g h h Maximal Non-maximal 18

  19. Classification and predetermination n Three cases depending on the number of vertices added to the graph v+0, v+1, and v+2 n v+0: 5 possible scenarios based on the status of corresponding subgraph(s) n Three cases v+0 v+1 v+2 Two insert Same subgraph Different subgraphs A new Two new positions of a key one ones Subgraph status Non- Maximal Both non- A maximal Both maximal maximal maximal and a non- - - maximal Scenarios (a) (e) (b) (c) (d) - - 19

  20. v+0: (a) One non-maximal subgraph n One empty bucket n Success! T 1 T 2 T 1 T 2 a H 1 ( ) a 0 0 b b b x 1 1 1 H 1 (x 1 ) a b d c c 2 H 2 ( ) 2 a d x 1 3 3 c H 2 (x 1 ) 4 4 c x 1 d d 5 5 6 6 7 7 Pseudoforest 20

  21. v+0: (b) Two non-maximal subgraphs n Two empty buckets n Success! b T 1 T 2 T 1 T 2 b a a d a 0 0 a d b b 1 1 c c c c 2 H 2 (x 2 ) 2 3 3 H 2 ( ) g x 2 H 1 (x 2 ) 4 4 x 2 g x 2 d d 5 H 1 ( ) 5 g 6 g 6 f f f f 7 7 Pseudoforest 21

  22. v+0: (c) One maximal and one non-maximal n One loop and one empty bucket n Conventional cuckoo hashing: taking a random walk Ø T 1 : executing extra useless kick-out operations Ø T 2 : making a success Ø SmartCuckoo: directly selecting to enter from T 2 n Success! T 1 T 2 b T 1 T 2 b a a 0 a 0 a d d b b 1 1 c c c c 2 2 H 1 (x 3 ) e e H 1 ( ) e e 3 3 x 3 4 4 g H 2 (x 3 ) x 3 g H 2 ( ) g d d 5 5 Pseudoforest x 3 g 6 6 f f f f 7 7 22

  23. v+0: (d) Two maximal subgraphs n Two loops! n Execution: Ø Conventional cuckoo hashing: sufficient attempts, then reporting a failure Ø SmartCuckoo: reporting a failure without any kick-out operations. b T 1 T 2 a d a 0 b 1 c H 2 (x 4 ) e c 2 e Failure! 3 H 2 ( ) g H 1 (x 4 ) h 4 x 4 d h 5 H 1 ( ) g 6 i f f 7 i Pseudoforest 23

  24. v+0: (e) One maximal subgraph n One loop! T 1 T 2 a 0 H 1 ( ) b H 2 (x 5 ) 1 b x 5 c H 1 (x 5 ) a 2 d Failure! e 3 H 2 ( ) c 4 e d 5 6 7 Pseudoforest 24

  25. Case: v+1 n A new vertex after the item's insertion n Success! T 1 T 2 T 1 T 2 a a 0 0 b b b b 1 a 1 a d d c c 2 2 3 c c 3 H 2 (x 6 ) H 2 ( ) x 6 4 4 x 6 d d 5 5 H 1 ( ) x 6 H 1 (x 6 ) 6 6 7 7 Pseudoforest 25

  26. Case: v+2 n Two new vertices after the insertion n Success! T 1 T 2 T 1 T 2 a a 0 b 0 b b b a 1 1 d a d c c 2 2 c 3 3 c H 2 ( ) x 7 4 4 x 7 d d 5 5 H 1 ( ) H 1 (x 7 ) x 7 H 2 (x 7 ) 6 6 7 7 Pseudoforest 26

  27. Evaluation methodology n Comparisons: Ø Baseline (Cuckoo hashing with a stash @ SIAM Journal on Computing '09) Ø libcuckoo @ EuroSys'14 Ø BCHT (bucketized cuckoo hash table) n Traces: Ø RandomInteger: random integer generator @ TOMACS'98 Ø MacOS: http://tracer.filesystems.org Ø DocWords: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words Ø YCSB: https://github.com/brianfrankcooper/YCSB @ SOCC'11 n Metrics: in millions of operations per second Insertion throughput Ø Lookup throughput: positive/negative Ø Throughput of workload with mixed queries (YCSB) Ø 27

  28. Insertion throughput 3.5 0.5 × Baseline Millions of Insertions Per 3 libcuckoo BCHT 2.5 SmartCuckoo 5 × Second 2 1.5 1 0.5 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load Factor n SmartCuckoo significantly increases insertion throughputs. n 0.5 × to 5 × speedups compared to Baseline. 28

  29. Lookup throughput Baseline libcuckoo BCHT SmartCuckoo 2.5 Millions of Lookups Per 2 Second 1.5 1 0.5 0 100% 0% Percentage of Existent Keys in the Lookup Requests n 0%: all candidate positions for a key have to be accessed. n Almost the same lookup throughput with Baseline. n Significantly higher than libcuckoo and BCHT. 29

  30. Throughput of workload with mixed queries 2.4 Millions of Operations Per Baseline 2 Workload Insert Lookup Update libcuckoo 1.6 Second BCHT YCSB-1 100 0 0 SmartCuckoo 1.2 YCSB-2 75 25 0 0.8 YCSB-3 50 50 0 0.4 YCSB-4 25 75 0 YCSB-5 0 95 5 0 YCSB-1 YCSB-2 YCSB-3 YCSB-4 YCSB-5 Workloads n With the decrease of the percentage of insertions, all schemes increase the throughputs. n In each workload, SmartCuckoo produces higher throughput than other three schemes. 30

  31. Conclusion and future work n Cuckoo hashing is cost-efficient to offer O(1) query performance. n We address the problem of potential endless loops in item insertion. n SmartCuckoo helps improve predictable performance in storage systems. n To-do-list: SmartCuckoo in hash tables with more than two hash functions; n The use of multiple slots in each bucket. n 31

  32. Thanks and questions? Open-source code: https://github.com/syy804123097/SmartCuckoo 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend