 
              DATA -INTENSIVE COMPUTING SYSTEMS LAB ORATORY PinK: High-speed In-storage Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo Chung * , Arvind * , and Sungjin Lee Daegu Gyeongbuk Institute of Science & Technology (DGIST) *Massachusetts Institute of Technology (MIT) 2020 USENIX Annual T echnical Conference (ATC’ 20, July 15 ~ 17)
Key-Value Store is Everywhere!  Key-Value store (KVS) has become a necessary infrastructure  Algorithm Web indexing, Caching, Storage systems  SILK (ATC’19) ,  Dostoevsky (SIGMOD’18)  Monkey (SIGMOD’17) …  System  FlashStore (VLDB’10)  Wisckey (FAST’16)  LOCS (Eurosys’14) …  Architecture  Bluecache (VLDB’16) … 2
Key-Value (KV) Storage Device Web indexing, Caching, Storage systems Key-Value Interface Fewer Host Resources Host KVS Engine Low Latency High Throughput Block Device Driver KV-SSD Device Driver Block-SSD KV-SSD capacitior Offloading KVS functionality 3
Key-Value (KV) Storage Device Web indexing, Caching, Storage systems Key-Value Interface Fewer Host Resources Host KVS Engine Low Latency High Throughput  Academia Block Device Driver KV-SSD Device Driver  LightStore (ASPLOS’19), KV- SSD (SYSTOR’19), iLSM- SSD(MASCOTS’19) Block-SSD KV-SSD KAML (HPCA’17), NVMKV(ATC’15), Bluecache (VLDB’16) …  Industry Offloading KVS  Samsung’s KV -SSD functionality 4
Key Challenges of Designing KV-SSD  1. Limited DRAM resource  SSDs usually have DRAM as much as 0.1% of NAND for indexing!  Logical block: 4KB > KV-pair: 1KB on average DRAM DRAM 1KB 4KB DRAM Scalability NAND Scalability  DRAM scalability slower than NAND! 1.13x / year 1.43x / year Technology and Cost Trends at Advanced Nodes, 2020, https://semiwiki.com/wp-content/uploads/2020/03/Lithovision-2020.pdf 5
Key Challenges of Designing KV-SSD (Cont.)  2. Limited CPU performance  SSDs have low power CPU (ARM based) ARM CPU x86 CPU Which algorithm is better for KV-SSD with these limitations, Hash or Log-structured Merge-tree (LSM-tree) ? 6
Experiments using Hash-based KV-SSD  Samsung KV-SSD prototype  hash-based KV-SSD*  Benchmark  KV-SSD: KVBench**, Long tail latency Performance drop 32B key and 1KB value read request  Block-SSD: FIO, / / / / / / 1KB read request What is the reason? 7 *KV-PM983, **Samsung KV-SSD benchmark tool
Problem of Hash-based KV-SSD SSD: 4TB, DRAM:4GB Key: 32B, Value: 1KB Hash bucket Full key (32B) Pointer to value (4B) Value 144GB >> 4GB KAML ( HPCA’17 ) Pointer to KV (4B) Signature (2B) Full key and Value 24GB > 4GB Flashstore (VLDB’10) 8
Problem of Hash-based KV-SSD Get ( key 7 ) Bucket 10 Signature: 1000 Hash Function LRU Cache Performance Drop Cache miss cached hash buckets Flash Access Bucket Bucket Bucket Bucket 5 Long tail latency Signature Signature Signature Signature Ptr Ptr Ptr Ptr probing 1000 1000 1000 2000 Signature Collision 1001 1001 1001 2001 Read other KV-pair 1002 1002 1002 2002 1003 1003 1003 2003 DRAM Flash Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket 9 Bucket 10 Signature Signature Signature Signature Signature Ptr Signature Ptr Signature Ptr Signature Ptr Signature Ptr Ptr Ptr Ptr Ptr 1000 1000 1000 1000 1000 1000 1000 1004 1000 KEY:16 , Value KEY: 10, Value KEY: 7,Value 1001 1001 1001 1001 1001 1001 1001 1005 1001 Key is not 7 1002 1002 1002 1002 1002 1002 1002 1006 1002 Key is not 7 1003 1003 1003 1003 1003 1003 1003 1007 1003 In-flash hash buckets 9
LSM-tree?  Another Option “LSM - tree”  Low DRAM requirement  No collision  Easy to serve range query Is the LSM-tree really good enough? 10
Problem of LSM-tree-based KV-SSD  1. Long tail latency! In the worst case, h-1 flash accesses for 1 KV ( h = height of LSM-tree) Get ( key 7 ) Level 2 Level 0: Memtable Level 1 Level h 0 f h ( 7 ) f h ( 7 ) f h ( 7 ) Bloom filter Bloom filter Bloom filter pass pass 4 15 20 pass … Indices Indices Indices Indices Indices Value Value Value Indices DRAM Flash 4 V 5 V V V 1 V 2 V 4 V 8 V 6 7 1 V 3 V V V 11 12 no key 7 : false positive no key 7 : false positive finally key 7 found 11
Problem of LSM-tree-based KV-SSD  2. CPU overhead!  Merge sort in compaction  Building bloom filters ARM CPU Level N Bloom filter 15 13 11 9 7 Level N+1 6 5 4 3 2 1 16 14 12 10 8 New Level N+1  3. I/O overhead!  Compaction I/O added by LSM-tree 12
Experiments using LSM-tree-based KV-SSD  Lightstore*: LSM-tree-based KV-SSD  Key-value separation ( Wisckey** ) and Bloom filter ( Monkey*** )  Benchmark  Lightstore: YCSB-LOAD and YCSB-C (Read only), 32B key and 1KB value Long tail latency Compaction time-breakdown YCSB-C 13 *ASPLOS’19, **FAST’16, ***SIGMOD’17
PinK : New LSM-tree-based KV-SSD  Long tail latency? L0 L0 DRAM  Using “ Level-pinning ” L1 Flash L1 L2 DRAM L2  CPU overhead? Flash L3 L3  “ No Bloom filter ” Bloom filter  “ HW accelerator ” for compaction  I/O overhead?  Reducing compaction I/O Level N Level N+1 by level-pinning Level N+1  Optimizing GC by reinserting valid data to LSM-tree Level N Level N+1 Level N+1 14
Introduction PinK Overview of LSM-tree in PinK Bounding tail latency Memory requirement Reducing search overhead Reducing compaction I/O Reducing sorting time Experiments Conclusion
Overview of LSM-tree in PinK  PinK is based on key-value separated LSM-tree Skiplist KV KV KV KV Level 0 Start key Level 1 2 23 Level 2 Level list (sorted array) … … … Level h-1 DRAM Flash Meta segment area Data segment area Address pointer 2 V K V K V K V 2 4 11 19 Meta segment Data segment Pointer to KV 16
Bounding Tail Latency PinK LSM-tree with bloom filter LSM-tree: # of Levels 5 GET GET Bloom filter … … L1 Binary search L1 Binary search In worst case, In worst case, 4 flash access! 1 flash access! L2 L2 Binary search Binary search Level list L3 Binary search Binary search L3 … … L4 L4 DRAM DRAM Flash Flash Memory usage? Meta segment … … 17
Memory Requirement  4TB SSD, 4GB DRAM (32B key, 1KB value) Total # of levels: 5  Skip list (L 0 ) 8MB KV KV KV KV L1 L2 Level list 432MB L3 3.5 GB < 4GB … L4 Only one flash access for indexing DRAM Flash 1 level: 1.47MB 2 levels: 68MB Meta segment 3 levels: 3.1GB … 4 levels: 144GB 18
Reducing Search Overhead  Fractional cascading Binary search Binary search × T Binary search Binary search on overlapped range Binary search × T Binary search h Range pointer … … Binary search × T Binary search … … 𝑃(ℎ 2 log(𝑈)) 𝑃(ℎ log(𝑈)) search complexity is Burdensome! 19
Reducing Search Overhead  Prefix Less compare overhead  Cache efficient search  Binary search Binary search “Prefix” and “range pointer” memory usage: about 10% of level list Binary search Prefix (4B) … Key (32B) Ptr (4B) Binary search on same prefix … … Binary search on keys 20
Reducing Compaction I/O PinK without level-pinning PinK with level-pinning Full Full Update level list Update level list 6 read & 6 write No read & write … … Burdensome! 1 2 3 5 6 9 1 2 3 5 6 9 DRAM Flash 1 3 1 3 capacitior 2 5 6 9 2 5 6 9 DRAM … … Flash 21
Reducing Sorting Time DRAM L n Flash Write DRAM or Flash 15 14 11 9 2 ARM CPU DRAM Key Comparator Read DRAM or Flash L n+1 (==, >, <) Flash 16 14 12 10 2 DRAM Flash L n Meta segment addresses New L n+1 Meta segment level list of L n+1 addresses New address for Meta segments PinK 22
PinK Summary  Long tail latency? L0 L0 DRAM Using level-pinning L1 Flash L1 L2 DRAM L2  CPU overhead? Flash L3 L3 Removing Bloom filter Optimizing binary search Bloom filter Adopting HW accelerator ARM CPU  I/O overhead? Reducing compaction I/O Optimizing GC by reinserting valid data to LSM-tree Please refer to the paper! 23
Introduction PinK Experiments Conclusion
Custom KV-SSD Prototype and Setup  All algorithms for KV-SSD were implemented on ZCU102 board  For fast experiments: 64GB SSD, 64 MB DRAM (0.1% of NAND capacity) Client Server KV-SSD platform Xilinx ZCU102 4GB DRAM Expansion Card Custom Connectors Xeon E5-2640 Flash Card (20 cores @ 2.4 GHz) 32GB DRAM Artix7 FPGA 10GbE Zynq Ultrascale+ SoC Raw NAND (Quad-core ARM Cortex-A53 Flash chips 10GbE NIC with FPGA) (256GB) 25
Benchmark Setup  YCSB: 32B key, 1KB value Load A B C D E F R:W ratio 0:100 50:50 95:5 100:0 95:5 95:5 50:50(RMW) Query type Point Range read Point Request Latest Uniform Zipfian Zipfian distribution (Highest locality)  Two phases  Load: issue unique 44M KV pairs (44GB, 70% of total SSD)  Run: issue 44M KV pairs following workload description 26
Recommend
More recommend