Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo - PowerPoint PPT Presentation

DATA -INTENSIVE COMPUTING SYSTEMS LAB ORATORY PinK: High-speed In-storage Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo Chung * , Arvind * , and Sungjin Lee Daegu Gyeongbuk Institute of Science & Technology (DGIST) *Massachusetts Institute of Technology (MIT) 2020 USENIX Annual T echnical Conference (ATC’ 20, July 15 ~ 17)

Key-Value Store is Everywhere!  Key-Value store (KVS) has become a necessary infrastructure  Algorithm Web indexing, Caching, Storage systems  SILK (ATC’19) ,  Dostoevsky (SIGMOD’18)  Monkey (SIGMOD’17) …  System  FlashStore (VLDB’10)  Wisckey (FAST’16)  LOCS (Eurosys’14) …  Architecture  Bluecache (VLDB’16) … 2

Key-Value (KV) Storage Device Web indexing, Caching, Storage systems Key-Value Interface Fewer Host Resources Host KVS Engine Low Latency High Throughput Block Device Driver KV-SSD Device Driver Block-SSD KV-SSD capacitior Offloading KVS functionality 3

Key-Value (KV) Storage Device Web indexing, Caching, Storage systems Key-Value Interface Fewer Host Resources Host KVS Engine Low Latency High Throughput  Academia Block Device Driver KV-SSD Device Driver  LightStore (ASPLOS’19), KV- SSD (SYSTOR’19), iLSM- SSD(MASCOTS’19) Block-SSD KV-SSD KAML (HPCA’17), NVMKV(ATC’15), Bluecache (VLDB’16) …  Industry Offloading KVS  Samsung’s KV -SSD functionality 4

Key Challenges of Designing KV-SSD  1. Limited DRAM resource  SSDs usually have DRAM as much as 0.1% of NAND for indexing!  Logical block: 4KB > KV-pair: 1KB on average DRAM DRAM 1KB 4KB DRAM Scalability NAND Scalability  DRAM scalability slower than NAND! 1.13x / year 1.43x / year Technology and Cost Trends at Advanced Nodes, 2020, https://semiwiki.com/wp-content/uploads/2020/03/Lithovision-2020.pdf 5

Key Challenges of Designing KV-SSD (Cont.)  2. Limited CPU performance  SSDs have low power CPU (ARM based) ARM CPU x86 CPU Which algorithm is better for KV-SSD with these limitations, Hash or Log-structured Merge-tree (LSM-tree) ? 6

Experiments using Hash-based KV-SSD  Samsung KV-SSD prototype  hash-based KV-SSD*  Benchmark  KV-SSD: KVBench**, Long tail latency Performance drop 32B key and 1KB value read request  Block-SSD: FIO, / / / / / / 1KB read request What is the reason? 7 *KV-PM983, **Samsung KV-SSD benchmark tool

Problem of Hash-based KV-SSD SSD: 4TB, DRAM:4GB Key: 32B, Value: 1KB Hash bucket Full key (32B) Pointer to value (4B) Value 144GB >> 4GB KAML ( HPCA’17 ) Pointer to KV (4B) Signature (2B) Full key and Value 24GB > 4GB Flashstore (VLDB’10) 8

Problem of Hash-based KV-SSD Get ( key 7 ) Bucket 10 Signature: 1000 Hash Function LRU Cache Performance Drop Cache miss cached hash buckets Flash Access Bucket Bucket Bucket Bucket 5 Long tail latency Signature Signature Signature Signature Ptr Ptr Ptr Ptr probing 1000 1000 1000 2000 Signature Collision 1001 1001 1001 2001 Read other KV-pair 1002 1002 1002 2002 1003 1003 1003 2003 DRAM Flash Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket 9 Bucket 10 Signature Signature Signature Signature Signature Ptr Signature Ptr Signature Ptr Signature Ptr Signature Ptr Ptr Ptr Ptr Ptr 1000 1000 1000 1000 1000 1000 1000 1004 1000 KEY:16 , Value KEY: 10, Value KEY: 7,Value 1001 1001 1001 1001 1001 1001 1001 1005 1001 Key is not 7 1002 1002 1002 1002 1002 1002 1002 1006 1002 Key is not 7 1003 1003 1003 1003 1003 1003 1003 1007 1003 In-flash hash buckets 9

LSM-tree?  Another Option “LSM - tree”  Low DRAM requirement  No collision  Easy to serve range query Is the LSM-tree really good enough? 10

Problem of LSM-tree-based KV-SSD  1. Long tail latency! In the worst case, h-1 flash accesses for 1 KV ( h = height of LSM-tree) Get ( key 7 ) Level 2 Level 0: Memtable Level 1 Level h 0 f h ( 7 ) f h ( 7 ) f h ( 7 ) Bloom filter Bloom filter Bloom filter pass pass 4 15 20 pass … Indices Indices Indices Indices Indices Value Value Value Indices DRAM Flash 4 V 5 V V V 1 V 2 V 4 V 8 V 6 7 1 V 3 V V V 11 12 no key 7 : false positive no key 7 : false positive finally key 7 found 11

Problem of LSM-tree-based KV-SSD  2. CPU overhead!  Merge sort in compaction  Building bloom filters ARM CPU Level N Bloom filter 15 13 11 9 7 Level N+1 6 5 4 3 2 1 16 14 12 10 8 New Level N+1  3. I/O overhead!  Compaction I/O added by LSM-tree 12

Experiments using LSM-tree-based KV-SSD  Lightstore*: LSM-tree-based KV-SSD  Key-value separation ( Wisckey** ) and Bloom filter ( Monkey*** )  Benchmark  Lightstore: YCSB-LOAD and YCSB-C (Read only), 32B key and 1KB value Long tail latency Compaction time-breakdown YCSB-C 13 *ASPLOS’19, **FAST’16, ***SIGMOD’17

PinK : New LSM-tree-based KV-SSD  Long tail latency? L0 L0 DRAM  Using “ Level-pinning ” L1 Flash L1 L2 DRAM L2  CPU overhead? Flash L3 L3  “ No Bloom filter ” Bloom filter  “ HW accelerator ” for compaction  I/O overhead?  Reducing compaction I/O Level N Level N+1 by level-pinning Level N+1  Optimizing GC by reinserting valid data to LSM-tree Level N Level N+1 Level N+1 14

Introduction PinK Overview of LSM-tree in PinK Bounding tail latency Memory requirement Reducing search overhead Reducing compaction I/O Reducing sorting time Experiments Conclusion

Overview of LSM-tree in PinK  PinK is based on key-value separated LSM-tree Skiplist KV KV KV KV Level 0 Start key Level 1 2 23 Level 2 Level list (sorted array) … … … Level h-1 DRAM Flash Meta segment area Data segment area Address pointer 2 V K V K V K V 2 4 11 19 Meta segment Data segment Pointer to KV 16

Bounding Tail Latency PinK LSM-tree with bloom filter LSM-tree: # of Levels 5 GET GET Bloom filter … … L1 Binary search L1 Binary search In worst case, In worst case, 4 flash access! 1 flash access! L2 L2 Binary search Binary search Level list L3 Binary search Binary search L3 … … L4 L4 DRAM DRAM Flash Flash Memory usage? Meta segment … … 17

Memory Requirement  4TB SSD, 4GB DRAM (32B key, 1KB value) Total # of levels: 5  Skip list (L 0 ) 8MB KV KV KV KV L1 L2 Level list 432MB L3 3.5 GB < 4GB … L4 Only one flash access for indexing DRAM Flash 1 level: 1.47MB 2 levels: 68MB Meta segment 3 levels: 3.1GB … 4 levels: 144GB 18

Reducing Search Overhead  Fractional cascading Binary search Binary search × T Binary search Binary search on overlapped range Binary search × T Binary search h Range pointer … … Binary search × T Binary search … … 𝑃(ℎ 2 log(𝑈)) 𝑃(ℎ log(𝑈)) search complexity is Burdensome! 19

Reducing Search Overhead  Prefix Less compare overhead  Cache efficient search  Binary search Binary search “Prefix” and “range pointer” memory usage: about 10% of level list Binary search Prefix (4B) … Key (32B) Ptr (4B) Binary search on same prefix … … Binary search on keys 20

Reducing Compaction I/O PinK without level-pinning PinK with level-pinning Full Full Update level list Update level list 6 read & 6 write No read & write … … Burdensome! 1 2 3 5 6 9 1 2 3 5 6 9 DRAM Flash 1 3 1 3 capacitior 2 5 6 9 2 5 6 9 DRAM … … Flash 21

Reducing Sorting Time DRAM L n Flash Write DRAM or Flash 15 14 11 9 2 ARM CPU DRAM Key Comparator Read DRAM or Flash L n+1 (==, >, <) Flash 16 14 12 10 2 DRAM Flash L n Meta segment addresses New L n+1 Meta segment level list of L n+1 addresses New address for Meta segments PinK 22

PinK Summary  Long tail latency? L0 L0 DRAM Using level-pinning L1 Flash L1 L2 DRAM L2  CPU overhead? Flash L3 L3 Removing Bloom filter Optimizing binary search Bloom filter Adopting HW accelerator ARM CPU  I/O overhead? Reducing compaction I/O Optimizing GC by reinserting valid data to LSM-tree Please refer to the paper! 23

Introduction PinK Experiments Conclusion

Custom KV-SSD Prototype and Setup  All algorithms for KV-SSD were implemented on ZCU102 board  For fast experiments: 64GB SSD, 64 MB DRAM (0.1% of NAND capacity) Client Server KV-SSD platform Xilinx ZCU102 4GB DRAM Expansion Card Custom Connectors Xeon E5-2640 Flash Card (20 cores @ 2.4 GHz) 32GB DRAM Artix7 FPGA 10GbE Zynq Ultrascale+ SoC Raw NAND (Quad-core ARM Cortex-A53 Flash chips 10GbE NIC with FPGA) (256GB) 25

Benchmark Setup  YCSB: 32B key, 1KB value Load A B C D E F R:W ratio 0:100 50:50 95:5 100:0 95:5 95:5 50:50(RMW) Query type Point Range read Point Request Latest Uniform Zipfian Zipfian distribution (Highest locality)  Two phases  Load: issue unique 44M KV pairs (44GB, 70% of total SSD)  Run: issue 44M KV pairs following workload description 26

Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo - PowerPoint PPT Presentation

DATA -INTENSIVE COMPUTING SYSTEMS LAB ORATORY PinK: High-speed In-storage Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo Chung * , Arvind * , and Sungjin Lee Daegu Gyeongbuk Institute of Science & Technology (DGIST)

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Freight TAILS Presentation to: FREVUE London Partners Meeting 25th October 2016 Freight TAILS

CS 331: Artificial Intelligence in the last column tails black 3 0.09 sum to 1 tails red 1

User search and free sofuware culture by sajolida 1. What is Tails 2. Our usability process 3.

Freight TAILS Presentation to: Central London Freight Quality Partnership 26th October 2016

( ) ( | z ) P ( z ) P Y P Y z 3 Inference Independence We will write the

( ) ( | z ) P ( z ) P Y P Y z 3 Bayes Rule Inference We will write

6.02 Fall 2012 Lecture #12 Bounded-input, bounded-output stability Frequency response 6.02

Bounded Radius Routing Perform bounded PRIM algorithm Under = 0, = 0.5, and =

Bounded Degree Spanning Tree using Iterative Relaxation Barna Saha March 11, 2015 Bounded

Bounded Type Parameters 49 What is a bounded Type Parameter? Restrict the types that may

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

ALRIGHT: Asymmetric LaRge-Scale (I)GARCH with Hetero-Tails Marc S. Paolella Swiss Banking

On the conference USB sticks and Tails. Joel Alwen IST Austria USB Stick Design Goals 1.

Tails of Archimedean Copulas tail dependence in risk management Arthur Charpentier

Whats going on here? Results from multiple runs of the same program: Flipping a coin: Heads!

Development at the Speed and Scale of Google Ashish Kumar Engineering Tools The Challenge

Using a positron beam to measure the speed of light anisotropy Bogdan Wojtsekhowski,

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra

Delegation Sketch : a Parallel Design with Support for Fast and Accurate Concurrent Operations

Chapter 4 Confusions Please see my answers below in italics Generals: Many times units are

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters Kanthi Nagaraj (Stanford),

High-speed and Programmable Networks ECE/CS598HPN Instructor: Radhika Mittal History of

Whats Calculus? Answer: Next semester! (Fundamental Theorem of Calculus, by Newton and