BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - PowerPoint PPT Presentation

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Outline  Introduction and Motivation  Our Design  System and Implementation  Evaluation

In-memory KV-Store  A crucial building block for many systems – Data cache (e.g. Memcached and Redis in Facebook, Twitter) – In-memory database  Availability is important for in-memory KV-Stores – Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory Data redundancy in distributed memory is essential for fast failover

Two redundancy schemes  Replication is a classical way to provide data availability – E.g., Repcached, Redis Write request Client Data node High High Update bandwidth memory cost cost Backup Backup node node

Two redundancy schemes  Erasure coding is a space-efficient redundancy scheme  The increase of CPU speed enables fast data recovery – Encoding/Decoding rates can reach 40Gb/s on single core [1] Client Write request Data Data Data Node Node Node High Low bandwidth memory Update cost cost Parity Parity Node Node [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16

In-place Update  A traditional mechanism for encoding small objects Update(obj4->obj4’) Delta(obj4, obj4’) Data node 1 obj4’ obj4 obj7 obj1 Update(obj8->obj8’) Update Data node 2 obj2 obj8’ obj8 obj5 (obj3->obj3’) Data node 3 Bandwidth cost obj6 obj9 obj3 obj3’ is the same as 3-replication Parity node 1 p p P p P P Parity node 2 p P P p P p Our goal: both memory efficiency and bandwidth efficiency

Our Design  Aggregate write requests and encode objects in a new coding stripe invalid Batch coding Append Data node 1 obj7 obj1 obj4 obj4 obj4’ Data node 2 obj2 obj8 obj5 obj8 obj8’ Batch node Data node 3 obj3 obj6 obj9 obj3 obj3’ Parity node 1 P P P P Parity node 2 P P P P

Latency Analysis  Batch coding induces extra request waiting time  Formalize the waiting time W Request throughput W = f(T, k) number of data nodes Latency bound Ɛ K = 3

Garbage Collection  Recycle updated or deleted blocks and release extra parity blocks  Move-based garbage collection Original stripes Batched stripes Data Move nodes Parity nodes Much bandwidth cost for updating GC GC parity blocks

Garbage Collection  How to reduce the GC bandwidth cost? – Intuition: GC the stripes with the most invalid blocks  Greedy block moving Original stripes Batched stripes Data nodes Parity nodes Two block moves to release GC GC two coding stripes

Garbage Collection  How to further reduce block move? – Intuition: make the updates focus on few stripes  Popularity-based data arrangement Original stripes Batched stripes Hot Cold Cold Hot Data nodes Parity nodes Only one block move to GC GC release two coding stripes

Bandwidth Analysis  Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper

System Architecture Data process Client Batch process Data process Batch coding Preprocessing Data process Client Garbage Metadata Parity process collection management Client Parity process Storage group

Handle Write Requests v2 Data process 1 Client Set(k1, v1) v1 Batch process Data process 2 Stripe Hash v3 index table Client Data process 3 Update set(k2, v2) Batch coding P1 Parity process 1 set(k3, v3) Client v2 P2 v1 Parity process 2 v3 P1 P2 b1

Handle Read Requests v2 Data process 1 v1 get(b1) Data process 2 Batch process get(k1) v3 Client Hash Stripe Data process 3 table index P1 Parity process 1 Key Stripe id k1 b1 P2 k2 b1 Parity process 2 k3 b1

Recovery v2 Data process 1 Recover the request data first 1. Get values according to stripe id from any k storage processes v1 Data process 2 Batch process get(k1) v3 Client Data process 3 Decoder P1 Parity process 1 v2 2. Recover the P2 P1 v1 lost blocks Parity process 2 P2

Evaluation  Cluster configuration – 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs – 1Gb/s Ethernet  Targets of comparison – In-place update EC (Cocytus[1]) – Replication (Rep)  Workload – YCSB with different key distributions – 50%:50% read/write ratio [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16

Bandwidth Cost Save up to 51% bandwidth cost Bandwidth cost for different coding schemes.

Throughput Up to 2.4x improvement Throughput performance for different coding schemes.

Memory Save up to 41% memory cost Memory consumption for different redundancy schemes

Latency Read latency Write latency

Conclusion  Efficiency and availability are two crucial features for in- memory KV-Stores  We build BCStore, an in-memory KV-Store which applies erasure coding for data availability  We design batch coding mechanism to achieve high bandwidth efficiency for write workload  We propose a heuristic garbage collection algorithm to improve memory efficiency

Thanks! Q&A

Severity of Bandwidth Cost  Prevalence of write requests in large-scale web services – Peak load can easily run out of network bandwidth and degrade service performance  Monetary cost of bandwidth becomes several times higher – Especially under the commonly used peak-load pricing model – Bandwidth amplification would be more serious with the increase of m (number of parity servers)  Budget of bandwidth resource is usually limited in workload-sharing cluster Our goal: High memory efficiency and bandwidth efficiency

Our Design  Batch write requests and append a new coding stripe Batch coding Append Data node 1 obj1 obj4 obj7 obj4’ Data node 2 obj2 obj8 obj5 obj8’ Data node 3 obj3 obj6 obj9 obj3’ Parity node 1 P P Parity node 2 P P

Challenges  Recycle the memory space of data blocks which are deleted or updated – Data blocks and parity blocks are appended to the storage – Updated blocks can not be delete directly  Encode variable-sized data efficiently – Variable-sized data can not be appended to previous storage space directly

Garbage Collection  Popularity-based data arrangement Hot Data node 1 Data node 2 Sort Data node 3 Parity node1 Parity node2 Batched cold objects

Encoding Variable-size Data  Virtual coding stripes (vcs) – Each virtual coding stripe has a large fixed-length space and is aligned in virtual address Physical space Data node 1 Data node 1 Data node 2 Data node 3 Parity node 1 Parity node 2 vcs1 vcs2 vcs3 Virtual space

Bandwidth Cost Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))

Throughput Throughput performance for moderate-skewed Zipfian workload

Throughput Throughput for recovery

In-place Update  A traditional mechanism for coding small objects Data node 1 obj4 obj7 obj1 Data node 2 obj2 obj8 obj5 Data node 3 obj6 obj9 obj3 Parity node 1 P P P Parity node 2 P P P

Garbage Collection  How to further reduce block move? – Intuition: make the updates focus on few stripes  Popularity-based data arrangement Original stripes Batched stripes Hot Cold Cold Hot Data nodes Parity nodes GC GC

Bandwidth Analysis  Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Original stripes Batched stripes Data nodes Parity nodes GC GC Worst case of GC bandwidth

Bandwidth Cost Bandwidth cost for different throughput. (RS(5,4))

Recovery Data process 1 1. Get latest batch id Data process 2 M Batch process Client Replication Data process 3 3. Serve requests M Batch process Parity process 1 2. Update the latest stable batch id and reconstruct metadata Parity process 2

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - PowerPoint PPT Presentation

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation Evaluation Outline

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Design of Bandwidth Bandwidth Aware Aware and and Design of Congestion Avoiding Avoiding

1 Store Buffer Design Example Memory Dependence Any load instruction receives the memory Store

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Caching in HTTP Adaptive Streaming: Friend or Foe? Danny Lee Ali C. Begen Constantine Dovrolis

Nonparametric density estimation Christopher F Baum EC 823: Applied Econometrics Boston College,

An Introduction to the Tor Ecosystem for Developers Alexander Fry February 2, 2020 FOSDEM

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

A Novel Parallel Traffic Control Mechanism for Cloud Computing Zheng Li, Nenghai Yu, Zhuo Hao

Physical Synthesis of Bus Matrix for High Bandwidth Low Power On-chip Communications Renshen Wang

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory Shaden Smith 1

A Bandwidth-saving Optimization for MPI Broadcast Collective Operation Huan Zhou, Vladimir