BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - - PowerPoint PPT Presentation
BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - - PowerPoint PPT Presentation
BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation Evaluation Outline
Outline
Introduction and Motivation Our Design System and Implementation Evaluation
Outline
Introduction and Motivation Our Design System and Implementation Evaluation
In-memory KV-Store
A crucial building block for many systems
– Data cache (e.g. Memcached and Redis in Facebook, Twitter) – In-memory database
Availability is important for in-memory KV-Stores
– Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory
Data redundancy in distributed memory is essential for fast failover
Two redundancy schemes
Replication is a classical way to provide data availability
– E.g., Repcached, Redis
Client Data node Update Backup node Backup node Write request
High memory cost High bandwidth cost
Two redundancy schemes
Erasure coding is a space-efficient redundancy scheme The increase of CPU speed enables fast data recovery
– Encoding/Decoding rates can reach 40Gb/s on single core [1]
[1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16
Client Data Node Update Parity Node Parity Node Data Node Data Node Write request
Low memory cost High bandwidth cost
In-place Update
Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2
- bj1
- bj4
- bj7
- bj2
- bj5
- bj8
- bj3
- bj6
- bj9
A traditional mechanism for encoding small objects
P P P P P P Update(obj4->obj4’) Delta(obj4, obj4’)
- bj4’
p p Update (obj3->obj3’)
- bj3’
p p Update(obj8->obj8’) p p
Bandwidth cost is the same as 3-replication
Our goal: both memory efficiency and bandwidth efficiency
- bj8’
Outline
Introduction and Motivation Our Design System and Implementation Evaluation
Our Design
P P
- bj3’
- bj8’
- bj4’
Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2
- bj1
- bj4
- bj7
- bj2
- bj5
- bj8
- bj3
- bj6
- bj9
Batch coding Append
Aggregate write requests and encode objects in a new
coding stripe
P P P P P P Batch node
- bj4
- bj8
- bj3
invalid
Latency Analysis
Batch coding induces extra request waiting time Formalize the waiting time W
Latency bound Ɛ
W = f(T, k)
K = 3
Request throughput number of data nodes
Garbage Collection
Recycle updated or deleted blocks and release extra
parity blocks
Move-based garbage collection
GC GC Data nodes Parity nodes Original stripes Batched stripes Move Much bandwidth cost for updating parity blocks
Garbage Collection
How to reduce the GC bandwidth cost?
– Intuition: GC the stripes with the most invalid blocks
Greedy block moving
GC GC Data nodes Parity nodes Original stripes Batched stripes Two block moves to release two coding stripes
Garbage Collection
How to further reduce block move?
– Intuition: make the updates focus on few stripes
Popularity-based data arrangement
Hot Cold Hot Cold GC GC Original stripes Batched stripes Data nodes Parity nodes Only one block move to release two coding stripes
Bandwidth Analysis
Theorem
GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper
Outline
Introduction and Motivation Our Design System and Implementation Evaluation
System Architecture
Client Batch process Batch coding Garbage collection Metadata management Data process Storage group Preprocessing Data process Data process Parity process Parity process Client Client
Handle Write Requests
Batch process Batch coding Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Client Set(k1, v1) Hash table set(k2, v2) P2 P1 v3 v1 v2 v2 v1 v3 P1 P2 b1 Client set(k3, v3) Stripe index Update
Handle Read Requests
Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Hash table get(k1) v2 v1 v3 P1 P2 get(b1) Stripe index
Key Stripe id k1 b1 k2 b1 k3 b1
Recovery
Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client get(k1) v2 v1 v3 P1 P2
- 1. Get values according to stripe
id from any k storage processes Decoder v2 P1 P2 v1 Recover the request data first
- 2. Recover the
lost blocks
Outline
Introduction and Motivation Our Design System and Implementation Evaluation
Evaluation
Cluster configuration
– 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs – 1Gb/s Ethernet
Targets of comparison
– In-place update EC (Cocytus[1]) – Replication (Rep)
Workload
– YCSB with different key distributions – 50%:50% read/write ratio
[1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16
Bandwidth Cost
Bandwidth cost for different coding schemes. Save up to 51% bandwidth cost
Throughput
Throughput performance for different coding schemes. Up to 2.4x improvement
Memory
Memory consumption for different redundancy schemes Save up to 41% memory cost
Latency
Read latency Write latency
Conclusion
Efficiency and availability are two crucial features for in-
memory KV-Stores
We build BCStore, an in-memory KV-Store which
applies erasure coding for data availability
We design batch coding mechanism to achieve high
bandwidth efficiency for write workload
We propose a heuristic garbage collection algorithm to
improve memory efficiency
Thanks!
Q&A
Severity of Bandwidth Cost
Prevalence of write requests in large-scale web services
– Peak load can easily run out of network bandwidth and degrade service performance
Monetary cost of bandwidth becomes several times
higher
– Especially under the commonly used peak-load pricing model – Bandwidth amplification would be more serious with the increase of m (number of parity servers)
Budget of bandwidth resource is usually limited in
workload-sharing cluster
Our goal: High memory efficiency and bandwidth efficiency
Our Design
P P
- bj3’
- bj8’
- bj4’
Data node 1 Parity node 1 P P Data node 2 Data node 3 Parity node 2
- bj1
- bj4
- bj7
- bj2
- bj5
- bj8
- bj3
- bj6
- bj9
Batch coding Append
Batch write requests and append a new coding stripe
Challenges
Recycle the memory space of data blocks which are
deleted or updated
– Data blocks and parity blocks are appended to the storage – Updated blocks can not be delete directly
Encode variable-sized data efficiently
– Variable-sized data can not be appended to previous storage space directly
Garbage Collection
Popularity-based data arrangement
Data node 1 Hot cold Sort
Batched
- bjects
Data node 2 Data node 3 Parity node1 Parity node2
Encoding Variable-size Data
Virtual coding stripes (vcs)
– Each virtual coding stripe has a large fixed-length space and is aligned in virtual address
Virtual space Parity node 1 vcs1 Data node 2 Data node 3 Parity node 2 Physical space Data node 1 vcs2 vcs3 Data node 1
Bandwidth Cost
Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))
Throughput
Throughput performance for moderate-skewed Zipfian workload
Throughput
Throughput for recovery
In-place Update
Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2
- bj1
- bj4
- bj7
- bj2
- bj5
- bj8
- bj3
- bj6
- bj9
A traditional mechanism for coding small objects
P P P P P P
Garbage Collection
How to further reduce block move?
– Intuition: make the updates focus on few stripes
Popularity-based data arrangement
Hot Cold Hot Cold GC GC Original stripes Batched stripes Data nodes Parity nodes
Bandwidth Analysis
Theorem
GC GC Original stripes Batched stripes Data nodes Parity nodes Worst case of GC bandwidth
GC bandwidth + Coding bandwidth <= In-place update bandwidth
Bandwidth Cost
Bandwidth cost for different throughput. (RS(5,4))
Recovery
Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Batch process M M
- 1. Get latest batch id
- 2. Update the latest stable batch id
and reconstruct metadata Replication
- 3. Serve requests