BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - - PowerPoint PPT Presentation

bcstore bandwidth efficient in memory kv store with batch
SMART_READER_LITE
LIVE PREVIEW

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding - - PowerPoint PPT Presentation

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation Evaluation Outline


slide-1
SLIDE 1

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding

Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

slide-2
SLIDE 2

Outline

 Introduction and Motivation  Our Design  System and Implementation  Evaluation

slide-3
SLIDE 3

Outline

 Introduction and Motivation  Our Design  System and Implementation  Evaluation

slide-4
SLIDE 4

In-memory KV-Store

 A crucial building block for many systems

– Data cache (e.g. Memcached and Redis in Facebook, Twitter) – In-memory database

 Availability is important for in-memory KV-Stores

– Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory

Data redundancy in distributed memory is essential for fast failover

slide-5
SLIDE 5

Two redundancy schemes

 Replication is a classical way to provide data availability

– E.g., Repcached, Redis

Client Data node Update Backup node Backup node Write request

High memory cost High bandwidth cost

slide-6
SLIDE 6

Two redundancy schemes

 Erasure coding is a space-efficient redundancy scheme  The increase of CPU speed enables fast data recovery

– Encoding/Decoding rates can reach 40Gb/s on single core [1]

[1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16

Client Data Node Update Parity Node Parity Node Data Node Data Node Write request

Low memory cost High bandwidth cost

slide-7
SLIDE 7

In-place Update

Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2

  • bj1
  • bj4
  • bj7
  • bj2
  • bj5
  • bj8
  • bj3
  • bj6
  • bj9

 A traditional mechanism for encoding small objects

P P P P P P Update(obj4->obj4’) Delta(obj4, obj4’)

  • bj4’

p p Update (obj3->obj3’)

  • bj3’

p p Update(obj8->obj8’) p p

Bandwidth cost is the same as 3-replication

Our goal: both memory efficiency and bandwidth efficiency

  • bj8’
slide-8
SLIDE 8

Outline

 Introduction and Motivation  Our Design  System and Implementation  Evaluation

slide-9
SLIDE 9

Our Design

P P

  • bj3’
  • bj8’
  • bj4’

Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2

  • bj1
  • bj4
  • bj7
  • bj2
  • bj5
  • bj8
  • bj3
  • bj6
  • bj9

Batch coding Append

 Aggregate write requests and encode objects in a new

coding stripe

P P P P P P Batch node

  • bj4
  • bj8
  • bj3

invalid

slide-10
SLIDE 10

Latency Analysis

 Batch coding induces extra request waiting time  Formalize the waiting time W

Latency bound Ɛ

W = f(T, k)

K = 3

Request throughput number of data nodes

slide-11
SLIDE 11

Garbage Collection

 Recycle updated or deleted blocks and release extra

parity blocks

 Move-based garbage collection

GC GC Data nodes Parity nodes Original stripes Batched stripes Move Much bandwidth cost for updating parity blocks

slide-12
SLIDE 12

Garbage Collection

 How to reduce the GC bandwidth cost?

– Intuition: GC the stripes with the most invalid blocks

 Greedy block moving

GC GC Data nodes Parity nodes Original stripes Batched stripes Two block moves to release two coding stripes

slide-13
SLIDE 13

Garbage Collection

 How to further reduce block move?

– Intuition: make the updates focus on few stripes

 Popularity-based data arrangement

Hot Cold Hot Cold GC GC Original stripes Batched stripes Data nodes Parity nodes Only one block move to release two coding stripes

slide-14
SLIDE 14

Bandwidth Analysis

 Theorem

GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper

slide-15
SLIDE 15

Outline

 Introduction and Motivation  Our Design  System and Implementation  Evaluation

slide-16
SLIDE 16

System Architecture

Client Batch process Batch coding Garbage collection Metadata management Data process Storage group Preprocessing Data process Data process Parity process Parity process Client Client

slide-17
SLIDE 17

Handle Write Requests

Batch process Batch coding Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Client Set(k1, v1) Hash table set(k2, v2) P2 P1 v3 v1 v2 v2 v1 v3 P1 P2 b1 Client set(k3, v3) Stripe index Update

slide-18
SLIDE 18

Handle Read Requests

Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Hash table get(k1) v2 v1 v3 P1 P2 get(b1) Stripe index

Key Stripe id k1 b1 k2 b1 k3 b1

slide-19
SLIDE 19

Recovery

Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client get(k1) v2 v1 v3 P1 P2

  • 1. Get values according to stripe

id from any k storage processes Decoder v2 P1 P2 v1 Recover the request data first

  • 2. Recover the

lost blocks

slide-20
SLIDE 20

Outline

 Introduction and Motivation  Our Design  System and Implementation  Evaluation

slide-21
SLIDE 21

Evaluation

 Cluster configuration

– 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs – 1Gb/s Ethernet

 Targets of comparison

– In-place update EC (Cocytus[1]) – Replication (Rep)

 Workload

– YCSB with different key distributions – 50%:50% read/write ratio

[1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16

slide-22
SLIDE 22

Bandwidth Cost

Bandwidth cost for different coding schemes. Save up to 51% bandwidth cost

slide-23
SLIDE 23

Throughput

Throughput performance for different coding schemes. Up to 2.4x improvement

slide-24
SLIDE 24

Memory

Memory consumption for different redundancy schemes Save up to 41% memory cost

slide-25
SLIDE 25

Latency

Read latency Write latency

slide-26
SLIDE 26

Conclusion

 Efficiency and availability are two crucial features for in-

memory KV-Stores

 We build BCStore, an in-memory KV-Store which

applies erasure coding for data availability

 We design batch coding mechanism to achieve high

bandwidth efficiency for write workload

 We propose a heuristic garbage collection algorithm to

improve memory efficiency

slide-27
SLIDE 27

Thanks!

Q&A

slide-28
SLIDE 28

Severity of Bandwidth Cost

 Prevalence of write requests in large-scale web services

– Peak load can easily run out of network bandwidth and degrade service performance

 Monetary cost of bandwidth becomes several times

higher

– Especially under the commonly used peak-load pricing model – Bandwidth amplification would be more serious with the increase of m (number of parity servers)

 Budget of bandwidth resource is usually limited in

workload-sharing cluster

Our goal: High memory efficiency and bandwidth efficiency

slide-29
SLIDE 29

Our Design

P P

  • bj3’
  • bj8’
  • bj4’

Data node 1 Parity node 1 P P Data node 2 Data node 3 Parity node 2

  • bj1
  • bj4
  • bj7
  • bj2
  • bj5
  • bj8
  • bj3
  • bj6
  • bj9

Batch coding Append

 Batch write requests and append a new coding stripe

slide-30
SLIDE 30

Challenges

 Recycle the memory space of data blocks which are

deleted or updated

– Data blocks and parity blocks are appended to the storage – Updated blocks can not be delete directly

 Encode variable-sized data efficiently

– Variable-sized data can not be appended to previous storage space directly

slide-31
SLIDE 31

Garbage Collection

 Popularity-based data arrangement

Data node 1 Hot cold Sort

Batched

  • bjects

Data node 2 Data node 3 Parity node1 Parity node2

slide-32
SLIDE 32

Encoding Variable-size Data

 Virtual coding stripes (vcs)

– Each virtual coding stripe has a large fixed-length space and is aligned in virtual address

Virtual space Parity node 1 vcs1 Data node 2 Data node 3 Parity node 2 Physical space Data node 1 vcs2 vcs3 Data node 1

slide-33
SLIDE 33

Bandwidth Cost

Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))

slide-34
SLIDE 34

Throughput

Throughput performance for moderate-skewed Zipfian workload

slide-35
SLIDE 35

Throughput

Throughput for recovery

slide-36
SLIDE 36

In-place Update

Data node 1 Parity node 1 Data node 2 Data node 3 Parity node 2

  • bj1
  • bj4
  • bj7
  • bj2
  • bj5
  • bj8
  • bj3
  • bj6
  • bj9

 A traditional mechanism for coding small objects

P P P P P P

slide-37
SLIDE 37

Garbage Collection

 How to further reduce block move?

– Intuition: make the updates focus on few stripes

 Popularity-based data arrangement

Hot Cold Hot Cold GC GC Original stripes Batched stripes Data nodes Parity nodes

slide-38
SLIDE 38

Bandwidth Analysis

 Theorem

GC GC Original stripes Batched stripes Data nodes Parity nodes Worst case of GC bandwidth

GC bandwidth + Coding bandwidth <= In-place update bandwidth

slide-39
SLIDE 39

Bandwidth Cost

Bandwidth cost for different throughput. (RS(5,4))

slide-40
SLIDE 40

Recovery

Batch process Data process 1 Data process 2 Data process 3 Parity process 1 Parity process 2 Client Batch process M M

  • 1. Get latest batch id
  • 2. Update the latest stable batch id

and reconstruct metadata Replication

  • 3. Serve requests