SelectiveEC: Selective Reconstruction in Erasure-coded Storage - - PowerPoint PPT Presentation

selectiveec selective reconstruction in
SMART_READER_LITE
LIVE PREVIEW

SelectiveEC: Selective Reconstruction in Erasure-coded Storage - - PowerPoint PPT Presentation

SelectiveEC: Selective Reconstruction in Erasure-coded Storage Systems Liangliang Xu, Min Lyu, Qiliang Li, Lingjiang Xie, and Yinlong Xu University of Science and Technology of China HotStorage 2020 Distributed Storage Systems (DSSes) Data


slide-1
SLIDE 1

SelectiveEC: Selective Reconstruction in Erasure-coded Storage Systems

Liangliang Xu, Min Lyu, Qiliang Li, Lingjiang Xie, and Yinlong Xu University of Science and Technology of China HotStorage 2020

slide-2
SLIDE 2

Distributed Storage Systems (DSSes)

  • Data is important
  • Large scale
  • Exponential growth
  • DSSes are the core

infrastructures

  • Thousands of nodes
  • “Fat node”
  • Up to 72 TB of storage (about

1.5M chunks) per node in Pangu[1]

  • Frequent failures

Disk faults Network failures Artificial errors Cluster crushed

[1] ATC2019: Dayu: Fast and Low-interference Data Recovery in Very-large Storage Systems

slide-3
SLIDE 3

Erasure Coding (EC)

  • EC popularly adopted in DSSes
  • Provide high reliability with low

storage cost

  • (k, m)-Reed Solomon (RS) codes
  • k data chunks
  • m parity chunks
  • Tolerate any m nodes failures

Client

D0 D1 D2 P0 P1 D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4

Writing a (3,2)-RS stripe

slide-4
SLIDE 4

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4

slide-5
SLIDE 5

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4 Node5

D0

slide-6
SLIDE 6

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4 Node5

D0 1 1 1

① Reading chunks from source nodes

slide-7
SLIDE 7

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4 Node5

D0 1 1 1 2 2 2

① Reading chunks from source nodes ② Transferring data in network

slide-8
SLIDE 8

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4 Node5

D0 1 1 1 2 2 2 3

① Reading chunks from source nodes ② Transferring data in network ③ Decoding

slide-9
SLIDE 9

Reconstruction

Reconstructing a chunk of (3,2)-RS stripe

D0 D1 D2 P0 P1

Node0 Node1 Node2 Node3 Node4 Node5

D0 1 1 1 2 2 2 3 4

① Reading chunks from source nodes ② Transferring data in network ③ Decoding ④ Writing decoded data

slide-10
SLIDE 10

Breakdown of EC Reconstruction Time

  • Network transferring contributes most to the reconstruction time
  • Settings
  • 28 nodes: 1NN + 27DNs
  • quad-core 3.4 GHz Intel Core i5-

7500 CPU

  • 8GB RAM
  • 1T HDD
  • 1Gbps switch (30MB/s, 90MB/s
  • r 150MB/s in Pangu[1])
  • 128MB chunk size

Reconstructing a (3,2)-RS chunk in 1Gbps network

[1] ATC2019: Dayu: Fast and Low-interference Data Recovery in Very-large Storage Systems

Stages Reading chunks from source nodes Transferring data in network Decoding Writing decoded data Time Ratio 0.68% 85.23% 7.82% 6.27%

slide-11
SLIDE 11

Random Data Layout

  • Random distribution
  • Load balance in a large amount of stripes
  • Reconstruction batch by batch
  • Limited network, disk I/O, CPU and memory resource
  • Optimal batch size
  • # of live nodes
  • Detailed analysis in the paper
slide-12
SLIDE 12

Random Data Layout

  • Nonuniform data layout in a batch
  • Unbalanced upstream bandwidth occupation

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7

Random data layout of (3,2)-RS stripes

slide-13
SLIDE 13

Random Data Layout

  • Nonuniform choices of replacement nodes
  • Unbalanced downstream bandwidth occupation

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7

Random data layout of (3,2)-RS stripes

slide-14
SLIDE 14

Goals

  • Balanced distribution of source nodes

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7

Random data layout of (3,2)-RS stripes

slide-15
SLIDE 15

Goals

  • Balanced distribution of source nodes
  • Balanced distribution of replacement nodes

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7

Random data layout of (3,2)-RS stripes

slide-16
SLIDE 16

SelectiveEC

Schedule reconstruction tasks out of order Select source nodes dynamically Select replacement nodes dynamically

slide-17
SLIDE 17

Graph Model

  • Bipartite graph Gs = (T ∪ N, E) for the selection of source nodes
  • T: tasks, i.e. each having k+m-1 source nodes
  • N: source nodes, i.e. all of live nodes
  • (Ti, Nj) ∈ E iff there is a chunk of stripe Ti in source node Nj

4 5 7 5 5 1 1 Tasks Source nodes

  • Connections of tasks

and live nodes

  • Nonuniform distribution
  • f chunks

Gs = (T ∪ N, E) for (3, 2)-RS

slide-18
SLIDE 18

Select k Source Nodes Dynamically

  • Goal: balance upstream

bandwidth occupation

  • Using maximum flow to select k

source nodes

  • Construct a flow graph FGs
  • Find a maximum flow
  • Maximum flow value = 17
  • No conflict in the chosen source

connections

slide-19
SLIDE 19

Schedule Reconstruction Tasks Out of Order

  • Preparation work
  • Find the most unsaturated task
  • Compute an unsaturated list of source nodes
  • Task to be replaced: T7
  • Unsaturated list: N5, N6, N7
slide-20
SLIDE 20

Schedule Reconstruction Tasks Out of Order

  • Schedule reconstruction tasks
  • Scan the reconstruction queue
  • Find a new task
  • More connections with unsaturated list
  • Update FGs
  • Find a maximum flow

Maximum flow value = 19 Replace a new task: T7

slide-21
SLIDE 21

Schedule Reconstruction Tasks Out of Order

  • Schedule reconstruction tasks
  • Scan the reconstruction queue
  • Find a new task
  • More connections with unsaturated list
  • Update FGs
  • Find a maximum flow
  • Achieve more balanced upstream

bandwidth occupation

slide-22
SLIDE 22

Select Replacement Nodes Dynamically

  • Construct bipartite graph Gr for the selection of replacement

nodes

  • Complement of Gs
  • Find a perfect matching
  • Easy to find in large-scale DSSes
  • Achieve load balance of replacement nodes
  • Balanced downstream bandwidth occupation
  • Balanced disk I/O, CPU and memory usage
slide-23
SLIDE 23

Evaluation

  • Implement simulative prototype of SeletiveEC
  • The simulations run in a server with
  • Two 12-core Intel Xeon E5-2650 processors
  • 64GB DDR4 memory
  • Linux 3.10.0
  • (3,2)-RS stripes
  • # of chunks in a “fat node”
  • 100 times of the number of live nodes
  • DRP: the degree of recovery parallelism
slide-24
SLIDE 24

The First Batch

  • For small scale, DRP of SelectiveEC are all bigger than 0.975
  • For large scale, DRP of SelectiveEC improves the DRP up to 97.6%

Small scale Large scale

slide-25
SLIDE 25

Full Batches

  • Around 0.97 for SelectiveEC
  • Around 0.50 for random reconstruction
slide-26
SLIDE 26

Summary

  • SelectiveEC, a balanced scheduling module
  • Schedule reconstruction tasks out of order
  • Select source nodes dynamically
  • Select replacement nodes dynamically
  • Improve the load balance for single failure recovery effectively
  • Simulation results
  • Improve the degree of recovery parallelism significantly
  • Future work
  • Deploy in practical systems
  • Optimize the algorithms to support multiple failures
slide-27
SLIDE 27

Thanks for your attention!

Q&A Liangliang Xu@USTC llxu@mail.ustc.edu.cn