[PPT] - A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit PowerPoint Presentation

SLIDE 1

Dynamo

A Tale of Two Erasure Codes in HDFS

1

Mingyuan Xia*, Mohit Saxena+ , Mario Blaum+, and David A. Pease+

*McGill University, +IBM Research Almaden

FAST’15

何军权 2015-04-30

SLIDE 2

2

Outline

 Introduction & Motivation  Design  Evaluation  Conclustions  Related work

SLIDE 3

3

Introduction & Motivation

SLIDE 4

4

Big Data Storage

 Reliability and Availability

 Replication: 3-way replication  Erasure Code: Reed-Solomon(RS), LRC

4

GFS

3-way replication 3x, 2003

FB HDFS

RS, 1.4x, 2011

GFS v2

RS, 1.5x, 2012

Azure

LRC, 1.33x, 2012

FB HDFS

LRC, 1.66x, 2013

SLIDE 5

5

Popular Erasure Code Families

 Product Code(PC)  Local Reconstruction Code(LRC)  Other

5

a0 a1 a2 a3 a4 a5 G1 a6 a7 a8 a9 a10 a11 G2 L0 L1 L2 L3 L4 L5

PC LRC

Reed-Solomon(RS)

a0 a1 a2 a3 a4 ha b0 b1 b2 b3 b4 hb P0 P1 P2 P3 P4 h

SLIDE 6

6

Erasure Code

 Facebook HDFS RS(10,4)

 Compute 4 parities per 10 data blocks  All blocks store in different storage nodes  Storage Overhead: 1.4x

D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 P1 P2 P3 P4

SLIDE 7

7

Erasure Code

 High Degraded Read Latency

 Read to an unavailable block requires

 Multiple disk reads, network transfers and compute cycles to

decode

…

HDFS Read exception

Client

SLIDE 8

8

Erasure Code

 Long Reconstruction Time

 Facebook's Cluster:

 100K blocks lost per day  50 machine-unavailablility events per day  Reconstruction traffic: 180TB per day

…

HDFS

Reconstruction Job

SLIDE 9

9

Erasure Code

Degraded Read Latency Recover Cost

Recover Cost: the total number of blocks required to reconstruction a data block after failure

Reconstruction Time

SLIDE 10

10

Recovery Cost vs. Storage Overhead

 Conclusion

 Storage Overhead and Reconstruction Cost are a tradeoff in

single erasure code.

FB HDFS RS GFS v2 RS Azure LRC FB HDFS LRC GFS 3-way Repl

SLIDE 11

11 11

How to balance?

Storage Overhead Recovery Cost

SLIDE 12

12

Data Access Skew

 Conclusions

 Only few data are "hot"

 P(freq > 10) ~= 1%

 Most data are "cold"

 P(freq <= 10) ~= 99%

12

SLIDE 13

13

Data Access Skew

 Hot data

 High access frequency  A small fraction of data

 Cold data

 Low access frequency  A major fraction of data

13

A little improvement on read can gain a high read performance A few less of data to store can save huge storage space

Hot Data: Decrease the Recovery Cost Cold Data: High Storage Efficiency

SLIDE 14

14

HACFS

 System State

 Tracks file states

 File size, last mTime  Read count and coding state

 Adapting Coding

 Tracks system states  Choose coding scheme

based on read count and mTime

 Erasure Coding

 Providing four coding

interfaces

 Encode/Decode  Upcode/Downcode

SLIDE 15

15

Erasure Coding Algorithms

 Two different erasure codes

 Fast code:

 Encode the frequently accessed blocks to reduce the read latency

and reconstruction time

 Provide overall low recovery cost

 Compact code:

 Encode the less frequently accessed blocks to get low storage

verhead

 Maintain a low and bounded storage overhead

15

SLIDE 16

16

State Transition

3-way replication Fast Code Compact Code

Recently created

HACFS

Write cold COND' COND COND

COND : Read Hot and Bounded COND': Read Cold or Not Bounded

COND'

SLIDE 17

17

Fast and Compact Product Codes(1)

17

a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 Pa0 Pa1 Pa2 Pa3 Pa4 Pha

Fast Code (Product Code 2x5) Storage overhead: 1.8x Recovery Cost: 2

a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 P0 P1 P2 P3 P4 Ph

Compact Code (Product Code 6x5) Storage overhead: 1.4x

ha1=RS(a0,a1,a2,a3,a4)
Pa0=XOR(a0,a5)

SLIDE 18

18

Fast and Compact Product Codes(2)

18

a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 Pa0 Pa1 Pa2 Pa3 Pa4 Pha

Fast Code (Product Code 2x5) Storage overhead: 1.8x Recovery Cost: 2

a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 P0 P1 P2 P3 P4 Ph

Compact Code (Product Code 6x5) Storage overhead: 1.4x Recovery Cost: 5

P0=XOR(a0,a5,b0,b5,c0,c5)
ha1=RS(a0,a1,a2,a3,a4)
Pa0=XOR(a0,a5)

SLIDE 19

19

Fast and Compact LRC(1)

19

a0 a1 a2 a3 a4 a5 G1 a6 a7 a8 a9 a10 a11 G2 L0 L1 L2 L3 L4 L5

Fast Code (LRC(12,6,2)) Storage overhead: 20/12=1.67x

a0 a1 a2 a3 a4 a5 G1 a6 a7 a8 a9 a10 a11 G2 L0 L1

Compact Code (LRC(12,2,2)) Storage overhead: 16/12=1.33x Recovery Cost: 2 Recovery Cost: 6 {G1,G2}=RS(a0,a1,..,a11) Li=XOR(ai, ai+6) {G1,G2}=RS(a0,a1,..,a11) Li=RS'(a0, a1, a2, a6, a7, a8)

SLIDE 20

20 20

Upcoding for Product Codes

b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 Pb0 Pb1 Pb2 Pb3 Pb4 Phb a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 Pa0 Pa1 Pa2 Pa3 Pa4 Pha c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 Pc0 Pc1 Pc2 Pc3 Pc4 Phc a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 P0 P1 P2 P3 P4 Ph Fast Code PC(2x5) Compact Code PC(6x5)

Parities h require no re-construction
Parities P require no data block transfer
All parities updates can be done in parallel

SLIDE 21

21 21

Downcoding for Product Codes

b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 Pb0 Pb1 Pb2 Pb3 Pb4 Phb a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 Pa0 Pa1 Pa2 Pa3 Pa4 Pha c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 Pc0 Pc1 Pc2 Pc3 Pc4 Phc a0 a1 a2 a3 a4 ha1 a5 a6 a7 a8 a9 ha2 b0 b1 b2 b3 b4 hb1 b5 b6 b7 b8 b9 hb2 c0 c1 c2 c3 c4 hc1 c5 c6 c7 c8 c9 hc2 P0 P1 P2 P3 P4 Ph

Compact Code PC(6x5) Fast Code PC(2x5)

Pa0=XOR(a0,a5)
Pc0=XOR(P0,Pa0,Pb0)

SLIDE 22

22

Evaluation

 Platform

 CPU: Intel Xeon E5645 24 cores, 2.4GHz  Disk: 7.2K RPM, 6*2TB  Memory: 96GB  Network: 1Gbps NIC  Cluster size: 11 nodes

 Workload

22

CC: Cloudera Customer FB: Facebook

SLIDE 23

23

Evaluation Metrics

 Degraded read latency

 Foreground read request latency

 Reconstruction time

 Background recovery for failures

 Storage overhead

23

SLIDE 24

24

 The Production systems: 16-21 seconds  HACFS: 10-14 seconds

Degraded Read Latency

Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5

SLIDE 25

25

 A disk with 100GB data failed

 HACFS-PC takes about 10-35 minutes less than Production

systems

 HACFS-LRC is worse than RS(6,3) in GFS v2

 To reconstruction global parities, HACFS-LRC need to read 12

blocks, but GFS v2 only 6 blocks

Reconstruction Time

SLIDE 26

26

System Comparison



Colossus FS:RS(6,3)-1.5x



HDFS-Raid: RS(10,4)-1.4x



Azure: LRC(12,2,2)-1.33x

26



HACFS-PC:



PC(2x5)-1.8x



PC(6x5)-1.4x



HACFS-LRC:



LRC(12,6,2)-1.67x



LRC(12,2,2)-1.33x

SLIDE 27

27

System Comparison



Colossus FS:RS(6,3)-1.5x



HDFS-Raid: RS(10,4)-1.4x



Azure: LRC(12,2,2)-1.33x

27



HACFS-PC:



PC(2x5)-1.8x



PC(6x5)-1.4x



HACFS-LRC:



LRC(12,6,2)-1.67x



LRC(12,2,2)-1.33x lost block type HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure data block fast: 2 fast: 2 6 10 6 comp: 5 comp: 6 global parity fast: 5 fast: 12 6 10 12 comp: 6 comp: 12

SLIDE 28

28

System Comparison



Colossus FS:RS(6,3)-1.5x



HDFS-Raid: RS(10,4)-1.4x



Azure: LRC(12,2,2)-1.33x

28



HACFS-PC:



PC(2x5)-1.8x



PC(6x5)-1.4x



HACFS-LRC:



LRC(12,6,2)-1.67x



LRC(12,2,2)-1.33x lost block type HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure data block fast: 2 fast: 2 6 10 6 comp: 5 comp: 6 global parity fast: 5 fast: 12 6 10 12 comp: 6 comp: 12

SLIDE 29

29

Conclusions

 By using Erasure code, a lot of storage space can be

saved.

 The production systems using a single erasure code

can not balance the tradeoff between recovery cost and storage overhead very well.

 HACFS by using a dynamically adaptive coding can

provide both low recovery cost and storage overhead.

SLIDE 30

30

Related Work

 f4 OSDI'14

 Divide the cold and hot by the data age

 XOR-based Erasure Code--FAST’12

 Combination RS with XOR.

 Minimum-Storage-Regeneration(MSR)

 Minimizes network transfers during reconstruction.

 Product-Matrix-Reconstruct-By-Transfer(PM-RBT)

FAST’15

 Optimal in terms of I/O, storage, and network bandwidth.

30

SLIDE 31

31 31

Thank You!

SLIDE 32

32

Acknowledgment

 Prof. Xiong  Zigang Zhang  Biao Ma

CAS– ICT – Storage System Group 32