Research on Efficient Erasure-Coding- Based Cluster Storage Systems - PowerPoint PPT Presentation

Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS’14 1 Joint work with Runhui Li, Jian Lin, Yuchong Hu

Motivation  Clustered storage systems are widely deployed to provide scalable storage by striping data across multiple nodes • e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc.  Failures are common LAN 2

Failure Types  Temporary failures • Nodes are temporarily inaccessible (no data loss) • 90% of failures in practice are transient [Ford, OSDI’10] • e.g., power loss, network connectivity loss, CPU overloading, reboots, maintenance, upgrades  Permanent failures • Data is permanently lost • e.g., disk crashes, latent sector errors, silent data corruptions, malicious attacks 3

Replication vs. Erasure Coding  Solution: Add redundancy: • Replication • Erasure coding  Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints due to explosive data growth • e.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33% [Huang, ATC’12 ] 4

Background: Erasure Coding  Divide file to k data chunks (each with multiple blocks)  Encode data chunks to additional n-k parity chunks  Distribute data/parity chunks to n nodes  Fault-tolerance : any k out of n nodes can recover file data Nodes A A B B C C A D D B divide encode File C A+C A+C D B+D B+D A+D A+D B+C+D B+C+D (n, k) = (4, 2) 5

Erasure Coding  Pros: • Reduce storage space with high fault tolerance  Cons: • Data chunk updates imply parity chunk updates  expensive updates • In general, k chunks are needed to recover a single lost chunk  expensive recovery  Our talk: Can we improve recovery of erasure- coding-based clustered storage systems, while preserving storage efficiency? 6

Our Work  CORE [MSST’13, TC] • Augments existing regenerating codes to support both optimal single and concurrent failure recovery  Degraded-read scheduling [DSN’14] • Improves MapReduce performance in failure mode  Designed, implemented, and experimented on Hadoop Distributed File System 7

Recover a Failure  Conventional recovery : download data from any k nodes A repaired node Node 1 B A B C C Node 2 C D D A D B A+C A+C Node 3 File of B+D B+D size M A+D Node 4 B+C+D ＋ Recovery Traffic = = M  Q: Can we minimize recovery traffic? 8

Regenerating Codes [Dimakis et al.; ToIT’10]  Repair in regenerating codes: • Surviving nodes encode chunks (network coding) • Download one encoded chunk from each node A repaired node Node 1 B A B C C Node 2 C D D A A+C A+C Node 3 B File of B+D size M A+D Node 4 A+B+C B+C+D Recovery Traffic = = 0.75M ＋＋ 9

Concurrent Node Failures  Regenerating codes only designed for single failure recovery • Optimal regenerating codes collect data from n-1 surviving nodes for single failure recovery  Correlated and co-occurring failures are possible • In clustered storage [Schroeder, FAST’07; Ford, OSDI’10] • In dispersed storage [Chun NSDI’06; Shah NSDI’06]  CORE augments regenerating codes for optimal concurrent failure recovery • Retains regenerating code construction 10

CORE’s Idea  Consider a system with n nodes  Two functions for regenerating codes in single failure recovery: • Enc : storage node encodes data • Rec : reconstruct lost data using encoded data from n-1 surviving nodes  t-failure recovery (t > 1): • Reconstruct each failed node as if other n-1 nodes are surviving nodes 11

Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 Node 5 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 Node 0 S 1,0 S 1,1 S 1,2 CORE  Setting: n=6, k=3  Suppose now Nodes 0 and 1 fail  Recall that optimal regenerating codes collect data from n- 1 surviving nodes for single failure recovery  How does CORE work? 12

Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 e 3,0 Node 5 e 4,0 e 2,0 e 3,1 e 4,1 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 e 1,0 e 2,1 e 5,0 e 5,1 e 0,1 Node 0 S 1,0 S 1,1 S 1,2 CORE s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) e 0,1 = Enc 0,1 (s 0,0 , s 0,1 , s 0,2 ) = Enc 0,1 (Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) e 1,0 = Enc 1,0 (s 1,0 , s 1,1 , s 1,2 ) = Enc 1,0 (Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 )) 13

Example  We have two equations e 0,1 = Enc 0,1 ( Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) e 1,0 = Enc 1,0 ( Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ))  Trick: They form a linear system of equations  If the equations are linearly independent, we can calculate e 0,1 and e 1,0  Then we obtain lost data by s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) 14

Bad Failure Pattern  A system of equations may not have a unique solution. We call this a bad failure pattern  Bad failure patterns count for less than ~1%  Our idea: reconstruct data by adding one more node to bypass the bad failure pattern • Suppose nodes 0,1 form a bad failure pattern and nodes 0,1,2 form a good failure pattern. Reconstruct lost data for nodes 0,1,2 • Still achieve bandwidth saving over conventional 15

Bandwidth Saving  Bandwidth Ratio : Ratio of CORE to conventional in recovery bandwidth Good Failure Pattern Bad Failure Pattern Bandwidth Ratio Bandwidth Ratio 1 1 0.5 0.5 0 0 t t 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 (12,6) (16,8) (20,10) (12,6) (16,8) (20,10)  Bandwidth saving of CORE is significant • e.g., (n, k) = (20,10) • Single failure: ~80% • 2-4 concurrent failures: 36-64% 16

Theorem  Theorem: CORE, which builds on regenerating codes for single failure recovery, achieves the lower bound of recovery bandwidth if we recover a good failure pattern with t ≥ 1 failed nodes • Over ~99% of failure patterns are good 17

CORE Implementation RaidNode 1. Identify corrupted blocks CORE Namenode Encoder/Decoder 2. Send recovered blocks Encoder Encoder Encoder block block block block Datanode Datanode block block block 18

CORE Implementation  Parallelization  Erasure coding on C++ • Executed through JNI 19

Experiments  Testbed: • 1 namenode, and up to 20 datanodes • Quad core 3.1GHz CPU, 8GB RAM, 7200RPM SATA harddisk, 1Gbps Ethernet Namenode Datanode Datanode Datanode  Coding schemes: • Reed-Solomon codes vs. CORE (interference alignment codes) 20

Decoding Throughput  Evaluate computational performance: • Assume single failure (t=1) • Surviving data for recovery first loaded in memory • Decoding throughput : ratio of size of recovered data to decoding time  CORE (regenerating codes) achieves ≥500MB/s at packet size 8KB 21

Recovery Throughput 70 Recovery thpt (MB/s) 60 CORE t=1 50 RS t=1 40 CORE t=2 30 RS t=2 20 CORE t=3 RS t=3 10 0 (12, 6) (16, 8) (20, 10)  CORE shows significantly higher throughput • e.g., in (20, 10), for single failure, the gain is 3.45x ; for two failures, it’s 2.33x ; for three failures, is 1.75x 22

MapReduce  Q: How does erasure-coded storage affect data analytics?  Traditional MapReduce is designed with replication storage in mind  To date, no explicit analysis of MapReduce on erasure-coded storage • Failures trigger degraded reads in erasure coding 23

MapReduce Shuffle Slave 0 B C <A,2> <A,1> WordCount Slave 1 A C <B,2> Example: <C,2> Slave 2 A B Reduce tasks Map tasks  MapReduce idea: • Map tasks process blocks and generate intermediate results • Reduce tasks collect intermediate results and produce final output  Constraint: cross-rack network resource is scarce 24

MapReduce on Erasure Coding  Show that default scheduling hurts MapReduce performance on erasure-coded storage  Propose Degraded-First Scheduling for MapReduce task-level scheduling • Improves MapReduce performance on erasure-coded storage in failure mode 25

Default Scheduling in MapReduce while a heartbeat from slave s arrives do for job in jobQueue do Processing a block stored if job has a local task on s then in another rack assign the local task else if job has a remote task then assign the remote task else if job has a degraded task then assign the degraded task endif Processing an unavailable endfor block in the system endwhile  Locality-first scheduling: the master gives the first priority to assigning a local task to a slave 26

Locality-First in Failure Mode Core Switch ToR Switch ToR Switch S2 S4 S0 S1 S3 B 0,0 B 3,0 B 0,1 P 3,0 P 0,0 P 3,1 P 0,1 B 3,1 B 1,0 B 4,0 B 1,1 B 4,1 P 1,0 P 4,0 P 1,1 P 4,1 B 2,0 P 2,1 P 5,0 P 5,1 B 2,1 B 5,0 P 2,0 B 5,1 slaves Process B 0,1 Download P 0,0 Process B 0,0 S1 Process B 4,0 Process B 1,1 Download P 1,0 Process B 1,0 S2 Process B 5,1 Process B 2,1 Download P 2,0 Process B 2,0 S3 Process B 5,0 Map finishes Process B 3,1 Download P 3,0 Process B 3,0 S4 Process B 5,1 time(s) 27 10 30 40

Research on Efficient Erasure-Coding- Based Cluster Storage Systems - PowerPoint PPT Presentation

Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS14 1 Joint work with Runhui Li, Jian Lin, Yuchong Hu Motivation Clustered storage systems are widely deployed

Forward Error Correction using Erasure Codes using Erasure Codes Reference : L. Rizzo,

Decoding F q -linear codes over erasure channels Sara D. Cardell Universidad de Alicante

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Pyrit: Polynomial Ring Transforms for Fast Erasure Coding Some parts of this work have been

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Erasure Codes. Erasure Code: Example. Example Make polynomial, P ( x ) = a 2 x 2 + a 1 x + a 0

Type Erasure 86 What is Type Erasure? The way for the Java

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Efficient Network Coding in Planar Multicast Networks Tang Xiahou Department of Computer Science

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Applications of Random Coding and Algebraic Coding Theories to Universal Lossless Source Coding

Coding and Applications in Sensor Networks Why coding? Information compression

with Erasure Coding Liangfeng Cheng 1 , Yuchong Hu 1 , Patrick P. C. Lee 2 1 Huazhong University of

Symplectic Heegaard splittings and generalizations Joan Birman (on joint work with Dennis Johnson

Gap between the alternation number and the dealternating number Mar a de los Angeles Guevara

On the leading coefficients of higher-order Alexander polynomials Takahiro KITAYAMA JSPS

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

Overview Agenda A selection of relevant concepts from Graph and Network Theory Markus

The Small World Problem Christoph Trattner Know-Center Graz University of Technology,

COHN LOCALIZATION, GENERALIZED FREE PRODUCTS AND BOUNDARY LINKS ANDREW RANICKI (Edinburgh)