 
              Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS’14 1 Joint work with Runhui Li, Jian Lin, Yuchong Hu
Motivation  Clustered storage systems are widely deployed to provide scalable storage by striping data across multiple nodes • e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc.  Failures are common LAN 2
Failure Types  Temporary failures • Nodes are temporarily inaccessible (no data loss) • 90% of failures in practice are transient [Ford, OSDI’10] • e.g., power loss, network connectivity loss, CPU overloading, reboots, maintenance, upgrades  Permanent failures • Data is permanently lost • e.g., disk crashes, latent sector errors, silent data corruptions, malicious attacks 3
Replication vs. Erasure Coding  Solution: Add redundancy: • Replication • Erasure coding  Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints due to explosive data growth • e.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33% [Huang, ATC’12 ] 4
Background: Erasure Coding  Divide file to k data chunks (each with multiple blocks)  Encode data chunks to additional n-k parity chunks  Distribute data/parity chunks to n nodes  Fault-tolerance : any k out of n nodes can recover file data Nodes A A B B C C A D D B divide encode File C A+C A+C D B+D B+D A+D A+D B+C+D B+C+D (n, k) = (4, 2) 5
Erasure Coding  Pros: • Reduce storage space with high fault tolerance  Cons: • Data chunk updates imply parity chunk updates  expensive updates • In general, k chunks are needed to recover a single lost chunk  expensive recovery  Our talk: Can we improve recovery of erasure- coding-based clustered storage systems, while preserving storage efficiency? 6
Our Work  CORE [MSST’13, TC] • Augments existing regenerating codes to support both optimal single and concurrent failure recovery  Degraded-read scheduling [DSN’14] • Improves MapReduce performance in failure mode  Designed, implemented, and experimented on Hadoop Distributed File System 7
Recover a Failure  Conventional recovery : download data from any k nodes A repaired node Node 1 B A B C C Node 2 C D D A D B A+C A+C Node 3 File of B+D B+D size M A+D Node 4 B+C+D + Recovery Traffic = = M  Q: Can we minimize recovery traffic? 8
Regenerating Codes [Dimakis et al.; ToIT’10]  Repair in regenerating codes: • Surviving nodes encode chunks (network coding) • Download one encoded chunk from each node A repaired node Node 1 B A B C C Node 2 C D D A A+C A+C Node 3 B File of B+D size M A+D Node 4 A+B+C B+C+D Recovery Traffic = = 0.75M + + 9
Concurrent Node Failures  Regenerating codes only designed for single failure recovery • Optimal regenerating codes collect data from n-1 surviving nodes for single failure recovery  Correlated and co-occurring failures are possible • In clustered storage [Schroeder, FAST’07; Ford, OSDI’10] • In dispersed storage [Chun NSDI’06; Shah NSDI’06]  CORE augments regenerating codes for optimal concurrent failure recovery • Retains regenerating code construction 10
CORE’s Idea  Consider a system with n nodes  Two functions for regenerating codes in single failure recovery: • Enc : storage node encodes data • Rec : reconstruct lost data using encoded data from n-1 surviving nodes  t-failure recovery (t > 1): • Reconstruct each failed node as if other n-1 nodes are surviving nodes 11
Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 Node 5 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 Node 0 S 1,0 S 1,1 S 1,2 CORE  Setting: n=6, k=3  Suppose now Nodes 0 and 1 fail  Recall that optimal regenerating codes collect data from n- 1 surviving nodes for single failure recovery  How does CORE work? 12
Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 e 3,0 Node 5 e 4,0 e 2,0 e 3,1 e 4,1 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 e 1,0 e 2,1 e 5,0 e 5,1 e 0,1 Node 0 S 1,0 S 1,1 S 1,2 CORE s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) e 0,1 = Enc 0,1 (s 0,0 , s 0,1 , s 0,2 ) = Enc 0,1 (Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) e 1,0 = Enc 1,0 (s 1,0 , s 1,1 , s 1,2 ) = Enc 1,0 (Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 )) 13
Example  We have two equations e 0,1 = Enc 0,1 ( Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) e 1,0 = Enc 1,0 ( Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ))  Trick: They form a linear system of equations  If the equations are linearly independent, we can calculate e 0,1 and e 1,0  Then we obtain lost data by s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) 14
Bad Failure Pattern  A system of equations may not have a unique solution. We call this a bad failure pattern  Bad failure patterns count for less than ~1%  Our idea: reconstruct data by adding one more node to bypass the bad failure pattern • Suppose nodes 0,1 form a bad failure pattern and nodes 0,1,2 form a good failure pattern. Reconstruct lost data for nodes 0,1,2 • Still achieve bandwidth saving over conventional 15
Bandwidth Saving  Bandwidth Ratio : Ratio of CORE to conventional in recovery bandwidth Good Failure Pattern Bad Failure Pattern Bandwidth Ratio Bandwidth Ratio 1 1 0.5 0.5 0 0 t t 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 (12,6) (16,8) (20,10) (12,6) (16,8) (20,10)  Bandwidth saving of CORE is significant • e.g., (n, k) = (20,10) • Single failure: ~80% • 2-4 concurrent failures: 36-64% 16
Theorem  Theorem: CORE, which builds on regenerating codes for single failure recovery, achieves the lower bound of recovery bandwidth if we recover a good failure pattern with t ≥ 1 failed nodes • Over ~99% of failure patterns are good 17
CORE Implementation RaidNode 1. Identify corrupted blocks CORE Namenode Encoder/Decoder 2. Send recovered blocks Encoder Encoder Encoder block block block block Datanode Datanode block block block 18
CORE Implementation  Parallelization  Erasure coding on C++ • Executed through JNI 19
Experiments  Testbed: • 1 namenode, and up to 20 datanodes • Quad core 3.1GHz CPU, 8GB RAM, 7200RPM SATA harddisk, 1Gbps Ethernet Namenode Datanode Datanode Datanode  Coding schemes: • Reed-Solomon codes vs. CORE (interference alignment codes) 20
Decoding Throughput  Evaluate computational performance: • Assume single failure (t=1) • Surviving data for recovery first loaded in memory • Decoding throughput : ratio of size of recovered data to decoding time  CORE (regenerating codes) achieves ≥500MB/s at packet size 8KB 21
Recovery Throughput 70 Recovery thpt (MB/s) 60 CORE t=1 50 RS t=1 40 CORE t=2 30 RS t=2 20 CORE t=3 RS t=3 10 0 (12, 6) (16, 8) (20, 10)  CORE shows significantly higher throughput • e.g., in (20, 10), for single failure, the gain is 3.45x ; for two failures, it’s 2.33x ; for three failures, is 1.75x 22
MapReduce  Q: How does erasure-coded storage affect data analytics?  Traditional MapReduce is designed with replication storage in mind  To date, no explicit analysis of MapReduce on erasure-coded storage • Failures trigger degraded reads in erasure coding 23
MapReduce Shuffle Slave 0 B C <A,2> <A,1> WordCount Slave 1 A C <B,2> Example: <C,2> Slave 2 A B Reduce tasks Map tasks  MapReduce idea: • Map tasks process blocks and generate intermediate results • Reduce tasks collect intermediate results and produce final output  Constraint: cross-rack network resource is scarce 24
MapReduce on Erasure Coding  Show that default scheduling hurts MapReduce performance on erasure-coded storage  Propose Degraded-First Scheduling for MapReduce task-level scheduling • Improves MapReduce performance on erasure-coded storage in failure mode 25
Default Scheduling in MapReduce while a heartbeat from slave s arrives do for job in jobQueue do Processing a block stored if job has a local task on s then in another rack assign the local task else if job has a remote task then assign the remote task else if job has a degraded task then assign the degraded task endif Processing an unavailable endfor block in the system endwhile  Locality-first scheduling: the master gives the first priority to assigning a local task to a slave 26
Locality-First in Failure Mode Core Switch ToR Switch ToR Switch S2 S4 S0 S1 S3 B 0,0 B 3,0 B 0,1 P 3,0 P 0,0 P 3,1 P 0,1 B 3,1 B 1,0 B 4,0 B 1,1 B 4,1 P 1,0 P 4,0 P 1,1 P 4,1 B 2,0 P 2,1 P 5,0 P 5,1 B 2,1 B 5,0 P 2,0 B 5,1 slaves Process B 0,1 Download P 0,0 Process B 0,0 S1 Process B 4,0 Process B 1,1 Download P 1,0 Process B 1,0 S2 Process B 5,1 Process B 2,1 Download P 2,0 Process B 2,0 S3 Process B 5,0 Map finishes Process B 3,1 Download P 3,0 Process B 3,0 S4 Process B 5,1 time(s) 27 10 30 40
Recommend
More recommend