research on efficient erasure coding
play

Research on Efficient Erasure-Coding- Based Cluster Storage Systems - PowerPoint PPT Presentation

Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS14 1 Joint work with Runhui Li, Jian Lin, Yuchong Hu Motivation Clustered storage systems are widely deployed


  1. Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS’14 1 Joint work with Runhui Li, Jian Lin, Yuchong Hu

  2. Motivation  Clustered storage systems are widely deployed to provide scalable storage by striping data across multiple nodes • e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc.  Failures are common LAN 2

  3. Failure Types  Temporary failures • Nodes are temporarily inaccessible (no data loss) • 90% of failures in practice are transient [Ford, OSDI’10] • e.g., power loss, network connectivity loss, CPU overloading, reboots, maintenance, upgrades  Permanent failures • Data is permanently lost • e.g., disk crashes, latent sector errors, silent data corruptions, malicious attacks 3

  4. Replication vs. Erasure Coding  Solution: Add redundancy: • Replication • Erasure coding  Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints due to explosive data growth • e.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33% [Huang, ATC’12 ] 4

  5. Background: Erasure Coding  Divide file to k data chunks (each with multiple blocks)  Encode data chunks to additional n-k parity chunks  Distribute data/parity chunks to n nodes  Fault-tolerance : any k out of n nodes can recover file data Nodes A A B B C C A D D B divide encode File C A+C A+C D B+D B+D A+D A+D B+C+D B+C+D (n, k) = (4, 2) 5

  6. Erasure Coding  Pros: • Reduce storage space with high fault tolerance  Cons: • Data chunk updates imply parity chunk updates  expensive updates • In general, k chunks are needed to recover a single lost chunk  expensive recovery  Our talk: Can we improve recovery of erasure- coding-based clustered storage systems, while preserving storage efficiency? 6

  7. Our Work  CORE [MSST’13, TC] • Augments existing regenerating codes to support both optimal single and concurrent failure recovery  Degraded-read scheduling [DSN’14] • Improves MapReduce performance in failure mode  Designed, implemented, and experimented on Hadoop Distributed File System 7

  8. Recover a Failure  Conventional recovery : download data from any k nodes A repaired node Node 1 B A B C C Node 2 C D D A D B A+C A+C Node 3 File of B+D B+D size M A+D Node 4 B+C+D + Recovery Traffic = = M  Q: Can we minimize recovery traffic? 8

  9. Regenerating Codes [Dimakis et al.; ToIT’10]  Repair in regenerating codes: • Surviving nodes encode chunks (network coding) • Download one encoded chunk from each node A repaired node Node 1 B A B C C Node 2 C D D A A+C A+C Node 3 B File of B+D size M A+D Node 4 A+B+C B+C+D Recovery Traffic = = 0.75M + + 9

  10. Concurrent Node Failures  Regenerating codes only designed for single failure recovery • Optimal regenerating codes collect data from n-1 surviving nodes for single failure recovery  Correlated and co-occurring failures are possible • In clustered storage [Schroeder, FAST’07; Ford, OSDI’10] • In dispersed storage [Chun NSDI’06; Shah NSDI’06]  CORE augments regenerating codes for optimal concurrent failure recovery • Retains regenerating code construction 10

  11. CORE’s Idea  Consider a system with n nodes  Two functions for regenerating codes in single failure recovery: • Enc : storage node encodes data • Rec : reconstruct lost data using encoded data from n-1 surviving nodes  t-failure recovery (t > 1): • Reconstruct each failed node as if other n-1 nodes are surviving nodes 11

  12. Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 Node 5 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 Node 0 S 1,0 S 1,1 S 1,2 CORE  Setting: n=6, k=3  Suppose now Nodes 0 and 1 fail  Recall that optimal regenerating codes collect data from n- 1 surviving nodes for single failure recovery  How does CORE work? 12

  13. Example Node 2 Node 3 Node 4 S 2,0 S 2,1 S 2,2 S 3,0 S 3,1 S 3,2 S 4,0 S 4,1 S 4,2 e 3,0 Node 5 e 4,0 e 2,0 e 3,1 e 4,1 S 5,0 S 5,1 S 5,2 Node 1 S 0,0 S 0,1 S 0,2 e 1,0 e 2,1 e 5,0 e 5,1 e 0,1 Node 0 S 1,0 S 1,1 S 1,2 CORE s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) e 0,1 = Enc 0,1 (s 0,0 , s 0,1 , s 0,2 ) = Enc 0,1 (Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) e 1,0 = Enc 1,0 (s 1,0 , s 1,1 , s 1,2 ) = Enc 1,0 (Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 )) 13

  14. Example  We have two equations e 0,1 = Enc 0,1 ( Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 )) e 1,0 = Enc 1,0 ( Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ))  Trick: They form a linear system of equations  If the equations are linearly independent, we can calculate e 0,1 and e 1,0  Then we obtain lost data by s 0,0 , s 0,1 , s 0,2 = Rec 0 (e 1,0 , e 2,0 , e 3,0 , e 4,0 , e 5,0 ) s 1,0 , s 1,1 , s 1,2 = Rec 1 (e 0,1 , e 2,1 , e 3,1 , e 4,1 , e 5,1 ) 14

  15. Bad Failure Pattern  A system of equations may not have a unique solution. We call this a bad failure pattern  Bad failure patterns count for less than ~1%  Our idea: reconstruct data by adding one more node to bypass the bad failure pattern • Suppose nodes 0,1 form a bad failure pattern and nodes 0,1,2 form a good failure pattern. Reconstruct lost data for nodes 0,1,2 • Still achieve bandwidth saving over conventional 15

  16. Bandwidth Saving  Bandwidth Ratio : Ratio of CORE to conventional in recovery bandwidth Good Failure Pattern Bad Failure Pattern Bandwidth Ratio Bandwidth Ratio 1 1 0.5 0.5 0 0 t t 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 (12,6) (16,8) (20,10) (12,6) (16,8) (20,10)  Bandwidth saving of CORE is significant • e.g., (n, k) = (20,10) • Single failure: ~80% • 2-4 concurrent failures: 36-64% 16

  17. Theorem  Theorem: CORE, which builds on regenerating codes for single failure recovery, achieves the lower bound of recovery bandwidth if we recover a good failure pattern with t ≥ 1 failed nodes • Over ~99% of failure patterns are good 17

  18. CORE Implementation RaidNode 1. Identify corrupted blocks CORE Namenode Encoder/Decoder 2. Send recovered blocks Encoder Encoder Encoder block block block block Datanode Datanode block block block 18

  19. CORE Implementation  Parallelization  Erasure coding on C++ • Executed through JNI 19

  20. Experiments  Testbed: • 1 namenode, and up to 20 datanodes • Quad core 3.1GHz CPU, 8GB RAM, 7200RPM SATA harddisk, 1Gbps Ethernet Namenode Datanode Datanode Datanode  Coding schemes: • Reed-Solomon codes vs. CORE (interference alignment codes) 20

  21. Decoding Throughput  Evaluate computational performance: • Assume single failure (t=1) • Surviving data for recovery first loaded in memory • Decoding throughput : ratio of size of recovered data to decoding time  CORE (regenerating codes) achieves ≥500MB/s at packet size 8KB 21

  22. Recovery Throughput 70 Recovery thpt (MB/s) 60 CORE t=1 50 RS t=1 40 CORE t=2 30 RS t=2 20 CORE t=3 RS t=3 10 0 (12, 6) (16, 8) (20, 10)  CORE shows significantly higher throughput • e.g., in (20, 10), for single failure, the gain is 3.45x ; for two failures, it’s 2.33x ; for three failures, is 1.75x 22

  23. MapReduce  Q: How does erasure-coded storage affect data analytics?  Traditional MapReduce is designed with replication storage in mind  To date, no explicit analysis of MapReduce on erasure-coded storage • Failures trigger degraded reads in erasure coding 23

  24. MapReduce Shuffle Slave 0 B C <A,2> <A,1> WordCount Slave 1 A C <B,2> Example: <C,2> Slave 2 A B Reduce tasks Map tasks  MapReduce idea: • Map tasks process blocks and generate intermediate results • Reduce tasks collect intermediate results and produce final output  Constraint: cross-rack network resource is scarce 24

  25. MapReduce on Erasure Coding  Show that default scheduling hurts MapReduce performance on erasure-coded storage  Propose Degraded-First Scheduling for MapReduce task-level scheduling • Improves MapReduce performance on erasure-coded storage in failure mode 25

  26. Default Scheduling in MapReduce while a heartbeat from slave s arrives do for job in jobQueue do Processing a block stored if job has a local task on s then in another rack assign the local task else if job has a remote task then assign the remote task else if job has a degraded task then assign the degraded task endif Processing an unavailable endfor block in the system endwhile  Locality-first scheduling: the master gives the first priority to assigning a local task to a slave 26

  27. Locality-First in Failure Mode Core Switch ToR Switch ToR Switch S2 S4 S0 S1 S3 B 0,0 B 3,0 B 0,1 P 3,0 P 0,0 P 3,1 P 0,1 B 3,1 B 1,0 B 4,0 B 1,1 B 4,1 P 1,0 P 4,0 P 1,1 P 4,1 B 2,0 P 2,1 P 5,0 P 5,1 B 2,1 B 5,0 P 2,0 B 5,1 slaves Process B 0,1 Download P 0,0 Process B 0,0 S1 Process B 4,0 Process B 1,1 Download P 1,0 Process B 1,0 S2 Process B 5,1 Process B 2,1 Download P 2,0 Process B 2,0 S3 Process B 5,0 Map finishes Process B 3,1 Download P 3,0 Process B 3,0 S4 Process B 5,1 time(s) 27 10 30 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend