raidp replication with intra disk parity
play

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - PowerPoint PPT Presentation

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41 Todays Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14 Problem:


  1. RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41

  2. Today’s Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14

  3. Problem: Disks fail • So storage systems use redundancy when storing data • Two forms of redundancy: – Replication, or – Erasure codes Slide 3 of 41

  4. Replication vs. Erasure Coding a=2 b=3 Slide 4 of 41

  5. Replication vs. Erasure Coding Replication a=2 a=2 b=3 b=3 a=2 b=3 Slide 5 of 41

  6. Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 6 of 41

  7. Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 7 of 41

  8. Replication vs. Erasure Coding Erasure coding Replication X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 8 of 41

  9. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 9 of 41

  10. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 10 of 41

  11. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 X a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 11 of 41

  12. Many modern systems replicate warm data • Amazon’s storage services • Google File System (GFS) • Facebook’s Haystack • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 12 of 41

  13. Why is replication advantageous for warm data? Better for reads: Load balancing ✓ 1. Load balancing Parallelism ✓ 2. Parallelism Avoids degraded reads ✓ 3. Avoids degraded reads Better for writes : Lower sync latency ✓ 4. Lower sync latency Better for reads and writes : 5. Increased sequentiality ✓ 5. Increased sequentiality 6. Avoids the CPU processing used for encoding ✓ 6. Avoids the CPU processing used for encoding 7. Lower repair traffic ✓ 7. Lower repair traffic Slide 13 of 41

  14. Recovery in replication based systems is efficient 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 14 of 41

  15. Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 15 of 41

  16. Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 16 of 41

  17. Erasure coding, on the other hand… A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 17 of 41

  18. Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 18 of 41

  19. Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 19 of 41

  20. Erasure coding, on the other hand… Facebook “estimate[s] that if 50% of the X cluster was Reed-Solomon encoded, the A 1 A 2 A 3 A PARITY repair network traffic would completely B 1 B PARITY B 3 B 2 1 saturate the cluster network links” 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 20 of 41

  21. Modern replicating systems triple-replicate warm data • Amazon’s DynamoDB • Facebook’s Haystack • Google File System (GFS) • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 21 of 41

  22. Bottom Line • Replication is used for warm data only • It’s expensive! (Wastes storage, energy, network) • Erasure coding used for the rest ( cold data ) Our goal: Quickly recover from two simultaneous disk failures without resorting to a third replica for warm data Slide 22 of 41

  23. RAIDP - ReplicAtion with Intra-Disk Parity • Hybrid storage system for warm data with only two* copies of each data object. • Recovers quickly from a simultaneous failure of any two disks • Largely enjoys the aforementioned 7 advantages of replication Slide 23 of 41

  24. System Architecture Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 24 of 41

  25. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 25 of 41

  26. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 26 of 41

  27. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated • 1-Sharing : Any two disks share at most one superchunk 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 27 of 41

  28. Introducing “disk add-ons” SATA/SAS 1 ⨁ 2 ⨁ 3 ⨁ 4 Add-on Disk Drive 2 4 1 3 Power • Associated with a specific disk – Interposes all I/O to disk – Stores an erasure code of the local disk’s superchunks – Fails separately from the associated disk Slide 29 of 41

  29. RAIDP Recovery Add-on 1 ⨁ 2 ⨁ 6 ⨁ 8 1 ⨁ 2 ⨁ 6 ⨁ 8 2 ⨁ 3 ⨁ 7 ⨁ 9 Add-on 3 ⨁ 4 ⨁ 8 ⨁ 10 Add-on 4 ⨁ 5 ⨁ 9 ⨁ 6 Add-on 5 ⨁ 1 ⨁ 10 ⨁ 7 Add-on X X 1 2 4 3 5 2 2 3 4 5 1 1 6 7 8 9 10 8 8 6 9 6 7 10 XOR Add-on 1 with the surviving superchunks ⊕ ⊕ ⊕ = 8 6 1 1 ⨁ 2 ⨁ 6 ⨁ 8 2 from Disk 1. Slide 30 of 41

  30. warm data (less) triple replication RAIDP (single failure) ! RAIDP (double failure) repair traffic cold data erasure coding (more) ! storage capacity Slide 31 of 41

  31. Lstor Feasability Goal : Replace a third replica disk with 2 Lstors Lstors need to be cheap, fast, and fail separately from disk. Storage: Enough to maintain parity (~$9) [1] - Processing: Microcontroller for local machine independence (~$5) [2] - Power: Several hundred Amps for 2–3 min from small supercapacitor - to read data from the Lstor Commodity 2.5” 4TB disk for storing an additional replica costs $100: 66% more than a conservative estimate of the cost of two Lstors Slide 32 of 41

  32. Implementation in HDFS RAIDP implemented in in Hadoop 1.0.4 • – Two variants: • Append-only • Updates-in-place • 3K LOC extension to HDFS – Pre-allocated block files to simulate superchunks – Lstors simulated in memory – Added crash consistency and several optimizations Slide 33 of 41

  33. Evaluation • RAIDP vs. HDFS with 2 and 3 replicas • Tested on a 16-node cluster – Intel Xeon CPU E3-1220 V2 @ 3.10GHz – 16GB RAM – 7200 RPM disks • 10Gbps Ethernet • 6GB superchunks, ~800GB cluster capacity Slide 34 of 41

  34. Hadoop write throughput (Runtime of writing 100GB) HDFS RAIDP completes the workload 22% faster! RAIDP For Updates in place: RAIDP performs 4 I/Os for each write à Both replicas are read before they are overwritten S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 35 of 41

  35. Hadoop read throughput (Runtime of reading 100GB) HDFS RAIDP S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 36 of 41

  36. Write Runtime vs. Network Usage Network usage in GB Runtime of writing 100GB when writing 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 37 of 41

  37. TeraSort Runtime vs. Network Usage Network usage in GB Runtime of sorting 100GB when sorting 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 38 of 41

  38. Recovery time in RAIDP System 1Gbps Network 10Gbps Network RAIDP 827 s 125 s RAID-6 12,300 s 1,823 s 16 node cluster with 6GB superchunk RAIDP recovers 14x faster! For erasure coding, such a recovery is required for every disk failure. For RAIDP, such a recovery is only required after the 2nd failure. Slide 39 of 41

  39. Vision and Future work • Survives two simultaneous failures with only two replicas • Can be augmented to withstand more than two simultaneous failures – “Stacked” LSTORs • Building Lstors instead of simulating them • Equipping Lstors with network interfaces so that they can withstand rack failures • Experiment with SSDs Slide 40 of 41

  40. Summary • RAIDP achieves similar failure tolerance as 3-way replicated systems – Better performance when writing new data – Small performance hit during updates • Yet: – Requires 33% less storage – Uses considerably less network bandwidth for writes – Recovery is much more efficient than EC • Opens the way for storage vendors and cloud providers to use 2 (instead of 3, or more) replicas – Potential savings in size, energy, and capacity Slide 41 of 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend