RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - PowerPoint PPT Presentation

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41

Today’s Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14

Problem: Disks fail • So storage systems use redundancy when storing data • Two forms of redundancy: – Replication, or – Erasure codes Slide 3 of 41

Replication vs. Erasure Coding a=2 b=3 Slide 4 of 41

Replication vs. Erasure Coding Replication a=2 a=2 b=3 b=3 a=2 b=3 Slide 5 of 41

Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 6 of 41

Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 7 of 41

Replication vs. Erasure Coding Erasure coding Replication X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 8 of 41

Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 9 of 41

Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 10 of 41

Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 X a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 11 of 41

Many modern systems replicate warm data • Amazon’s storage services • Google File System (GFS) • Facebook’s Haystack • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 12 of 41

Why is replication advantageous for warm data? Better for reads: Load balancing ✓ 1. Load balancing Parallelism ✓ 2. Parallelism Avoids degraded reads ✓ 3. Avoids degraded reads Better for writes : Lower sync latency ✓ 4. Lower sync latency Better for reads and writes : 5. Increased sequentiality ✓ 5. Increased sequentiality 6. Avoids the CPU processing used for encoding ✓ 6. Avoids the CPU processing used for encoding 7. Lower repair traffic ✓ 7. Lower repair traffic Slide 13 of 41

Recovery in replication based systems is efficient 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 14 of 41

Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 15 of 41

Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 16 of 41

Erasure coding, on the other hand… A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 17 of 41

Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 18 of 41

Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 19 of 41

Erasure coding, on the other hand… Facebook “estimate[s] that if 50% of the X cluster was Reed-Solomon encoded, the A 1 A 2 A 3 A PARITY repair network traffic would completely B 1 B PARITY B 3 B 2 1 saturate the cluster network links” 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 20 of 41

Modern replicating systems triple-replicate warm data • Amazon’s DynamoDB • Facebook’s Haystack • Google File System (GFS) • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 21 of 41

Bottom Line • Replication is used for warm data only • It’s expensive! (Wastes storage, energy, network) • Erasure coding used for the rest ( cold data ) Our goal: Quickly recover from two simultaneous disk failures without resorting to a third replica for warm data Slide 22 of 41

RAIDP - ReplicAtion with Intra-Disk Parity • Hybrid storage system for warm data with only two* copies of each data object. • Recovers quickly from a simultaneous failure of any two disks • Largely enjoys the aforementioned 7 advantages of replication Slide 23 of 41

System Architecture Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 24 of 41

System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 25 of 41

System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 26 of 41

System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated • 1-Sharing : Any two disks share at most one superchunk 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 27 of 41

Introducing “disk add-ons” SATA/SAS 1 ⨁ 2 ⨁ 3 ⨁ 4 Add-on Disk Drive 2 4 1 3 Power • Associated with a specific disk – Interposes all I/O to disk – Stores an erasure code of the local disk’s superchunks – Fails separately from the associated disk Slide 29 of 41

RAIDP Recovery Add-on 1 ⨁ 2 ⨁ 6 ⨁ 8 1 ⨁ 2 ⨁ 6 ⨁ 8 2 ⨁ 3 ⨁ 7 ⨁ 9 Add-on 3 ⨁ 4 ⨁ 8 ⨁ 10 Add-on 4 ⨁ 5 ⨁ 9 ⨁ 6 Add-on 5 ⨁ 1 ⨁ 10 ⨁ 7 Add-on X X 1 2 4 3 5 2 2 3 4 5 1 1 6 7 8 9 10 8 8 6 9 6 7 10 XOR Add-on 1 with the surviving superchunks ⊕ ⊕ ⊕ = 8 6 1 1 ⨁ 2 ⨁ 6 ⨁ 8 2 from Disk 1. Slide 30 of 41

warm data (less) triple replication RAIDP (single failure) ! RAIDP (double failure) repair traffic cold data erasure coding (more) ! storage capacity Slide 31 of 41

Lstor Feasability Goal : Replace a third replica disk with 2 Lstors Lstors need to be cheap, fast, and fail separately from disk. Storage: Enough to maintain parity (~$9) [1] - Processing: Microcontroller for local machine independence (~$5) [2] - Power: Several hundred Amps for 2–3 min from small supercapacitor - to read data from the Lstor Commodity 2.5” 4TB disk for storing an additional replica costs $100: 66% more than a conservative estimate of the cost of two Lstors Slide 32 of 41

Implementation in HDFS RAIDP implemented in in Hadoop 1.0.4 • – Two variants: • Append-only • Updates-in-place • 3K LOC extension to HDFS – Pre-allocated block files to simulate superchunks – Lstors simulated in memory – Added crash consistency and several optimizations Slide 33 of 41

Evaluation • RAIDP vs. HDFS with 2 and 3 replicas • Tested on a 16-node cluster – Intel Xeon CPU E3-1220 V2 @ 3.10GHz – 16GB RAM – 7200 RPM disks • 10Gbps Ethernet • 6GB superchunks, ~800GB cluster capacity Slide 34 of 41

Hadoop write throughput (Runtime of writing 100GB) HDFS RAIDP completes the workload 22% faster! RAIDP For Updates in place: RAIDP performs 4 I/Os for each write à Both replicas are read before they are overwritten S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 35 of 41

Hadoop read throughput (Runtime of reading 100GB) HDFS RAIDP S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 36 of 41

Write Runtime vs. Network Usage Network usage in GB Runtime of writing 100GB when writing 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 37 of 41

TeraSort Runtime vs. Network Usage Network usage in GB Runtime of sorting 100GB when sorting 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 38 of 41

Recovery time in RAIDP System 1Gbps Network 10Gbps Network RAIDP 827 s 125 s RAID-6 12,300 s 1,823 s 16 node cluster with 6GB superchunk RAIDP recovers 14x faster! For erasure coding, such a recovery is required for every disk failure. For RAIDP, such a recovery is only required after the 2nd failure. Slide 39 of 41

Vision and Future work • Survives two simultaneous failures with only two replicas • Can be augmented to withstand more than two simultaneous failures – “Stacked” LSTORs • Building Lstors instead of simulating them • Equipping Lstors with network interfaces so that they can withstand rack failures • Experiment with SSDs Slide 40 of 41

Summary • RAIDP achieves similar failure tolerance as 3-way replicated systems – Better performance when writing new data – Small performance hit during updates • Yet: – Requires 33% less storage – Uses considerably less network bandwidth for writes – Recovery is much more efficient than EC • Opens the way for storage vendors and cloud providers to use 2 (instead of 3, or more) replicas – Potential savings in size, energy, and capacity Slide 41 of 41

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - PowerPoint PPT Presentation

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41 Todays Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14 Problem:

Disk Management Disk Structure Disk Scheduling RAID Disk Block Management

Disk Storage Disk Storage Different types of disk storage: The smallest addressable unit

CPSC 410/611: Disk Management Disk Structure Disk Scheduling RAID Disk Block

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

1 2 Single Disk (a) Side view of a magnetic disk. (b) Top view of a magnetic disk. 3

CPSC 410/611: Disk Management Disk Structure Disk Scheduling RAID

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Today How is data saved in the hard disk? Magnetic disk Disk speed parameters Disk

CPSC 410/ 611: Week 9 Disk St ruct ure Disk Scheduling RAI D Disk Block

Algorithms for Parity Games Piotr Danilewski May 15, 2008 Piotr Danilewski Algorithms for

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Argosy: Verifying layered storage systems with recovery refinement Tej Chajed , Joseph Tassarotti,

HARD DISK DRIVES Performance Storage capacity Software support Reliability Why we

A Comparison of High-Level Full-System Power Models Component and system designers

Skylight A Window on Shingled Disk Operation Abutalib Aghayev, Peter Desnoyers Northeastern

The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay

Near-optimal Algorithms for Shortest Paths in Weighted Unit-Disk Graphs Haitao Wang 1 Jie Xue 2 1

Csci 5980 Spring 2020 New Storage Technologies/D evices Higher performan Tape SMR HDD SSD

Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1

Memory forensics (well, thats what the title says) Wietse Venema wietse@porcupine.org IBM

Goals for Today Learning Objective: Understand and evaluate disk scheduling algorithms