RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - - PowerPoint PPT Presentation

raidp replication with intra disk parity
SMART_READER_LITE
LIVE PREVIEW

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - - PowerPoint PPT Presentation

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41 Todays Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14 Problem:


slide-1
SLIDE 1

RAIDP: ReplicAtion with Intra-Disk Parity

Eitan Rosenfeld, Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir

Slide 1 of 41

slide-2
SLIDE 2

Today’s Datacenters

Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14

Slide 2 of 41

slide-3
SLIDE 3

Problem: Disks fail

  • So storage systems use redundancy when

storing data

  • Two forms of redundancy:

– Replication, or – Erasure codes

Slide 3 of 41

slide-4
SLIDE 4

Replication vs. Erasure Coding

b=3 a=2

Slide 4 of 41

slide-5
SLIDE 5

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication

Slide 5 of 41

slide-6
SLIDE 6

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication

X

Slide 6 of 41

slide-7
SLIDE 7

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication

X

Slide 7 of 41

slide-8
SLIDE 8

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication Erasure coding

b=3 a=2 a+b=5

X

Slide 8 of 41

slide-9
SLIDE 9

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication Erasure coding

b=3 a=2 a+b=5

X X

Slide 9 of 41

slide-10
SLIDE 10

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication Erasure coding

b=3 a=2 a+b=5

X X

a+2b = 8

Slide 10 of 41

slide-11
SLIDE 11

Replication vs. Erasure Coding

b=3 a=2 b=3 a=2 b=3 a=2

Replication Erasure coding

b=3 a=2 a+b=5

X X X

a+2b = 8

Slide 11 of 41

slide-12
SLIDE 12

Many modern systems replicate warm data

  • Amazon’s storage services
  • Google File System (GFS)
  • Facebook’s Haystack
  • Windows Azure Storage (WAS)
  • Microsoft’s Flat Datacenter Storage (FDS)
  • HDFS (open-source file-system for Hadoop)
  • Cassandra
  • ...

Slide 12 of 41

slide-13
SLIDE 13

Load balancing ✓ Parallelism ✓ Avoids degraded reads ✓ Lower sync latency ✓

  • 5. Increased sequentiality ✓
  • 6. Avoids the CPU processing used for encoding ✓
  • 7. Lower repair traffic ✓

Why is replication advantageous for warm data?

Better for reads:

1. Load balancing 2. Parallelism 3. Avoids degraded reads

Better for writes:

  • 4. Lower sync latency

Better for reads and writes:

  • 5. Increased sequentiality
  • 6. Avoids the CPU processing used for encoding
  • 7. Lower repair traffic

Slide 13 of 41

slide-14
SLIDE 14

Recovery in replication based systems is efficient

Disk 1 Disk 2 Disk 3 Disk 4

1 4 2 1 4 3 4 3 5 1 1 6 2 5 6 5

Slide 14 of 41

slide-15
SLIDE 15

Recovery in replication based systems is efficient

Disk 1 Disk 2 Disk 3 Disk 4

1 4 2 1 4 3 4 3 5 1 1 6 2 5 6 5

X

Slide 15 of 41

slide-16
SLIDE 16

Recovery in replication based systems is efficient

Disk 1 Disk 2 Disk 3 Disk 4

1 4 2 1 4 3 4 3 5 1 1 6 2 5 6 5

X

Slide 16 of 41

slide-17
SLIDE 17

Disk 1 Disk 2 Disk 3 Disk 4

APARITY A1 A2 A3 APARITY

Erasure coding, on the other hand…

B1 B3 C1 1 B2 CPARITY BPARITY 5 C3 C2 DPARITY D1 D3 D2

Slide 17 of 41

slide-18
SLIDE 18

Disk 1 Disk 2 Disk 3 Disk 4

APARITY A1 A2 A3 APARITY

Erasure coding, on the other hand…

B1 B3 C1 1 B2 CPARITY BPARITY 5 C3 C2 DPARITY D1 D3 D2

X

Slide 18 of 41

slide-19
SLIDE 19

Disk 1 Disk 2 Disk 3 Disk 4

A1 A2 A3 APARITY

Erasure coding, on the other hand…

B1 B3 C1 1 B2 CPARITY BPARITY 5 C3 C2 DPARITY D1 D3 D2

X

Slide 19 of 41

slide-20
SLIDE 20

Disk 1 Disk 2 Disk 3 Disk 4

A1 A2 A3 APARITY

Erasure coding, on the other hand…

B1 B3 C1 1 B2 CPARITY BPARITY 5 C3 C2 DPARITY D1 D3 D2

X

Facebook “estimate[s] that if 50% of the cluster was Reed-Solomon encoded, the repair network traffic would completely saturate the cluster network links”

Slide 20 of 41

slide-21
SLIDE 21

Modern replicating systems triple-replicate warm data

  • Amazon’s DynamoDB
  • Facebook’s Haystack
  • Google File System (GFS)
  • Windows Azure Storage (WAS)
  • Microsoft’s Flat Datacenter Storage (FDS)
  • HDFS (open-source file-system for Hadoop)
  • Cassandra
  • ...

Slide 21 of 41

slide-22
SLIDE 22

Bottom Line

  • Replication is used for warm data only
  • It’s expensive! (Wastes storage, energy, network)
  • Erasure coding used for the rest (cold data)

Our goal: Quickly recover from two simultaneous disk failures without resorting to a third replica for warm data

Slide 22 of 41

slide-23
SLIDE 23

RAIDP - ReplicAtion with Intra-Disk Parity

  • Hybrid storage system for warm data with
  • nly two* copies of each data object.
  • Recovers quickly from a simultaneous failure
  • f any two disks
  • Largely enjoys the aforementioned 7

advantages of replication

Slide 23 of 41

slide-24
SLIDE 24

System Architecture

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

Slide 24 of 41

slide-25
SLIDE 25

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

System Architecture

  • Each of the N disks is divided into N-1 superchunks

– e.g. 4GB each

Slide 25 of 41

slide-26
SLIDE 26

System Architecture

1 2 2 3 4 1 3 4 5 5 7 8 9 10 7 8 9 10 6 6

  • Each of the N disks is divided into N-1 superchunks

– e.g. 4GB each

  • 1-Mirroring: Superchunks must be 2-replicated

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

Slide 26 of 41

slide-27
SLIDE 27

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

System Architecture

  • Each of the N disks is divided into N-1 superchunks

– e.g. 4GB each

  • 1-Mirroring: Superchunks must be 2-replicated
  • 1-Sharing: Any two disks share at most one superchunk

1 2 2 3 4 1 3 4 5 5 7 8 9 10 7 8 9 10 6 6

Slide 27 of 41

slide-28
SLIDE 28

Introducing “disk add-ons”

  • Associated with a specific disk

– Interposes all I/O to disk – Stores an erasure code of the local disk’s

superchunks

– Fails separately from the associated disk Disk Drive Add-on SATA/SAS Power

4 3 2 1

1 ⨁2 ⨁3 ⨁4

Slide 29 of 41

slide-29
SLIDE 29

Add-on Add-on Add-on Add-on Add-on

RAIDP Recovery

5 1 4 5 9 10 7 2 3 7 9 6 1⨁2⨁6⨁8 2⨁3⨁7⨁9 3⨁4⨁8⨁10 4⨁5⨁9⨁6 5⨁1⨁10⨁7 1 2 6 8 3 4 8 10

X

1⨁2⨁6⨁8 8 6 1 2

X

⊕ ⊕ ⊕ =

XOR Add-on 1 with the surviving superchunks from Disk 1.

1⨁2⨁6⨁8

1 6 8 2

Slide 30 of 41

slide-30
SLIDE 30

storage capacity repair traffic ! ! erasure coding (more) (less) triple replication RAIDP (single failure) RAIDP (double failure) warm data cold data

Slide 31 of 41

slide-31
SLIDE 31

Lstor Feasability

Goal: Replace a third replica disk with 2 Lstors Lstors need to be cheap, fast, and fail separately from disk.

  • Storage: Enough to maintain parity (~$9) [1]
  • Processing: Microcontroller for local machine independence (~$5) [2]
  • Power: Several hundred Amps for 2–3 min from small supercapacitor

to read data from the Lstor

Commodity 2.5” 4TB disk for storing an additional replica costs $100: 66% more than a conservative estimate of the cost of two Lstors

Slide 32 of 41

slide-32
SLIDE 32

Implementation in HDFS

  • RAIDP implemented in in Hadoop 1.0.4

– Two variants:

  • Append-only
  • Updates-in-place
  • 3K LOC extension to HDFS

– Pre-allocated block files to simulate superchunks – Lstors simulated in memory – Added crash consistency and several optimizations

Slide 33 of 41

slide-33
SLIDE 33

Evaluation

  • RAIDP vs. HDFS with 2 and 3 replicas
  • Tested on a 16-node cluster

– Intel Xeon CPU E3-1220 V2 @ 3.10GHz – 16GB RAM – 7200 RPM disks

  • 10Gbps Ethernet
  • 6GB superchunks, ~800GB cluster capacity

Slide 34 of 41

slide-34
SLIDE 34

Hadoop write throughput (Runtime of writing 100GB)

RAIDP

U p d a t e s

  • i

n

  • p

l a c e L s t

  • r

s H D F S

  • 3

H D F S

  • 2

HDFS

S u p e r c h u n k s

  • n

l y

For Updates in place: RAIDP performs 4 I/Os for each write à Both replicas are read before they are overwritten

RAIDP completes the workload 22% faster!

Slide 35 of 41

slide-35
SLIDE 35

Hadoop read throughput (Runtime of reading 100GB)

U p d a t e s

  • i

n

  • p

l a c e L s t

  • r

s H D F S

  • 3

H D F S

  • 2

S u p e r c h u n k s

  • n

l y

RAIDP HDFS

Slide 36 of 41

slide-36
SLIDE 36

Write Runtime vs. Network Usage

Runtime of writing 100GB Network usage in GB when writing 100GB RAIDP HDFS-3 RAIDP HDFS-3

Slide 37 of 41

slide-37
SLIDE 37

TeraSort Runtime vs. Network Usage

Runtime of sorting 100GB Network usage in GB when sorting 100GB RAIDP HDFS-3 RAIDP HDFS-3

Slide 38 of 41

slide-38
SLIDE 38

Recovery time in RAIDP

System 1Gbps Network 10Gbps Network

RAIDP 827 s 125 s RAID-6 12,300 s 1,823 s For erasure coding, such a recovery is required for every disk failure. For RAIDP, such a recovery is only required after the 2nd failure.

RAIDP recovers 14x faster!

16 node cluster with 6GB superchunk

Slide 39 of 41

slide-39
SLIDE 39

Vision and Future work

  • Survives two simultaneous failures with only two

replicas

  • Can be augmented to withstand more than two

simultaneous failures

– “Stacked” LSTORs

  • Building Lstors instead of simulating them
  • Equipping Lstors with network interfaces so that

they can withstand rack failures

  • Experiment with SSDs

Slide 40 of 41

slide-40
SLIDE 40

Summary

  • RAIDP achieves similar failure tolerance as 3-way

replicated systems – Better performance when writing new data – Small performance hit during updates

  • Yet:

– Requires 33% less storage – Uses considerably less network bandwidth for writes – Recovery is much more efficient than EC

  • Opens the way for storage vendors and cloud

providers to use 2 (instead of 3, or more) replicas

– Potential savings in size, energy, and capacity

Slide 41 of 41