- - PowerPoint PPT Presentation

a hitchhiker s guide to fast and efficient data
SMART_READER_LITE
LIVE PREVIEW

- - PowerPoint PPT Presentation

A"Hitchhikers"Guide"to"Fast"and"Efficient"Data" Reconstruc:on"in"Erasure;coded"Data"Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran


slide-1
SLIDE 1

A"“Hitchhiker’s”"Guide"to"Fast"and"Efficient"Data" Reconstruc:on"in"Erasure;coded"Data"Centers

  • K. V. Rashmi, Nihar Shah, D. Gu,
  • H. Kuang, D. Borthakur, K. Ramchandran
slide-2
SLIDE 2

Need"for"Redundant"Storage"" in"Data"Centers"

  • "Frequent"unavailability"events"in"data"centers"

– unreliable"components" – soHware"glitches,"maintenance"shutdowns,"" " "power"failures,"etc."

  • "Redundancy"necessary"for"reliability"and"availability"

" "

slide-3
SLIDE 3

Popular"Approach"for"Redundant"Storage:" Replica:on"

  • Distributed"file"systems"used"in"data"centers"store"

mul:ple"copies"of"data"on"different"machines"" "

  • Machines"typically"chosen"on"different"racks""

– to"tolerate"rack"failures"

E.g.,"Hadoop"Distributed"File"System"(HDFS)"stores"""" 3"replicas"by"default"

" " "

slide-4
SLIDE 4

HDFS"

a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j"

FILE% divide"into"blocks" introduce"redundancy"

TOR" TOR" TOR" TOR" AS/Router"

…% …% …% …%

store"distributed" across"network"

slide-5
SLIDE 5

Massive"Data"Sizes:"" Need"Alterna:ve"to"Replica:on"

"

  • Small"to"moderately"sized"data:"disk"storage"is"

inexpensive""

– replica:on"viable"

  • No"longer"true"for"massive"scales"of"opera:on"

– e.g.,"Facebook"data"warehouse"cluster"stores" mul:ple"tens"of"Petabytes"(PBs)"

“Erasure"codes”"are"an"alterna:ve"

slide-6
SLIDE 6

Erasure"Codes"in"Data"Centers"

"

  • Facebook"data"warehouse"cluster"

– uses"Reed;Solomon"(RS)"codes"instead"of"3; replica:on"on"a"por:on"of"the"data" – savings'of'mul-ple'Petabytes'of'storage'space' "

slide-7
SLIDE 7

block 1 block 2 block 3 block 4

a b a+b a+2b parity"blocks" data"blocks"

Erasure"Codes"

Replication

Overhead" 2x" 2x"

block 1 block 2 block 3 block 4

a b a b Fault"" tolerance:" tolerates"any"one"failure" tolerates"any"two"failures"

Reed-Solomon (RS) code

In"general,"erasure"codes"provide"orders"of"magnitude" higher"reliability"at"much"smaller"storage"overheads"

slide-8
SLIDE 8

Outline"

  • Erasure"Codes"in"Data"Centers"

– HDFS"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"system:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-9
SLIDE 9

Outline"

  • Erasure"Codes"in"Data"Centers"

– HDFS"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"solu:on:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-10
SLIDE 10

Erasure"codes"in"Data"Centers:"" HDFS;RAID"

Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”! Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09!

a" b" c" d" e" f" g" h" i" j" P1" P2" P3" P4" a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j"

Overhead:"3x" Overhead:"1.4x"

(10,"4)"Reed;Solomon"code"

slide-11
SLIDE 11

Erasure"codes"in"Data"Centers:"" HDFS;RAID"

a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j" a" b" c" d" e" f" g" h" i" j"

Overhead:"3x" " Cannot"tolerate"" many"3;failures"

a" b" c" d" e" f" g" h" i" j" P1" P2" P3" P4"

Overhead:"1.4x"

  • Any"10"blocks"sufficient"
  • Can"tolerate"any"4;failures"

Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”! Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09!

(10,"4)"Reed;Solomon"code"

slide-12
SLIDE 12

Outline"

  • Erasure"Codes"in"Data"Centers"

– HDFS"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"system:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-13
SLIDE 13

" Impact"on"Data"Center"Network"

  • Degraded"Reads"

– reques:ng"currently" unavailable"data" – on;the;fly"reconstruc:on"

  • Recovery"

– periodically"replace" unavailable"blocks" – to"ensure"desired"level"of" reliability"

Storage"Layer" Network"Layer" Reconstruc:on"Opera:ons"

slide-14
SLIDE 14

RS"codes"significantly"increase"network" usage"during"reconstruc:on"

Impact"on"Data"Center"Network"

slide-15
SLIDE 15

Replication

a"

Network Transfer & disk IO = 1x Network Transfer & disk IO = 2x

b" a+b" a a b b a b a+b a+2b

Reed-Solomon code

block 1 block 2 block 3 block 4 block 1 block 2 block 3 block 4

a a

Network"transfer"&"disk"IO" """"""""""="(#data;blocks)"x"(size"of"data"to"be"reconstructed)"

Impact"on"Data"Center"Network"

In"(10,"4)"RS,"it"is"10x"

slide-16
SLIDE 16

Burdens"the"already"oversubscribed" Top;of;Rack"and"higher"level"switches"

TOR" TOR" TOR" TOR" Router"

a% b% a% +% b% a% +% 2b%

…% …% …% …%

machine"1"

a%

Impact"on"Data"Center"Network"

machine"2" machine"3" machine"4"

slide-17
SLIDE 17

Impact"on"Data"Center"Network:"" Facebook"Data"Warehouse"Cluster"

  • Mul:ple"PB"of"Reed;Solomon"encoded"data"
  • Median"of"180"TB"transferred"across"racks"per"day"for"RS"

reconstruc:on"≈"5":mes"that"under"3;replica:on"

Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study

  • n the Facebook Warehouse Cluster”, Usenix HotStorage Workhsop 2013"
slide-18
SLIDE 18

RS"codes:"The"Good"and"The"Bad"

  • Maximum"possible"fault;tolerance"for"given"

storage"overhead""

– storage;capacity"op:mal"" – (“maximum&distance&separable”"in"coding"theory"parlance)"

  • Flexibility"in"choice"of"parameters"

– Supports"any"number"of"data"and"parity"blocks"

" "

  • Not"designed"to"handle"reconstruc:on"
  • pera:ons"efficiently"

– nega:ve"impact"on"the"network"

slide-19
SLIDE 19

RS"codes:"The"Good"and"The"Bad"

  • Maximum"possible"fault;tolerance"for"given"

storage"overhead""

– storage;capacity"op:mal"" – (“maximum&distance&separable”"in"coding"theory"parlance)"

  • Flexibility"in"choice"of"parameters"

– Supports"any"number"of"data"and"parity"blocks"

" "

  • Not"designed"to"handle"reconstruc:on"
  • pera:ons"efficiently"

– nega:ve"impact"on"the"network"

Maintain" Improve"

Goal%

slide-20
SLIDE 20

Goal"

To"build"a"system"with:"

"

  • Same"(op:mal)"storage"requirement"and"

fault"tolerance"

  • Same"(complete)"flexibility"in"choice"of"

design"parameters"

  • Reduced"data"transfer"across"network"and"

reduced"IO"from"disk"during"reconstruc:on" Maintain" Improve"

slide-21
SLIDE 21

Hitchhiker"

Is"a"system"with:"

"

  • Same"(op:mal)"storage"requirement"and"

fault"tolerance"

  • Same"(complete)"flexibility"in"choice"of"

design"parameters" 25"to"45%"less"network"transfers"and"disk"IO"" during"reconstruc:on" Maintain" Improve"

!" !" !"

slide-22
SLIDE 22

Outline"

  • Erasure"Codes"in"Data"Centers"

– HDFS"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"system:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-23
SLIDE 23

Hitchhiker’s" Erasure"Code"

At"an"Abstract"Level"

Reed;Solomon"Code"

Hop;and;couple" (disk"layout)"

HITCHHIKER"

slide-24
SLIDE 24

Start"with"the"RS"code"

block"1" block"2" block"3" block"4"

a1 b1 a1+b1 a1+2b1 a2 b2 a2+b2 a2+2b2

Hitchhiker’s"Erasure"Code:"Toy"Example"

1"byte" 1"byte"

slide-25
SLIDE 25

1"byte" block"1" block"2" block"3" block"4"

a1 b1 a1+b1 a1+2b1 a2 b2 a2+b2 a2+2b2+a1

1"byte"

Add"informa:on"from"first"group"on"to"" pari:es"of"the"second"group" No"extra"storage"

Intermediate"Code"

slide-26
SLIDE 26

Storage;op:mality"of"Intermediate"Code"

Retains"failure"tolerance"of"RS"codes:"" can"tolerate"failure"of"any"2"nodes"

1"byte" block"1" block"2" block"3" block"4"

a1 b1 a1+b1 a1+2b1 a2 b2 a2+b2 a2+2b2+a1

1"byte"

a1 b b1 +a +a1

subtract"

a2 b b2

slide-27
SLIDE 27

1"byte" block"1" block"2" block"3" block"4"

a1 b1 a1+b1 a1+2b1 a2 b2 a2+b2 a2+2b2+a1

1"byte"

Final"Code"

subtract"

Inver:ble"opera:on"within"a"block"

slide-28
SLIDE 28

1"byte" block"1" block"2" block"3" block"4"

a1 b1 a1+b1 2b1-a2-2b2 a2 b2 a2+b2 a2+2b2+a1

1"byte"

Final"Code"

Inver:ble"opera:ons"within"blocks"do"not"change" storage"or"fault"tolerance"

slide-29
SLIDE 29

Data"transferred:"only"3"bytes"" (instead"of"4"bytes"as"in"RS)"

b2 a2+b2

block"1" block"2" block"3" block"4"

a1 b1 a1+b1 2b1-a2-2b2 a2 b2 a2+b2 a2+2b2+a1 a2+2b2+a1

Efficient"Reconstruc:on"

1"byte" 1"byte"

slide-30
SLIDE 30

1"byte" block"1" block"2" block"3" block"4"

a1 b1 a1+b1 2b1-a2-2b2 a2 b2 a2+b2 a2+2b2+a1

1"byte"

Data"transferred:"only"3"bytes"" (instead"of"4"bytes"as"in"RS)"

Efficient"Reconstruc:on"

slide-31
SLIDE 31
  • Builds"on"top"of"RS"codes"
  • Uses"our"theore:cal"framework"of"“Piggybacking”*"
  • Three"versions"

– XOR"" – XOR+" – non;XOR"

* K.V. Rashmi, Nihar Shah, K. Ramchandran, “A Piggybacking Design Framework for Read-and Download-

efficient Distributed Storage Codes”, in IEEE International Symposium on Information Theory, 2013.!

Hitchhiker’s"Erasure"Code"

slide-32
SLIDE 32
  • Way"of"choosing"which"bytes"to"mix""

– couples"bytes"farther"apart"in"block" – to"minimize"fragmenta:on"of"reads"during" reconstruc:on""

  • Translate"savings"in"network;transfer"to"savings"in"

disk;IO"as"well"

  • By"making"reads"con:guous"

Hop;and;couple" (disk"layout)"

slide-33
SLIDE 33

RS"vs"Hitchhiker"from"the"Network’s"Perspec:ve…"

slide-34
SLIDE 34

Data"Transfer"during"Reconstruc:on"" in"RS;based"System"

Transfer:"10"full"blocks" Connect"to"10"machines"

block"10" block"11" block"14" block"13" block"12" block"9" block"8" block"7" block"6" block"5" block"4" block"2" block"1" block"3" 256"MB" data" parity"

slide-35
SLIDE 35

Data"Transfer"during"Reconstruc:on"" in"Hitchhiker"

Transfer:"2"full"blocks"+"9"half"blocks"(="6.5"blocks"total)" Connect"to"11"machines" Reconstruc:on"of"data"blocks"1;9:"

block"10" block"11" block"14" 256"MB" block"13" block"12" block"9" block"8" block"7" block"6" block"5" block"4" block"2" block"1" block"3" data" parity"

slide-36
SLIDE 36

Data"Transfer"during"Reconstruc:on"" in"Hitchhiker"

Transfer:"13"half"blocks"(="6.5"blocks"total)" Connect"to"13"machines" Reconstruc:on"of"block"10:"

256"MB" block"10" block"11" block"14" 256"MB" block"13" block"12" block"9" block"8" block"7" block"6" block"5" block"4" block"2" block"1" block"3" data" parity"

slide-37
SLIDE 37

Outline"

  • Erasure"Codes"in"Data"Centers"

– HDFS"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"system:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-38
SLIDE 38

Implementa:on"&"Evalua:on"Setup"(1)"

  • Implemented"on"top"of"HDFS;RAID"

– erasure"coding"module"in"HDFS"based"on"RS" – used"in"the"Facebook"data"warehouse"cluster"

  • Deployed"and"tested"on"a"60"machine"test"

cluster"at"Facebook""

– verified"35%"reduc:on"in"the"network"transfers" during"reconstruc:on"

slide-39
SLIDE 39
  • Evalua:on"of":ming"metrics"on"the"Facebook"

data"warehouse"cluster"in"produc:on"

– under"real;:me"produc:on"traffic"and"workloads" – using"Map;Reduce"to"run"encoding"and"reconstruc:on" jobs,"just"as"HDFS;RAID"

Implementa:on"&"Evalua:on"Setup"(2)"

slide-40
SLIDE 40

Decoding"Time"

  • RS"decoding"on"only"half"por:on"of"the"blocks"
  • Faster"computa:on"for"degraded"reads"and"recovery"

"

  • XOR"versions:"25%"lesser"than"non;XOR"

36%""reduc:on"

slide-41
SLIDE 41

Read"&" Transfer"Time"

  • Read"&"transfer":me"30%"lower"in"Hitchhiker"(HH)"
  • Similar"reduc:on"for"other"block"sizes"as"well"

System% Data%transfer% Connec:vity%(#machines)% RS" 2.56"GB" 10" HH"blocks"1;9" 1.67"GB" 11" HH"block"10" 1.67"GB" 13"

Median" 95th"%ile"

slide-42
SLIDE 42

Encoding"Time"

72%"higher" Benefits"outweigh"higher"encoding"cost"in"many"systems" (e.g.,"HDFS):"

  • encoding"is"one":me"opera:on"
  • oHen"run"as"a"background"job"
  • does"not"fall"along"any"cri:cal"path"
slide-43
SLIDE 43

Outline"

  • Impact"on"the"data"center"network"

– Problem"descrip:on" "

  • Our"system:"“Hitchhiker”"

"

  • Implementa:on"and"evalua:on"

– Facebook"data"warehouse"cluster"

  • Literature"

"

slide-44
SLIDE 44

Exis:ng"Systems"

  • Need"addi:onal"storage"

– Huang"et"al."(Windows"Azure)"2012,"Sathiamoorthy" et"al."(Xorbas)"2013,"Esmaili"et"al."(CORE)"2013"

  • Add"addi:onal"pari:es"to"reduce"download"

– Hu"et"al."(NCFS"2011)"

  • Highly"restricted"parameters"

– Khan"et"al."(Rotated;RS)"2012:"#parity"≤"3" – Xiang"et"al.,"Wang"et"al."2010,"Hu"(NCCloud)"et"al."2012:" #parity"≤"2" – Hitchhiker"performs"as"good"or"beter"for"these"" restricted"seungs"as"well"

slide-45
SLIDE 45

Code"metrics:"" Storage"requirement"" Same"(op:mal)"" Supported"parameters"" All" Fault"tolerance"" Same"(op:mal)""

"

Reconstruc:on:" Network"transfers" 35%"less"" Disk"IO" 35%"less" Data"read"and"transfer":me"(median)"" 31.8%"less"" Data"read"and"transfer":me"(95th"%ile)" 30.2%"less" Computa:on":me"(median)" 36.1%"less"

"

Encoding:" Encoding":me"(median)" 72.1%"more"

Thanks!" Hitchhiker:"Summary"

slide-46
SLIDE 46

Backup"Slides"

slide-47
SLIDE 47

Hop;and;Couple"

  • Technique"to"pair"bytes"under"Hitchhiker’s"

erasure"code"

  • Makes"disk"reads"during"reconstruc:on"

con:guous"

… . . . . . . . . . . . . . . . . . . . . . . . . . . .

coupled bytes

(encoded together)

unit 1 unit 2 unit 10 unit 14 unit 3

unit 4 unit 12 unit 13

… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

hop length unit 1 unit 2 unit 10 unit 11 unit 14

data units parity units

unit 3

unit 4 unit 12 unit 13

. . . . . . . . . . . . . . . . . . . . . . . .

1 byte coupled bytes

(encoded together)

(a) coupling adjacent bytes to form stripes (b) hop-and-couple

. . . . . . . . .

unit 11

data units parity units