Cloud storage state of affairs Storage clusters contain thousands of - - PowerPoint PPT Presentation

cloud storage state of affairs
SMART_READER_LITE
LIVE PREVIEW

Cloud storage state of affairs Storage clusters contain thousands of - - PowerPoint PPT Presentation

A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB


slide-1
SLIDE 1

A mathematical theory of distributed storage

Research

Michael Luby Dagstuhl workshop (16321) Coding in the time of Big Data August 8, 2016

slide-2
SLIDE 2

Storage clusters contain thousands of storage nodes, with e.g. 500 TB capacity per node Clusters are built on commodity HW, failures are very frequent Durability of data achieved via replication (3 copies à 3x storage), too costly

Cloud storage state of affairs

Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”

Daily Failed nodes in 3000 node FB production cluster (1 Month)

slide-3
SLIDE 3

Cloud storage

  • Triplication

− High storage overhead (67%) and cost − Limited durability (2 failures) − Reactive repair à high repair bandwidth

  • Erasure Codes (RS)

− RS (9,6,3) à 33% storage overhead (MS) − RS (14,10,4) à 29% storage overhead (FB) − Better overhead and durability than triplication, but

− High repair bandwidth − Degraded access Triplication Small erasure codes Ideal Cloud Storage

slide-4
SLIDE 4

Cloud storage

Triplication Small erasure codes Ideal Cloud Storage Liquid cloud storage

Liquid cloud storage

slide-5
SLIDE 5

10 100 1000 10000 100000 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Peak repair BW per node [Mbps] Storage overhead b = r/n

RS

k=382, r=20 k=335, r=67 k=268, r=134

Liquid Triplication

Quantitative comparison

Liquid advantages

Lower storage overhead Lower repair costs Better durability Superior trade-offs Customize to infrastructure

MTTDL (durability)

107 years : Liquid 106 years : Reed-Solomon 105 years : Triplication

BETTER WORSE BETTER WORSE

slide-6
SLIDE 6

Mathematical model of distributed storage

− Based on an understanding of deployed systems − Models distributed storage system − Models maintaining recoverability of message when nodes can fail − Enables analysis of storage overhead & repair bandwidth trade-offs

Information-theoretic lower bounds

− Fundamental lower bounds on trade-offs

Algorithmic upper bounds

− Matching algorithmic upper bounds on trade-offs − Using standard erasure codes

Overview

slide-7
SLIDE 7

message x message z Source Transmitter Receiver Destination signal Received signal y Noise

A mathematical model of communication – Shannon

slide-8
SLIDE 8

writes data reads data y recovered message z message x

Time T passes between storage and access Source Storer Accessor Destination Failures Nodes Nodes

A mathematical model of distributed storage

slide-9
SLIDE 9

Read data Write data

Repairer

Local memory

Node2 Noden Node3 Node1

… Storage nodes model

Nodes can fail

slide-10
SLIDE 10

When a node fails

− All data stored at a node is immediately lost − Failed node is immediately replaced by a new node

− Bits at new node are initialized to zeroes

Failure process

− Determines when nodes fail − Determines what nodes fail

Node failure model

slide-11
SLIDE 11

Storer

− Writes data to the nodes generated from message x received from a source

Repairer

− Aware of when and what nodes fail − Repair process (continual loop)

− Read data from nodes − Generate new data from the read data − Write new data to nodes

Accessor

− Reads data y from the nodes − Generates z from y and provides z to a destination

Goal – recovered message z = original message x

System overview

slide-12
SLIDE 12

Lower bounds

− There is a failure process, so that for any repairer:

− The average repair traffic is above a threshold function of the storage overhead

− Information theoretic

Upper bounds

− There is a repairer, so that for any failure process:

− The peak repair traffic is below a threshold function of the storage overhead

− Algorithmic, based on Liquid cloud storage algorithms

− Large erasure codes − Lazy repair strategy − Flow data organization

Bounds

slide-13
SLIDE 13

Failure timing – determines when nodes fail

− TFixed = fixed timing, i.e., Δ-duration between failures − TRandom = random timing, i.e., Poisson with Δ-duration average between failures

Failure pattern – determines what nodes fail

− PRandom = random pattern, i.e., random node fails − PAdversarial = adversarial pattern, i.e., failed node chosen based on all available information

Failure process

(TRandom, PAdversarial)-failures (TRandom, PRandom)-failures (TFixed, PAdversarial)-failures (TFixed, PRandom)-failures

slide-14
SLIDE 14

Deterministic

− Previous failure process actions determine next repairer action

Randomized

− Repairer can use a source of random bits to determine actions − Random bits are private to repairer (not available to the failure process)

Repairers

Deterministic repairer Randomized repairer

slide-15
SLIDE 15

(TFixed, PAdversarial)-failures bounds

Lower bound Upper bound

Deterministic repairer (TFixed, PAdversarial)-failures

Bounds are equal (asymptotically as storage overhead goes to zero) Bounds on storage overhead versus repairer traffic

slide-16
SLIDE 16

Deterministic repairer

Main bounds

Randomized repairer (TRandom, PAdversarial)-failures (TFixed, PRandom)-failures

Main lower bound Main upper bound

(TFixed, PAdversarial)-failures

Bounds are equal (asymptotically as storage overhead goes to zero)

(TRandom, PRandom)-failures

Both main bounds apply to random failures

slide-17
SLIDE 17

Storage overhead

− β= 1- m/c = storage overhead − m = size of message x − c = nŸs = total storage capacity

− n = number of storage nodes − s = storage capacity per node

Repairer read rate

− RAVG = lower bound on average repair read rate − RPEAK = upper bound on peak repair read rate

Durability

MTTDL = mean time till at least some of the message is unrecoverable

Definitions for bounds

slide-18
SLIDE 18

Lower bound when β = 0.25

− is necessary to guarantee message recovery

Upper bound when β = 0.25

− is sufficient to guarantee message recovery

x

Asymptotic as β —> 0

− X±

(TFixed, PAdversarial)-failures bounds

!!"# → ! 2! ∙ Δ ← !!"#$ !!"#$ ≤ 1.31 ∙ ! 2! ∙ Δ !!"# ≥ 0.815 ∙ ! 2! ∙ Δ

slide-19
SLIDE 19

Lower bound

− (TFixed, PRandom)-failures interacting with randomized repairer − is necessary to achieve a large MTTDL

Upper bound

− (TRandom, PAdversarial)-failures interacting with deterministic repairer − is sufficient to achieve a large MTTDL

Main results as storage overhead β —> 0

!!"# ≥ ! 2! ∙ Δ !!"#$ ≤ ! 2! ∙ Δ

slide-20
SLIDE 20

0.00 2.00 4.00 6.00 8.00 10.00 12.00 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49

Repairer read rate Storage overhead Upper Bound 1/2β Bound (T_Fixed, P_Adversarial)-Lower Bound (T_Fixed, P_Random)-Lower Bound

s = 1 Δ = 1

β (RPEAK) (RAVG) (RAVG)

Visualization of bounds tradeoffs

slide-21
SLIDE 21

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25

Normalized repairer read rate Storage overhead Upper*2β (T_Fixed, P_Adversarial)-Lower*2β (T_Fixed, P_Random)-Lower*2β

Visualization of ratio: upper and lower to asymptotic

β

slide-22
SLIDE 22

Repairer-failure process game

− Repairer trying to ensure message is recoverable − Failure process trying to make message unrecoverable

Transcript

− Record of interactions between repairer and failure process − When nodes fail − What nodes fail − What bits read by repairer, etc.

The (lower bound) game

slide-23
SLIDE 23

Snapshot at time t

− The c stored bits at the nodes at time t

At time t’ > t

− = # snapshot bits read between t and t’ − = # snapshot bits lost (erased before being read) between t and t’ − = # snapshot bits unmodified between t and t’ − is an invariant

Initially

− x

Claim

− If then message unrecoverable at time t’ − If then message unrecoverable at time t’

Snapshot interval (t, t’)

!(!, !!) r(!, !!) ! = ! !, !! + ! !, !! + !(!, !!) !(!, !!) ! !, !! > ! − ! = ! ∙ ! ! !, ! = ! !, ! = 0, ! !, ! = ! ! !, !! + ! !, !! < !

slide-24
SLIDE 24

Node6 Node9 Node3 Node5 Node1

Intuition

Node4 Node7 Node8 Node2

Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes (β = 1/3)

Message is not recoverable Message may be recoverable

slide-25
SLIDE 25

Suppose erased and read bits are disjoint

− Implies all erased bits are lost − Necessary condition for message recoverability:

− Repairer needs to read m bits before failure process erases c - m bits

− Failure process erases bits at a rate − Repairer needs to read bits at a rate of at least

Generally erased and read bits not disjoint

− Repairer can read bits from node before node fails − The bits that have been read are not lost if the node fails − Number of bits lost when node fails can be less than s

Intuition analysis

!/Δ ! ∙ ! (! − !) ∙ Δ = (1 − !) ∙ ! ! ∙ Δ

slide-26
SLIDE 26

Node6 Node9 Node3 Node5 Node1

Snapshot interval evolution

Node4 Node7 Node8 Node2

Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes (β = 1/3)

slide-27
SLIDE 27

Snapshot interval (high or low)

Average read traffic per node failure since snapshot > b before 2βn failures High Average read traffic per node failure since snapshot ≤ b for all 2βn failures Low 2βn failures since snapshot 2βn failures since snapshot

b

b = bound on average read traffic per node failure Δ

! = !"#(2!) ∙ ! 2!

!"# ! = !! 2 ∙ ! − 1 − ! ∙ ln 1 1 − ! 1 2 = !"# 1 ≤ !"# ! ≤ !"# 0 = 1

slide-28
SLIDE 28

(TFixed, PAdversarial)-failure process choices

start of interval first node failure

Fail node 1 Fail node 2 Fail node 3 Fail node 4

There is a path that leads to high interval High Low second node failure

Fail node 1 Fail node 3 Fail node 4

third node failure

Fail node 1 Fail node 4

Failure process chooses this path to define interval Repairer average read rate per node failure is at least b over high interval

slide-29
SLIDE 29

(TFixed, PAdversarial)-failure process choices

start of interval first node failure

Fail node 1 Fail node 2 Fail node 3 Fail node 4

second node failure third node failure fourth node failure All paths lead to low intervals High Low We show this implies message is unrecoverable for at least one path

slide-30
SLIDE 30

Worst case (when all paths define low intervals)

Low 2βn failures since snapshot b = bound on average read traffic per node failure

b b b b b b b b b b

Average read traffic per node failure since snapshot = b for all 2βn failures For each of the possible paths

! ∙ (! − 1) ∙ ⋯ ∙ (! − 2!!)

slide-31
SLIDE 31

Expected behavior when all paths define low intervals

Node failures shown in the order of their occurrence starting at snapshot 3 2!" ⋯ 1 2

slide-32
SLIDE 32

Expected behavior when all paths define low intervals

2!! ∙ ! Number of nodes failures = Total capacity erased = Expected number of bits lost > 2!" !" ∙ ! = !" = ! − ! ! − ! Implies there is a path where number of bits lost > Implies that the message is unrecoverable for the interval defined by this path Failure process chooses this path to define interval Node failures shown in the order of their occurrence starting at snapshot 1 2 3 2!" ⋯

slide-33
SLIDE 33

Not all node failures are snap failures

− Random nodes fail independent of snapshots − Most failures are snap failures when storage overhead is small

Node failure timing variance

− Repairer may decide strategy based on timing between node failures − Show that on average even short intervals have average failure timing

Random failures instead of adversarial pattern

− Failure process cannot choose which transcripts to continue and discard − Previous argument shows message unrecoverable in expectation − Sub-martingale: transforms expectation to high probability result

(TFixed, Padversarial) à (TFixed, Prandom)

slide-34
SLIDE 34

F1 F2 Fn

Read Data

Repairer

Write Data

Local memory

Functional storage nodes model

Node1 Node2 Noden

slide-35
SLIDE 35

Functional model

− Previous model is a special case

Allows more powerful repairers

− Can perform arbitrary local node computation on data before provided to repairer

− Local network traffic between node storage and CPU at node doesn’t count in read traffic

− Allows repairer to use things like network coding

− Regenerating codes (Dimakis et. al.) looked at doing things like this

− Allows arbitrary techniques at node

− Beyond standard erasure codes or network coding, etc.

Functional lower bounds = lower bounds stated previously

− Upper bounds use standard erasure codes & read data directly from nodes − Upper bounds = lower bounds − Network coding unnecessary to achieve optimality

Functional storage nodes model

slide-36
SLIDE 36

Transient failures

− Delay between node failure and repair of node needed to avoid unnecessary repair

Silent data loss

− Need to read data to know whether or not it is lost

Rebuild and replacement of nodes issues

− Update infrastructure over time (replace old infra with newer infra) − Nodes often introduced in batches – a robotic or manual process

Distributed repair, measure repair bandwidth per node

− Avoid network hotspots

Different failure models

− Varying failure rate − Correlated failures

More advanced models for distributed storage

slide-37
SLIDE 37

References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-

  • wned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development

functions, and substantially all of its product and services businesses, including its semiconductor business and QCT.

For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.

Thank you

Follow us on: