Cloud storage state of affairs Storage clusters contain thousands of - - PowerPoint PPT Presentation
Cloud storage state of affairs Storage clusters contain thousands of - - PowerPoint PPT Presentation
A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB
Storage clusters contain thousands of storage nodes, with e.g. 500 TB capacity per node Clusters are built on commodity HW, failures are very frequent Durability of data achieved via replication (3 copies à 3x storage), too costly
Cloud storage state of affairs
Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”
Daily Failed nodes in 3000 node FB production cluster (1 Month)
Cloud storage
- Triplication
− High storage overhead (67%) and cost − Limited durability (2 failures) − Reactive repair à high repair bandwidth
- Erasure Codes (RS)
− RS (9,6,3) à 33% storage overhead (MS) − RS (14,10,4) à 29% storage overhead (FB) − Better overhead and durability than triplication, but
− High repair bandwidth − Degraded access Triplication Small erasure codes Ideal Cloud Storage
Cloud storage
Triplication Small erasure codes Ideal Cloud Storage Liquid cloud storage
Liquid cloud storage
10 100 1000 10000 100000 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Peak repair BW per node [Mbps] Storage overhead b = r/n
RS
k=382, r=20 k=335, r=67 k=268, r=134
Liquid Triplication
Quantitative comparison
Liquid advantages
Lower storage overhead Lower repair costs Better durability Superior trade-offs Customize to infrastructure
MTTDL (durability)
107 years : Liquid 106 years : Reed-Solomon 105 years : Triplication
BETTER WORSE BETTER WORSE
Mathematical model of distributed storage
− Based on an understanding of deployed systems − Models distributed storage system − Models maintaining recoverability of message when nodes can fail − Enables analysis of storage overhead & repair bandwidth trade-offs
Information-theoretic lower bounds
− Fundamental lower bounds on trade-offs
Algorithmic upper bounds
− Matching algorithmic upper bounds on trade-offs − Using standard erasure codes
Overview
message x message z Source Transmitter Receiver Destination signal Received signal y Noise
A mathematical model of communication – Shannon
writes data reads data y recovered message z message x
Time T passes between storage and access Source Storer Accessor Destination Failures Nodes Nodes
A mathematical model of distributed storage
Read data Write data
Repairer
Local memory
Node2 Noden Node3 Node1
… Storage nodes model
Nodes can fail
When a node fails
− All data stored at a node is immediately lost − Failed node is immediately replaced by a new node
− Bits at new node are initialized to zeroes
Failure process
− Determines when nodes fail − Determines what nodes fail
Node failure model
Storer
− Writes data to the nodes generated from message x received from a source
Repairer
− Aware of when and what nodes fail − Repair process (continual loop)
− Read data from nodes − Generate new data from the read data − Write new data to nodes
Accessor
− Reads data y from the nodes − Generates z from y and provides z to a destination
Goal – recovered message z = original message x
System overview
Lower bounds
− There is a failure process, so that for any repairer:
− The average repair traffic is above a threshold function of the storage overhead
− Information theoretic
Upper bounds
− There is a repairer, so that for any failure process:
− The peak repair traffic is below a threshold function of the storage overhead
− Algorithmic, based on Liquid cloud storage algorithms
− Large erasure codes − Lazy repair strategy − Flow data organization
Bounds
Failure timing – determines when nodes fail
− TFixed = fixed timing, i.e., Δ-duration between failures − TRandom = random timing, i.e., Poisson with Δ-duration average between failures
Failure pattern – determines what nodes fail
− PRandom = random pattern, i.e., random node fails − PAdversarial = adversarial pattern, i.e., failed node chosen based on all available information
Failure process
(TRandom, PAdversarial)-failures (TRandom, PRandom)-failures (TFixed, PAdversarial)-failures (TFixed, PRandom)-failures
Deterministic
− Previous failure process actions determine next repairer action
Randomized
− Repairer can use a source of random bits to determine actions − Random bits are private to repairer (not available to the failure process)
Repairers
Deterministic repairer Randomized repairer
(TFixed, PAdversarial)-failures bounds
Lower bound Upper bound
Deterministic repairer (TFixed, PAdversarial)-failures
Bounds are equal (asymptotically as storage overhead goes to zero) Bounds on storage overhead versus repairer traffic
Deterministic repairer
Main bounds
Randomized repairer (TRandom, PAdversarial)-failures (TFixed, PRandom)-failures
Main lower bound Main upper bound
(TFixed, PAdversarial)-failures
Bounds are equal (asymptotically as storage overhead goes to zero)
(TRandom, PRandom)-failures
Both main bounds apply to random failures
Storage overhead
− β= 1- m/c = storage overhead − m = size of message x − c = ns = total storage capacity
− n = number of storage nodes − s = storage capacity per node
Repairer read rate
− RAVG = lower bound on average repair read rate − RPEAK = upper bound on peak repair read rate
Durability
MTTDL = mean time till at least some of the message is unrecoverable
Definitions for bounds
Lower bound when β = 0.25
− is necessary to guarantee message recovery
Upper bound when β = 0.25
− is sufficient to guarantee message recovery
x
Asymptotic as β —> 0
− X±
(TFixed, PAdversarial)-failures bounds
!!"# → ! 2! ∙ Δ ← !!"#$ !!"#$ ≤ 1.31 ∙ ! 2! ∙ Δ !!"# ≥ 0.815 ∙ ! 2! ∙ Δ
Lower bound
− (TFixed, PRandom)-failures interacting with randomized repairer − is necessary to achieve a large MTTDL
Upper bound
− (TRandom, PAdversarial)-failures interacting with deterministic repairer − is sufficient to achieve a large MTTDL
Main results as storage overhead β —> 0
!!"# ≥ ! 2! ∙ Δ !!"#$ ≤ ! 2! ∙ Δ
0.00 2.00 4.00 6.00 8.00 10.00 12.00 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49
Repairer read rate Storage overhead Upper Bound 1/2β Bound (T_Fixed, P_Adversarial)-Lower Bound (T_Fixed, P_Random)-Lower Bound
s = 1 Δ = 1
β (RPEAK) (RAVG) (RAVG)
Visualization of bounds tradeoffs
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25
Normalized repairer read rate Storage overhead Upper*2β (T_Fixed, P_Adversarial)-Lower*2β (T_Fixed, P_Random)-Lower*2β
Visualization of ratio: upper and lower to asymptotic
β
Repairer-failure process game
− Repairer trying to ensure message is recoverable − Failure process trying to make message unrecoverable
Transcript
− Record of interactions between repairer and failure process − When nodes fail − What nodes fail − What bits read by repairer, etc.
The (lower bound) game
Snapshot at time t
− The c stored bits at the nodes at time t
At time t’ > t
− = # snapshot bits read between t and t’ − = # snapshot bits lost (erased before being read) between t and t’ − = # snapshot bits unmodified between t and t’ − is an invariant
Initially
− x
Claim
− If then message unrecoverable at time t’ − If then message unrecoverable at time t’
Snapshot interval (t, t’)
!(!, !!) r(!, !!) ! = ! !, !! + ! !, !! + !(!, !!) !(!, !!) ! !, !! > ! − ! = ! ∙ ! ! !, ! = ! !, ! = 0, ! !, ! = ! ! !, !! + ! !, !! < !
Node6 Node9 Node3 Node5 Node1
Intuition
Node4 Node7 Node8 Node2
Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes (β = 1/3)
Message is not recoverable Message may be recoverable
Suppose erased and read bits are disjoint
− Implies all erased bits are lost − Necessary condition for message recoverability:
− Repairer needs to read m bits before failure process erases c - m bits
− Failure process erases bits at a rate − Repairer needs to read bits at a rate of at least
Generally erased and read bits not disjoint
− Repairer can read bits from node before node fails − The bits that have been read are not lost if the node fails − Number of bits lost when node fails can be less than s
Intuition analysis
!/Δ ! ∙ ! (! − !) ∙ Δ = (1 − !) ∙ ! ! ∙ Δ
Node6 Node9 Node3 Node5 Node1
Snapshot interval evolution
Node4 Node7 Node8 Node2
Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes (β = 1/3)
Snapshot interval (high or low)
Average read traffic per node failure since snapshot > b before 2βn failures High Average read traffic per node failure since snapshot ≤ b for all 2βn failures Low 2βn failures since snapshot 2βn failures since snapshot
b
b = bound on average read traffic per node failure Δ
! = !"#(2!) ∙ ! 2!
!"# ! = !! 2 ∙ ! − 1 − ! ∙ ln 1 1 − ! 1 2 = !"# 1 ≤ !"# ! ≤ !"# 0 = 1
(TFixed, PAdversarial)-failure process choices
start of interval first node failure
Fail node 1 Fail node 2 Fail node 3 Fail node 4
There is a path that leads to high interval High Low second node failure
Fail node 1 Fail node 3 Fail node 4
third node failure
Fail node 1 Fail node 4
Failure process chooses this path to define interval Repairer average read rate per node failure is at least b over high interval
(TFixed, PAdversarial)-failure process choices
start of interval first node failure
Fail node 1 Fail node 2 Fail node 3 Fail node 4
second node failure third node failure fourth node failure All paths lead to low intervals High Low We show this implies message is unrecoverable for at least one path
Worst case (when all paths define low intervals)
Low 2βn failures since snapshot b = bound on average read traffic per node failure
b b b b b b b b b b
Average read traffic per node failure since snapshot = b for all 2βn failures For each of the possible paths
! ∙ (! − 1) ∙ ⋯ ∙ (! − 2!!)
Expected behavior when all paths define low intervals
Node failures shown in the order of their occurrence starting at snapshot 3 2!" ⋯ 1 2
Expected behavior when all paths define low intervals
2!! ∙ ! Number of nodes failures = Total capacity erased = Expected number of bits lost > 2!" !" ∙ ! = !" = ! − ! ! − ! Implies there is a path where number of bits lost > Implies that the message is unrecoverable for the interval defined by this path Failure process chooses this path to define interval Node failures shown in the order of their occurrence starting at snapshot 1 2 3 2!" ⋯
Not all node failures are snap failures
− Random nodes fail independent of snapshots − Most failures are snap failures when storage overhead is small
Node failure timing variance
− Repairer may decide strategy based on timing between node failures − Show that on average even short intervals have average failure timing
Random failures instead of adversarial pattern
− Failure process cannot choose which transcripts to continue and discard − Previous argument shows message unrecoverable in expectation − Sub-martingale: transforms expectation to high probability result
(TFixed, Padversarial) à (TFixed, Prandom)
F1 F2 Fn
Read Data
Repairer
Write Data
Local memory
Functional storage nodes model
Node1 Node2 Noden
…
Functional model
− Previous model is a special case
Allows more powerful repairers
− Can perform arbitrary local node computation on data before provided to repairer
− Local network traffic between node storage and CPU at node doesn’t count in read traffic
− Allows repairer to use things like network coding
− Regenerating codes (Dimakis et. al.) looked at doing things like this
− Allows arbitrary techniques at node
− Beyond standard erasure codes or network coding, etc.
Functional lower bounds = lower bounds stated previously
− Upper bounds use standard erasure codes & read data directly from nodes − Upper bounds = lower bounds − Network coding unnecessary to achieve optimality
Functional storage nodes model
Transient failures
− Delay between node failure and repair of node needed to avoid unnecessary repair
Silent data loss
− Need to read data to know whether or not it is lost
Rebuild and replacement of nodes issues
− Update infrastructure over time (replace old infra with newer infra) − Nodes often introduced in batches – a robotic or manual process
Distributed repair, measure repair bandwidth per node
− Avoid network hotspots
Different failure models
− Varying failure rate − Correlated failures
More advanced models for distributed storage
References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-
- wned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development
functions, and substantially all of its product and services businesses, including its semiconductor business and QCT.
For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.