Cloud storage state of affairs Storage clusters contain thousands of - PowerPoint PPT Presentation

A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research

Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB capacity per node Clusters are built on commodity HW, failures are very frequent Durability of data achieved via replication (3 copies à 3x storage), too costly Daily Failed nodes in 3000 node FB production cluster (1 Month) Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”

Cloud storage • Triplication − High storage overhead (67%) and cost − Limited durability (2 failures) − Reactive repair à high repair bandwidth Ideal Cloud • Erasure Codes (RS) Storage − RS (9,6,3) à 33% storage overhead (MS) − RS (14,10,4) à 29% storage overhead (FB) − Better overhead and durability than triplication, but Triplication − High repair bandwidth Small erasure codes − Degraded access

Cloud storage Liquid cloud storage Ideal Cloud Storage Triplication Small erasure codes Liquid cloud storage

Quantitative comparison Liquid advantages Lower storage overhead Triplication 100000 Lower repair costs Peak repair BW per node [Mbps] WORSE Better durability RS 10000 Superior trade-offs k =382, r =20 Customize to infrastructure 1000 k =335, r =67 MTTDL (durability) k =268, r =134 Liquid 10 7 years : Liquid BETTER 100 10 6 years : Reed-Solomon BETTER 10 5 years : Triplication WORSE 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Storage overhead b = r/n

Overview Mathematical model of distributed storage − Based on an understanding of deployed systems − Models distributed storage system − Models maintaining recoverability of message when nodes can fail − Enables analysis of storage overhead & repair bandwidth trade-offs Information-theoretic lower bounds − Fundamental lower bounds on trade-offs Algorithmic upper bounds − Matching algorithmic upper bounds on trade-offs − Using standard erasure codes

A mathematical model of communication – Shannon Received message signal message signal Source Transmitter Receiver Destination x z y Noise

A mathematical model of distributed storage Time T passes between storage and access reads recovered Nodes Nodes message writes data message Source Storer Accessor Destination x data y z Failures

Storage nodes model Read data … Local Node 1 Node 2 Node 3 Node n Repairer memory Write data Nodes can fail

Node failure model When a node fails − All data stored at a node is immediately lost − Failed node is immediately replaced by a new node − Bits at new node are initialized to zeroes Failure process − Determines when nodes fail − Determines what nodes fail

System overview Storer − Writes data to the nodes generated from message x received from a source Repairer − Aware of when and what nodes fail − Repair process (continual loop) − Read data from nodes − Generate new data from the read data − Write new data to nodes Accessor − Reads data y from the nodes − Generates z from y and provides z to a destination Goal – recovered message z = original message x

Bounds Lower bounds − There is a failure process, so that for any repairer: − The average repair traffic is above a threshold function of the storage overhead − Information theoretic Upper bounds − There is a repairer, so that for any failure process: − The peak repair traffic is below a threshold function of the storage overhead − Algorithmic, based on Liquid cloud storage algorithms − Large erasure codes − Lazy repair strategy − Flow data organization

Failure process Failure timing – determines when nodes fail − T Fixed = fixed timing, i.e., Δ -duration between failures − T Random = random timing, i.e., Poisson with Δ -duration average between failures Failure pattern – determines what nodes fail − P Random = random pattern, i.e., random node fails − P Adversarial = adversarial pattern, i.e., failed node chosen based on all available information (T Random , P Adversarial )-failures (T Random , P Random )-failures (T Fixed , P Adversarial )-failures (T Fixed , P Random )-failures

Repairers Deterministic − Previous failure process actions determine next repairer action Randomized − Repairer can use a source of random bits to determine actions − Random bits are private to repairer (not available to the failure process) Deterministic repairer Randomized repairer

(T Fixed , P Adversarial )-failures bounds Bounds on storage overhead versus repairer traffic Upper bound (T Fixed , P Adversarial )-failures Deterministic repairer Lower bound Bounds are equal (asymptotically as storage overhead goes to zero)

Main bounds Main upper bound (T Random , P Adversarial )-failures Deterministic repairer (T Random , P Random )-failures (T Fixed , P Adversarial )-failures (T Fixed , P Random )-failures Randomized repairer Main lower bound Both main bounds apply to random failures Bounds are equal (asymptotically as storage overhead goes to zero)

Definitions for bounds Storage overhead − β = 1- m / c = storage overhead − m = size of message x − c = n s = total storage capacity − n = number of storage nodes − s = storage capacity per node Repairer read rate − R AVG = lower bound on average repair read rate − R PEAK = upper bound on peak repair read rate Durability MTTDL = mean time till at least some of the message is unrecoverable

(T Fixed , P Adversarial )-failures bounds Lower bound when β = 0.25 ! ! !"# ≥ 0 . 815 ∙ 2 ! ∙ Δ − is necessary to guarantee message recovery Upper bound when β = 0.25 ! ! !"#$ ≤ 1 . 31 ∙ 2 ! ∙ Δ − is sufficient to guarantee message recovery x Asymptotic as β —> 0 ! ! !"# → 2 ! ∙ Δ ← ! !"#$ − X±

Main results as storage overhead β —> 0 Lower bound − (T Fixed , P Random )-failures interacting with randomized repairer ! ! !"# ≥ 2 ! ∙ Δ − is necessary to achieve a large MTTDL Upper bound − (T Random , P Adversarial )-failures interacting with deterministic repairer ! ! !"#$ ≤ 2 ! ∙ Δ − is sufficient to achieve a large MTTDL

Visualization of bounds tradeoffs Repairer read rate 10.00 12.00 0.00 2.00 4.00 6.00 8.00 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 (T_Fixed, P_Random)-Lower Bound (T_Fixed, P_Adversarial)-Lower Bound 1/2 β Bound Upper Bound 0.14 0.15 0.16 0.17 0.18 0.19 Storage overhead 0.20 0.21 0.22 0.23 0.24 0.25 ( R PEAK ) 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 β 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 ( R AVG ) 0.45 0.46 ( R AVG ) 0.47 0.48 0.49 Δ = 1 s = 1

Visualization of ratio: upper and lower to asymptotic 1.40 Normalized repairer read rate 1.20 1.00 0.80 Upper*2 β 0.60 (T_Fixed, P_Adversarial)-Lower*2 β 0.40 (T_Fixed, P_Random)-Lower*2 β 0.20 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 β Storage overhead

The (lower bound) game Repairer-failure process game − Repairer trying to ensure message is recoverable − Failure process trying to make message unrecoverable Transcript − Record of interactions between repairer and failure process − When nodes fail − What nodes fail − What bits read by repairer, etc.

Snapshot interval ( t , t’ ) Snapshot at time t − The c stored bits at the nodes at time t At time t’ > t r ( ! , ! ! ) − = # snapshot bits read between t and t’ ! ( ! , ! ! ) − = # snapshot bits lost (erased before being read) between t and t’ ! ( ! , ! ! ) − = # snapshot bits unmodified between t and t’ ! = ! ! , ! ! + ! ! , ! ! + ! ( ! , ! ! ) − is an invariant Initially ! ! , ! = ! ! ! , ! = ! ! , ! = 0 , − x Claim ! ! , ! ! + ! ! , ! ! < ! − If then message unrecoverable at time t’ ! ! , ! ! > ! − ! = ! ∙ ! − If then message unrecoverable at time t’

Intuition Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes ( β = 1/3) Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Message may be recoverable Message is not recoverable

Intuition analysis Suppose erased and read bits are disjoint − Implies all erased bits are lost − Necessary condition for message recoverability: − Repairer needs to read m bits before failure process erases c - m bits ! / Δ − Failure process erases bits at a rate ( ! − ! ) ∙ Δ = ( 1 − ! ) ∙ ! ! ∙ ! − Repairer needs to read bits at a rate of at least ! ∙ Δ Generally erased and read bits not disjoint − Repairer can read bits from node before node fails − The bits that have been read are not lost if the node fails − Number of bits lost when node fails can be less than s

Snapshot interval evolution Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes ( β = 1/3) Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Cloud storage state of affairs Storage clusters contain thousands of - PowerPoint PPT Presentation

A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB

Large objects in the Cloud Thursday, 11 April 13 Riak Cloud Storage Cloud Storage software

Cloud Storage Nabil Abdennadher nabil.abdennadher@hesge.ch 1 Cloud storage Objective

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways Ming Chen and Erez Zadok Stony Brook

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Heterogeneity in Computing: Now and in the Future Anne Benoit LIP, Ecole Normale Sup erieure

E[F] ? Albert R Meyer, May 8, 2013 Albert R Meyer, May 8, 2013 ranvarfail.1 ranvarfail.3

hPIN/hTAN: A Lightweight and Low- Cost e-Banking Solution against Untrusted Computers Shujun Li 1

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Asset Team Will continue to use Maya 2008 for this quarter. Yes, Maya 2008 NOT Maya

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009"

Communication Networks II Seamless Context-Aware Communication Services - Overall Issues Prof.

Cloud storage state of affairs Storage clusters contain thousands of - PowerPoint PPT Presentation

A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB

Large objects in the Cloud Thursday, 11 April 13 Riak Cloud Storage Cloud Storage software

Cloud Storage Nabil Abdennadher nabil.abdennadher@hesge.ch 1 Cloud storage Objective

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways Ming Chen and Erez Zadok Stony Brook

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Heterogeneity in Computing: Now and in the Future Anne Benoit LIP, Ecole Normale Sup erieure

E[F] ? Albert R Meyer, May 8, 2013 Albert R Meyer, May 8, 2013 ranvarfail.1 ranvarfail.3

hPIN/hTAN: A Lightweight and Low- Cost e-Banking Solution against Untrusted Computers Shujun Li 1

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Asset Team Will continue to use Maya 2008 for this quarter. Yes, Maya 2008 NOT Maya

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009&quot;

Communication Networks II Seamless Context-Aware Communication Services - Overall Issues Prof.

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Visualizing Sensor Data Hauptseminar Information Visualization - Wintersemester 2008/2009"