RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - PowerPoint PPT Presentation

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019

Denser flash  shorter lifetime Denser flash TLC Error rate MLC Acceptable error rate SLC SLC lifetime TLC lifetime MLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 2

Shorter flash lifetimes are a problem • Datacenter operators must closely monitor flash writes How can we increase • Memory : flash cost ratio is increasing  workloads moving from DRAM to flash flash lifetimes?  increases pressure on flash • Datacenters struggling to adopt future generations of flash (e.g., QLC) 3

Increasing acceptable error rate  increase lifetimes TLC Error rate MLC Acceptable error rate SLC TLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 4

But.. hardware is expected to have low error rates • Software is designed so bit errors are rare • Bit errors errors cause failed operations and reduced availability • Error-handling path is not performant 5

Distributed error Isolation and RECovery Techniques (DIRECT) 1. Use distributed redundancy to fix local bit errors • Distributed systems need redundant copies for availability 2. Optimize error-recovery performance  flash devices can expose high error rates  flash devices have longer lifetimes  cheaper flash devices (QLC and beyond) 6

Bit errors in the storage stack… Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 7

… can manifest in the file system Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks) 8

…or in the local data store Distributed Coordination / Replication Layer . . . local data store local data store RocksDB local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks)  Application metadata or data 10

…and need to be dealt with in the coordination layer Distributed Coordination / Replication Layer Paxos / ZooKeeper . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks)  Application metadata or data • Correct recovery 11

DIRECT corrects bit errors in the local data store Distributed Coordination / Replication Layer DIRECT . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 12

Local data store errors: metadata Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata X (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) Data objects Data objects Data objects local data store local data store local data store 14

DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 15

Local data store errors: data Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects local data store local data store local data store 16

DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas 17

Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 18

Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer (PAR) How to isolate data DIRECT necessary for recovery? Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 19

DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 20

DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery 21

Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X A A A object replicas 22

Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ A A A recovery request write operation 23

Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ ’ ’ A A A write operation 24

Naïve recovery protocol: inconsistency Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) ’ ’ A A A 25

DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 26

Implementations of DIRECT • ZippyDB/RocksDB • RocksDB: KV store backed by log-structured merge tree • ZippyDB: distributed KV store backed by RocksDB • HDFS: Block-level distributed file system 47

ZippyDB Overview Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB Secondary Primary Secondary 48

ZippyDB Overview Coordination Layer Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB RocksDB = Local data store Secondary Primary Secondary 49

How ZippyDB handles corruptions • User reads: retry from another server • Background reads (compaction): crash server 50

ZippyDB-DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 51

RocksDB SST file layout Data block 1 Data block 1 . . . . . . Data block N Data block N Metadata block 1 Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . Metadata block 2 . . . Index block Index block footer Index block footer footer 52

ZippyDB-DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 53

Identifying corrupt data Data block 1 . No way of knowing the . . exact key-value pair! X Data block N Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . . Index block Index block footer footer 54

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - PowerPoint PPT Presentation

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019 Denser flash shorter lifetime Denser flash TLC Error rate

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Large objects in the Cloud Thursday, 11 April 13 Riak Cloud Storage Cloud Storage software

Cloud Storage Nabil Abdennadher nabil.abdennadher@hesge.ch 1 Cloud storage Objective

Mean cloud cover for BBC as derived using the AVHRR Cloud Type algorithm Accuracy, reliability

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways Ming Chen and Erez Zadok Stony Brook

Cloud- and Peer-to-Peer Storage End-user considerations and product overview 4/3/2010 Arjan

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

Plan for today EDAF70: Applied Artificial Intelligence Administrative stuff or Brief intro

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

on falls prevention Christine McArthur NHS Highland Backgrou Background nd well

CSE 5306 Distributed Systems Synchronization Jia Rao http://ranger.uta.edu/~jrao/ 1

Reconstruction of Quantitative properties from point X-rays F. Abdmouleh & M. Tajine Rec

Pharo Conf 24-25 of May 2012 S. Ducasse http://rmod.lille.inria.fr/

Tao Probing the End of the World 1 - 1 - Masato

Recent Advances in SUSY Y uji Tachikawa (U. Tokyo, Dept. Phys & Kavli IPMU) Strings 2014,

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - PowerPoint PPT Presentation

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019 Denser flash shorter lifetime Denser flash TLC Error rate

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Large objects in the Cloud Thursday, 11 April 13 Riak Cloud Storage Cloud Storage software

Cloud Storage Nabil Abdennadher nabil.abdennadher@hesge.ch 1 Cloud storage Objective

Mean cloud cover for BBC as derived using the AVHRR Cloud Type algorithm Accuracy, reliability

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways Ming Chen and Erez Zadok Stony Brook

Cloud- and Peer-to-Peer Storage End-user considerations and product overview 4/3/2010 Arjan

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

Plan for today EDAF70: Applied Artificial Intelligence Administrative stuff or Brief intro

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

on falls prevention Christine McArthur NHS Highland Backgrou Background nd well

CSE 5306 Distributed Systems Synchronization Jia Rao http://ranger.uta.edu/~jrao/ 1

Reconstruction of Quantitative properties from point X-rays F. Abdmouleh &amp; M. Tajine Rec

Pharo Conf 24-25 of May 2012 S. Ducasse http://rmod.lille.inria.fr/

Tao Probing the End of the World 1 - 1 - Masato

Recent Advances in SUSY Y uji Tachikawa (U. Tokyo, Dept. Phys &amp; Kavli IPMU) Strings 2014,

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Reconstruction of Quantitative properties from point X-rays F. Abdmouleh & M. Tajine Rec

Recent Advances in SUSY Y uji Tachikawa (U. Tokyo, Dept. Phys & Kavli IPMU) Strings 2014,