Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael - PowerPoint PPT Presentation

1 CSCI 350 Ch. 14 – Reliable Storage & Transactions Mark Redekopp Michael Shindler & Ramesh Govindan

2 Introduction • Seeking reliability and consistency … of file system Inode … DP – Consistency: If adding multiple File Metadata DP blocks and we need to update the Direct Ptr … DP indirect pointers, a poorly timed Direct Ptr DP IP Direct Ptr -1 crash could leave the file in an Direct Ptr DP … Direct Ptr inconsistent state DP Direct Ptr – Reliability: Data can get corrupted or … Direct Ptr. lost due to mechanical/electrical Indirect Ptr. Dbl. Ind. Ptr. issues • Solutions – Transactions (we will focus on these) – Redundancy / Error-correction • RAID, ECC/Parity codes, checksums, etc. • See earlier units

3 Transactions • A transaction is a set of updates to void threadTask(void* arg) { the state of one or more objects /* Do local computation */ • Terminology /* checkpoints/saves state */ begin_transaction(val1,val2) { – Committed: If a transaction commits /* Do some computation/updates */ (succeeds) then the new state of the val1 -= amount; val2 += amount; objects will be seen going forward [i.e. all } // end_transaction updates occur] abort { // restore/re-read val1, val2 – Rollback: If a transaction rolls back (fails) // restart } then the object will remain in its original } state (as if no updates to any part of the state were made) [i.e. no updates occur] We have seen this before briefly in the context of multi-object synchronization. Now we'll focus on its application to file systems.

4 ACID Properties • Transactions help achieve the ACID properties – Atomicity: Update appears as indivisible (all or nothing); no partial updates are visible – Consistency: Old state and new, updated state meet certain necessary invariants • E.g. No orphaned blocks, etc. – Isolation: Idea of serializability (transactions T appears to execute entirely before T' or vice versa) – Durability: Committed transactions are persistent

5 Logging Original val1 = 50; val2 = 100; • Logging is a common way to achieve amount=10; transactions Log – Maintains a log of "records" in persistent storage • Steps: Start XACT1 (val1, val2) – Write intent (i.e. updates) to log XACT1: val1 = 40; val2 = 110; – Write 'commit' to log (if no errors) XACT1: COMMIT • No going back now – Perform update • Actually carry out the updates described in the intent – Garbage collect (log entries, etc.) Updated val1 = 40; val2 = 110; • Once the intentions are carried out amount=10; successfully, we can now delete the log entry and any other temporary data

6 Recovery 1.Write intent (i.e. • If crash occurs before COMMIT is updates) to log 2.Write 'commit' to log written, the transaction 3.Perform update 4.Garbage collect (log effectively is rolled back (original entries, etc.) state is still present) and the log entry will be reclaimed on restart Original val1 = 50; val2 = 100; amount=10; • If crash occurs after step 2 completes, then the Log intentions/commit in the log will be replayed upon restart until all Start XACT1 (val1, val2) XACT1: the intentions are carried out val1 = 40; val2 = 110; XACT1: COMMIT

7 Handling Concurrency • Suppose two transactions Transaction 1 Transaction 2 val1 = 50; val2 = 100; val1 = 50; val2 = 100; amount=10; amount=-30; attempt to execute Log concurrently • Only 1 can successfully Start XACT1 (val1, val2) commit XACT1: val1 = 40; val2 = 110; Start XACT2 (val1, val2) • The other will need to roll XACT2: val1 = 80; val2 = 70; back XACT1: COMMIT XACT2: FAIL

8 Handling Concurrency • After rollback the second Transaction 1 Transaction 2 val1 = 50; val2 = 100; val1 = 50; val2 = 100; amount=10; amount=-30; transaction will need to restart and thus use the Log update values • It could potentially fail Start XACT1 (val1, val2) XACT1: again based on some new val1 = 40; val2 = 110; Start XACT2 (val1, val2) transaction that commits XACT2: val1 = 80; val2 = 70; before it, in which case it XACT1: COMMIT Transaction 2 would replay again val1 = 40; val2 = 110; XACT2: FAIL amount=-300; Start XACT2 (val1, val2) – Some priority can be used to XACT2: val1 = 70; val2 = 80; help "older" transactions XACT1: COMMIT commit before "newer" ones

9 Redo Logging Transaction 1 Transaction 2 • The process outlined in the past val1 = 50; val2 = 100; val1 = 50; val2 = 100; amount=10; amount=-30; several slides are known as "redo logging" Log – On a crash, the committed transactions will be "redone" Start XACT1 (val1, val2) – If another crash before the XACT1: val1 = 40; val2 = 110; transaction can be "redone" it will Start XACT2 (val1, val2) simply try again on the next restart XACT2: and continue retrying until successful val1 = 80; val2 = 70; XACT1: COMMIT • Alternative: "Undo Logging" Transaction 2 val1 = 40; val2 = 110; XACT2: FAIL – Make updates in place but write old amount=-300; Start XACT2 (val1, val2) values to the log XACT2: val1 = 70; val2 = 80; – On rollback, replace the new values XACT1: COMMIT with the old ones in the log Which to use? Each has their advantages. What do we expect more of: successful or failed transactions?

10 Idempotent Operations • Updates must be idempotent Transaction 1 Transaction 2 val1 = 50; val2 = 100; val1 = 50; val2 = 100; amount=10; amount=-30; (i.e. redoing it once compared to many times leaves the same Log result) • Notice the log store the values Start XACT1 (val1, val2) we wanted to write to the XACT1: variables val1 = 40; val2 = 110; Start XACT2 (val1, val2) – Writes are idempotent (e.g. XACT2: val1 = 80; val2 = 70; writing 40 to val1 once and then XACT1: COMMIT repeating it will still leave val1 Transaction 2 val1 = 40; val2 = 110; XACT2: FAIL with 40) amount=-300; Start XACT2 (val1, val2) • If our log store val1 -= 10 then XACT2: val1 = 70; val2 = 80; each replay would deduct XACT1: COMMIT another 10 from val1

11 Performance of Redo Logging • Transactions may seem like a lot of overhead but… – Writes to the log are sequential • We've learned how sequential writes are faster than random writes – Actual updates (step 3) can be asynchronous • Updates can be batched together and performed at an "opportune" time • Caller can return and proceed as soon as commit is written • Don't wait too long though as then recovery time is slower due to "replay" of many updates and log itself takes more space since a transaction in the log can't be reclaimed until it is completed • Writes can be scheduled as a batch (rather than FIFO)

12 Logging and File Systems • Need to ensure all metadata is updated according to ACID principles

13 Use of Logging In File Systems • Two variants – Journaling: • Use of a logging for updates to metadata (i.e. inodes, free-space map, etc.) • But actual data is updated in place (so file data itself can be inconsistent) • Used by NTFS, Apple's HFS+, and Linux's XFS – Linux's ext3 and ext4 FS can be configured for journaling – Logging • Use of a log for both metadata and file data – Linux's ext3 and ext4 can also be configured to do logging • COW file systems are inherently transactional – Only when the root node (uberblock) is update does new data become visible (i.e. transaction commits)

Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael - PowerPoint PPT Presentation

1 CSCI 350 Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael Shindler & Ramesh Govindan 2 Introduction Seeking reliability and consistency of file system Inode DP Consistency: If adding multiple

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Reliable Power Reliable Markets AESO Rule Consultation Loss Factors Rule 9.2 and Appendix 7

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Central Valley Gas Storage, LLC November 3, 2016 Gill Ranch Storage, LLC Lodi Gas Storage, LLC

AC Transit Bus Storage Facility July 9, 2015 TJPA Board Meeting TJPA Board Meeting Bus Storage

Introd u cing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is

Storage 2015 Storage Shifts and Software Defined Storage (SDS) MRMUG Chris Walker Solution

SUSE Enterprise Storage 142 142 SUSE Enterprise Storage An intelligent software-defined storage

Lecture 4: Storage Management 1 / 57 Storage Management Administrivia Assignment 1 is due on

No compromises: distributed transactions with consistency, availability, and performance

Git and GitHub CS 4411 Spring 2020 If that doesnt fix it, git.txt contains the phone number of

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Zero-Knowledge Proofs I Lelantus Oct. 16, 2019 Overview Zero-Knowledge Proving a

Byzantine Fault Tolerance Consensus Strikes Back (continued) Announcements Lab 2 Due in

Reliability In case of a crash, recover to a consistent (or correct state) and continue

Non-Malleable Primitives Why and How The case of Commitments Rafail Ostrovsky (UCLA, USA)

Recent IDIS Changes Based on the HOME Commitment Interim Rule: Session 2 1 Agenda

Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael - PowerPoint PPT Presentation

1 CSCI 350 Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael Shindler & Ramesh Govindan 2 Introduction Seeking reliability and consistency of file system Inode DP Consistency: If adding multiple

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN &amp; Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Reliable Power Reliable Markets AESO Rule Consultation Loss Factors Rule 9.2 and Appendix 7

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Central Valley Gas Storage, LLC November 3, 2016 Gill Ranch Storage, LLC Lodi Gas Storage, LLC

AC Transit Bus Storage Facility July 9, 2015 TJPA Board Meeting TJPA Board Meeting Bus Storage

Introd u cing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is

Storage 2015 Storage Shifts and Software Defined Storage (SDS) MRMUG Chris Walker Solution

SUSE Enterprise Storage 142 142 SUSE Enterprise Storage An intelligent software-defined storage

Lecture 4: Storage Management 1 / 57 Storage Management Administrivia Assignment 1 is due on

No compromises: distributed transactions with consistency, availability, and performance

Git and GitHub CS 4411 Spring 2020 If that doesnt fix it, git.txt contains the phone number of

Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout

Zero-Knowledge Proofs I Lelantus Oct. 16, 2019 Overview Zero-Knowledge Proving a

Byzantine Fault Tolerance Consensus Strikes Back (continued) Announcements Lab 2 Due in

Reliability In case of a crash, recover to a consistent (or correct state) and continue

Non-Malleable Primitives Why and How The case of Commitments Rafail Ostrovsky (UCLA, USA)

Recent IDIS Changes Based on the HOME Commitment Interim Rule: Session 2 1 Agenda

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage