Reliability In case of a crash, recover to a consistent (or correct - PowerPoint PPT Presentation

Reliability In case of a crash, recover to a consistent (or correct state) and continue processing. Types of Failures Node failure 1. Communication line of failure 2. Loss of a message (or transaction) 3. Network partition 4. Any combination of above 5. Distributed DBMS Reliability and Partition. 1

Approaches to Reliability Audit trails (or logs) 1. Two phase commit protocol 2. Retry based on timing mechanism 3. Reconfigure 4. Allow enough concurrency which permits definite 5. recovery (avoid certain types of conflicting parallelism) Crash resistance design 6. Distributed DBMS Reliability and Partition. 2

Recovery Controller Types of failures: transaction failure site failure (local or remote) communication system failure Transaction failure UNDO/REDO Logs (Gray) transparent transaction (effects of execution in private workspace)  Failure does not affect the rest of the system Site failure volatile storage lost stable storage lost processing capability lost (no new transactions accepted) Distributed DBMS Reliability and Partition. 3

System Restart Types of transactions: 1. In commitment phase 2. Committed actions reflected in real/stable 3. Have not yet begun 4. In prelude (have done only undoable actions) We need: stable undo log; stable redo log (at commit); perform redo log (after commit) Problem: entry into undo log; performing the action Solution:  < T, A, E > undo actions must be restartable (or idempotent) DO – UNDO  UNDO  DO – UNDO – UNDO – UNDO --- UNDO Distributed DBMS Reliability and Partition. 4

Local site failure Transaction committed  do nothing Transaction semi-committed  abort Transaction computing/validating  abort AVOIDS BLOCKING Remote site failure Assume failed site will accept transaction Send abort/commit messages to failed site via spoolers Initialization of failed site Update for globally committed transaction before validating other transactions If spooler crashed, request other sites to send list of committed transactions Distributed DBMS Reliability and Partition. 5

Communication system failure Network partition Lost message Message order messed up Network partition Semi-commit in all partitions and commit on reconnection (updates available to user with warning) Commit transactions if primary copy taken for all entities within the partition Consider commutative actions Compensating transactions Distributed DBMS Reliability and Partition. 6

Compensating transactions Commit transactions in all partitions - Break cycle by removing semi-committed transactions - Otherwise abort transactions that are invisible to the - environment (no incident edges) Pay the price of committing such transactions and issue - compensating transactions Recomputing cost Size of readset/writeset - Computation complexity - Distributed DBMS Reliability and Partition. 7

site of site C site B origin (coordinator) time UNKNOWN UNKNOWN UNKNOWN active active active initiate commit READY prepare READY prepare COMMITTING commit COMMITTING commit ack UNKNOWN UNKNOWN inactive inactive ack UNKNOWN inactive Figure 5.3: Linear Commit Protocol Distributed DBMS Reliability and Partition. 8

TABLE 1: Local Site Failure Local Site Failure System’s Decision at Local Site After Committing/Aborting Do nothing a local transaction (Assume: Message has been sent to remote sites) After Semi-Committing a Abort transaction when local site local transaction recovers Send abort messages to other sites During Abort transaction when local site computing/validating a recovers local transaction Send abort message to other sites Distributed DBMS Reliability and Partition. 9

Ripple Edges: T i reads a value produced by T j in same partition Precedence Edges: Ti reads a value but has now been changed by Tj in same partition Interference Edges: T i reads a data-item in one partition and T j writes in another partition then T i → T j Finding minimal number of nodes to break all cycles in a precedence graph consisting of only two-cycle of ripple edges has a polynomial solver. Distributed DBMS Reliability and Partition. 10

Communications Software guide (where is the code and how is it compiled?) Design • Testing RAID – Sockets, ports, calls (sendto, recvfrom) RAID installation • – Oracle RAIDTOol • – Server cache Example test session • – Addressing in RAID Recommended reading – LUDP How to incorporate a new High level calls • server (RC) – Setup – RegisterSelf How to run an experiment (John-Comm) – ServActive – ServAddr – SendPacket – RecvMsg Distributed DBMS Reliability and Partition. 11

Storage of backup copies of database • Reduce storage • Maintain number of versions • Access time Move servers at Kernel level • Buffer pool, scheduler, lightweight processes • Shared memory Distributed DBMS Reliability and Partition. 12

New protocols and algorithms Replicated copy control • Survivability • Availability • Reconfigurability • Consistency and dependability • Performance Distributed DBMS Reliability and Partition. 13

Site is up Site is up (all fail locks for this site released) All data items are available Continued recovery, copies on failed site marked and fail-locks are released Partial recovery, unmarked data-objects are available Site is down Control transaction 1 running None of the data items are available Figure : States in site recovery and availability of data-items for transaction processing Distributed DBMS Reliability and Partition. 14

ABCDEFGH ABCDE FGH DE F GH ABC AB C D E B A Distributed DBMS Reliability and Partition. 15

Data Structures • Connection vector at each site: ABCDE Vector of boolean values • Partition graph ABC DE AC B A C ADE Distributed DBMS Reliability and Partition. 16

Site name vector of file f (n is the number of copies) S = < s 1 , s 2 ,…, s n > Linear order vector of file f L = < l 1 , l 2 ,…, l n > Version number X of a copy of file f Number of times network partitioned while the copy is in majority Distributed DBMS Reliability and Partition. 17

Version vector of a copy at site S i V = < v 1 , v 2 ,…, v n > Marked vector of a copy of file f M = < M 1 , m 2 ,…, m n > m i = T if marked = F if unmarked Distributed DBMS Reliability and Partition. 18

ABCDE ABC DE AC B A C ADE Distributed DBMS Reliability and Partition. 19

Examples of Partition Trees P_tree S1 : P_tree S3 : {1,2,3,4,5,6,7} {1,2,3,4,5,6,7} undefined undefined {1,2,5,6} {3,4,7}} undefined undefined {3} {1,2} undefined {1} (a) (b) Figure 9. Partition trees maintained at S 1 and S 3 before any merge of partition occurs Distributed DBMS Reliability and Partition. 20

Partition Tree after Merge P_tree S 1 , 3 : {1,2,3,4,5,6,7} {3,4,7} {1,2,5,6} undefined {3} undefined {1,2} undefined {1} Figure 10. Partition tree maintained at S 1 and/or S 3 after S 3 merge Distributed DBMS Reliability and Partition. 21

Reliability In case of a crash, recover to a consistent (or correct - PowerPoint PPT Presentation

Reliability In case of a crash, recover to a consistent (or correct state) and continue processing. Types of Failures Node failure 1. Communication line of failure 2. Loss of a message (or transaction) 3. Network partition 4. Any

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Why the 2018 Water Reliability Study WACO Presentation 2018 OC Reliability Study October 5,

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

Quest for Reliability Ankush Malhotra VP & GM of Fluke Reliability Speaker Bio Ankush

Safety and Reliability Safety and Reliability Analysis Analysis Team KANG Team KANG Group 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Reliability Engineering Overview Reliability engineering measures and improves resistance to

Slide 1 SPHSC 569 Single Subject Design Reliability Slide 2 Reliability-Quantitative and

NUC-001-1 Reliability Standard Update April 8, 2008 Keith ONeal Office of Electric

VAMWA/VMA Study EPA Method 1668 Reliability and Data Variability EPA Method 1668 Reliability and

Byzantine Fault Tolerance Consensus Strikes Back (continued) Announcements Lab 2 Due in

Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael Shindler & Ramesh

No compromises: distributed transactions with consistency, availability, and performance

Git and GitHub CS 4411 Spring 2020 If that doesnt fix it, git.txt contains the phone number of

Non-Malleable Primitives Why and How The case of Commitments Rafail Ostrovsky (UCLA, USA)

Recent IDIS Changes Based on the HOME Commitment Interim Rule: Session 2 1 Agenda

The Regional Innovation Strategies Program 2019 Competition Debrief Webinar August 21, 2019

COSC 340: Software Engineering Version Control with Git Michael Jantz Notes adapted from: Pro

Sambuz

Useful Links

Newsletter

Mail Us

Reliability In case of a crash, recover to a consistent (or correct - PowerPoint PPT Presentation

Reliability In case of a crash, recover to a consistent (or correct state) and continue processing. Types of Failures Node failure 1. Communication line of failure 2. Loss of a message (or transaction) 3. Network partition 4. Any

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Why the 2018 Water Reliability Study WACO Presentation 2018 OC Reliability Study October 5,

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

Quest for Reliability Ankush Malhotra VP &amp; GM of Fluke Reliability Speaker Bio Ankush

Safety and Reliability Safety and Reliability Analysis Analysis Team KANG Team KANG Group 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Reliability Engineering Overview Reliability engineering measures and improves resistance to

Slide 1 SPHSC 569 Single Subject Design Reliability Slide 2 Reliability-Quantitative and

NUC-001-1 Reliability Standard Update April 8, 2008 Keith ONeal Office of Electric

VAMWA/VMA Study EPA Method 1668 Reliability and Data Variability EPA Method 1668 Reliability and

Byzantine Fault Tolerance Consensus Strikes Back (continued) Announcements Lab 2 Due in

Ch. 14 Reliable Storage &amp; Transactions Mark Redekopp Michael Shindler &amp; Ramesh

No compromises: distributed transactions with consistency, availability, and performance

Git and GitHub CS 4411 Spring 2020 If that doesnt fix it, git.txt contains the phone number of

Non-Malleable Primitives Why and How The case of Commitments Rafail Ostrovsky (UCLA, USA)

Recent IDIS Changes Based on the HOME Commitment Interim Rule: Session 2 1 Agenda

The Regional Innovation Strategies Program 2019 Competition Debrief Webinar August 21, 2019

COSC 340: Software Engineering Version Control with Git Michael Jantz Notes adapted from: Pro

Sambuz

Useful Links

Newsletter

Mail Us

Quest for Reliability Ankush Malhotra VP & GM of Fluke Reliability Speaker Bio Ankush

Ch. 14 Reliable Storage & Transactions Mark Redekopp Michael Shindler & Ramesh