Check Pointing and Rollback Recovery Course: Distributed Computing - - PowerPoint PPT Presentation

▶

Feb 26, 2023 166 likes •524 views

Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath Spring 2019 About this topic This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential

SLIDE 1

Check Pointing and Rollback Recovery

Course: Distributed Computing Faculty: Dr. Rajendra Prasath

Spring 2019

SLIDE 2

About this topic

This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential aspects of check pointing and roll back recovery in distributed contexts 2

Rajendra, IIIT Sri City

SLIDE 3

What did you learn so far? What did you learn so far?

è Challenges in Message Passing systems è Distributed Sorting è Space-Time Diagram è Partial Ordering / Causal Ordering è Concurrent Events è Local Clocks and Vector Clocks è Distributed Snapshots è Termination Detection è Topology Abstraction and Overlays è Leader Election Problem in Rings è Message Ordering / Group Communications è Distributed Mutual Exclusion Algorithms

RECAP

Rajendra, IIIT Sri City

SLIDE 4

Topics to focus on

pics to focus on …

è Distributed Mutual Exclusion è Deadlock Detection

è Check Pointing and Rollback Recovery

è Self-Stabilization è Distributed Consensus è Reasoning with Knowledge è Peer – to – peer computing and Overlays è Authentication in Distributed Systems

For End Semester

Rajendra, IIIT Sri City

SLIDE 5

Distributed Mutual Exclusion(Recap) Distributed Mutual Exclusion(Recap)

è No Deadlocks – No processes should be permanently blocked, waiting for messages (Resources) from other sites è No starvation – no site should have to wait indefinitely to enter its critical section, while other sites are executing the CS more than once è Fairness - requests honored in the order they are made. This means processes have to be able to agree on the

rder of events. (Fairness prevents starvation)

è Fault Tolerance – the algorithm is able to survive a failure at one or more sites

Rajendra, IIIT Sri City

SLIDE 6

Deadlock Deadlock – Illustr Illustrated (Recap) ated (Recap)

è Vehicular Traffic – A real-time scenario 6

Rajendra, IIIT Sri City

SLIDE 7

Dining Philosophers (Recap) Dining Philosophers (Recap)

Rajendra, IIIT Sri City

è Each philosopher must alternately think and eat è A philosopher can only eat when they have both left and right forks è Problem: How to design a discipline of behavior (a concurrent algorithm) such that no philosopher will starve?

è Suggest a Simple Solution ??

SLIDE 8

Check Pointing and Rollback Recovery

Let us explore Check Pointing and Roll Back Recovery algorithms in distributed systems

Rajendra, IIIT Sri City

SLIDE 9

Handling F Handling Failur ailures / Recovery? es / Recovery?

è Failure of a site/node in a distributed system causes inconsistencies in the state of the system. è Recovery: bringing back the failed node in step with other nodes in the system. è Failures: è Process failure: è Deadlocks, protection violation, erroneous user input, etc. è System failure: è Failure of processor/system. System failure can have full/partial amnesia. è It can be a pause failure (system restarts at the same state it was in before the crash) or a complete halt. è Secondary storage failure: data inaccessible. è Communication failure: network inaccessible.

Rajendra, IIIT Sri City

SLIDE 10

Recovery in Concurr Recovery in Concurrent Systems ent Systems

è State involves message exchanges in DS è In distributed systems, rolling back one process can cause the roll back of other processes è Orphan messages & Domino effect: Assume Y fails after sending m

è X has record of m at x3 but Y has no record. M à orphan message. è Y rolls back to y2 à X should go to x2 è If Z rolls back, X and Y has to go to x1 and y1 à Domino effect, roll back of one process causes one or more processes to roll back

Rajendra, IIIT Sri City

X Y Z x1 y1 z1 x2 x3 y2 z2 m

SLIDE 11

Messages L Messages Lost

è If Y fails after receiving m, it will rollback to y1 è X will rollback to x1 è m will be a lost message as X has recorded it as sent & Y has no record of receiving it

Rajendra, IIIT Sri City

X Y m x1 y1 Failure X

SLIDE 12

Livelocks Livelocks

è Y crashes before receiving n1. Y rolls back to y1 à X to x1 è Y recovers, receives n1 and sends m2 è X recovers, sends n2 but has no record of sending n1 è Hence, Y is forced to rollback second time. X also rolls back as it has received m2 but Y has no record of m2 è Above sequence can repeat indefinitely, causing a livelock

Rajendra, IIIT Sri City

X Y x1 y1 X Y x1 y1 m1 n1 m2 n2 X Failure X 2nd Rollback n1

SLIDE 13

Consistent Checkpoints Consistent Checkpoints

è Overcoming domino effect and livelocks: checkpoints should not have messages in transit. è Consistent checkpoints: no message exchange between any pair of processes in the set as well as

utside the set during the interval spanned by

checkpoints. è {x1,y1,z1} is a strongly consistent checkpoint 13

Rajendra, IIIT Sri City

X Y Z x1 y1 z1 x2 x3 y2 z2 m

SLIDE 14

Types of ypes of CRR CRR Algorithms Algorithms

è Synchronous Algorithm

è Two Phase algorithm proposed by Koo and Toueg

è Asynchronous Algorithm

è A simple algorithm proposed by Juang & Venkatesan

Rajendra, IIIT Sri City

SLIDE 15

Consistent Set of Consistent Set of Checkpoints Checkpoints

Assumptions:

è Checkpoint, send / recv are atomic è Take a checkpoint after sending every message è The set of the most recent checkpoints is always consistent

è Why? Is it strongly consistent?

è What is the main problem with this approach? è Take a checkpoint after every K messages sent? è Is it still consistent? 15

Rajendra, IIIT Sri City

SLIDE 16

Synchr Synchronous

nous Checkpointing

Checkpointing Algo Algo

è Proposed by Koo ad Toueg1 (1987) è Assumptions:

è processes communicate by exchanging messages through channels è channels are FIFO, end-to-end protocols cope up with the message loss due to rollback recovery è Communication failures do not partition the network è Uses two kinds of checkpoints

è Tentative è Permanent

Rajendra, IIIT Sri City

1 R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," in IEEE Transactions

n Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan. 1987. doi: 10.1109/TSE.1987.232562

SLIDE 17

Phase - 1 Phase - 1

è Initiator: take tentative checkpoint è Initiator requests all other processes to take tentative checkpoint è All other processes:

è can respond `yes' or `no'

è Initiator: decide to make checkpoints permanent if everyone has responded `yes’ è A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions 17

Rajendra, IIIT Sri City

SLIDE 18

Phase - 2 Phase - 2

è If all processes took checkpoints, Pi decides to make the checkpoint permanent. è Otherwise, checkpoints are to be discarded. è Pi conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded 18

Rajendra, IIIT Sri City

SLIDE 19

Potential Issues

tential Issues

è Between tentative checkpoint and commit/ abort of checkpoint process must hold back messages. è Does this guarantee we have a strongly consistent state? è Can you construct an example that shows we can still have lost messages? 19

Rajendra, IIIT Sri City

SLIDE 20

Synchr Synchronous

nous Checkpointing

Checkpointing: : Properties

perties

è All or none of the processes take permanent checkpoints è There is no record of a message being received but not sent è Checkpoints may be taken unnecessarily (Give an example!!) è Can these unnecessarily checkpoints be avoided?

Rajendra, IIIT Sri City

SLIDE 21

Optimizing Checkpoints Optimizing Checkpoints

Main IDEA:

è Record all messages sent and received after the last checkpoint (last_recv(x, y), first_sent(x, y)) è When X requests Y to take a tentative checkpoint:

è X sends the last message received from Y with the request è Y takes a tentative checkpoint only if the last message received by X from Y was sent after Y sent the first message after the last checkpoint (Happened before !!)

last_recv(x, y) ≥ first_sent(y, x) è When a process takes a checkpoint, it will ask all other processes (that sent messages to the process) to take checkpoints.

Rajendra, IIIT Sri City

SLIDE 22

Rollback Recovery: P Rollback Recovery: Properties

perties

è There are two phases: Phase 1 and Phase 2 è Assume that between requests to rollback and decision, no one sends other messages è All or none of the processes restart from checkpoints è After rollback, all processes resume in a consistent state è Can have unnecessary rollback: can use a similar technique as the one in taking checkpoints to eliminate unnecessary rollback 22

Rajendra, IIIT Sri City

SLIDE 23

Rollback Recovery Rollback Recovery

è Phase 1

è Initiator: check whether all processes are willing to restart from last checkpoints è Others: may reply `yes' or `no'

è Phase 2

è Initiator: propagate go/nogo decision to all processes è Others: carry out the decision of the initiator

Rajendra, IIIT Sri City

SLIDE 24

Unnecessary Rollbacks Unnecessary Rollbacks

è Avoid Rollback in unnecessary situations? è An example

è (z2 does not need to rollback – why?) 24

Rajendra, IIIT Sri City

SLIDE 25

Disadvantages Disadvantages

è Check Pointing Algorithm generates

message traffic

è Synchronization delays are introduced è These costs may seem high if failures between checkpoints are unlikely 25

Rajendra, IIIT Sri City

SLIDE 26

Asynchr Asynchronous

nous Appr

Approach

è Take multiple local checkpoints independently è After a failure, try to find a consistent set of recent checkpoints è All incoming messages between local checkpoints are logged è pessimistic approach: log each message before processing è optimistic approach: buffer messages & log in batches è Why is the second approach called optimistic? è What are the advantages and disadvantages of each approach?

Rajendra, IIIT Sri City

SLIDE 27

An Event Driven Computation An Event Driven Computation

è A process waits until it receives a message; then processes the received message; changes its state and sends zero or more messages to its neighbors and then waits to receive the next message è The current state and the contents of the messages sent depend on its previous state and the content of the message è Events are identified by unique numbers (increasing)

Rajendra, IIIT Sri City

SLIDE 28

Asynchr Asynchronous

nous Checkpointing

Checkpointing Algo Algo

è Proposed by Juang & Venkatesan2 Assumptions: è Communication channels are reliable è Communication channels are FIFO è Communication channels have no buffer size limits è Message transmission delay is bounded è Underlying system is Event-Driven, with locally timestamped (monotonically increasing numbers) events: Each event waits for a message, processes the message, changes process state, and sends a number of messages

Rajendra, IIIT Sri City 2 https://www.utdallas.edu/~venky/pubs/crash-rec-icdcs91.pdf

SLIDE 29

Basic Idea Basic Idea

è At each event, a triplet {s, m, msgs_sent} is put in the the log: s is the state, m is the message causing the event, msgs_sent is the set of messages sent. Two data structures used: è RCVD(i, j, checkpoint) -- the number of message received by processor i from processor j at checkpoint, è SENT(i, j, checkpoint) -- the number of messages sent from i to j at checkpoint. è Use the message send/recv counts to determine the point to rollback. 29

Rajendra, IIIT Sri City

SLIDE 30

Algorithm Algorithm

At process i: è If i is a process that is recovering from a failure, checkpoint = the latest event logged in the stable storage. è else checkpoint = latest event that took place. è for k = 1 to N do è send ROLLBACK(i, SENT(i, j, checkpoint)) to all neighbors j è wait for ROLLBACK messages from all neighbors è for every ROLLBACK(j, c) received

è if (RCVD(i, j, checkpoint) > c) then è find the latest event e such that RCVD(i, j, e) = c è checkpoint = e

Rajendra, IIIT Sri City

SLIDE 31

Is the algorithm consistent? Is the algorithm consistent?

è In each iteration: At least one processor will rollback to its final recovery point unless current recovery point is consistent è Answer: YES / NO è Complexity of this algorithm?

è will it be greater than O(n) where n is the total number of message exchanges? è Explore the details … !!

Rajendra, IIIT Sri City

SLIDE 32

Summary Summary

è Recovery in Distributed / Concurrent Systems è Checkpointing

è Consistent set of checkpoints

è Rollback recovery

è Synchronous Algorithm (Koo and Toueg) è Asynchronous Algorithm (Juang & Venkatesan) è Stay tuned ... More to come up … !!

Rajendra, IIIT Sri City

SLIDE 33

How to r How to reach me? each me?

è Please leave me an email:

rajendra [DOT] prasath [AT] iiits [DOT] in

è Visit my homepage @

è http://www.iiits.ac.in/FacPages/index- rajendra.html OR è http://rajendra.2power3.com 33

Rajendra, IIIT Sri City

SLIDE 34

Perspective Students (having CGPA above 8.5

and above)

Promising Students (having CGPA above 6.5

and less than 8.5)

Needy Students (having CGPA less than 6.5)
Can the above group help these students? (Your

work will also be rewarded)

You may grow a culture of collaborative

learning by helping the needy students

Help among Yourselves?

Rajendra, IIIT Sri City

SLIDE 35

… Questions ???

Thanks …

Rajendra, IIIT Sri City