CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17, 2018 URCS

Outline Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

Errors happen • Errors happen • How do we recover from them (say, for message loss)? • (before information theory): ? • (after information theory): ?

Checkpointing and Recovery To checkpoint is to save the state of a computation so that you can “rollback” to it • Examples: • Save games • Virtual machine snapshots Recovery is then “simply” restoring the checkpoint

Distributed Checkpointing: The Challenge • Processes only know: • which messages they have received • which messages they have sent • what their local state is • Checkpointing ideally should not require everybody to “pause” • Must run concurrently with computation

The Recovery Line Recovery line Checkpoint Initial state P1 Failure P2 Time Message sent Inconsistent collection from P2 to P1 of checkpoints

Algorithm • A process records its local state independently • messages sent/received included • A recovery for a process entails going back to its most recent checkpoint • Unfortunately, this can’t be done independently

Rollbacks Checkpoint Initial state P1 Failure m* m P2 Time Assume P 2 fails. How far we do need to rollback to achieve a consistent worldview?

Detecting dependencies • For a process P i , let INT i ( m ) be the interval between the m − 1 and m checkpoints. • All messages sent in INT i ( m ) contain ( i , m ) • When process P j receives this message, it may be in INT j ( n ) • records dependency INT i ( m ) → INT j ( n ) • saves dependency with checkpoint

Rolling back: Consistency • If P i rolls back to checkpoint m − 1, no messages from INT i ( m ) were ever sent • All checkpoints dependent on INT i ( m ) are invalid • Rollbacks need to continue until consistency is reached

Algorithm • Coordinator broadcasts CHECKPOINT-REQUEST message to all processes • When this request is received, • Process checkpoints local state • Acknowledges to coordinator that it has taken checkpoint and waits • When coordinator receives acknowledgements from all processes, it sends CHECKPOINT-DONE • Processes resume computation • What about messages?

Message handling • All incoming messages received after CHECKPOINT-REQUEST are not considered part of the checkpoint • All outgoing messages are held back until CHECKPOINT-DONE is received • This results in a “globally consistent state” • How?

Basic idea • Computations are deterministic and rely only on messages transmitted • Save messages from a checkpoint and replay them during recovery

Piecewise deterministic execution • A piecewise deterministic computation interval: • starts with a non-deterministic event (e.g. receipt of a message) • continues in a completely deterministic fashion • ends just before another non-deterministic event This implies that only non-deterministic events need to be logged.

Who should save the messages? Q crashes and recovers P m2 is never replayed, m1 m1 so neither will m3 Q m3 m2 m3 m2 R Unlogged message Time Logged message

Orphan processes • Let DEP ( m ) represent processes that depend on message m • Let COPY ( m ) represent processes that contain a copy of m • but may not have logged it • Note, m contains all details necessary to retransmit it A process Q is orphaned if and only if: • Q depends on m (i.e. Q ∈ DEP ( m )) • All processes in COPY ( m ) have failed • So m cannot be played back

Pessimistically avoiding orphan processes • Orphan processes can be avoided by ensuring that • A non-deterministic message is sent only to one process • That process cannot send another message without logging m

Further reading Chandy and Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM TOCS 1985

Acknowledgments All figures from Van Steen and Tanenbaum, Distributed Systems, 3rd Edition, Chapter 8.

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17, 2018 URCS Outline Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging Outline Checkpointing and

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018

CSC2/458 Parallel and Distributed Systems Parallel Data Structures - I Sreepathi Pai January 18,

CSC2/458 Parallel and Distributed Systems Distribute Computing Other Programming Models

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Consensus and Failures Sreepathi Pai April 10, 2018

CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Clocks Sreepathi Pai March 22, 2018 URCS Outline

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13,

CSC2/458 Parallel and Distributed Systems Termination Detection Sreepathi Pai April 12, 2018

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

Artificial Intelligence Game Playing Continued Lecture 9 CS 444 Spring 2019 Dr. Kevin

Academic Advising Office: EERC 131 Typical EE Advising Electrical Engineers Computer Engineers

5. Conditioning and Independence Andrej Bogdanov Conditional PMF Let X be a random variable and

Game playing Chapter 5, Sections 16 of; based on AIMA Slides c Artificial Intelligence,

CS5412: THE BASE METHODOLOGY VERSUS THE ACID MODEL Lecture VIII Ken Birman Todays lecture

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath

Spring Partner Forum Return of the bears? 1. Since we last met: Exuberance turned fearfulness