CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery - - PowerPoint PPT Presentation

csc2 458 parallel and distributed systems checkpointing
SMART_READER_LITE
LIVE PREVIEW

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery - - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17, 2018 URCS Outline Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging Outline Checkpointing and


slide-1
SLIDE 1

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery

Sreepathi Pai April 17, 2018

URCS

slide-2
SLIDE 2

Outline

Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

slide-3
SLIDE 3

Outline

Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

slide-4
SLIDE 4

Errors happen

  • Errors happen
  • How do we recover from them (say, for message loss)?
  • (before information theory): ?
  • (after information theory): ?
slide-5
SLIDE 5

Checkpointing and Recovery

To checkpoint is to save the state of a computation so that you can “rollback” to it

  • Examples:
  • Save games
  • Virtual machine snapshots

Recovery is then “simply” restoring the checkpoint

slide-6
SLIDE 6

Distributed Checkpointing: The Challenge

  • Processes only know:
  • which messages they have received
  • which messages they have sent
  • what their local state is
  • Checkpointing ideally should not require everybody to

“pause”

  • Must run concurrently with computation
slide-7
SLIDE 7

The Recovery Line

P1 P2 Initial state Failure Checkpoint Time Recovery line Inconsistent collection

  • f checkpoints

Message sent from P2 to P1

slide-8
SLIDE 8

Outline

Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

slide-9
SLIDE 9

Algorithm

  • A process records its local state independently
  • messages sent/received included
  • A recovery for a process entails going back to its most recent

checkpoint

  • Unfortunately, this can’t be done independently
slide-10
SLIDE 10

Rollbacks

m m* P1 P2 Initial state Failure Checkpoint Time

Assume P2 fails. How far we do need to rollback to achieve a consistent worldview?

slide-11
SLIDE 11

Detecting dependencies

  • For a process Pi, let INTi(m) be the interval between the

m − 1 and m checkpoints.

  • All messages sent in INTi(m) contain (i, m)
  • When process Pj receives this message, it may be in INTj(n)
  • records dependency INTi(m) → INTj(n)
  • saves dependency with checkpoint
slide-12
SLIDE 12

Rolling back: Consistency

  • If Pi rolls back to checkpoint m − 1, no messages from

INTi(m) were ever sent

  • All checkpoints dependent on INTi(m) are invalid
  • Rollbacks need to continue until consistency is reached
slide-13
SLIDE 13

Outline

Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

slide-14
SLIDE 14

Algorithm

  • Coordinator broadcasts CHECKPOINT-REQUEST message to

all processes

  • When this request is received,
  • Process checkpoints local state
  • Acknowledges to coordinator that it has taken checkpoint and

waits

  • When coordinator receives acknowledgements from all

processes, it sends CHECKPOINT-DONE

  • Processes resume computation
  • What about messages?
slide-15
SLIDE 15

Message handling

  • All incoming messages received after

CHECKPOINT-REQUEST are not considered part of the checkpoint

  • All outgoing messages are held back until

CHECKPOINT-DONE is received

  • This results in a “globally consistent state”
  • How?
slide-16
SLIDE 16

Outline

Checkpointing and Recovery Independent Checkpointing Coordinated Checkpointing Message Logging

slide-17
SLIDE 17

Basic idea

  • Computations are deterministic and rely only on messages

transmitted

  • Save messages from a checkpoint and replay them during

recovery

slide-18
SLIDE 18

Piecewise deterministic execution

  • A piecewise deterministic computation interval:
  • starts with a non-deterministic event (e.g. receipt of a

message)

  • continues in a completely deterministic fashion
  • ends just before another non-deterministic event

This implies that only non-deterministic events need to be logged.

slide-19
SLIDE 19

Who should save the messages?

P Q R Q crashes and recovers Unlogged message Logged message m1 m2 m2 m3 m3 m1 m2 is never replayed, so neither will m3 Time

slide-20
SLIDE 20

Orphan processes

  • Let DEP(m) represent processes that depend on message m
  • Let COPY (m) represent processes that contain a copy of m
  • but may not have logged it
  • Note, m contains all details necessary to retransmit it

A process Q is orphaned if and only if:

  • Q depends on m (i.e. Q ∈ DEP(m))
  • All processes in COPY (m) have failed
  • So m cannot be played back
slide-21
SLIDE 21

Pessimistically avoiding orphan processes

  • Orphan processes can be avoided by ensuring that
  • A non-deterministic message is sent only to one process
  • That process cannot send another message without logging m
slide-22
SLIDE 22

Further reading

Chandy and Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM TOCS 1985

slide-23
SLIDE 23

Acknowledgments

All figures from Van Steen and Tanenbaum, Distributed Systems, 3rd Edition, Chapter 8.