Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow - - PowerPoint PPT Presentation

optimized coordinated checkpoint rollback protocol using
SMART_READER_LITE
LIVE PREVIEW

Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow - - PowerPoint PPT Presentation

Context Fault-tolerance DFG CCK Simulations Perspectives Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model Xavier Besseron and Thierry Gautier {xavier.besseron | thierry.gautier}@imag.fr Laboratoire


slide-1
SLIDE 1

Context Fault-tolerance DFG CCK Simulations Perspectives

Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model

Xavier Besseron and Thierry Gautier

{xavier.besseron | thierry.gautier}@imag.fr

Laboratoire d’Informatique de Grenoble MOAIS Project

APRETAF Workshop, January 2009

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 1/ 33

slide-2
SLIDE 2

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 2/ 33

slide-3
SLIDE 3

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 3/ 33

slide-4
SLIDE 4

Context Fault-tolerance DFG CCK Simulations Perspectives

Grid computing

What are grids? Clusters are computers connected by a LAN Grids are clusters connected by a WAN Heterogeneous (processors, networks, ...) Dynamic (failures, reservations, ...) Aladdin – Grid’5000 French experimental grid platform More than 4800 cores 9 sites in France 1 site in Brazil 1 site in Luxembourg

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 4/ 33

slide-5
SLIDE 5

Context Fault-tolerance DFG CCK Simulations Perspectives

Fault-tolerance

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 Failure probability Number of processors 1−day execution time 5−days execution time 10−days execution time

Why fault-tolerance? Fault probability is high on a grid Split a large computation in shorter separated computations Dynamic reconfiguration

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 5/ 33

slide-6
SLIDE 6

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 6/ 33

slide-7
SLIDE 7

Context Fault-tolerance DFG CCK Simulations Perspectives

Fault-tolerance survey [Elnozahy02]

Duplication-based protocols [Avizienis76][Wiesmann99] Application execution is duplicated, spatially or temporally. Log-based protocols [Alvisi98] Assume that the state of the system evolves according to non-deterministic events Non-deterministic events are logged in order to rollback from a previous saved checkpoint Checkpoint/rollback protocols Periodically save the local process state of the applications. Uncoordinated checkpointing [Randell75] Coordinated checkpointing [Chandy85] Communication-induced checkpointing [Baldoni97]

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 7/ 33

slide-8
SLIDE 8

Context Fault-tolerance DFG CCK Simulations Perspectives

Checkpoint/rollback protocols

Why checkpoint/rollback protocol? Duplication protocols require too much resources [Wiesmann99] and a computation interruption can be tolerated Logging protocols require too much resources (memory and bandwidth) with large communication applications [Elnozahy04] Why coordinated checkpointing? Coordinated checkpointing advantages: No domino effect [Elnozahy02] Low overhead towards application communications [Bouteiller03][Zheng04] Coordination overhead can be amortized using a suitable checkpoint period [Elnozahy04]

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 8/ 33

slide-9
SLIDE 9

Context Fault-tolerance DFG CCK Simulations Perspectives

Application state

Global state The global state of an application is composed of: the local state of all its processes; the state of all its communication channels. Coherent global state A coherent global state is a state than can happen during a correct execution of the application.

m0 m1 P1 P2 P0 m2 m1 m2 P1 P2 P0 m0 Coherent global state Incoherent global state

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 9/ 33

slide-10
SLIDE 10

Context Fault-tolerance DFG CCK Simulations Perspectives

Classical coordinated checkpoint/rollback protocol

Two steps: Checkpoint step, during failure-free execution Coordinate all processes to checkpoint a coherent global state: Coordinate all the processes Flush communication channels between all processes Save the processes state Rollback step, to recover after a failure Global restart: Replace failed processes by new ones All processes restart from their last checkpoint Restart time is, in worst case, the checkpoint period

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 10/ 33

slide-11
SLIDE 11

Context Fault-tolerance DFG CCK Simulations Perspectives

Challenging problems

How to improve performances of coordinated checkpoint/protocols? Reduce the synchronization cost [Koo87] Speed-up restart [Bouteiller03][Zheng04] Reduce lost computation time in case of fault

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 11/ 33

slide-12
SLIDE 12

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 12/ 33

slide-13
SLIDE 13

Context Fault-tolerance DFG CCK Simulations Perspectives

Applications: simulation of physical phenomena

Characteristics Iterative decomposition domain applications Large amount of data Parallelization: static-scheduling Iterative applications ⇒ only schedule the loop “kernel” Large data ⇒ preserve locality

P0 P7 P6 P5 P2 P4 P3 P1

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 13/ 33

slide-14
SLIDE 14

Context Fault-tolerance DFG CCK Simulations Perspectives

Applications: simulation of physical phenomena

Characteristics Iterative decomposition domain applications Large amount of data Parallelization: static-scheduling Iterative applications ⇒ only schedule the loop “kernel” Large data ⇒ preserve locality

Iterations Domain P0 P1 P2 P3 P4

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 13/ 33

slide-15
SLIDE 15

Context Fault-tolerance DFG CCK Simulations Perspectives

Data Flow Graph

How it works? Partition the

  • ne-iteration graph

Generate communication tasks Distribute each sub-graph on all the processes Repeat the sub-graphs to iterate

Computation task Data Dependency P0 P1 P2 Send task Receive task Communication

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 14/ 33

slide-16
SLIDE 16

Context Fault-tolerance DFG CCK Simulations Perspectives

Keypoint: abstract representation

The Data Flow Graph Properties A task is the computational unit A process is composed of a (dynamic) sequence of tasks At any time, Kaapi allows to discover not yet executed tasks and their dependencies This abstract representation shows the future of the execution The data flow graph representation is causally connected to the application execution. Usage: analyze and transform the application state and behavior Schedule tasks (at any time) Checkpoint application state

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 15/ 33

slide-17
SLIDE 17

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 16/ 33

slide-18
SLIDE 18

Context Fault-tolerance DFG CCK Simulations Perspectives

Checkpoint step

Classical protocol checkpoint Coordinate all processes to checkpoint a coherent global state: Coordinate all the processes Flush communication channels between all processes Save the processes state CCK: differences with the classical protocol Optimize the checkpoint step using the abstract representation of the execution (data flow graph): Partial flush: only between processes which communicates Increment checkpoint: save only modified data

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 17/ 33

slide-19
SLIDE 19

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: classical protocol vs CCK

Classical protocol restart Global restart: Replace failed processes by new ones All processes restart from their last checkpoint Restart time is, in worst case, the checkpoint period CCK protocol restart Partial restart: Detect lost communications for the failed processes Find the strictly required computation set to make the global state coherent Schedule statically this task set

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 18/ 33

slide-20
SLIDE 20

Context Fault-tolerance DFG CCK Simulations Perspectives

After a checkpoint

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Non-failed process Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-21
SLIDE 21

Context Fault-tolerance DFG CCK Simulations Perspectives

A process failed

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-22
SLIDE 22

Context Fault-tolerance DFG CCK Simulations Perspectives

Incoherent application state

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Task to re-execute Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-23
SLIDE 23

Context Fault-tolerance DFG CCK Simulations Perspectives

Lost communications

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Task to re-execute

Clost

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-24
SLIDE 24

Context Fault-tolerance DFG CCK Simulations Perspectives

Communications to replay

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Task to re-execute

Call Call

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-25
SLIDE 25

Context Fault-tolerance DFG CCK Simulations Perspectives

Tasks to re-execute

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Task to re-execute

T to_re-execute

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 19/ 33

slide-26
SLIDE 26

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: classical protocol vs CCK

Classical protocol restart

Failure Next checkpoint Past of the execution Execution Last checkpoint Processes

CCK protocol restart

Failure Next checkpoint Past of the execution Execution Last checkpoint Processes

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 20/ 33

slide-27
SLIDE 27

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: classical protocol vs CCK

Classical protocol restart

W recovery

std Failure Next checkpoint Past of the execution Execution Last checkpoint Processes

CCK protocol restart

W recovery

cck Failure Next checkpoint Past of the execution Execution Last checkpoint Processes

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 20/ 33

slide-28
SLIDE 28

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: classical protocol vs CCK

Classical protocol restart

End of recovery

W recovery

std Past of the execution Execution Last checkpoint Processes

CCK protocol restart

End of recovery

W recovery

cck Past of the execution Execution Last checkpoint Processes

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 20/ 33

slide-29
SLIDE 29

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: classical protocol vs CCK

Classical protocol restart

End of recovery

W recovery

std Past of the execution Execution Last checkpoint Processes

CCK protocol restart

End of recovery

W recovery

cck Past of the execution Execution Last checkpoint Processes

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 20/ 33

slide-30
SLIDE 30

Context Fault-tolerance DFG CCK Simulations Perspectives

Recovery: Cost analysis

Classical protocol restart Required work to recover: W std

recovery = O(N · τ)

Restart time on N processes: T std

restart = O(τ)

CCK protocol restart Required work to recover: W cck

recovery = O(Nfailed · τ + εapplication,τ)

Restart time on N processes: T cck

restart = O(Nfailed · τ + εapplication,τ

N ) We have to add the CCK-recovery overhead: O(N · K) messages + O(|G|) in time + data distribution cost

K is an application dependent constant that represent the neighbor number

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 21/ 33

slide-31
SLIDE 31

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 22/ 33

slide-32
SLIDE 32

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Simulations: case study

Application Jacobi method on a 3D-domain 2, 0483 domain (64 GB) Split in 643 subdomains (32 KB each) Subdomain update computed in 10 ms Scenario One process failed Simulation of the restart in worst case ⇒ % of tasks to re-execute (W cck

recovery/W std recovery)

⇒ Involved processes

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 23/ 33

slide-33
SLIDE 33

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

CCK restart: checkpoint period influence

1,024 processors, ie 256 subdomains (64 MB) per process

  • ne iteration last about 2.5 seconds

20 40 60 80 100 100 200 300 400 500 600 % with respect to the classical protocol Checkpoint period (in s) tasks to re−execute involved processes

For a 60-seconds period, the estimated restart time is: 60 seconds with the classical protocol 3.6 seconds with CCK (if totally parallelized)

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 24/ 33

slide-34
SLIDE 34

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

CCK restart: process number influence

1000 2000 3000 4000 5000 6000 7000 8000 9000 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of involved processes Process number period = 5 s period = 10 s period = 25 s period = 50 s period = 100 s classical protocol 20 40 60 80 100 1000 2000 3000 4000 5000 6000 7000 8000 9000 % with respect to the classical protocol Process number period = 5 s period = 10 s period = 25 s period = 50 s period = 100 s classical protocol

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 25/ 33

slide-35
SLIDE 35

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

Default execution order

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-36
SLIDE 36

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

Default execution order

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-37
SLIDE 37

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

Default execution order

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-38
SLIDE 38

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

Default execution order

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-39
SLIDE 39

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-40
SLIDE 40

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-41
SLIDE 41

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-42
SLIDE 42

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-43
SLIDE 43

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-44
SLIDE 44

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-45
SLIDE 45

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

Local re-ordering

With local re-ordering

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 26/ 33

slide-46
SLIDE 46

Context Fault-tolerance DFG CCK Simulations Perspectives Checkpoint period Process number Local re-ordering

CCK restart: local re-ordering influence

10 20 30 40 50 60 10 20 30 40 50 Number of tasks to re−execute Fault date (in seconds) checkpoint checkpoint checkpoint checkpoint checkpoint CCK without local re−ordering CCK with local re−ordering classical protocol

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 27/ 33

slide-47
SLIDE 47

Context Fault-tolerance DFG CCK Simulations Perspectives

Outline

1

Context

2

Fault-tolerance

3

Data Flow Graph model in Kaapi

4

Coordinated Checkpointing in Kaapi

5

Simulations

6

Perspectives

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 28/ 33

slide-48
SLIDE 48

Context Fault-tolerance DFG CCK Simulations Perspectives

Perspectives

Performance guarantees for failure-free executions The goal is to optimize the protocol parameters : Interval delay between checkpoint events Checkpoint server number and mapping Dynamic reconfiguration Adding or removing nodes requires to re-schedule statically Checkpoint to get a coherent global state Schedule statically for the new node number Resume the execution

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 29/ 33

slide-49
SLIDE 49

Context Fault-tolerance DFG CCK Simulations Perspectives

Thanks for your attention

Questions?

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 30/ 33

slide-50
SLIDE 50

Kaapi parallel programming model

The application is described as a data flow graph. API Global address space Independent of the number of processors Data (Shared<...>): declares an object in the global memory Tasks (Fork<...>): creates a new task that may be executed in concurrence with other tasks Access mode: given by the task: Read, Write, Exclusive, Concurrent write Shared<Matrix> A; Shared<double> B; Fork<Task>() (A,B);

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 31/ 33

slide-51
SLIDE 51

Optimized CCK restart

Non-failed process Non-failed process Non-executed task Data

2 3 4 5 6

Communication Dependency

1

Send task Receive task Failed process Executed task Task to re-execute Data in memory

T to_re-execute (optimised)

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 32/ 33

slide-52
SLIDE 52

First experiments: 3D-domain decomposition

Preliminary results, Kaapi vs MPICH:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 16 32 64 64+32 64+64 Mean time for an iteration (s) Number of nodes 1 cluster 2 clusters Kaapi MPICH

Xavier Besseron and Thierry Gautier Fault-Tolerance using Dataflow Graph 33/ 33