Combining Checkpointing and Replication for Reliable Execution of - - PowerPoint PPT Presentation

combining checkpointing and replication for reliable
SMART_READER_LITE
LIVE PREVIEW

Combining Checkpointing and Replication for Reliable Execution of - - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 , 2 , Aur elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F`


slide-1
SLIDE 1

Introduction Model DP Algo Experiments Conclusion

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Anne Benoit1,2, Aur´ elien Cavelan3, Florina M. Ciorba3, Valentin Le F` evre1, Yves Robert1,4

  • 1. LIP, Ecole Normale Sup´

erieure de Lyon, France

  • 2. Georgia Institute of Technology, Atlanta, GA, USA
  • 3. University of Basel, Switzerland
  • 4. University of Tennessee, Knoxville, TN, USA

http://graal.ens-lyon.fr/~abenoit/

APDCM workshop @ IPDPS’18, Vancouver, May 21, 2018

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 1/ 26

slide-2
SLIDE 2

Introduction Model DP Algo Experiments Conclusion

Linear workflows

High-performance computing (HPC) application: chain of tasks T1 → T2 → · · · → Tn Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 26

slide-3
SLIDE 3

Introduction Model DP Algo Experiments Conclusion

Linear workflows

High-performance computing (HPC) application: chain of tasks T1 → T2 → · · · → Tn Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 26

slide-4
SLIDE 4

Introduction Model DP Algo Experiments Conclusion

Reliable execution

Hierarchical

  • 105 or 106 nodes
  • Each node equipped with 104 or 103 cores

Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h

  • f 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Need to ensure that the execution will be reliable, i.e., without failures

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 3/ 26

slide-5
SLIDE 5

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with checkpoints

Checkpoint, rollback, and recovery:

Time

T1 C1 T2 T3 C3 T4 C4

(no error)

Time Fail-stop error

T1 C1 T2 T3 C3 T4 C4

(error)

Time

T1 C1 T2 T3 R2 T2 T3 C3 · · ·

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26

slide-6
SLIDE 6

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with checkpoints

Checkpoint, rollback, and recovery:

Time

T1 C1 T2 T3 C3 T4 C4

(no error)

Time Fail-stop error

T1 C1 T2 T3 C3 T4 C4

(error)

Time

T1 C1 T2 T3 R2 T2 T3 C3 · · ·

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26

slide-7
SLIDE 7

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with checkpoints

Checkpoint, rollback, and recovery:

Time

T1 C1 T2 T3 C3 T4 C4

(no error)

Time Fail-stop error

T1 C1 T2 T3 C3 T4 C4

(error)

Time

T1 C1 T2 T3 R2 T2 T3 C3 · · ·

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26

slide-8
SLIDE 8

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26

slide-9
SLIDE 9

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26

slide-10
SLIDE 10

Introduction Model DP Algo Experiments Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26

slide-11
SLIDE 11

Introduction Model DP Algo Experiments Conclusion

Contributions

Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 26

slide-12
SLIDE 12

Introduction Model DP Algo Experiments Conclusion

Contributions

Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 26

slide-13
SLIDE 13

Introduction Model DP Algo Experiments Conclusion

Outline

1

Model and objective

2

Optimal dynamic programming algorithm

3

Experiments

4

Conclusion

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 7/ 26

slide-14
SLIDE 14

Introduction Model DP Algo Experiments Conclusion

Application and platform model

Application:

Chain T1 → T2 → · · · → Tn Parallel tasks: (failure-free) execution time of Ti using qi processors is wi

  • αi + 1−αi

qi

  • (Amdahl’s law)

Platform:

Homogeneous platform with p processors Pi, 1 ≤ i ≤ p Fail-stop errors, Exponential distribution, error rate λind P(X ≤ T) = 1 − e−qλindT on q processors

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 8/ 26

slide-15
SLIDE 15

Introduction Model DP Algo Experiments Conclusion

Checkpointing

Checkpointing time: Ci(qi) = ai + bi

qi + ciqi

ai + bi

qi : communication time with latency ai

ciqi: message passing overhead

Downtime D Recovery cost Rj+1 (where Tj is the last checkpointed task) Ri+1(qi) = Ci(qi) for 1 ≤ i ≤ n − 1: recovering for Ti+1 ≈ reading Ci T0 with w0 = 0 checkpointed (input time R1(q1)) Tn always checkpointed (output time Cn(qn))

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 9/ 26

slide-16
SLIDE 16

Introduction Model DP Algo Experiments Conclusion

No replication

Ti not replicated: costs C norep

i

and Rnorep

i

Failure-free execution time: T norep

i

= wi

  • αi + 1−αi

p

  • Expected execution time Enorep(i):

Enorep(i) = P(Xp ≤ T norep

i

)

  • T norep

lost

(T norep

i

) + D + Rnorep

i

+ Enorep(i)

  • + (1 − P(Xp ≤ T norep

i

))T norep

i

P(Xp ≤ t) = 1 − e−λindpt: probability of failure on one of the p processors before time t T norep

lost

(T norep

i

) =

1 λindp − t eλind pTnorep

i

−1

Enorep(i) = (eλindpT norep

i

− 1)(

1 λindp + D + Rnorep i

) If Ti is checkpointed, add C norep

i

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 26

slide-17
SLIDE 17

Introduction Model DP Algo Experiments Conclusion

No replication

Ti not replicated: costs C norep

i

and Rnorep

i

Failure-free execution time: T norep

i

= wi

  • αi + 1−αi

p

  • Expected execution time Enorep(i):

Enorep(i) = P(Xp ≤ T norep

i

)

  • T norep

lost

(T norep

i

) + D + Rnorep

i

+ Enorep(i)

  • + (1 − P(Xp ≤ T norep

i

))T norep

i

P(Xp ≤ t) = 1 − e−λindpt: probability of failure on one of the p processors before time t T norep

lost

(T norep

i

) =

1 λindp − t eλind pTnorep

i

−1

Enorep(i) = (eλindpT norep

i

− 1)(

1 λindp + D + Rnorep i

) If Ti is checkpointed, add C norep

i

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 26

slide-18
SLIDE 18

Introduction Model DP Algo Experiments Conclusion

Replication

Ti replicated: if a copy fails, downtime + recovery Each copy uses p/2 processors; costs C rep

i

and Rrep

i

Failure-free execution time: T rep

i

= wi

  • αi + 1−αi

p 2

  • Expected execution time Erep(i) if Ti−1 is checkpointed:

Erep(i) = P(Yp ≤ T rep

i

)

  • T rep

lost(T rep i

) + D + Rrep

i

+ Erep(i)

  • +(1 − P(Yp ≤ T rep

i

))T rep

i

P(Yp ≤ t) = (1 − e−

λind p 2

t)2: probability of failure on both replicas

  • f p

2 processors before time t

T rep

lost(T rep i

) computed as before . . . Formula for Erep(i)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 11/ 26

slide-19
SLIDE 19

Introduction Model DP Algo Experiments Conclusion

Replication

Ti replicated: if a copy fails, downtime + recovery Each copy uses p/2 processors; costs C rep

i

and Rrep

i

Failure-free execution time: T rep

i

= wi

  • αi + 1−αi

p 2

  • Expected execution time Erep(i) if Ti−1 is checkpointed:

Erep(i) = P(Yp ≤ T rep

i

)

  • T rep

lost(T rep i

) + D + Rrep

i

+ Erep(i)

  • +(1 − P(Yp ≤ T rep

i

))T rep

i

P(Yp ≤ t) = (1 − e−

λind p 2

t)2: probability of failure on both replicas

  • f p

2 processors before time t

T rep

lost(T rep i

) computed as before . . . Formula for Erep(i)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 11/ 26

slide-20
SLIDE 20

Introduction Model DP Algo Experiments Conclusion

Optimization problem

ChainsRepCkpt optimization problem Minimize the expected makespan of the workflow Four possibilities for each task: checkpoint or not, and replicate or not

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 12/ 26

slide-21
SLIDE 21

Introduction Model DP Algo Experiments Conclusion

Outline

1

Model and objective

2

Optimal dynamic programming algorithm

3

Experiments

4

Conclusion

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 13/ 26

slide-22
SLIDE 22

Introduction Model DP Algo Experiments Conclusion

Optimization problem

Theorem The optimal solution to the ChainsRepCkpt problem can be

  • btained using a dynamic programming algorithm in O(n2) time,

where n is the number of tasks in the chain. Recursively computes expectation of optimal time required to execute tasks T1 to Ti and then checkpoint Ti Distinguish whether Ti is replicated or not

T rep

  • pt (i): knowing that Ti is replicated

T norep

  • pt

(i): knowing that Ti is not replicated Solution: min

  • T rep
  • pt (n) + C rep

n , T norep

  • pt

(n) + C norep

n

  • APDCM’18

Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 14/ 26

slide-23
SLIDE 23

Introduction Model DP Algo Experiments Conclusion

Computing T rep

  • pt(j): j is replicated

T rep

  • pt (j)= min

1≤i<j

               T rep

  • pt (i) + C rep

i

+ T rep,rep

NC

(i + 1, j), T rep

  • pt (i) + C rep

i

+ T norep,rep

NC

(i + 1, j), T norep

  • pt

(i) + C norep

i

+ T rep,rep

NC

(i + 1, j), T norep

  • pt

(i)+C norep

i

+T norep,rep

NC

(i + 1, j), Rrep

1

+ T rep,rep

NC

(1, j), Rnorep

1

+ T norep,rep

NC

(1, j)                Ti: last checkpointed task before Tj Ti can be replicated or not Ti+1 can be replicated or not T A,B

NC : no intermediate checkpoint, first/last task replicated or not,

previous task checkpointed Similar equation for T norep

  • pt

(j)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 15/ 26

slide-24
SLIDE 24

Introduction Model DP Algo Experiments Conclusion

Computing T rep

  • pt(j): j is replicated

T rep

  • pt (j)= min

1≤i<j

               T rep

  • pt (i) + C rep

i

+ T rep,rep

NC

(i + 1, j), T rep

  • pt (i) + C rep

i

+ T norep,rep

NC

(i + 1, j), T norep

  • pt

(i) + C norep

i

+ T rep,rep

NC

(i + 1, j), T norep

  • pt

(i)+C norep

i

+T norep,rep

NC

(i + 1, j), Rrep

1

+ T rep,rep

NC

(1, j), Rnorep

1

+ T norep,rep

NC

(1, j)                Ti: last checkpointed task before Tj Ti can be replicated or not Ti+1 can be replicated or not T A,B

NC : no intermediate checkpoint, first/last task replicated or not,

previous task checkpointed Similar equation for T norep

  • pt

(j)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 15/ 26

slide-25
SLIDE 25

Introduction Model DP Algo Experiments Conclusion

Computing T A,B

NC (i, j) T A,B

NC (i, j) =

min

  • T A,rep

NC

(i, j − 1), T A,norep

NC

(i, j − 1)

  • + T A,B(j | i)

T A,B(j | i): time needed to execute task Tj, knowing that a failure during Tj implies to recover from Ti:

T A,norep(j | i) =

  • 1 − e−λT norep

j

  • T norep

lost

(T norep

j

) + D + RA

i

+ min

  • T A,rep

NC

(i, j − 1), T A,norep

NC

(i, j − 1)

  • + T A,norep(j | i)
  • +e−λT norep

j

  • T norep

j

  • T A,rep(j | i) =
  • 1 − e−

λTrep j 2

2 T rep

lost(T rep j

) + D + RA

i

+ min

  • T A,rep

NC

(i, j − 1), T A,norep

NC

(i, j − 1)

  • + T A,rep(j | i)
  • +

 1 −

  • 1 − e−

λTrep j 2

2 

  • T rep

j

  • APDCM’18

Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 16/ 26

slide-26
SLIDE 26

Introduction Model DP Algo Experiments Conclusion

Complexity

Compute O(n2) intermediate values T A,B(j | i) and T A,B

NC (i, j)

for 1 ≤ i, j ≤ n and A, B ∈ {rep, norep} Each of these take constant time O(n) values T A

  • pt(i), for 1 ≤ i ≤ n and A ∈ {rep, norep}

Minimum over at most 6n elements: O(n) Overall complexity: O(n2)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 17/ 26

slide-27
SLIDE 27

Introduction Model DP Algo Experiments Conclusion

Outline

1

Model and objective

2

Optimal dynamic programming algorithm

3

Experiments

4

Conclusion

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 18/ 26

slide-28
SLIDE 28

Introduction Model DP Algo Experiments Conclusion

Experimental setup

Total work: W = 10, 000 seconds Fully parallel tasks: αi = 0 (worst case for replication) Five work distributions: Uniform: Identical tasks, W

n

Increasing: length increases: i

2W n(n+1)

Decreasing: length decreases: (n − i + 1)

2W n(n+1)

HighLow: ⌈ n

10⌉ big tasks (60% of work) followed by small

tasks Random: random lengths between W

2n and 3W 2n , reduced if it

exceeds W C rep

i

= αC norep

i

and Rrep

i

= αRnorep

i

, where 1 ≤ α ≤ 2

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 19/ 26

slide-29
SLIDE 29

Introduction Model DP Algo Experiments Conclusion

Comparison to checkpoint only

Uniform distribution Reports occ. of checkpoints and replicas in optimal solution Checkpointing cost ≤ task length ⇒ no replication

1.0e − 03 4.0e − 03 1.6e − 02 6.4e − 02 2.6e − 01 1.0e + 00 4.1e + 00 1.6e + 01 6.6e + 01 2.6e + 02 1.0e + 03 Checkpoint/Recovery cost over task length ratio 1.00e − 08 4.00e − 08 1.60e − 07 6.40e − 07 2.56e − 06 1.02e − 05 4.10e − 05 1.64e − 04 6.55e − 04 2.62e − 03 1.05e − 02 Error Rate None Checkpointing Only Replication Only Checkpointing+Replication

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 20/ 26

slide-30
SLIDE 30

Introduction Model DP Algo Experiments Conclusion

Optimal solutions with both strategies

Scenario of the red square on the previous slide Less checkpoints when replication is used Optimal solution combines both techniques Rule of thumb: replication preferred for small tasks

T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSCKPT (Checkpointing Only) T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSREPCKPT (Checkpointing and Replication)

(a) Uniform

T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSCKPT (Checkpointing Only) T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSREPCKPT (Checkpointing and Replication)

(b) Increasing

T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSCKPT (Checkpointing Only) T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSREPCKPT (Checkpointing and Replication)

(c) Decreasing

T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSCKPT (Checkpointing Only) T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSREPCKPT (Checkpointing and Replication)

(d) HighLow

T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSCKPT (Checkpointing Only) T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 CHAINSREPCKPT (Checkpointing and Replication) None Checkpointing Only Replication Only Checkpointing+Replication

(e) Random

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 21/ 26

slide-31
SLIDE 31

Introduction Model DP Algo Experiments Conclusion

Comparison, different numbers of tasks

Performance of ChainsRepCkpt compared to ChainsCkpt Expensive checkpoints (limited to ≈ 17) ⇒ makespan of ChainsCkpt remains constant ChainsRepCkpt can replicate increasing number of small tasks

20 40 60 80 100 Number of Tasks 2 4 6 8 10 Normalized Makespan

CHAINSCKPT CHAINSREPCKPT (Uniform) CHAINSREPCKPT (Increasing) CHAINSREPCKPT (Decreasing) CHAINSREPCKPT (Highlow) CHAINSREPCKPT (Random)

20 40 60 80 100 Number of Tasks 20 40 60 80 100 Number of Checkpoints

CHAINSCKPT CHAINSREPCKPT (Uniform) CHAINSREPCKPT (Increasing) CHAINSREPCKPT (Decreasing) CHAINSREPCKPT (Highlow) CHAINSREPCKPT (Random)

20 40 60 80 100 Number of Tasks 20 40 60 80 100 Number of Replicas

CHAINSREPCKPT (Uniform) CHAINSREPCKPT (Increasing) CHAINSREPCKPT (Decreasing) CHAINSREPCKPT (Highlow) CHAINSREPCKPT (Random)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 22/ 26

slide-32
SLIDE 32

Introduction Model DP Algo Experiments Conclusion

Impact of error rate and checkpoint cost

Larger error rate ⇒ using replication helps Replication not needed for small checkpointing costs Replication more efficient when no increase in checkpoint cost

10−5 10−4 10−3 10−2 Error rate λindp 2 4 6 8 10 Normalized Makespan CHAINSREPCKPT (C = R = 1000) CHAINSCKPT (C = R = 1000) 1 2 3 4 5 Checkpoint/Recovery cost over task length ratio 2 4 6 8 10 Normalized Makespan CHAINSREPCKPT CHAINSCKPT 20 40 60 80 100 Number of Tasks 1 2 3 4 5 Normalized Makespan CHAINSREPCKPT (Crep = Cnorep) CHAINSREPCKPT (Crep = 1.5 Cnorep) CHAINSREPCKPT (Crep = 2 Cnorep)

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 23/ 26

slide-33
SLIDE 33

Introduction Model DP Algo Experiments Conclusion

Further experiments

With increasing number of processors and variable checkpointing costs: improvement up to 80.5% with p = 10, 000 processors Impact of number of checkpoints and replicas: the optimal solution always matches minimum value obtained in simulations When both checkpointing cost and error rate are high, small deviation from optimal solution leads to large overhead

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 24/ 26

slide-34
SLIDE 34

Introduction Model DP Algo Experiments Conclusion

Outline

1

Model and objective

2

Optimal dynamic programming algorithm

3

Experiments

4

Conclusion

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 25/ 26

slide-35
SLIDE 35

Introduction Model DP Algo Experiments Conclusion

Conclusion

Combination of checkpointing and replication Goal: Minimize execution time of linear workflows Decide which task to checkpoint and/or replicate Sophisticated dynamic programming algorithm: optimal solution Experiments: Gain over checkpoint-only approach quite significant, when checkpoint is costly and error rate is high Extend to more complicated workflows Experiments on real application workflows Cope with silent errors as well as fail-stop errors

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 26/ 26

slide-36
SLIDE 36

Introduction Model DP Algo Experiments Conclusion

Conclusion

Combination of checkpointing and replication Goal: Minimize execution time of linear workflows Decide which task to checkpoint and/or replicate Sophisticated dynamic programming algorithm: optimal solution Experiments: Gain over checkpoint-only approach quite significant, when checkpoint is costly and error rate is high Extend to more complicated workflows Experiments on real application workflows Cope with silent errors as well as fail-stop errors

APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 26/ 26