Resilient and energy-aware scheduling algorithms Anne Benoit LIP, - - PowerPoint PPT Presentation

resilient and energy aware scheduling algorithms
SMART_READER_LITE
LIVE PREVIEW

Resilient and energy-aware scheduling algorithms Anne Benoit LIP, - - PowerPoint PPT Presentation

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Resilient and energy-aware scheduling algorithms Anne Benoit LIP, Ecole Normale Sup erieure de Lyon, France Anne.Benoit@ens-lyon.fr


slide-1
SLIDE 1

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Resilient and energy-aware scheduling algorithms

Anne Benoit LIP, Ecole Normale Sup´ erieure de Lyon, France

Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit/

4th GDR RSD and ASF Winter School on Distributed Systems and Networks, February 2019

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 1/ 84

slide-2
SLIDE 2

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Motivation

Scheduling: Allocate resources to applications to optimize some performance metrics Resources: Large-scale distributed systems with millions of components Applications: Parallel applications, expressed as a set of tasks,

  • r divisible application with some work to complete

Performance metrics: Of course we are concerned with the performance of the applications, but also with resilience and energy consumption

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 2/ 84

slide-3
SLIDE 3

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Classical scheduling problems

Tasks Machines

P1 P2

Objectives: Minimizing total execution time (Cmax) Minimizing weighted sum of execution times

i wiCi

Results: NP-completeness, algorithms, approximation algorithms, (in-)approximation bounds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 3/ 84

slide-4
SLIDE 4

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Classical scheduling problems

t

Tasks Machines

P1 P2

Objectives: Minimizing total execution time (Cmax) Minimizing weighted sum of execution times

i wiCi

Results: NP-completeness, algorithms, approximation algorithms, (in-)approximation bounds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 3/ 84

slide-5
SLIDE 5

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Classical scheduling problems

Cmax

t

Tasks Machines

P1 P2

Objectives: Minimizing total execution time (Cmax) Minimizing weighted sum of execution times

i wiCi

Results: NP-completeness, algorithms, approximation algorithms, (in-)approximation bounds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 3/ 84

slide-6
SLIDE 6

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Classical scheduling problems

C2 C1 C5 C3 C4

t

Tasks Machines

P1 P2

Objectives: Minimizing total execution time (Cmax) Minimizing weighted sum of execution times

i wiCi

Results: NP-completeness, algorithms, approximation algorithms, (in-)approximation bounds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 3/ 84

slide-7
SLIDE 7

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Classical scheduling problems

C2 C1 C5 C3 C4

t

Tasks Machines

P1 P2

Objectives: Minimizing total execution time (Cmax) Minimizing weighted sum of execution times

i wiCi

Results: NP-completeness, algorithms, approximation algorithms, (in-)approximation bounds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 3/ 84

slide-8
SLIDE 8

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Dealing with failures

Consider one processor (e.g. in your laptop)

Mean Time Between Failures (MTBF) = 100 years (Almost) no failures in practice

Why bother about failures? Theorem: The MTBF decreases linearly with the number of processors! With 36500 processors:

MTBF = 1 day A failure every day on average!

A large simulation can run for weeks, hence it will face failures

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 4/ 84

slide-9
SLIDE 9

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Dealing with failures

Consider one processor (e.g. in your laptop)

Mean Time Between Failures (MTBF) = 100 years (Almost) no failures in practice

Why bother about failures? Theorem: The MTBF decreases linearly with the number of processors! With 36500 processors:

MTBF = 1 day A failure every day on average!

A large simulation can run for weeks, hence it will face failures

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 4/ 84

slide-10
SLIDE 10

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Intuition

Time p1 p2 p3

t

If three processors have around 20 faults during a time t (µ =

t 20)...

Time p

t

...during the same time, the platform has around 60 faults (µp =

t 60)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 5/ 84

slide-11
SLIDE 11

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

So, how to deal with failures?

Failures usually handled by adding redundancy: Replicate the work (for instance, use only half of the processors, and the other half is used to redo the same computation) Checkpoint the application: Periodically save the state of the application on stable storage, so that we can restart in case of failure without loosing everything

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 6/ 84

slide-12
SLIDE 12

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Another crucial issue: Energy consumption

“The internet begins with coal” Nowadays: more than 90 billion kilowatt-hours of electricity a year; requires 34 giant (500 megawatt) coal-powered plants, and produces huge CO2 emissions Explosion of artificial intelligence; AI is hungry for processing power! Need to double data centers in next four years → how to get enough power? Failures: Redundant work consumes even more energy Energy and power awareness crucial for both environmental and economical reasons

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 7/ 84

slide-13
SLIDE 13

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 8/ 84

slide-14
SLIDE 14

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Introduction to resilience

Fail-stop errors:

Component failures (node, network, power, ...) Application fails and data is lost

Silent data corruptions:

Bit flip (Disk, RAM, Cache, Bus, ...) Detection is not immediate, and we may get wrong results

How often should we checkpoint to minimize the waste, i.e., the time lost because of resilience techniques and failures?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 9/ 84

slide-15
SLIDE 15

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 10/ 84

slide-16
SLIDE 16

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors

Periodic checkpoint, rollback, and recovery:

Time

C T C T C

(no error)

Time Fail-stop error

C T C T C

(error)

Time

C R T C T C

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection. Rollback to last checkpoint and re-execute.

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 11/ 84

slide-17
SLIDE 17

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors

Periodic checkpoint, rollback, and recovery:

Time

C T C T C

(no error)

Time Fail-stop error

C T C T C

(error)

Time

C R T C T C

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection. Rollback to last checkpoint and re-execute.

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 11/ 84

slide-18
SLIDE 18

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors

Periodic checkpoint, rollback, and recovery:

Time

C T C T C

(no error)

Time Fail-stop error

C T C T C

(error)

Time

C R T C T C

Fail-stop error

(error) Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection. Rollback to last checkpoint and re-execute.

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 11/ 84

slide-19
SLIDE 19

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Silent error

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-20
SLIDE 20

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection Silent error

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-21
SLIDE 21

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection corrupted! Silent error

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-22
SLIDE 22

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection corrupted! Silent error

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-23
SLIDE 23

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection corrupted! corrupted?

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-24
SLIDE 24

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection corrupted! corrupted?

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-25
SLIDE 25

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with silent errors

Silent error = detection latency Error is detected only when corrupted data is activated Same approach?

C

T

C

T

C

Time Detection corrupted! corrupted?

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

slide-26
SLIDE 26

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Methods for detecting silent errors

General-purpose approaches Replication [Fiala et al. 2012] or triple modular redundancy and voting [Lyons and Vanderkulk 1962] Application-specific approaches Algorithm-based fault tolerance (ABFT): checksums in dense matrices Limited to one error detection and/or correction in practice [Huang and Abraham 1984] Partial differential equations (PDE): use lower-order scheme as verification mechanism [Benson, Schmit and Schreiber 2014] Generalized minimal residual method (GMRES): inner-outer iterations [Hoemmen and Heroux 2011] Preconditioned conjugate gradients (PCG): orthogonalization check every k iterations, re-orthogonalization if problem detected [Sao and Vuduc 2013, Chen 2013] Data-analytics approaches Dynamic monitoring of HPC datasets based on physical laws (e.g., temperature limit, speed limit) and space or temporal proximity [Bautista-Gomez and Cappello 2014] Time-series prediction, spatial multivariate interpolation [Di et al. 2014]

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 13/ 84

slide-27
SLIDE 27

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop and silent errors

Time

V C T V C T V C

(no error) Time

V C R T V C T V C

Fail-stop error (fail-stop error) Time

V C T V R T V C T V C

Silent error Detection (silent error)

What is the optimal checkpointing period?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 14/ 84

slide-28
SLIDE 28

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 15/ 84

slide-29
SLIDE 29

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Optimization objective (1/2)

Time

C T C T C

T is the pattern length (time without failures) C is the checkpoint cost E(T) is the expected execution time of the pattern By definition, the overhead of the pattern is defined as: H(T) = E(T)

T

− 1 The overhead measures the fraction of extra time due to: Checkpoints Recoveries and re-executions (failures) The goal is to minimize the quantity: H(T)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 16/ 84

slide-30
SLIDE 30

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Optimization objective (2/2)

Goal: Find the optimal pattern length T ∗, so that the overhead is minimized Overhead: H(T) = E(T)

T

− 1

  • 1. Compute expected execution time E(T) (exact formula)
  • 2. Compute overhead H(T) (first-order approximation)
  • 3. Derive optimal T ∗: fail-stop errors
  • 4. Derive optimal T ∗: silent errors
  • 5. Derive optimal T ∗: both

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 17/ 84

slide-31
SLIDE 31

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 1. Expected execution time E(T)

T: Pattern length C: Checkpoint time R: Recovery time λf =

1 µf : Fail-stop error rate Time

C T C T C

(no error) E(T) = Pno−error (T + C) +

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 18/ 84

slide-32
SLIDE 32

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 1. Expected execution time E(T)

T: Pattern length C: Checkpoint time R: Recovery time λf =

1 µf : Fail-stop error rate Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

E(T) = Pno−error (T + C) + Perror

  • Elost + R + E(T)
  • Winter School, Feb. 5, 2019

Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 18/ 84

slide-33
SLIDE 33

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 1. Expected execution time E(T)

Assume that failures follow an exponential distribution Exp(λf ) Independent errors (memoryless property) There is at least one error before time t with probability: P(X ≤ t) = 1 − e−λf t

(cdf)

Probability of failure / no-failure Perror = 1 − e−λf T Pno−error = e−λf T

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 19/ 84

slide-34
SLIDE 34

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 1. Expected execution time E(T)

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

E(T) = e−λf T (T + C) + (1 − e−λf T)

  • Elost + R + E(T)
  • = T + C + (eλf T − 1)
  • Elost + R
  • Elost is the time lost when the failure strikes:

Elost = ∞ tP(X = t|X <T)dt = 1 λf − T eλf T − 1 = T 2 + o(λf T) We lose half the pattern upon failure (in expectation)!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 20/ 84

slide-35
SLIDE 35

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 1. Expected execution time E(T)

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

E(T) = e−λf T (T + C) + (1 − e−λf T)

  • Elost + R + E(T)
  • = T + C + (eλf T − 1)
  • Elost + R
  • Elost is the time lost when the failure strikes:

Elost = ∞ tP(X = t|X <T)dt = 1 λf − T eλf T − 1 = T 2 + o(λf T) We lose half the pattern upon failure (in expectation)!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 20/ 84

slide-36
SLIDE 36

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 2. Compute overhead H(T)

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

We use Taylor series to approximate e−λf T up to first-order terms: e−λf T = 1 − λf T + o(λf T) Works well provided that λf << T, C, R E(T) = T + C + λf T T 2 + R

  • + o(λf T)

Finally, we get the overhead of the pattern: H(T) = C T + λf T 2 + o(λf T)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 21/ 84

slide-37
SLIDE 37

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 2. Compute overhead H(T)

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

We use Taylor series to approximate e−λf T up to first-order terms: e−λf T = 1 − λf T + o(λf T) Works well provided that λf << T, C, R E(T) = T + C + λf T T 2 + R

  • + o(λf T)

Finally, we get the overhead of the pattern: H(T) = C T + λf T 2 + o(λf T)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 21/ 84

slide-38
SLIDE 38

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 3. Derive optimal T ∗: Fail-stop errors

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

H(T) = C T + λf T 2 + o(λf T) We solve: ∂H(T) ∂T = − C T 2 + λf 2 = 0 Finally, we retrieve: T ∗ =

  • 2C

λf =

  • 2µf C

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 22/ 84

slide-39
SLIDE 39

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 3. Derive optimal T ∗: Fail-stop errors

Time

C T C T C

(no error)

Time

C R T C T C

Fail-stop error

(recovery)

  • Elost

H(T) = C T + λf T 2 + o(λf T) We solve: ∂H(T) ∂T = − C T 2 + λf 2 = 0 Finally, we retrieve: T ∗ =

  • 2C

λf =

  • 2µf C

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 22/ 84

slide-40
SLIDE 40

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 4. Derive optimal T ∗: Silent errors

Time

V C T V R T V C T V C

Silent error Detection (silent error)

Similar to fail-stop except: λf → λs Elost = T V : verification time Using the same approach: H(T) = C + V T + λsT

  • silent

+o(λsT)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 23/ 84

slide-41
SLIDE 41

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 5. Derive optimal T ∗: Both errors

H(T) = C + V T + λf T 2

  • fail−stop

+ λsT

  • silent

+o(λT) First-order approximations [Young 1974, Daly 2006, AB et al. 2016]

Fail-stop errors Silent errors Both errors Pattern T + C T + V + C T + V + C Optimal T ∗ C

λf 2

  • V +C

λs

V +C

λs+ λf

2

Overhead H∗ 2

  • λf

2 C

2

  • λs(V + C)

2

  • λs + λf

2

  • (V + C)

Is this optimal for energy consumption?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 24/ 84

slide-42
SLIDE 42

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

  • 5. Derive optimal T ∗: Both errors

H(T) = C + V T + λf T 2

  • fail−stop

+ λsT

  • silent

+o(λT) First-order approximations [Young 1974, Daly 2006, AB et al. 2016]

Fail-stop errors Silent errors Both errors Pattern T + C T + V + C T + V + C Optimal T ∗ C

λf 2

  • V +C

λs

V +C

λs+ λf

2

Overhead H∗ 2

  • λf

2 C

2

  • λs(V + C)

2

  • λs + λf

2

  • (V + C)

Is this optimal for energy consumption?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 24/ 84

slide-43
SLIDE 43

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 25/ 84

slide-44
SLIDE 44

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Energy model (1/2)

Modern processors equipped with dynamic voltage and frequency scaling (DVFS) capability Power consumption of processing unit is Pidle + κσ3, where κ > 0 and σ is the processing speed Error rate: May also depend on processing speed

λ(σ) follows a U-shaped curve increases exponentially with decreased processing speed σ increases also with increased speed because of high temperature

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 26/ 84

slide-45
SLIDE 45

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Energy model (2/2)

Total power consumption depends on:

Pidle: static power dissipated when platform is on (even idle) Pcpu(σ): dynamic power spent by operating CPU at speed σ Pio: dynamic power spent by I/O transfers (checkpoints and recoveries)

Computation and verification: power depends upon σ (total time Tcpu(σ)) Checkpointing and recovering: I/O transfers (total time Tio) Total energy consumption: Energy(σ) = Tcpu(σ)(Pidle + Pcpu(σ)) + Tio(Pidle + Pio)

Checkpoint: E C = C(Pidle + Pio) Recover: E R = R(Pidle + Pio) Verify at speed σ: E V (σ) = V (σ)(Pidle + Pcpu(σ))

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 27/ 84

slide-46
SLIDE 46

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Bi-criteria problem

Linear combination of execution time and energy consumption: a · Time + b · Energy Theorem Application subject to both fail-stop and silent errors Minimize a · Time + b · Energy The optimal checkpointing period is T ∗(σ) =

  • 2(V (σ)+Ce(σ))

λf (σ)+2λs(σ) ,

where Ce(σ) =

a+b(Pidle+Pio) a+b(Pidle+Pcpu(σ))C

Similar optimal period as without energy, but account for new parameters! T ∗ =

  • 2(V +C)

λf +2λs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 28/ 84

slide-47
SLIDE 47

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Bi-criteria problem

Linear combination of execution time and energy consumption: a · Time + b · Energy Theorem Application subject to both fail-stop and silent errors Minimize a · Time + b · Energy The optimal checkpointing period is T ∗(σ) =

  • 2(V (σ)+Ce(σ))

λf (σ)+2λs(σ) ,

where Ce(σ) =

a+b(Pidle+Pio) a+b(Pidle+Pcpu(σ))C

Similar optimal period as without energy, but account for new parameters! T ∗ =

  • 2(V +C)

λf +2λs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 28/ 84

slide-48
SLIDE 48

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 29/ 84

slide-49
SLIDE 49

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

When Amdahl meets Young/Daly

Error-free speedup with P processors and α sequential fraction: Amdahl’s Law: S(P) =

1 α+ 1−α

P

Bounded above by 1/α Strictly increasing function of P Allocating more processors on an error-prone platform? Higher error-free speedup More errors/faults

More frequent checkpointing

More resilience overhead

We can compute optimal processor allocation and checkpointing interval!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 30/ 84

slide-50
SLIDE 50

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

How is replication used?

On a Q-processor platform, application is replicated n times: Duplication: each replica has P = Q/2 processors Triplication: each replica has P = Q/3 processors General case: each replica has P = Q/n processors Having more replicas on an error-prone platform? Lower error-free speedup More resilient

Smaller checkpointing frequency

Less resilience overhead

Optimal replication level, processor allocation per replica, and checkpointing interval?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

slide-51
SLIDE 51

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

How is replication used?

On a Q-processor platform, application is replicated n times: Duplication: each replica has P = Q/2 processors Triplication: each replica has P = Q/3 processors General case: each replica has P = Q/n processors Having more replicas on an error-prone platform? Lower error-free speedup More resilient

Smaller checkpointing frequency

Less resilience overhead

Optimal replication level, processor allocation per replica, and checkpointing interval?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

slide-52
SLIDE 52

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

How is replication used?

On a Q-processor platform, application is replicated n times: Duplication: each replica has P = Q/2 processors Triplication: each replica has P = Q/3 processors General case: each replica has P = Q/n processors Having more replicas on an error-prone platform? Lower error-free speedup More resilient

Smaller checkpointing frequency

Less resilience overhead

Optimal replication level, processor allocation per replica, and checkpointing interval?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

slide-53
SLIDE 53

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Why is replication useful?

Error detection (duplication): Error correction (triplication):

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

slide-54
SLIDE 54

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Why is replication useful?

Error detection (duplication): Error correction (triplication):

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

slide-55
SLIDE 55

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Why is replication useful?

Error detection (duplication): Error correction (triplication):

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

slide-56
SLIDE 56

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Why is replication useful?

Error detection (duplication): Error correction (triplication):

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

slide-57
SLIDE 57

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Why is replication useful?

Error detection (duplication): Error correction (triplication):

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

slide-58
SLIDE 58

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 33/ 84

slide-59
SLIDE 59

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Two replication modes

Process replication: Group replication:

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 34/ 84

slide-60
SLIDE 60

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Two replication modes

Process replication: Group replication:

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 34/ 84

slide-61
SLIDE 61

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Independent process error distribution: Exponential Exp(λ), λ = 1/µ (Memoryless) Error probability of one process during T time of computation: P(T) = 1 − e−λT Process triplication: Failure probability of any triplicated process: Pprc

3 (T, 1) =

3 2

  • 1 − P(T)
  • P(T)2 + P(T)3

= 3e−λT 1 − e−λT2 +

  • 1 − e−λT3 = 1 − 3e−2λT + 2e−3λT

Failure probability of P-process application: Pprc

3 (T, P) = 1 − P(“No process fails”)

= 1 − (1 − Pprc

3 (T, 1))P = 1 −

  • 3e−2λT − 2e−3λTP

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

slide-62
SLIDE 62

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Independent process error distribution: Exponential Exp(λ), λ = 1/µ (Memoryless) Error probability of one process during T time of computation: P(T) = 1 − e−λT Process triplication: Failure probability of any triplicated process: Pprc

3 (T, 1) =

3 2

  • 1 − P(T)
  • P(T)2 + P(T)3

= 3e−λT 1 − e−λT2 +

  • 1 − e−λT3 = 1 − 3e−2λT + 2e−3λT

Failure probability of P-process application: Pprc

3 (T, P) = 1 − P(“No process fails”)

= 1 − (1 − Pprc

3 (T, 1))P = 1 −

  • 3e−2λT − 2e−3λTP

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

slide-63
SLIDE 63

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Independent process error distribution: Exponential Exp(λ), λ = 1/µ (Memoryless) Error probability of one process during T time of computation: P(T) = 1 − e−λT Process triplication: Failure probability of any triplicated process: Pprc

3 (T, 1) =

3 2

  • 1 − P(T)
  • P(T)2 + P(T)3

= 3e−λT 1 − e−λT2 +

  • 1 − e−λT3 = 1 − 3e−2λT + 2e−3λT

Failure probability of P-process application: Pprc

3 (T, P) = 1 − P(“No process fails”)

= 1 − (1 − Pprc

3 (T, 1))P = 1 −

  • 3e−2λT − 2e−3λTP

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

slide-64
SLIDE 64

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Group triplication: Failure probability of any P-process group: Pgrp

1 (T, P) = 1 − P(“No process in group fails”)

= 1 −

  • 1 − P(T)

P = 1 − e−λPT Failure probability of three-group application: Pgrp

3 (T, P) =

3 2

  • (1 − Pgrp

1 (T, 1)) Pgrp 1 (T, 1)2 + Pgrp 1 (T, 1)3

= 3e−λPT 1 − e−λPT2 +

  • 1 − e−λPT3

= 1 − 3e−2λPT + 2e−3λPT > 1 −

  • 3e−2λT − 2e−3λTP = Pprc

3 (T, P)

What about duplication? (any error kills both cases) Pprc

2 (T, P) = Pgrp 2 (T, P) = 1 − e−2λPT

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

slide-65
SLIDE 65

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Group triplication: Failure probability of any P-process group: Pgrp

1 (T, P) = 1 − P(“No process in group fails”)

= 1 −

  • 1 − P(T)

P = 1 − e−λPT Failure probability of three-group application: Pgrp

3 (T, P) =

3 2

  • (1 − Pgrp

1 (T, 1)) Pgrp 1 (T, 1)2 + Pgrp 1 (T, 1)3

= 3e−λPT 1 − e−λPT2 +

  • 1 − e−λPT3

= 1 − 3e−2λPT + 2e−3λPT > 1 −

  • 3e−2λT − 2e−3λTP = Pprc

3 (T, P)

What about duplication? (any error kills both cases) Pprc

2 (T, P) = Pgrp 2 (T, P) = 1 − e−2λPT

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

slide-66
SLIDE 66

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Group triplication: Failure probability of any P-process group: Pgrp

1 (T, P) = 1 − P(“No process in group fails”)

= 1 −

  • 1 − P(T)

P = 1 − e−λPT Failure probability of three-group application: Pgrp

3 (T, P) =

3 2

  • (1 − Pgrp

1 (T, 1)) Pgrp 1 (T, 1)2 + Pgrp 1 (T, 1)3

= 3e−λPT 1 − e−λPT2 +

  • 1 − e−λPT3

= 1 − 3e−2λPT + 2e−3λPT > 1 −

  • 3e−2λT − 2e−3λTP = Pprc

3 (T, P)

What about duplication? (any error kills both cases) Pprc

2 (T, P) = Pgrp 2 (T, P) = 1 − e−2λPT

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

slide-67
SLIDE 67

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Probability of failure

Group triplication: Failure probability of any P-process group: Pgrp

1 (T, P) = 1 − P(“No process in group fails”)

= 1 −

  • 1 − P(T)

P = 1 − e−λPT Failure probability of three-group application: Pgrp

3 (T, P) =

3 2

  • (1 − Pgrp

1 (T, 1)) Pgrp 1 (T, 1)2 + Pgrp 1 (T, 1)3

= 3e−λPT 1 − e−λPT2 +

  • 1 − e−λPT3

= 1 − 3e−2λPT + 2e−3λPT > 1 −

  • 3e−2λT − 2e−3λTP = Pprc

3 (T, P)

What about duplication? (any error kills both cases) Pprc

2 (T, P) = Pgrp 2 (T, P) = 1 − e−2λPT

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

slide-68
SLIDE 68

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Two observations

Observation 1 (Implementation) Process replication is more resilient than group replication (assuming same overhead) Group replication is easier to implement by treating an application as a blackbox Observation 2 (Analysis) Following two scenarios are equivalent w.r.t. failure probability: Group replication with n replicas, where each replica has P processes and each process has error rate λ Process replication with one process, which has error rate λP and which is replicated n times Benefit of analysis: Group(n, P, λ) → Process(n, 1, λP)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 37/ 84

slide-69
SLIDE 69

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Two observations

Observation 1 (Implementation) Process replication is more resilient than group replication (assuming same overhead) Group replication is easier to implement by treating an application as a blackbox Observation 2 (Analysis) Following two scenarios are equivalent w.r.t. failure probability: Group replication with n replicas, where each replica has P processes and each process has error rate λ Process replication with one process, which has error rate λP and which is replicated n times Benefit of analysis: Group(n, P, λ) → Process(n, 1, λP)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 37/ 84

slide-70
SLIDE 70

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Analysis steps

Maximize error-aware speedup Sn(T, P) = S(P) En(T, P)/T

  • 1. Derive failure probability Pprc

n (T, P) or Pgrp n (T, P) — exact

  • 2. Compute expected execution time En(T, P) — exact
  • 3. Compute first-order approx. of error-aware speedup Sn(T, P)
  • 4. Derive optimal Topt, Popt and get Sn(Topt, Popt)
  • 5. Choose right replication level n

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 38/ 84

slide-71
SLIDE 71

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Analytical results

Duplication: On a platform with Q processors and checkpointing cost C, the optimal resilience parameters for process/group duplication are:

Popt = min    Q 2 ,

  • 1

2 1 − α α 2 1 Cλ 1

3

   Topt =

  • C

2λPopt 1

2

Sopt = S(Popt) 1 + 2

  • 2λCPopt

1

2

Triplication & (n, k)-replication (k-out-of-n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval Topt nicely extends Young/Daly’s result Error-aware speedup Sopt minimally affected for small λ

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

slide-72
SLIDE 72

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Analytical results

Duplication: On a platform with Q processors and checkpointing cost C, the optimal resilience parameters for process/group duplication are:

Popt = min    Q 2 ,

  • 1

2 1 − α α 2 1 Cλ 1

3

   Topt =

  • C

2λPopt 1

2

Sopt = S(Popt) 1 + 2

  • 2λCPopt

1

2

Triplication & (n, k)-replication (k-out-of-n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval Topt nicely extends Young/Daly’s result Error-aware speedup Sopt minimally affected for small λ

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

slide-73
SLIDE 73

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Analytical results

Duplication: On a platform with Q processors and checkpointing cost C, the optimal resilience parameters for process/group duplication are:

Popt = min    Q 2 ,

  • 1

2 1 − α α 2 1 Cλ 1

3

   Topt =

  • C

2λPopt 1

2

Sopt = S(Popt) 1 + 2

  • 2λCPopt

1

2

Triplication & (n, k)-replication (k-out-of-n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval Topt nicely extends Young/Daly’s result Error-aware speedup Sopt minimally affected for small λ

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

slide-74
SLIDE 74

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Results comparison

For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication

Popt = Q 2 Popt = Q 3 (Processors ↓) Topt =

  • C

λQ Topt =

3

  • C

2λ2Q (Chkpt interval ↑) Sopt = Q/2 1 + 2√λCQ Sopt = Q/3 1 + 3

3

λC

2

2 Q (Exp. speedup??)

Process triplication v.s. Group triplication

Popt = Q 3 Popt = Q 3 (Processors =) Topt =

3

  • C

2λ2Q Topt =

3

  • 3C

2(λQ)2 (Chkpt interval ↓) Sopt = Q/3 1 + 3

3

λC

2

2 Q Sopt = Q/3 1 + 3

3

  • 1

3

λCQ

2

2 (Exp. speedup ↓)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 40/ 84

slide-75
SLIDE 75

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Results comparison

For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication

Popt = Q 2 Popt = Q 3 (Processors ↓) Topt =

  • C

λQ Topt =

3

  • C

2λ2Q (Chkpt interval ↑) Sopt = Q/2 1 + 2√λCQ Sopt = Q/3 1 + 3

3

λC

2

2 Q (Exp. speedup??)

Process triplication v.s. Group triplication

Popt = Q 3 Popt = Q 3 (Processors =) Topt =

3

  • C

2λ2Q Topt =

3

  • 3C

2(λQ)2 (Chkpt interval ↓) Sopt = Q/3 1 + 3

3

λC

2

2 Q Sopt = Q/3 1 + 3

3

  • 1

3

λCQ

2

2 (Exp. speedup ↓)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 40/ 84

slide-76
SLIDE 76

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Results comparison

For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication

Popt = Q 2 Popt = Q 3 (Processors ↓) Topt =

  • C

λQ Topt =

3

  • C

2λ2Q (Chkpt interval ↑) Sopt = Q/2 1 + 2√λCQ Sopt = Q/3 1 + 3

3

λC

2

2 Q (Exp. speedup??)

Process triplication v.s. Group triplication

Popt = Q 3 Popt = Q 3 (Processors =) Topt =

3

  • C

2λ2Q Topt =

3

  • 3C

2(λQ)2 (Chkpt interval ↓) Sopt = Q/3 1 + 3

3

λC

2

2 Q Sopt = Q/3 1 + 3

3

  • 1

3

λCQ

2

2 (Exp. speedup ↓)

Choosing right mode & level of replication Based on analytical results, app. output structure and system/language support

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 41/ 84

slide-77
SLIDE 77

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 42/ 84

slide-78
SLIDE 78

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulations

Consider a platform with Q = 106, and study Efficiency = Sopt Q Impact of MTBE and checkpointing cost C Impact of sequential fraction α Impact of number of processes P

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 43/ 84

slide-79
SLIDE 79

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Impact of MTBE and checkpointing cost

α = 10−6

106 105 104 103 102

System MTBE

0.0 0.1 0.2 0.3 0.4 0.5

Efficiency

Duplication Sim. Proc Trip. Sim. Group Trip. Sim. Duplication Th. Proc Trip. Th. Group Trip. Th.

(a) C = 1800s

106 105 104 103 102

System MTBE

0.0 0.1 0.2 0.3 0.4 0.5

Efficiency

Duplication Sim. Proc Trip. Sim. Group Trip. Sim. Duplication Th. Proc Trip. Th. Group Trip. Th.

(b) C = 60s

First-order accurate except for duplication (where P is larger) and with small MTBE Duplication can be sufficient for large MTBE, especially for small checkpointing cost

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 44/ 84

slide-80
SLIDE 80

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Impact of sequential fraction

C = 1800s

106 105 104 103 102

System MTBE

0.0 0.1 0.2 0.3 0.4 0.5

Efficiency

Duplication Sim. Proc Trip. Sim. Group Trip. Sim. Duplication Th. Proc Trip. Th. Group Trip. Th.

(c) α = 10−7

106 105 104 103 102

System MTBE

0.0 0.1 0.2 0.3 0.4 0.5

Efficiency

Duplication Sim. Proc Trip. Sim. Group Trip. Sim. Duplication Th. Proc Trip. Th. Group Trip. Th.

(d) α = 10−6

106 105 104 103 102

System MTBE

0.0 0.1 0.2 0.3 0.4 0.5

Efficiency

Duplication Sim. Proc Trip. Sim. Group Trip. Sim. Duplication Th. Proc Trip. Th. Group Trip. Th.

(e) α = 10−5

Increased α reduces efficiency Increased α increases minimum MTBE for which duplication is sufficient

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 45/ 84

slide-81
SLIDE 81

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Impact of number of processes

α = 10−5, C = 1800s

(f) MTBE = 104 (g) MTBE = 103

Efficiency/speedup not strictly increasing with P First-order Popt close to actual optimum

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 46/ 84

slide-82
SLIDE 82

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

What to remember

“Replication + checkpointing” as a general-purpose fault- tolerance protocol for detecting/correcting silent errors in HPC Process replication is more resilient than group replication, but group replication is easier to implement Analytical solution for Popt, Topt, and Sopt and for choosing right replication mode and level

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 47/ 84

slide-83
SLIDE 83

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 48/ 84

slide-84
SLIDE 84

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Chains of tasks

High-performance computing (HPC) application: chain of tasks T1 → T2 → · · · → Tn Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time Checkpoints can only be done after a task has completed

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 49/ 84

slide-85
SLIDE 85

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Chains of tasks

High-performance computing (HPC) application: chain of tasks T1 → T2 → · · · → Tn Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time Checkpoints can only be done after a task has completed

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 49/ 84

slide-86
SLIDE 86

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Dynamic programming algorithm without replication

Possibility to add verification, memory checkpoint and disk checkpoint at the end of a task

T0

V M D

T1 . . . Td1

V M D Td1+1

. . . Td2

V M D

. . .

Edisk(d1) E(d1, d2) Edisk(d2)

Edisk(d2) = min

0≤d1<d2{Edisk(d1) + E(d1, d2) + CD}

Initialization: Edisk(0) = 0 Objective: Compute Edisk(n) Compute Edisk(0), Edisk(1), Edisk(2), . . . , Edisk(n) in that order Complexity: O(n2)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 50/ 84

slide-87
SLIDE 87

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

slide-88
SLIDE 88

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

slide-89
SLIDE 89

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Coping with fail-stop errors with replication

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5 T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

T1( p

2)

C1 T2(p) T3(p) C3 T4( p

2)

T5(p) C5

Fail-stop error

T1( p

2)

C1 T4( p

2)

The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

slide-90
SLIDE 90

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Dynamic programming algorithm with replication

Recursively computes expectation of optimal time required to execute tasks T1 to Ti and then checkpoint Ti Distinguish whether Ti is replicated or not

T rep

  • pt (i): knowing that Ti is replicated

T norep

  • pt

(i): knowing that Ti is not replicated Solution: min

  • T rep
  • pt (n) + C rep

n , T norep

  • pt

(n) + C norep

n

  • Winter School, Feb. 5, 2019

Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 52/ 84

slide-91
SLIDE 91

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Computing T rep

  • pt(j): j is replicated

T rep

  • pt (j)= min

1≤i<j

               T rep

  • pt (i) + C rep

i

+ T rep,rep

NC

(i + 1, j), T rep

  • pt (i) + C rep

i

+ T norep,rep

NC

(i + 1, j), T norep

  • pt

(i) + C norep

i

+ T rep,rep

NC

(i + 1, j), T norep

  • pt

(i)+C norep

i

+T norep,rep

NC

(i + 1, j), Rrep

1

+ T rep,rep

NC

(1, j), Rnorep

1

+ T norep,rep

NC

(1, j)                Ti: last checkpointed task before Tj Ti can be replicated or not, Ti+1 can be replicated or not T A,B

NC : no intermediate checkpoint, first/last task replicated or not,

previous task checkpointed: complicated formula but done in constant time Similar equation for T norep

  • pt

(j) Overall complexity: O(n2)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 53/ 84

slide-92
SLIDE 92

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Comparison to checkpoint only

With identical tasks Reports occ. of checkpoints and replicas in optimal solution Checkpointing cost ≤ task length ⇒ no replication

1.0e − 03 4.0e − 03 1.6e − 02 6.4e − 02 2.6e − 01 1.0e + 00 4.1e + 00 1.6e + 01 6.6e + 01 2.6e + 02 1.0e + 03 Checkpoint/Recovery cost over task length ratio 1.00e − 08 4.00e − 08 1.60e − 07 6.40e − 07 2.56e − 06 1.02e − 05 4.10e − 05 1.64e − 04 6.55e − 04 2.62e − 03 1.05e − 02 Error Rate None Checkpointing Only Replication Only Checkpointing+Replication

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 54/ 84

slide-93
SLIDE 93

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Summary

Goal: Minimize execution time of linear workflows Decide which task to checkpoint and/or replicate Sophisticated dynamic programming algorithms: optimal solutions Even when accounting for energy: decide at which speed to execute each task Even with k different levels of checkpoints and partial verifications: algorithm in O(nk+5) Simulations: With replication, gain over checkpoint-only approach is quite significant, when checkpoint is costly and error rate is high

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 55/ 84

slide-94
SLIDE 94

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 56/ 84

slide-95
SLIDE 95

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Silent vs fail-stop errors

C: time to checkpoint; V : time to verify; R: time to recover; λ: error rate (platform MTBF µ = 1/λ) Optimal checkpointing period W for fail-stop errors (Young/Daly): W = √2Cµ (V = 0)

Time

V C ? R W V C W V C

Fail-stop error

Silent errors: W =

  • (V + C)µ

(C → V + C; missing factor 2)

Time

V C W V R W V C W V C

Silent error Detection

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 57/ 84

slide-96
SLIDE 96

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Back to energy consumption

Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ, power proportional to σ3 and execution time proportional to 1/σ → (dynamic) energy proportional to σ2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy consumption while guaranteeing a performance bound ⇒ At which speed should we execute the workload?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 58/ 84

slide-97
SLIDE 97

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Back to energy consumption

Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ, power proportional to σ3 and execution time proportional to 1/σ → (dynamic) energy proportional to σ2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy consumption while guaranteeing a performance bound ⇒ At which speed should we execute the workload?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 58/ 84

slide-98
SLIDE 98

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 59/ 84

slide-99
SLIDE 99

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Framework

Divisible-load applications Subject to silent data corruption Checkpoint/restart strategy: periodic patterns that repeat

  • ver time

Verified checkpoints Is it better to use two different speeds rather than only one? What are the optimal checkpointing period and optimal execution speeds?

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 60/ 84

slide-100
SLIDE 100

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Model

Set of speeds S = {s1, . . . , sK}: σ1 ∈ S speed for first execution, σ2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C; Recovery: time R Pidle and Pio constant; and Pcpu(σ) = κσ3 Energy for W units of work at speed σ: W

σ (Pidle + κσ3)

Energy of a verification at speed σ: V

σ (Pidle + κσ3)

Energy of a checkpoint: C(Pidle + Pio) Energy of a recovery: R(Pidle + Pio)

Time

V C T(p, n) V R T(p, n) V C T(p, n) V C

Silent error Detection

With a silent error

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

slide-101
SLIDE 101

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Model

Set of speeds S = {s1, . . . , sK}: σ1 ∈ S speed for first execution, σ2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C; Recovery: time R Pidle and Pio constant; and Pcpu(σ) = κσ3 Energy for W units of work at speed σ: W

σ (Pidle + κσ3)

Energy of a verification at speed σ: V

σ (Pidle + κσ3)

Energy of a checkpoint: C(Pidle + Pio) Energy of a recovery: R(Pidle + Pio)

Time

V C T(p, n) V R T(p, n) V C T(p, n) V C

Silent error Detection

With a silent error

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

slide-102
SLIDE 102

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Model

Set of speeds S = {s1, . . . , sK}: σ1 ∈ S speed for first execution, σ2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C; Recovery: time R Pidle and Pio constant; and Pcpu(σ) = κσ3 Energy for W units of work at speed σ: W

σ (Pidle + κσ3)

Energy of a verification at speed σ: V

σ (Pidle + κσ3)

Energy of a checkpoint: C(Pidle + Pio) Energy of a recovery: R(Pidle + Pio)

Time

V C T(p, n) V R T(p, n) V C T(p, n) V C

Silent error Detection

With a silent error

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

slide-103
SLIDE 103

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Model

Set of speeds S = {s1, . . . , sK}: σ1 ∈ S speed for first execution, σ2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C; Recovery: time R Pidle and Pio constant; and Pcpu(σ) = κσ3 Energy for W units of work at speed σ: W

σ (Pidle + κσ3)

Energy of a verification at speed σ: V

σ (Pidle + κσ3)

Energy of a checkpoint: C(Pidle + Pio) Energy of a recovery: R(Pidle + Pio)

Time

V C T(p, n) V R T(p, n) V C T(p, n) V C

Silent error Detection

With a silent error

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

slide-104
SLIDE 104

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Model

Set of speeds S = {s1, . . . , sK}: σ1 ∈ S speed for first execution, σ2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C; Recovery: time R Pidle and Pio constant; and Pcpu(σ) = κσ3 Energy for W units of work at speed σ: W

σ (Pidle + κσ3)

Energy of a verification at speed σ: V

σ (Pidle + κσ3)

Energy of a checkpoint: C(Pidle + Pio) Energy of a recovery: R(Pidle + Pio)

Time

V C T(p, n) V R T(p, n) V C T(p, n) V C

Silent error Detection

With a silent error

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

slide-105
SLIDE 105

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Problem

Optimization problem BiCrit: Minimize E(W , σ1, σ2) W s.t. T (W , σ1, σ2) W ≤ ρ, E(W , σ1, σ2) is the expected energy consumed to execute W units of work at speed σ1, with eventual re-executions at speed σ2 T (W , σ1, σ2) is the expected execution time to execute W units of work at speed σ1, with eventual re-executions at speed σ2 ρ is a performance bound, or admissible degradation factor

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 62/ 84

slide-106
SLIDE 106

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Computing expected execution time

Proposition (1) For the BiCrit problem with a single speed, T (W , σ, σ) = C + e

λW σ

W + V σ

  • +
  • e

λW σ − 1

  • R

Proposition (2) For the BiCrit problem, T (W , σ1, σ2) = C + W + V σ1 +

  • 1 − e− λW

σ1

  • e

λW σ2

  • R + W + V

σ2

  • Winter School, Feb. 5, 2019

Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 63/ 84

slide-107
SLIDE 107

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Proof of Proposition 1

Proof. The recursive equation to compute T (W , σ, σ) writes: T (W , σ, σ) = W + V σ + p(W /σ) (R + T (W , σ, σ)) + (1 − p(W /σ))C, where p(W /σ) = 1 − e− λW

σ . The reasoning is as follows:

We always execute W units of work followed by the verification, in time W +V

σ

; With probability p(W /σ), a silent error occurred and is detected, in which case we recover and start anew; Otherwise, with probability 1 − p(W /σ), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time.

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 64/ 84

slide-108
SLIDE 108

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Proof of Proposition 2

Proof. The recursive equation to compute T (W , σ1, σ2) writes: T (W , σ1, σ2) = W + V σ1 + p(W /σ1) (R + T (W , σ2, σ2)) + (1 − p(W /σ1))C, where p(W /σ1) = 1 − e− λW

σ1 . The reasoning is as follows:

We always execute W units of work followed by the verification, in time W +V

σ1

; With probability p(W /σ1), a silent error occurred and is detected, in which case we recover and start anew at speed σ2; Otherwise, with probability 1 − p(W /σ1), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time.

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 65/ 84

slide-109
SLIDE 109

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Computing expected energy consumption

Proposition For the BiCrit problem, E(W , σ1, σ2) =

  • C +
  • 1 − e− λW

σ1

  • e

λW σ2 R

  • (Pio + Pidle)

+ W + V σ1 (κσ3

1 + Pidle)

+ W + V σ2 (1 − e− λW

σ1 )e λW σ2 (κσ3

2 + Pidle)

Power spent during checkpoint or recovery: Pio + Pidle; power spent during computation and verification at speed σ: Pcpu(σ) + Pidle = κσ3 + Pidle. From Proposition 2, we get the expression of E(W , σ1, σ2).

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 66/ 84

slide-110
SLIDE 110

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Finding optimal pattern length (1)

To get closed-form expression for optimal value of W , use of first-order approximations, using Taylor expansion eλW = 1 + λW + O(λ2W 2): T (W , σ1, σ2) W = 1 σ1 + λW σ1σ2 + λR σ1 + λV σ1σ2 + C + V /σ1 W + O(λ2W ) (1) E(W , σ1, σ2) W = κσ3

1 + Pidle

σ1 + λW σ1σ2 (κσ3

2 + Pidle)

+ λR σ1 (Pio + Pidle) + λV σ1σ2 (κσ3

1 + Pidle)

+ C(Pio + Pidle) + V (κσ3

1 + Pidle)/σ1

W + O(λ2W ) (2)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 67/ 84

slide-111
SLIDE 111

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Finding optimal pattern length (2)

Theorem Given σ1, σ2 and ρ, consider the equation aW 2 + bW + c = 0, where a =

λ σ1σ2 , b = 1 σ1 + λ

  • R

σ1 + V σ1σ2

  • − ρ and c = C + V

σ1 .

If there is no positive solution to the equation, i.e., b > −2√ac, then BiCrit has no solution. Otherwise, let W1 and W2 be the two solutions of the equation with W1 ≤ W2 (at least W2 is positive and possibly W1 = W2). Then, the optimal pattern size is Wopt = min(max(W1, We), W2), (3) where We =

  • C(Pio + Pidle) + V

σ1 (κσ3 1 + Pidle) λ σ1σ2 (κσ3 2 + Pidle)

. (4)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 68/ 84

slide-112
SLIDE 112

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Finding optimal pattern length (3)

Proof. Neglecting lower-order terms, Equation (2) is minimized when W = We given by Equation (4). Two cases: ρ is too small ⇒ no solution W2 > 0:

We < W1 W1 ≤ We ≤ W2 We > W2

Using that the energy overhead is a convex function, we get the result (Wopt is in the interval [W1, W2])

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 69/ 84

slide-113
SLIDE 113

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Finding optimal speed pair

Speed pair (si, sj), with 1 ≤ i, j ≤ K: ρi,j is the minimum performance bound for which the BiCrit problem with σ1 = si and σ2 = sj admits a solution For each speed pair, compute W1, W2 the roots of aW 2 + bW + c; discard pairs with ρ < ρi,j For each remaining speed pair (σ1, σ2), compute Wopt and associated energy overhead Select speed pair (σ∗

1, σ∗ 2) that minimizes energy overhead

Time O(K 2), where K is the number of available speeds, usually a small constant

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 70/ 84

slide-114
SLIDE 114

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 71/ 84

slide-115
SLIDE 115

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulation setup

Platform parameters, based on real platforms Platform λ C = R V Hera 3.38e-6 300s 15.4 Atlas 7.78e-6 439s 9.1 Coastal 2.01e-6 1051s 4.5 Coastal SSD 2.01e-6 2500s 180.0 Power parameters, determined by the processor used Processor Normalized speeds P(σ) (mW) Intel Xscale 0.15, 0.4, 0.6, 0.8, 1 1550σ3 + 60 Transmeta Crusoe 0.45, 0.6, 0.8, 0.9, 1 5756σ3 + 4.4 Default values: Pio equivalent to power used when running at lowest speed; ρ = 3

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 72/ 84

slide-116
SLIDE 116

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulation results, using Hera/XScale configuration

A different re-execution speed does help! And all speed pairs can be optimal solutions (depending on ρ)!

σ1 Best σ2 Wopt

E(Wopt,σ1,σ2) Wopt

0.15 0.4 1711 466 0.4 0.4 2764 416 0.6 0.4 3639 674 0.8 0.4 4627 1082 1 0.4 5742 1625 ρ = 8 σ1 Best σ2 Wopt

E(Wopt,σ1,σ2) Wopt

0.15

  • 0.4

0.4 2764 416 0.6 0.4 3639 674 0.8 0.4 4627 1082 1 0.4 5742 1625 ρ = 3 σ1 Best σ2 Wopt

E(Wopt,σ1,σ2) Wopt

0.15

  • 0.4
  • 0.6

0.8 4251 690 0.8 0.4 4627 1082 1 0.4 5742 1625 ρ = 1.775 σ1 Best σ2 Wopt

E(Wopt,σ1,σ2) Wopt

0.15

  • 0.4
  • 0.6
  • 0.8

0.4 4627 1082 1 0.4 5742 1625 ρ = 1.4

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 73/ 84

slide-117
SLIDE 117

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulations - Impact of the parameters (1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000 Speed C σ1 σ2 σ 2000 4000 6000 8000 10000 12000 1000 2000 3000 4000 5000 Optimal W C Wopt(σ1,σ2) Wopt(σ,σ) 1200 1400 1600 1800 2000 2200 2400 2600 2800 1000 2000 3000 4000 5000 Energy overhead C E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

  • Opt. solution (speed pair, pattern size, and energy overhead) as a function of the checkpointing time C in Atlas/Crusoe configuration.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000 Speed V σ1 σ2 σ 5000 10000 15000 20000 25000 30000 1000 2000 3000 4000 5000 Optimal W V Wopt(σ1,σ2) Wopt(σ,σ) 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 1000 2000 3000 4000 5000 Energy overhead V E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

  • Opt. solution (speed pair, pattern size, and energy overhead) as a function of the verification time V in Atlas/Crusoe configuration.

Dotted line: one single speed; up to 35% improvement with two speeds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 74/ 84

slide-118
SLIDE 118

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulations - Impact of the parameters (2)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10-6 10-5 10-4 10-3 10-2 Speed λ σ1 σ2 σ 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10-6 10-5 10-4 10-3 10-2 Optimal W λ Wopt(σ1,σ2) Wopt(σ,σ) 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 10-6 10-5 10-4 10-3 10-2 Energy overhead λ E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

  • Opt. solution (speed pair, pattern size, and energy overhead) as a function of the error rate λ in Atlas/Crusoe configuration.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.5 2 2.5 3 3.5 Speed ρ σ1 σ2 σ 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 1 1.5 2 2.5 3 3.5 Optimal W ρ Wopt(σ1,σ2) Wopt(σ,σ) 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 1 1.5 2 2.5 3 3.5 Energy overhead ρ E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

  • Opt. solution (speed pair, pattern size, and energy overhead) as a function of the performance bound ρ in Atlas/Crusoe configuration.

Two speeds: checkpoint less frequently and provide energy savings

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 75/ 84

slide-119
SLIDE 119

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Simulations - Impact of the parameters (3)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000 Speed Pidle σ1 σ2 σ 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 1000 2000 3000 4000 5000 Optimal W Pidle Wopt(σ1,σ2) Wopt(σ,σ) 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 1000 2000 3000 4000 5000 Energy overhead Pidle E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

Optimal solution (speed pair, pattern size, and energy overhead) as a function of the idle power Pidle in Atlas/Crusoe configuration.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000 Speed Pio σ1 σ2 σ 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 1000 2000 3000 4000 5000 Optimal W Pio Wopt(σ1,σ2) Wopt(σ,σ) 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1000 2000 3000 4000 5000 Energy overhead Pio E(Wopt,σ1,σ2)/Wopt E(Wopt,σ,σ)/Wopt

Optimal solution (speed pair, pattern size, and energy overhead) as a function of the I/O power Pio in Atlas/Crusoe configuration.

Increase of W and E with Pidle and Pio; Pio has no impact on speeds

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 76/ 84

slide-120
SLIDE 120

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 77/ 84

slide-121
SLIDE 121

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Extensions: With fail-stop errors

f : proportion of fail-stop errors s: proportion of silent errors Proposition (3) With fail-stop and silent errors,

T (W , σ1, σ2) W = · · · + (f + s) σ1σ2 − f 2σ2

1

  • λW + O(λ2W ).

(5) E(W , σ1, σ2) W = · · · + (f + s)(κσ3

2 + Pidle)

σ1σ2 − f (κσ3

1 + Pidle)

2σ2

1

  • λW

+ O(λ2W ) (6)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 78/ 84

slide-122
SLIDE 122

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Limit of the first-order approximation

For BiCrit, the first-order approximation leads to a solution iff

  • 2
  • 1 + s

f −1/2 < σ2 σ1 < 2

  • 1 + s

f

  • Use second-order approximation? Open problem in the general case!

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 79/ 84

slide-123
SLIDE 123

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Interesting case

Theorem When considering only fail-stop errors with rate λ, the optimal pattern size W to minimize the time overhead T (W ,σ,2σ)

W

is Wopt =

3

  • 12C

λ2 σ Young/Daly’s formula: Wopt =

  • 2C/λσ = O(λ−1/2)

Here: Wopt = O(λ−2/3)

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 80/ 84

slide-124
SLIDE 124

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Conclusion

A different re-execution speed indeed helps saving energy while satisfying a performance constraint Silent errors: extension of Young/Daly formula → general closed-form solution to get optimal speed pair and optimal checkpointing period (first-order) Extensive simulations: up to 35% energy savings, any speed pair can be optimal BiCrit still open for general case with both silent and fail-stop errors Interesting case with fail-stop errors and double re-execution speed: O(λ−2/3) vs O(λ−1/2) New methods needed to capture the general case

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 81/ 84

slide-125
SLIDE 125

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Outline

1

Checkpointing for resilience How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption

2

Combining checkpoint with replication Replication analysis Simulations

3

Back to task scheduling

4

A different re-execution speed can help Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors

5

Summary and need for trade-offs

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 82/ 84

slide-126
SLIDE 126

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Summary and need for trade-offs

Two major challenges for Exascale systems: Resilience: need to handle failures Energy: need to reduce energy consumption The main objective is often performance, such as execution time, but other criteria must be accounted for Many models for which we have the answer: Optimal checkpointing period, with fail-stop / silent errors Use of replication to detect and correct silent errors When to checkpoint, replicate and verify for a chain of tasks? Use a different re-execution speed after a failure Still a lot of challenges to address, and techniques to be developed for many kinds of high-performance applications, making trade-offs between performance, reliability, and energy consumption

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 83/ 84

slide-127
SLIDE 127

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Summary and need for trade-offs

Two major challenges for Exascale systems: Resilience: need to handle failures Energy: need to reduce energy consumption The main objective is often performance, such as execution time, but other criteria must be accounted for Many models for which we have the answer: Optimal checkpointing period, with fail-stop / silent errors Use of replication to detect and correct silent errors When to checkpoint, replicate and verify for a chain of tasks? Use a different re-execution speed after a failure Still a lot of challenges to address, and techniques to be developed for many kinds of high-performance applications, making trade-offs between performance, reliability, and energy consumption

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 83/ 84

slide-128
SLIDE 128

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Summary and need for trade-offs

Two major challenges for Exascale systems: Resilience: need to handle failures Energy: need to reduce energy consumption The main objective is often performance, such as execution time, but other criteria must be accounted for Many models for which we have the answer: Optimal checkpointing period, with fail-stop / silent errors Use of replication to detect and correct silent errors When to checkpoint, replicate and verify for a chain of tasks? Use a different re-execution speed after a failure Still a lot of challenges to address, and techniques to be developed for many kinds of high-performance applications, making trade-offs between performance, reliability, and energy consumption

Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 83/ 84

slide-129
SLIDE 129

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Thanks...

... to my co-authors

Valentin Le F` evre, Aur´ elien Cavelan, Hongyang Sun Yves Robert Franck Cappello, Padma Raghavan, Florina M. Ciorba

... and to the Winter School organizers for their kind invitation! A few references:

  • A. Benoit, A. Cavelan, Y. Robert, H. Sun. Assessing General-Purpose Algorithms to Cope with

Fail-Stop and Silent Errors. TOPC, 2016

  • A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, H. Sun. Identifying the right

replication level to detect and correct silent errors at scale. FTXS/HPDC, 2017.

  • A. Benoit, A. Cavelan, Y. Robert and H. Sun. Multi-level checkpointing and silent error detection

for linear workflows. JoCS, 2017.

  • A. Benoit, A. Cavelan, F. Ciorba, V. Le F`

evre, Y. Robert. Combining checkpointing and replication for reliable execution of linear workflows with fail-stop and silent errors. IJNC, 2019.

  • A. Benoit, A. Cavelan, V. Le F`

evre, Y. Robert, H. Sun. A different re-execution speed can help. PASA/ICPP, 2016. Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 84/ 84

slide-130
SLIDE 130

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion

Thanks...

... to my co-authors

Valentin Le F` evre, Aur´ elien Cavelan, Hongyang Sun Yves Robert Franck Cappello, Padma Raghavan, Florina M. Ciorba

... and to the Winter School organizers for their kind invitation! A few references:

  • A. Benoit, A. Cavelan, Y. Robert, H. Sun. Assessing General-Purpose Algorithms to Cope with

Fail-Stop and Silent Errors. TOPC, 2016

  • A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, H. Sun. Identifying the right

replication level to detect and correct silent errors at scale. FTXS/HPDC, 2017.

  • A. Benoit, A. Cavelan, Y. Robert and H. Sun. Multi-level checkpointing and silent error detection

for linear workflows. JoCS, 2017.

  • A. Benoit, A. Cavelan, F. Ciorba, V. Le F`

evre, Y. Robert. Combining checkpointing and replication for reliable execution of linear workflows with fail-stop and silent errors. IJNC, 2019.

  • A. Benoit, A. Cavelan, V. Le F`

evre, Y. Robert, H. Sun. A different re-execution speed can help. PASA/ICPP, 2016. Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 84/ 84