Practical foundations for resilient applications George Bosilca - - PowerPoint PPT Presentation

practical foundations for resilient applications
SMART_READER_LITE
LIVE PREVIEW

Practical foundations for resilient applications George Bosilca - - PowerPoint PPT Presentation

Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power Dagstuhl 2015 Failures are bad for business In HPC: Today, 20% or more of the computing capacity


slide-1
SLIDE 1

Practical foundations for resilient applications

George Bosilca

Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015

slide-2
SLIDE 2

Failures are bad for business …

  • In HPC: “Today, 20% or more of the computing capacity in a

large high-performance computing system is wasted due to failures and recoveries”- Dr. M. Elnozahy et al., System Resilience at Extreme Scale, DARPA

  • Outside HPC: Dynamic execution environments (clouds) are not

suitable for parallel application execution due to volatility.

  • Tomorrow: U.S. Department of Energy identified 10 research

challenges to Exascale. One of them is

  • Resilience and correctness: Ensuring correct

scientific computation in face of faults, reproducibility, and algorithm verification challenges.

slide-3
SLIDE 3

Fault Tolerance: many solutions

  • Rollb

llback R Recovery y

  • Legacy approach
  • Checkpoint/Restart based
  • Active research on introducing more asynchrony

(uncoordinated checkpoint, message logging, correlated sets), increasing the MTBF (hardware) and decreasing the overheads (buddy checkpointing, NVRAM)

  • Fo

Forward R Recovery y

  • Replication (the only system level Forward Recovery)
  • Master-Worker with simple resubmission
  • Iterative methods, Naturally fault tolerant

algorithms

  • Algorithm Based Fault Tolerance

3

a b c d b e

Mast Master Wo Worker0 Wo Worker1 Wo Worker2

time

Protection blocks

Factorized in previous iterations

trailing matrix & protection update by applying the same

  • perations

Factorized in previous iterations

Factorize

AB ABFT FT

Coordina nated c che heckp kpoint nt (with b h blo locki king ng, c , cons nstant nt c che heckp kpoint nts)

slide-4
SLIDE 4

Research Status Anatomy

4 4

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

slide-5
SLIDE 5

Rollback recovery modeling

5 5

PurePeriodicCkpt

Process 0 Process 1 Process 2 Application Application Application Library Library Library PUREPERIODICCKPT Optimal Checkpoint Interval

BiPeriodicCkpt

Process 0 Process 1 Process 2 Application Application Application Library Library Library BIPERIODICCKPT LIBRARY Checkpoint Interval GENERAL Checkpoint Interval

Popt

PC =

p 2C(µ − D − R)

Popt

BPC,G =

p 2C(µ − D − R) Popt

BPC,L =

p 2CL(µ − D − R)

Young/Daly

slide-6
SLIDE 6

Rollback recovery modeling

6 6

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste PeriodicCkpt Bi-PeriodicCkpt

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data Evolutionary platforms design

slide-7
SLIDE 7

Rollback recovery modeling

7 7

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste PeriodicCkpt Bi-PeriodicCkpt

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

Too many checkpoints !!!

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data

slide-8
SLIDE 8

Rollback recovery modeling

8 8

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data

2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

ReEvolutionary platforms design

slide-9
SLIDE 9

Rollback recovery modeling

9 9

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data

2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

Still too many checkpoints !!!

slide-10
SLIDE 10

Research Status Anatomy

10 10

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large Large

Application Specificity

Small

Forward Recovery Rollback Recovery

This situation can be improved by moving investments from the hardware, more I/O bandwidth, future technologies (NVRAM) and increasing the MTBF of components, into software and developers.

slide-11
SLIDE 11

Forward Recovery

  • Any technique that permit the application to continue

without rollback

  • Repli

lication n (the only system level Forward Recovery)

  • Master-Worker w

with s h simple le r resubmi mission n

  • It

Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms

  • Alg

lgorithm B hm Based F Fault lt T Tole leranc nce

  • No checkpoint I/O overhead
  • No rollback, no loss of completed work
  • May require (sometime expensive, like replicates)

protection/recovery operations, but still generally more scalable than checkpoint

  • “Why is not everybody doing this already, then?”
  • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R)
  • Supposes

es t that M MPI c I continues es t to o

  • per

erate a e across f failures es

slide-12
SLIDE 12

Forward Recovery

  • Any technique that permit the application to continue

without rollback

  • Repli

lication n (the only system level Forward Recovery)

  • Master-Worker w

with s h simple le r resubmi mission n

  • It

Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms

  • Alg

lgorithm B hm Based F Fault lt T Tole leranc nce

  • No checkpoint I/O overhead
  • Minimal or no rollback, no loss of completed work
  • May require (sometime expensive, like replicates)

protection/recovery operations, but still generally more scalable than checkpoint

  • “Why is not everybody doing this already, then?”
  • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R)
  • Supposes

es t that M MPI c I continues es t to o

  • per

erate a e across f failures es

slide-13
SLIDE 13

Forward Recovery

  • Any technique that permit the application to continue

without rollback

  • Master-Worker w

with s h simple le r resubmi mission n

  • It

Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms

  • Alg

lgorithm B hm Based F Fault lt T Tole leranc nce

  • Repli

lication n (the only system level Forward Recovery)

  • No checkpoint I/O overhead
  • No rollback, no loss of completed work
  • May require (sometime expensive, like replicates)

protection/recovery operations, but still generally more scalable than checkpoint

  • “Why is not everybody doing this already, then?”
  • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R)
  • Supposes

es t that M MPI c I continues es t to o

  • per

erate a e across f failures es

Standardization o

  • f p

programming p paradigms b beh ehavior a after er failures es i is a a k key m missing i infrastructure e

slide-14
SLIDE 14

USE SER LEVEL EVEL FAI AILURE MIT ITIG IGATION ION ULFM ULFM

Expend the MPI communication infrastructure to integrate faults as a first class citizen of the message passing concepts

slide-15
SLIDE 15

Failure Model

  • Process Failures (hard errors)
  • Fail-stop failures: a process crash (dead, never returns back to life)
  • Transient (network) failures are “upgraded” to fail-stop (may be revisited

later)

  • Silent/soft (memory errors)
  • Message corruptions can be addressed without standard modifications
  • Memory corruptions are better addressed at the application level
  • Mechanisms should be provided to allow libraries and applications to deal

with them even on a distributed setting

  • Byzantine failures are outside of the scope

16 16

slide-16
SLIDE 16

What is ?

  • Define complex collective behavior

based on individual participants

  • The MPI standard includes two-sided

(point-to-point and collective) communications, one-sided (Remote Memory Access) communications, dynamic process management, topologies, I/O, tools support

  • Designed and maintained with the

goal of portability and efficiency

1994 : v1.0 1995 : v1.1 1997 : v1.2, v2.0 2008 : v1.3, v2.1 2009 : v2.2 2012 : v3.0 2015 : v3.1 2016- : v4.0

slide-17
SLIDE 17

Minimal Feature Set for FT

  • Failure Notification
  • Error Propagation
  • Error Recovery

Not all recovery strategies require all of these features, that’s why the interface splits notification, propagation and recovery

18 18

slide-18
SLIDE 18

ULFM: Key Philosophy

  • Flexibility
  • No particular

recovery model imposed or favored

  • Application directs

the recovery: it pays only for the level of protection it needs

  • Recovery can be

restricted to subgroups for scalability

19

  • Performance
  • Protective actions

are outside of critical MPI routines

  • MPI implementors

can uphold unmodified algorithms (collective, one- sided, I/O)

  • Encourages

programs to be reactive to failures

  • Productivity
  • Backward

compatible with legacy, fragile applications

  • Simple and

familiar concepts to repair MPI

  • Provides key MPI

concepts to enable FT support from Libraries, runtime, language extensions

Us User L Level F l Failu lure M Mitigation: a n: a s set o

  • f M

MPI e I extens nsions ns t to e ena nable le M MPI p I programs ms to r restore M MPI c I commu mmuni nication c n capabili lities d disable led b by f y failu lures

slide-19
SLIDE 19

Failure Notification

  • Notification of failures is lo

local o l only nly

  • New error MPI_ERR_PROC_FAILED raised when a communication with a targ

targeted ed process fails

  • In an operation (collective), some

me p process may succeed succeed while othe her r raise a an e n error

  • Bcast might succeed for the top of the tree, but fail for some subtree rooted on a failed

process

  • Exceptions indicate an operation failed
  • To know what process failed, apps call MPI_COMM_FAILURE_ACK(),

MPI_COMM_FAILURE_GET_ACKED()

  • Technicality: ANY_SOURCE must raise an exception
  • the dead could be the expected sender
  • Raise error MPI_ERR_PROC_FAILED_PENDING, preserve matching order
  • The application can complete the recv later (MPI_COMM_FAILURE_ACK())

20 20

slide-20
SLIDE 20

App using notification only

21 21

Master W1 W2 Wn

Send (W1,T1) Submit T1 Send (W2,T1) Resubmit Recv (ANY) Detected W1

  • Error notifications do not break MPI
  • App can continue to communicate on the communicator
  • More errors may be raised if the op cannot complete (typically, most collective
  • ps are expected to fail), but p2p between non-failed processes works
  • In this Master-Worker example, we can continue

w/o recovery!

  • Master sees a worker failed
  • Resubmit the lost work unit onto another worker
  • Quietly continue
slide-21
SLIDE 21

Error Propagation

  • Errors are local, processes have a different view of failures
  • We need a tool to resolve potential inconsistent behavior
  • When necessary, app can manually propagate an error
  • MPI_COMM_REVOKE(comm)
  • Interrupts all non-local MPI calls at all ranks on comm
  • Once revoked, any non-local operation on comm raises MPI_ERR_REVOKED (except

recovery functions, duh)

22 22 1 2 3

Recv(1) Failed Recv(3) Send(2) Recv(0) Revoked Revoked Revoked Revoke

slide-22
SLIDE 22

App using propagation only

  • Application does only p2p communications
  • P1 fails, P2 raises an error and wants to change comm

pattern to do application recovery

  • but P3..Pn are stuck in their posted recv
  • P2 unlocks them with Revoke
  • P3..Pn join P2 in the new recovery p2p communication

pattern

23 23

Recv(P1): failure P2 calls Revoke

P1 P2 P3 Pn

Recv(P1) Recv(P1): revoked Recovery

slide-23
SLIDE 23

Error Agreement

  • When in need to decide if there is a failure and

if the condition is recoverable (collectively)

  • MPI_COMM_AGREE(comm, &value)
  • Fault tolerant agreement over a bitwise integer
  • Unexpected failures (not acknowledged before the call)

raise MPI_ERR_PROC_FAILED

  • The value can be used to compute a user condition, even

when there are failures in comm

  • Can be used as a global failure detector

24 24

slide-24
SLIDE 24

Error Recovery

  • Restores full communication capability (all

collective ops, etc).

  • MPI_COMM_SHRINK(comm, newcomm)
  • Creates a new communicator excluding failed processes
  • New failures are absorbed during the operation

25 25

P1 P2 P3 Pn

Bcast Bcast Shrink Bcast

slide-25
SLIDE 25

Also supported

  • Remote Memory Access Window objects
  • The window becomes unusable after a failure, but
  • State of memory window is defined after an error (except for write

regions)

  • The window can be recreated (by repairing the parent communicator)
  • Files
  • The file pointer is scrambled after a failure, but
  • It can be reset by the application, and resume
  • The file can be recreated (by repairing the parent communicator)

26 26

slide-26
SLIDE 26

MPIX_REVOKE

  • Technically revoke is a

asynchronous reliable broadcast

  • Scalability challenges

Pla lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015

slide-27
SLIDE 27

MPIX_REVOKE

  • Technically revoke is a

asynchronous reliable broadcast

  • Scalability challenges
  • Potentially introduces a

significant amount of noise

  • The noise can be

persistent

  • Pla

lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015

slide-28
SLIDE 28

MPIX_REVOKE

  • Technically revoke is a

asynchronous reliable broadcast

  • Scalability challenges
  • Potentially introduces a

significant amount of noise

  • The noise can be

persistent

  • Perturbations due to
  • ngoing

communications

Pla lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015

slide-29
SLIDE 29

MPIX_COMM_AGREE

  • Particularity: there is a

double outcome, the consensus is done on the user-provided value but also

  • n the set of dead processes
  • One of the major differences with Paxos
  • Design an algorithm with

emphasis on the fault-free cases

  • Performance similar to MPI_Allreduce in

this case

  • Early-Returning: return once the

consensus is reached locally

  • Practical S

l Scala lable le C Cons nsens nsus f for P Pseudo-S

  • Sync

ynchr hrono nous Di Distributed S Sys ystems ms ,

  • T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, M. Parashar, K. Teranishi and
  • J. Dongarra, SC 2015
slide-30
SLIDE 30

MPIX_COMM_AGREE

  • Particularity: there is a

double outcome, the consensus is done on the user-provided value but also

  • n the set of dead processes
  • One of the major differences with Paxos
  • Design an algorithm with

emphasis on the fault-free cases

  • Performance similar to MPI_Allreduce in

this case

  • Early-Returning: return once the

consensus is reached locally

  • Maintain performance across multiple

failures (underlying topology remains efficient)

  • Practical S

l Scala lable le C Cons nsens nsus f for P Pseudo-S

  • Sync

ynchr hrono nous Di Distributed S Sys ystems ms ,

  • T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, M. Parashar, K. Teranishi and
  • J. Dongarra, SC 2015
slide-31
SLIDE 31

FTLA: ABFT for dense LA

  • No checkpoints: no I/O
  • cost of ABFT comes
  • nly from extra flops (to

update checksums) and extra storage

  • Cost decrease with

machine scale (divided by P when using PxQ processes)

  • Goes to 0% overhead

for protection, at scale (as observed on Kraken)

32

10 20 30 40 50 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 96x96; 320k 192x192; 640k 6 12 18 24 30 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGETRF FT-PDGETRF (no errors) Overhead: FT-PDGETRF (no errors)

C C A’ ’ L L U U

Mathematical alterations

slide-32
SLIDE 32

FTLA: ABFT for dense LA

  • No checkpoints: no I/O
  • cost of ABFT comes
  • nly from extra flops (to

update checksums) and extra storage

  • Cost decrease with

machine scale (divided by P when using PxQ processes)

  • Goes to 0% overhead

for protection, at scale (as observed on Kraken)

  • Cost per recovery: ~1%
  • f runtime

33

10 20 30 40 50 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 96x96; 320k 192x192; 640k 6 12 18 24 30 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGETRF FT-PDGETRF (no errors) Overhead: FT-PDGETRF (no errors)

2 4 6 8 10 12 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 10 20 30 40 50 60 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGEQRF FT-PDGEQRF (no errors) FT-PDGEQRF (one error) Overhead: FT-PDGEQRF (no errors) Overhead: FT-PDGEQRF (one error)

FTLA-Q

  • QR: p

: performa manc nce with 1 h 1 f failu lure r recovery y C C A’ ’ L L U U

Mathematical alterations

slide-33
SLIDE 33

Kraken ( n (24x24) u using ng L LUS USTRE

0.5 1 1.5 2 2.5 3 3.5 20k 40k 60k 80k 100k Performance (Tflops/s) Matrix Size (N) ScaLAPACK ABFT QR (w/o failure) ABFT QR (w/1 CoF recovery)

ABFT QR without failures has identical performance as the CoF enabled version, as in the absence of faults no checkpoint is realized.

Checkpoint-on-Failure (CoF)

Che heckp kpoint nt-o

  • on-F

n-Failu lure

  • Checkpoint all remaining processes

when a fault is detected

  • Minimal fault-free overhead

(identical to ABFT)

  • Checkpoint can happen locally, as

long as the next allocation cover the same resources.

  • As all alive nodes will be part of

the next run

slide-34
SLIDE 34

Integrated resilience techniques

  • It might be difficult to make an entire application ABFT-ready
  • But some applications exhibit iterative behaviors where a

significant part of the execution is spent in ABFT-aware sections

  • Mix it with traditional checkpoint/restart and see how it

behaves.

Process 0 Process 1 Process 2 Application Application Application Library Library Library Periodic Checkpoint Split Forced Checkpoints

slide-35
SLIDE 35

Integrated resilience techniques

10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

slide-36
SLIDE 36

Integrated resilience techniques

37 37

Checkpointing & Restart (C/R) … Algorithm Based Fault Tolerance (ABFT) Overhead

Small

Large

Significant Application Specificity

None

Forward Recovery Rollback Recovery

Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data

Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847

2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

slide-37
SLIDE 37

User activities

  • ORNL: Molecular Dynamic simulation
  • Employs coordinated user-level C/R, in place

restart with Shrink

  • UAB: transactional FT programming model
  • Tsukuba: Phalanx Master-worker

framework

  • Georgia University: Wang Landau Polymer

Freezing and Collapse

  • Employs two-level communication scheme

with group checkpoints

  • Upon failure, the tightly coupled group

restarts from checkpoint, the other distant groups continue undisturbed

  • Sandia: PDE sparse solver
  • Inria, UTK: Composite strategies

38

  • Cray: CREST miniapps, PDE solver Schwartz,

PPStee (Mesh, automotive), HemeLB (Lattice Boltzmann)

  • UTK: FTLA (dense Linear Algebra)
  • Employs ABFT
  • FTQR returns an error to the app, App calls new

BLACS repair constructs (spawn new processes with MPI_COMM_SPAWN), and re-enters FTQR to resume (ABFT recovery embedded)

  • ETH Zurich: Monte-Carlo
  • Upon failure, shrink the global communicator

(that contains spares) to recreate the same domain decomposition, restart MC with same rank mapping as before

(a) failure-free (b) few failures (c) many failures Figure 5. Results of the FT-MLMC implementation for three different failure scenarios.

Credits: ETH Zurich

slide-38
SLIDE 38

RDI2

Marc Gamell1, Daniel S. Katz2, Hemanth Kolla3, Jacqueline Chen3, Scott Klasky4, Manish Parashar1

1 Rutgers Discovery Informatics Institute (RDI2), Rutgers University 2 University of Chicago & Argonne National Laboratory 3 Sandia National Laboratories 4 Oak Ridge National Laboratory

Fenix: Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

RDI

Key contributions

O.S. MPI runtime App + libraries Fenix

Approach

  • On-line, local, semi-transparent recovery

from process, node, blade and cabinet failure

  • Targets MPI-based parallel applications
  • Using application-specific double in-memory,

implicitly coordinated, local checkpoints

Fenix

  • Design and implementation of the approach
  • Deployed on Titan Cray XK7 at ORNL

Experimental Evaluation

  • S3D combustion numerical simulation
  • Sustained performance with MTBF = 47 seconds
  • Experiments inject real failures

Implementation details

  • Built on top of MPI-ULFM
  • Tested up to

– 8192 cores w/ failures – 250k cores w/o failures

  • Provides C, C++ and Fortran

interfaces

RDI2

Fenix – Recovery Stages

  • 2. Environment

recovery

  • Re-spawn or Spare

process pool

  • Repair world

communicator

  • Delay recreation of

user comms

  • 1. Failure detection

Based on

  • ULFM return codes
  • MPI profiling

interface

  • Comm revoke
slide-39
SLIDE 39

RDI2

  • 4. Recovering from high-frequency failures

Conclusions:

  • Online recovery allows the usage of in-memory

checkpointing, O(0.1s)

  • Efficient recovery from high frequency node

failures, as exascale compels

  • With failures injected every 189, 94 and 47

seconds, the total job run-time penalty is 10%, 15% and 31%, respectively

  • Note that current production runs’ fault

tolerance cost is 31%!

  • This can dramatically improve by optimizing

ULFM shrink

slide-40
SLIDE 40

RDI2

  • 4. Recovering from high-frequency failures

Conclusions:

  • Online recovery allows the usage of in-memory

checkpointing, O(0.1s)

  • Efficient recovery from high frequency node

failures, as exascale compels

  • With failures injected every 189, 94 and 47

seconds, the total job run-time penalty is 10%, 15% and 31%, respectively

  • Note that current production runs’ fault

tolerance cost is 31%!

  • This can dramatically improve by optimizing

ULFM shrink

Fault lt Int Interval l Be Before Af After er 47 31% 17.9% 94 15% 8.4% 189 10% 6.2%

256 simultaneous failures

1 3 3 1 2 1 9 7 3 3 7 5 4 9 6 4 9 1 3 4 9 1 3

1

4 9 1 3

2

5 8 3 2 6 8 5 9 8 9 2 6 1 Number of cores 25 20 15 10 5

Current C/R production code exhibits a 31%

  • verhead with 6

failures a day

Log2Phase ERA

slide-41
SLIDE 41

MiniFE

  • Part of the Mantevo

mini-app suite

  • Exhibits pattern of comp and comm
  • f parallel finite element analysis
  • Integrated with LFLR

(Local Failure Local Recovery)

  • Uses ULFM to allow on-line

application recovery via abstraction

  • f data recovery, multiple options of

persistent storage, active pool of spare node (remove the RM from the critical path)

0.00# 2.00# 4.00# 6.00# 8.00# 10.00# 12.00# 14.00# 512# 1024# 2048# Execu&on)Time)(in)seconds)) Number)of)Processes)

Log2phase# ERA#

slide-42
SLIDE 42

Transactional model

  • Amin Hassani PhD (UAB) focus on

designing a transactional model for MPI (PhD advisor Anthony Skjellum)

  • The core concepts of FA-MPI are

implemented on top of ULFM MPI

  • Provides a higher level abstraction

than ULFM (but more targeted to a particular programming style)

43 43

communication initialization; if restarted then load data from last checkpoint (optional); end repeat while more work to do do MPI TryBlock start(); computation, communication and/or I/O; wait for operations to finish; inject local errors; MPI TryBlock finish(); if failure happened then isolate and mitigate the failure; if recovery needed then break; end periodically checkpoint; end if recovery needed then do recovery procedure; end until more work to do or restart needed; Algorithm 1: A basic application using FA-MPI

slide-43
SLIDE 43

Resilience in task-based paradigms

  • Focus on data dependencies,

data flows, and tasks

  • Don’t develop for an

architecture but for a portability layer

  • Let the runtime deal with the

hardware characteristics

  • But provide as much user control as possible
  • StarSS, StarPU, Swift,

Parallex, Quark, Kaapi, DuctTeip, ..., and PaRSEC

App

Data Distrib. Sched. Comm Memory Manager Heterogeneity Manager

Runtime

slide-44
SLIDE 44

Concepts

  • Clear separation of concerns: compiler optimize

each tasks, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution

  • Interface with the application developers through

specialized domain specific languages (PTG, insert_task, fork/join, …)

  • Separate algorithms from data distribution
  • Make control flow executions a relic

Runtime

  • Permeable portability layer for

heterogeneous architectures

  • Scheduling policies adapt every

execution to the hardware &

  • ngoing system status
  • Data movements between

consumers are inferred from

  • dependencies. Communications/

computations overlap naturally unfold

  • Coherency protocols minimize

data movements

  • Memory hierarchies (including

NVRAM and disk) integral part of the scheduling decisions

Pa PaRSEC: generic framework for architecture aware scheduling of micro-tasks

  • n distributed many-core

heterogeneous architectures

slide-45
SLIDE 45

Motivation: Goal

  • Failure model
  • Soft e

error ( (Silent Data Corruption): ): bit-flips in disk, memory or processor registers

  • Here we focus on soft errors happening during computation
  • Resilience / Fault Tolerance in dynamic task-based

runtime

  • Implemented in PaRSEC, the runtime system for PLASMA
  • Two levels of granularities and three mechanisms: Looking into DAG and

task.

  • Case study on the Cholesky factorization

46 46

slide-46
SLIDE 46

FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

Introduction to PaRSEC

  • Application representation

User’s view Runtime’s view

47 47

Final&result& POTRF& TRSM& SYRK& GEMM&

A00& A10& A11& A20& A21& A22& A30& A31& A32& A33&

Sile lent nt e error

slide-47
SLIDE 47

Detection & Local Correction through ABFT Strategy

  • Implementation

(1) Attaching 2 checksum vectors (2) Provide recovery

(3) Continue with the to original data scheme inside task DAG execution

48 48

Input&'le& Checksum&vector&

slide-48
SLIDE 48

Detection & Local Correction through ABFT Strategy

  • Checksum invariance
  • C1 = Ag1

C2 = Ag2 g1 = (1,1,...,1)T g2 = (1,2,...,n)T (*because of round-off errors,

a small tolerance is allowed)

  • Single bit Detection & correction
  • After update, single bit error happens at A(i,j)

=> error is in column j => error is in row i => adding the difference to recover

49 49

Matrix Operation (BLAS, LU, QR, etc)

Checksum1 C1 Checksum2 C2

Matrix Operation (BLAS, LU, QR, etc)

Checksum1 C1 Checksum2 C2

update update

A(k, j)

k=1 n

−C1( j) =γ

kA(k, j)

k=1 n

−C2( j) = iγ A'(i, j) = A(i, j)−γ

slide-49
SLIDE 49

Analysis of Overhead

  • 2. A single bit flip in task

1) Applying ABFT method (avoid

re-execution) Attach 2 checksum vectors to

every tile. Tile size is NB x NB.

Overhead (time)

  • Maintaining checksums:
  • Detecting & correcting error:
  • Total:

Overhead (Storage) =>

2 NB

Input&'le& Checksum&vector&

(1+ 2 NB)3 −1

1 NB (1+ 2 NB)3 −1+ 1 NB

slide-50
SLIDE 50

Experiment Platform

  • Machine: Titan in ORNL
  • CPU: AMD Opteron™ 6274 (Interlagos)
  • 16 Cores, 8 FPUs
  • We use 8 cores per CPU, ensuring 1 FPU per core (no GPUs)
  • Weak scaling experiments:
  • 6k x 6k matrix distributed on 1 node
  • Run up to 256 nodes
slide-51
SLIDE 51
  • Experiment 1: Single Bit Flip. ABFT Correction
slide-52
SLIDE 52
  • Experiment 1: Single Bit Flip. ABFT Correction
slide-53
SLIDE 53
  • Experiment 1: Single Bit Flip. ABFT Correction
slide-54
SLIDE 54

Experiment Assumed Single Bit Flip, but…

  • In a matrix

computation an error may propagate.

  • In this case need

to save inputs and restart the task.

55 55

Original errors Errors due to propagation

slide-55
SLIDE 55
  • Experiment 2: Save Task’s Inputs Locally and Restart Task
slide-56
SLIDE 56
  • Experiment 2: Save Task’s Inputs Locally and Restart Task
slide-57
SLIDE 57
  • Experiment 2: Save Task’s Inputs Locally and Restart Task
slide-58
SLIDE 58

Sub-DAG & Periodic Checkpoint Strategy

  • β = 2 example
  • Checkpointing intermediate data, limit the

number of re-executions.

  • Checkpoint interval β, a process will save

a copy of each data every β updates.

  • Input of failed task:
  • The same tile checkpointed at most

β updates ago

  • Final output of another task (validated)
  • Max number of re-executions is β

for factorizations

59 59

slide-59
SLIDE 59
  • Experiment 3: Checkpoint every 10 updates
slide-60
SLIDE 60
  • Experiment 3: Checkpoint every 10 updates
slide-61
SLIDE 61
  • Experiment 3: Checkpoint every 10 updates
slide-62
SLIDE 62

Application Recovery Patterns

63 63

slide-63
SLIDE 63

Conclusion

  • Checkpoint is the rich man solution ($ goes in

hardware with a short life span)

  • It’s a life changing solution: once taken it will be the only path forward and

will require more and more hardware support

  • Alternative solutions are possible and support

from major programming paradigms are production quality-ready