Practical foundations for resilient applications
George Bosilca
Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015
Practical foundations for resilient applications George Bosilca - - PowerPoint PPT Presentation
Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power Dagstuhl 2015 Failures are bad for business In HPC: Today, 20% or more of the computing capacity
Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015
llback R Recovery y
(uncoordinated checkpoint, message logging, correlated sets), increasing the MTBF (hardware) and decreasing the overheads (buddy checkpointing, NVRAM)
Forward R Recovery y
algorithms
3
a b c d b e
Mast Master Wo Worker0 Wo Worker1 Wo Worker2
time
Protection blocks
Factorized in previous iterations
trailing matrix & protection update by applying the same
Factorized in previous iterations
Factorize
AB ABFT FT
Coordina nated c che heckp kpoint nt (with b h blo locki king ng, c , cons nstant nt c che heckp kpoint nts)
4 4
Small
None
5 5
PurePeriodicCkpt
Process 0 Process 1 Process 2 Application Application Application Library Library Library PUREPERIODICCKPT Optimal Checkpoint Interval
BiPeriodicCkpt
Process 0 Process 1 Process 2 Application Application Application Library Library Library BIPERIODICCKPT LIBRARY Checkpoint Interval GENERAL Checkpoint Interval
Popt
PC =
p 2C(µ − D − R)
Popt
BPC,G =
p 2C(µ − D − R) Popt
BPC,L =
p 2CL(µ − D − R)
Young/Daly
6 6
Small
None
10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste PeriodicCkpt Bi-PeriodicCkpt
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data Evolutionary platforms design
7 7
Small
None
10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste PeriodicCkpt Bi-PeriodicCkpt
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data
8 8
Small
None
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data
2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
ReEvolutionary platforms design
9 9
Small
None
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data
2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
10 10
Small
Small
This situation can be improved by moving investments from the hardware, more I/O bandwidth, future technologies (NVRAM) and increasing the MTBF of components, into software and developers.
lication n (the only system level Forward Recovery)
with s h simple le r resubmi mission n
Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms
lgorithm B hm Based F Fault lt T Tole leranc nce
es t that M MPI c I continues es t to o
erate a e across f failures es
lication n (the only system level Forward Recovery)
with s h simple le r resubmi mission n
Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms
lgorithm B hm Based F Fault lt T Tole leranc nce
es t that M MPI c I continues es t to o
erate a e across f failures es
with s h simple le r resubmi mission n
Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms
lgorithm B hm Based F Fault lt T Tole leranc nce
lication n (the only system level Forward Recovery)
es t that M MPI c I continues es t to o
erate a e across f failures es
Expend the MPI communication infrastructure to integrate faults as a first class citizen of the message passing concepts
later)
with them even on a distributed setting
16 16
1994 : v1.0 1995 : v1.1 1997 : v1.2, v2.0 2008 : v1.3, v2.1 2009 : v2.2 2012 : v3.0 2015 : v3.1 2016- : v4.0
18 18
recovery model imposed or favored
the recovery: it pays only for the level of protection it needs
restricted to subgroups for scalability
19
Us User L Level F l Failu lure M Mitigation: a n: a s set o
MPI e I extens nsions ns t to e ena nable le M MPI p I programs ms to r restore M MPI c I commu mmuni nication c n capabili lities d disable led b by f y failu lures
targeted ed process fails
process
MPI_COMM_FAILURE_GET_ACKED()
20 20
21 21
Master W1 W2 Wn
Send (W1,T1) Submit T1 Send (W2,T1) Resubmit Recv (ANY) Detected W1
recovery functions, duh)
22 22 1 2 3
Recv(1) Failed Recv(3) Send(2) Recv(0) Revoked Revoked Revoked Revoke
23 23
Recv(P1): failure P2 calls Revoke
P1 P2 P3 Pn
Recv(P1) Recv(P1): revoked Recovery
24 24
25 25
Bcast Bcast Shrink Bcast
regions)
26 26
Pla lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015
lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015
Pla lan B n B: Int : Interruption o n of O Ong ngoing ng M MPI O I Operations ns t to S Support F Failu lure Recovery, A , A. . Bouteille ller, G. B , G. Bosilc lca a and nd J J. . Do Dong ngarra, E , Euro M MPI 2 I 2015
this case
consensus is reached locally
l Scala lable le C Cons nsens nsus f for P Pseudo-S
ynchr hrono nous Di Distributed S Sys ystems ms ,
this case
consensus is reached locally
failures (underlying topology remains efficient)
l Scala lable le C Cons nsens nsus f for P Pseudo-S
ynchr hrono nous Di Distributed S Sys ystems ms ,
update checksums) and extra storage
machine scale (divided by P when using PxQ processes)
for protection, at scale (as observed on Kraken)
32
10 20 30 40 50 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 96x96; 320k 192x192; 640k 6 12 18 24 30 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGETRF FT-PDGETRF (no errors) Overhead: FT-PDGETRF (no errors)
C C A’ ’ L L U U
Mathematical alterations
update checksums) and extra storage
machine scale (divided by P when using PxQ processes)
for protection, at scale (as observed on Kraken)
33
10 20 30 40 50 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 96x96; 320k 192x192; 640k 6 12 18 24 30 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGETRF FT-PDGETRF (no errors) Overhead: FT-PDGETRF (no errors)
2 4 6 8 10 12 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 10 20 30 40 50 60 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGEQRF FT-PDGEQRF (no errors) FT-PDGEQRF (one error) Overhead: FT-PDGEQRF (no errors) Overhead: FT-PDGEQRF (one error)
FTLA-Q
: performa manc nce with 1 h 1 f failu lure r recovery y C C A’ ’ L L U U
Mathematical alterations
Kraken ( n (24x24) u using ng L LUS USTRE
0.5 1 1.5 2 2.5 3 3.5 20k 40k 60k 80k 100k Performance (Tflops/s) Matrix Size (N) ScaLAPACK ABFT QR (w/o failure) ABFT QR (w/1 CoF recovery)
ABFT QR without failures has identical performance as the CoF enabled version, as in the absence of faults no checkpoint is realized.
when a fault is detected
(identical to ABFT)
long as the next allocation cover the same resources.
the next run
Process 0 Process 1 Process 2 Application Application Application Library Library Library Periodic Checkpoint Split Forced Checkpoints
10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(n) 80% of each iteration is spent in ABFT-algorithm modifying 80% of the data
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
37 37
Small
None
Memory/component remains constant Problem Size increases O(√n) Assuming MTBF of 10k nodes at 1 day, scaled in O(1/n) Checkpoint (and restart) cost at 10K 1 minute scaled in O(1) O(n^3) vs O(n^2) of each iteration is spent in ABFT- algorithm modifying 80% of the data
Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International Journal of Networking and Computing, ISSN 2185-2847
2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt
restart with Shrink
framework
Freezing and Collapse
with group checkpoints
restarts from checkpoint, the other distant groups continue undisturbed
38
PPStee (Mesh, automotive), HemeLB (Lattice Boltzmann)
BLACS repair constructs (spawn new processes with MPI_COMM_SPAWN), and re-enters FTQR to resume (ABFT recovery embedded)
(that contains spares) to recreate the same domain decomposition, restart MC with same rank mapping as before
(a) failure-free (b) few failures (c) many failures Figure 5. Results of the FT-MLMC implementation for three different failure scenarios.
Credits: ETH Zurich
RDI2
Marc Gamell1, Daniel S. Katz2, Hemanth Kolla3, Jacqueline Chen3, Scott Klasky4, Manish Parashar1
1 Rutgers Discovery Informatics Institute (RDI2), Rutgers University 2 University of Chicago & Argonne National Laboratory 3 Sandia National Laboratories 4 Oak Ridge National Laboratory
Fenix: Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
RDI
Key contributions
O.S. MPI runtime App + libraries Fenix
Approach
from process, node, blade and cabinet failure
implicitly coordinated, local checkpoints
Fenix
Experimental Evaluation
Implementation details
– 8192 cores w/ failures – 250k cores w/o failures
interfaces
RDI2
Fenix – Recovery Stages
recovery
process pool
communicator
user comms
Based on
interface
Conclusions:
checkpointing, O(0.1s)
failures, as exascale compels
seconds, the total job run-time penalty is 10%, 15% and 31%, respectively
tolerance cost is 31%!
ULFM shrink
Conclusions:
checkpointing, O(0.1s)
failures, as exascale compels
seconds, the total job run-time penalty is 10%, 15% and 31%, respectively
tolerance cost is 31%!
ULFM shrink
Fault lt Int Interval l Be Before Af After er 47 31% 17.9% 94 15% 8.4% 189 10% 6.2%
1 3 3 1 2 1 9 7 3 3 7 5 4 9 6 4 9 1 3 4 9 1 3
1
4 9 1 3
2
5 8 3 2 6 8 5 9 8 9 2 6 1 Number of cores 25 20 15 10 5
Log2Phase ERA
application recovery via abstraction
persistent storage, active pool of spare node (remove the RM from the critical path)
0.00# 2.00# 4.00# 6.00# 8.00# 10.00# 12.00# 14.00# 512# 1024# 2048# Execu&on)Time)(in)seconds)) Number)of)Processes)
Log2phase# ERA#
designing a transactional model for MPI (PhD advisor Anthony Skjellum)
implemented on top of ULFM MPI
than ULFM (but more targeted to a particular programming style)
43 43
communication initialization; if restarted then load data from last checkpoint (optional); end repeat while more work to do do MPI TryBlock start(); computation, communication and/or I/O; wait for operations to finish; inject local errors; MPI TryBlock finish(); if failure happened then isolate and mitigate the failure; if recovery needed then break; end periodically checkpoint; end if recovery needed then do recovery procedure; end until more work to do or restart needed; Algorithm 1: A basic application using FA-MPI
App
Data Distrib. Sched. Comm Memory Manager Heterogeneity Manager
Runtime
each tasks, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution
specialized domain specific languages (PTG, insert_task, fork/join, …)
heterogeneous architectures
execution to the hardware &
consumers are inferred from
computations overlap naturally unfold
data movements
NVRAM and disk) integral part of the scheduling decisions
Pa PaRSEC: generic framework for architecture aware scheduling of micro-tasks
heterogeneous architectures
error ( (Silent Data Corruption): ): bit-flips in disk, memory or processor registers
task.
46 46
FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
User’s view Runtime’s view
47 47
Final&result& POTRF& TRSM& SYRK& GEMM&
A00& A10& A11& A20& A21& A22& A30& A31& A32& A33&
Sile lent nt e error
(1) Attaching 2 checksum vectors (2) Provide recovery
(3) Continue with the to original data scheme inside task DAG execution
48 48
Input&'le& Checksum&vector&
C2 = Ag2 g1 = (1,1,...,1)T g2 = (1,2,...,n)T (*because of round-off errors,
a small tolerance is allowed)
=> error is in column j => error is in row i => adding the difference to recover
49 49
Matrix Operation (BLAS, LU, QR, etc)
Checksum1 C1 Checksum2 C2
Matrix Operation (BLAS, LU, QR, etc)
Checksum1 C1 Checksum2 C2
update update
A(k, j)
k=1 n
−C1( j) =γ
kA(k, j)
k=1 n
−C2( j) = iγ A'(i, j) = A(i, j)−γ
1) Applying ABFT method (avoid
Overhead (time)
Overhead (Storage) =>
2 NB
Input&'le& Checksum&vector&
(1+ 2 NB)3 −1
1 NB (1+ 2 NB)3 −1+ 1 NB
55 55
Original errors Errors due to propagation
59 59
63 63
will require more and more hardware support