Resilience for Multigrid Software at the Extreme Scale
Markus Huber
joint work with: Bj¨
- rn Gmeiner, Lorenz John, Ulrich R¨
ude, Barbara Wohlmuth
huber@ma.tum.de Technische Universit¨ at M¨ unchen, Germany
Resilience for Multigrid Software at the Extreme Scale Markus Huber - - PowerPoint PPT Presentation
Resilience for Multigrid Software at the Extreme Scale Markus Huber joint work with: Bj orn Gmeiner, Lorenz John, Ulrich R ude, Barbara Wohlmuth huber@ma.tum.de Technische Universit at M unchen, Germany Januar 25-27, 2016 SPPEXA
huber@ma.tum.de Technische Universit¨ at M¨ unchen, Germany
0 Outline
0 1 2 3 4 5
1 Terraneo: An Exa-scale Mantle Convection Framework
0 1 2 3 4 5
1 Terraneo: An Exa-scale Mantle Convection Framework
[Hughes 1986] [Brezzi, Douglas 1988]
T ∇ph, ∇qhT
T f, ∇qhT.
0 1 2 3 4 5
1 Terraneo: An Exa-scale Mantle Convection Framework
Apply an inexact Uzawa smoother uk+1 = uk + ˆ A−1(f − Auk − B⊤pk), pk+1 = pk + ˆ S−1(Buk+1 − Cpk − g)
Nodes Threads DoFs iter time 5 80 2.7 · 109 10 617.28 40 640 2.1 · 1010 10 703.69 320 5 120 1.2 · 1011 10 741.86 2 560 40 960 1.7 · 1012 9 720.24 20 480 327 680 1.1 · 1013 9 776.09
0 1 2 3 4 5
1 Terraneo: An Exa-scale Mantle Convection Framework
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
”The problem of building reliable systems out of unreliable components did preoccupy the first generation of computing system designers - see, e.g., Von Neumann, 1956, as first generation computers were very failure prone.”, [Capello et al. 2009]
[Dongarra et al. 2015]
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
[Bergen, R¨ ude et al. 2002, Gmeiner 2014]
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
α = log(Residual)
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
→ tp_br → mp_br tp_mr → mp_tr → mp_mr → bp_tr → bp_mr →
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
AII AIΓI Id − Id AΓI AΓΓ AΓF − Id Id AF ΓF AF F
uI uΓI uΓ uΓF uF
AII AIΓI AΓII AΓIΓI Id Id −Id AΓI AΓΓ AΓF −Id Id AF ΓF AF F
uI uΓI λΓI uΓ uΓF uF
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
1: Solve Au = f by multigrid cycles. 2: if Fault has occurred then 3:
STOP solving.
4:
Recover boundary data uΓF from line 4
5:
Initialize uF with zero
6:
In parallel do:
7:
a) Use nF MG cycles accelerated by ηs to approximate line 5:
8:
AF FuF = f F − AF ΓF uΓF
9:
b) Use nI MG cycles to approximate line 1
10:
AIIuI = f I − AIΓIuΓI
11:
RETURN to line 1 with new values uI in ΩI and uF in ΩF.
12: end if
AII AIΓI Id − Id AΓI AΓΓ AΓF − Id Id AF ΓF AF F uI uΓI uΓ uΓF uF (1)
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
Define κ :=
kR−k kF
∈ [0, 1], k, kR required number of iterations nI number of MG cycles on the healthy subdomain ηsnI number of MG cycles on the faulty subdomain Fault at kF = 5 and speedup ηs = 2 17% loss 2% loss 0.6% loss nI DD DN DD DN DD DN 0.80 0.80 0.80 0.80 0.80 0.80 1 0.20 0.00 0.20 0.20 0.20 0.00 2 0.20 0.00 0.20 0.00 0.00 0.00 3 0.40 0.40 0.40 0.20 0.20 0.00 4 0.60 0.60 0.60 0.40 0.40 0.20 Fault at kF = 11 and speedup ηs = 5 nI DD DN DD DN DD DN 0.82 0.82 0.82 0.82 0.91 0.91 1 0.36 0.36 0.27 0.27 0.27 0.27 2 0.09 0.00 0.09 0.00 0.00 0.00 3 0.18 0.09 0.27 0.09 0.09 0.09 4 0.27 0.18 0.36 0.18 0.18 0.18
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
Size No Rec ηs = 1 2 4 8 ηs = 1 2 4 8 4.5 · 108 13.73 (21) 9.14
0.00
11.47 2.30 0.02 0.04 2.1 · 109 11.69 (20) 9.31 0.04 0.08 0.11 9.35 2.41 0.11 0.14 1.2 · 1010 12.49 (20) 7.42
9.96 2.54 0.06 0.06 8.2 · 1010 11.16 (19) 5.54 0.08 0.07 0.04 8.36 0.11 0.15 0.17 6.0 · 1011 13.59 (19) 3.47 0.13 0.19 0.13 0.13 0.24 0.29 0.26
Size No Rec (1,2) (1,3) (2,2) (2,3) 4.5 · 108 18.35 (23) 0.02 0.03 0.03 0.04 2.1 · 109 16.33 (22) 0.05 0.06 0.06 0.06 1.2 · 1010 17.43 (22) 0.07 0.08 0.09 0.08 8.2 · 1010 16.69 (21) 0.16 0.17 0.16 0.17 6.0 · 1011 20.64 (21) 0.30 0.33 0.36 0.36
0 1 2 3 4 5
2 Building a Fault Tolerant Multigrid Solver
0 1 2 3 4 5
3 Towards Geophysical Applications
[Zhong, Gurnis et. al. 2008]
x x2,
[Williams, M¨ uller, Landgrebe, Whittaker 2012],
0 1 2 3 4 5
3 Towards Geophysical Applications
1−rinner − 2.99 τ)
a
0 1 2 3 4 5
4 Conclusion
0 1 2 3 4 5
5 References
[1] E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon. Towards resilient parallel linear Krylov solvers: recover-restart strategies. Research Report RR-8324, July 2013. [2] B. Bergen, T. Gradl, U. R¨ ude, and F. H¨
[3] F. Cappello, A. Geist, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update.
[4] J. Dongarra, T. Herault, and Y. Robert. Fault-Tolerance Techniques for High-Performance
[5] B. Gmeiner, M. Huber, L. John, U. R¨ ude, and B. Wohlmuth. A quantitative performance analysis for Stokes solvers at the extreme scale, submitted, arXiv:1511.02134. [6] D. G¨
with hierarchically compressed asynchronous checkpointing. Parallel Comput., 49:117 – 135, 2015. [7] M. Huber, B. Gmeiner, U. R¨ ude, and B. Wohlmuth. Resilience for Exascale Enabled Multigrid
[8] J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput., 30(1):102–116, 2008. [9] J. Sch¨
95(2):377–399, 2003.
0 1 2 3 4 5