resilience for multigrid software at the extreme scale
play

Resilience for Multigrid Software at the Extreme Scale Markus Huber - PowerPoint PPT Presentation

Resilience for Multigrid Software at the Extreme Scale Markus Huber joint work with: Bj orn Gmeiner, Lorenz John, Ulrich R ude, Barbara Wohlmuth huber@ma.tum.de Technische Universit at M unchen, Germany Januar 25-27, 2016 SPPEXA


  1. Resilience for Multigrid Software at the Extreme Scale Markus Huber joint work with: Bj¨ orn Gmeiner, Lorenz John, Ulrich R¨ ude, Barbara Wohlmuth huber@ma.tum.de Technische Universit¨ at M¨ unchen, Germany Januar 25-27, 2016 SPPEXA Symposium 2016

  2. 0 Outline Overview • Terraneo: An Exa-scale Mantle Convection Framework • Model problem • Ultra scalability • Building a Fault Tolerant Multigrid Solver • Challenges in exa-scale systems • Problem setting • Recovery strategies • Single fault scenarios • Multiple faults scenarios • Towards Geophyiscal Applications 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  3. 1 Terraneo: An Exa-scale Mantle Convection Framework Terraneo An Exa-scale Mantle Convection Framework 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  4. 1 Terraneo: An Exa-scale Mantle Convection Framework Stokes equations and equal order discretization Let Ω ⊂ R 3 with Γ = ∂ Ω − ν ∆ u + ∇ p = f in Ω , div u = 0 in Ω , u = 0 on Γ . Equal order discretization ( P 1 – P 1 ) [Hughes 1986] [Brezzi, Douglas 1988] Find ( u h , p h ) ∈ V h × Q h such that a ( u h , v h ) + b ( v h , p h ) = f ( v h ) ∀ v h ∈ V h , b ( u h , q h ) − c h ( q h , p h ) = g h ( q h ) ∀ q h ∈ Q h , with the level-dependent stabilization terms � � δ T h 2 δ T h 2 c h ( q h , p h ) = T �∇ p h , ∇ q h � T and g h ( q h ) = − T � f , ∇ q h � T . T ∈T h T ∈T h 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  5. 1 Terraneo: An Exa-scale Mantle Convection Framework Numerical simulation to the extreme • Uzawa-type multigrid method [Bank, Welfert, Yserentant 90], [Sch¨ oberl, Zulehner 03] Apply an inexact Uzawa smoother A − 1 ( f − A u k − B ⊤ p k ) , S − 1 ( B u k +1 − C p k − g ) u k +1 = u k + ˆ p k +1 = p k + ˆ Remark: For convergence we need ˆ A ≥ A and ˆ S ≥ C + B ˆ A − 1 B ⊤ • Sacalability on a current peta-scale system (JUQUEEN) Nodes Threads DoFs iter time 2 . 7 · 10 9 5 80 10 617.28 2 . 1 · 10 10 40 640 10 703.69 1 . 2 · 10 11 320 5 120 10 741.86 1 . 7 · 10 12 2 560 40 960 9 720.24 1 . 1 · 10 13 20 480 327 680 9 776.09 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  6. 1 Terraneo: An Exa-scale Mantle Convection Framework Mountain Climbing and Faults 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  7. 2 Building a Fault Tolerant Multigrid Solver Resilience • Past: Reliability of systems was a big concern for computing pioneers ”The problem of building reliable systems out of unreliable components did preoccupy the first generation of computing system designers - see, e.g., Von Neumann, 1956, as first generation computers were very failure prone.” , [Capello et al. 2009] • Present: Built-in system level resilience Hardware failure is of minor relevance for numerical simulation • Future: Huge number of components in exa-scale Algorithmic resilience will be of increasing importance for computational sciences [Dongarra et al. 2015] Storage of a vector of size O (10 13 ) : 73 TBytes. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  8. 2 Building a Fault Tolerant Multigrid Solver Problem setting and fault model Model problem: − ∆ u = f in Ω , + BC • Discretized by linear FE-method • Solved by multigrid V-cycles with standard components in the HPC-framework Hierarchical Hybrid Grids [Bergen, R¨ ude et al. 2002, Gmeiner 2014] Node crash in the MG: Faulty domain : u F in Ω F Interface : u Γ on Γ Intact domain : u I in Ω I 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  9. 2 Building a Fault Tolerant Multigrid Solver No fault recovery strategy within a MG From almost on the top back to the checkpoint level 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  10. 2 Building a Fault Tolerant Multigrid Solver Comparison of a local recovery strategies 6th iteration 7th iteration Fault no recovery local recovery one F-cycle α = log( � Residual � ) 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  11. 2 Building a Fault Tolerant Multigrid Solver Local recovery strategy In case of a fault • Fix interface values u I on Γ • Recover faulty values u F by solving − ∆ u F = f in Ω F with u I Dirichlet BC. Possiblility for local recovery: smoother, cg-iterations, multigrid cycles, direct solver... Faulty domain : u F in Ω F Interface : u Γ on Γ Intact domain : u I in Ω I 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  12. 2 Building a Fault Tolerant Multigrid Solver Numerical results Fault and local recovery ... ... after 5th iteration with a perfect superman. ... after 11th iteration with a perfect superman. Only MG cycles are efficient. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  13. 2 Building a Fault Tolerant Multigrid Solver Fault for the Stokes system Algorithmic strategy: • Fault in a multigrid algorithm with Uzawa-type smoother • Freeze velocity and pressure data at the interface • Locally re-calculated the lost values by superman power Fault after 5th (left) and 11th (right) iteration step 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  14. 2 Building a Fault Tolerant Multigrid Solver Optimal fault recovery strategy within a MG From almost on the top to the top without delay 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  15. 2 Building a Fault Tolerant Multigrid Solver Data structure for the recovery Ghost layer primitives tp_mr → → tp_br mp_tr → Stencil and sub-stencil mp_mr → structure → mp_br bp_tr → bp_mr → 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  16. 2 Building a Fault Tolerant Multigrid Solver Global recovery strategies based on tearing concepts Basic idea: coupling via halos on lower primitives • Dirichlet (faulty)– Dirichlet (healthy) strategy (DD)     A II A I Γ I 0 0 0 u I Id − Id 0 0 0 u Γ I     A Γ I A ΓΓ A Γ F u Γ  0 0        − Id Id u Γ F 0 0 0     A F Γ F A F F u F 0 0 0 • Dirichlet (faulty)– Neumann (healthy) strategy (DN)     A II A I Γ I 0 0 0 0 u I Id A Γ II A Γ I Γ I 0 0 0 u Γ I      Id − Id   λ Γ I  0 0 0 0     A Γ I A ΓΓ A Γ F u Γ 0 0 0         − Id Id u Γ F 0 0 0 0     A F Γ F A F F u F 0 0 0 0 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  17. 2 Building a Fault Tolerant Multigrid Solver Dirichlet-Dirichlet Recovery Strategy Dirichlet boundary condition on healthy domain Dirichlet boundary condition on faulty domain Alg. 1 Dirichlet-Dirichlet recovery 1: Solve Au = f by multigrid cycles. 2: if Fault has occurred then STOP solving. 3: Recover boundary data u Γ F from line 4 4: Initialize u F with zero 5: In parallel do: 6: a) Use n F MG cycles accelerated by η s to approximate line 5: 7: A F F u F = f F − A F Γ F u Γ F 8: b) Use n I MG cycles to approximate line 1 9: A II u I = f I − A I Γ I u Γ I 10: RETURN to line 1 with new values u I in Ω I and u F in Ω F . 11: 12: end if     AII AI Γ I 0 0 0 uI u Γ I Id − Id 0 0 0         A Γ I A ΓΓ A Γ F u Γ (1) 0 0         − Id Id u Γ F  0 0 0        0 0 0 AF Γ F AF F uF 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  18. 2 Building a Fault Tolerant Multigrid Solver Cycle advantage factor κ k R − k Define κ := ∈ [0 , 1] , k , k R required number of iterations k F n I number of MG cycles on the healthy subdomain η s n I number of MG cycles on the faulty subdomain Fault at k F = 5 and speedup η s = 2 17% loss 2% loss 0 . 6% loss DD DN DD DN DD DN n I 0 0.80 0.80 0.80 0.80 0.80 0.80 1 0.20 0.00 0.20 0.20 0.20 0.00 2 0.20 0.00 0.20 0.00 0.00 0.00 3 0.40 0.40 0.40 0.20 0.20 0.00 4 0.60 0.60 0.60 0.40 0.40 0.20 Fault at k F = 11 and speedup η s = 5 DD DN DD DN DD DN n I 0 0.82 0.82 0.82 0.82 0.91 0.91 1 0.36 0.36 0.27 0.27 0.27 0.27 2 0.09 0.00 0.09 0.00 0.00 0.00 3 0.18 0.09 0.27 0.09 0.09 0.09 4 0.27 0.18 0.36 0.18 0.18 0.18 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  19. 2 Building a Fault Tolerant Multigrid Solver Parallel setup: 0 . 6% − 0 . 00047% information loss Adaptivly steering: n F = η s n I + ∆ n F • DD and DN strategies: one failure at k F = 7 with n I = 3 η s = 1 2 4 8 η s = 1 2 4 8 Size No Rec 4 . 5 · 10 8 13.73 (21) 9.14 -0.01 0.00 -0.00 11.47 2.30 0.02 0.04 2 . 1 · 10 9 11.69 (20) 9.31 0.04 0.08 0.11 9.35 2.41 0.11 0.14 1 . 2 · 10 10 12.49 (20) 7.42 -0.01 -0.02 -0.00 9.96 2.54 0.06 0.06 8 . 2 · 10 10 11.16 (19) 5.54 0.08 0.07 0.04 8.36 0.11 0.15 0.17 6 . 0 · 10 11 13.59 (19) 3.47 0.13 0.19 0.13 0.13 0.24 0.29 0.26 • DN strategy: two consequtive failures at k F = 5 and k F = 9 with η s = 4 Size No Rec (1,2) (1,3) (2,2) (2,3) 4 . 5 · 10 8 18.35 (23) 0.02 0.03 0.03 0.04 2 . 1 · 10 9 16.33 (22) 0.05 0.06 0.06 0.06 1 . 2 · 10 10 17.43 (22) 0.07 0.08 0.09 0.08 8 . 2 · 10 10 16.69 (21) 0.16 0.17 0.16 0.17 6 . 0 · 10 11 20.64 (21) 0.30 0.33 0.36 0.36 Global recovery can be fully compensate fault wrt time-to-solution. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  20. 2 Building a Fault Tolerant Multigrid Solver Towards Geophysics 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend