spare node substitution for failure nodes
play

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN - PowerPoint PPT Presentation

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the Exa-flops era, faults could happen more frequently than ever System MTBF becomes shorter Important Issue : Recovery from faults Conventional


  1. Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS

  2. Background • In the Exa-flops era, faults could happen more frequently than ever → System MTBF becomes shorter • Important Issue : Recovery from faults • Conventional method : System-level Checkpoint-Restart – Requires massive I/O • Many mechanisms to survive failures have been proposed and investigated – Less I/O Size – One of the mechanisms is ULFM(User-Level Fault Mitigation). • User program handles failures • The program can survive from the failures and continue its execution • But there is no discussion how a job should survive from node failures

  3. Purpose of this Research • What is the best way to survive from node failures ? – Assuming a job can survive from a node failure by using an existing fault mitigation software – Not to propose a new fault mitigation mechanism – Propose recovery strategy

  4. Survival from Node Failure • Applications with dynamic load balancing – e.g. Distributed Master-Worker model – Avoiding failure nodes method – Applications continue its execution only with healthy nodes after failure • How about applications without dynamic load balancing? – e.g. Stencil Computation

  5. Avoiding Failure Node(s) for Stencil Computation x1.5 computation Stencil computation characteristics • – Communication pattern is fixed Failure – Load can be balanced When a recovery happens, above stencil • computation characteristics must be preserved However, New comm. pattern • – Hard to balance loads – Impossible to preserve communication pattern – Every time a new failure happens, communication pattern can differ Hard to program !!! • Using spare nodes to solve these problems

  6. Using Spare Nodes • An application runs with spare nodes • If node failure happens, migrate the task running on failed node to the spare node – Loads are balanced (continues with the same # procs.) – Preserve logical communication pattern – No change in the kernel part of application – Some penalties

  7. Spare Node Penalty-1 -System utilization Degradation- • Spare node allocation • System utilization is decreased 14 12 % Spare Nodes 10 3D(3,1) 8 3D(2,1) 6 3D(1,1) 4 2D(2,1) 2 2D(1,1) 0 1,000 10,000 100,000 1,000,000 # Nodes nD (α,β) n: Dimensions of networks α: # dimensions of spare nodes β: spare nodes width

  8. Spare Node Penalty-2 -Communication Performance Degradation- • Logical communication pattern can be preserved • by creating a new MPI communicator to exclude the failed node and include a spare node. • However, physical communication pattern is not the same, and communication performance(CP) can be degraded. • Larger hop counts (latency), and • Possible message collisions

  9. Ex. CP Degradation of Spare Node Substitution • Nodes on the topmost row work as spare nodes • Up to 5 possible collisions after 1 node failure – Independent from the # 2D Cartesian network topology nodes (XY routing ) 5-point Stencil Computation How faulty nodes should be replaced by spare nodes?

  10. Sliding Substitution(1) • We proposed “Sliding Substitution” methods – 0D Sliding (simple replace) Failed rank is continued on an alternative node • – 1D Sliding Processes between the failure node and the spare node are shifted • – 2D Sliding • Whole processes between the failure node's row(column) and the spare node's row(column) are shifted – 3D Sliding, 4D , 5D… 20 32 30 31 32 33 34 35 30 31 32 33 34 35 30 31 26 32 33 34 35 30 24 25 31 26 32 33 27 28 34 35 29 24 25 26 27 28 29 24 25 20 26 27 28 29 24 18 19 25 20 26 27 21 22 28 29 23 18 19 20 20 21 22 23 18 19 20 21 20 22 23 18 19 18 19 18 19 20 21 20 21 20 21 22 23 22 23 22 23 12 13 14 15 16 17 12 13 14 15 16 17 12 13 14 15 16 17 6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0D Sliding 1D Sliding 2D Sliding

  11. Preliminary Evaluation -5D stencil on 2D network- • Spare Allocation 30 30 0D : 2D(1,1) 0D : 2D(2,1) 2D(2,1) > 2D(1,1) 25 25 Max. Collisions Mesh 20 20 Torus 15 15 • Max. Failure 10 10 – 0D: up to # Spare 5 5 – 1D: 3 (or more) 0 0 – 2D: up to 2 (2D 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Cart. Topo.) 8 8 1D : 2D(2,1) 2D : 2D(2,1) Max. Collisions • Comm. Perf. 6 6 2D > 1D > 0D 4 4 2 2 0 0 1 2 3 4 5 1 2 3 4 5 # Failed Nodes # Failed Nodes

  12. Sliding Substitution(2) The higher the dimension • – The better the performance – The smaller the number of the failure nodes it can handle 2D or higher dimension Sliding • – Migrate tasks running on healthy nodes – Free nodes works as new spare nodes Hybrid Sliding • – 3D → 2D → 1D → 0D (on 3D network) 3D Sliding Works as new spare nodes

  13. Evaluation : 7P-Stencil on the K and BG/Q (Hybrid, 3D(2,1), 4MiB) 45 40 40 35 35 Smaller is better 30 Relative latency 30 25 25 Sim. Avg. 20 Sim. Worst 20 15 Sim. Best 15 10 Exp. Worst 10 5 5 0 0 0 100 200 300 0 50 100 150 200 # Failed Nodes # Failed Nodes The K Computer BG/Q 12x12x12 Nodes (calc. 11x11x12) 16x8x8 Nodes (calc. 15x7x8) K computer : up to 8 times slower • BG/Q : up to 12 times slower •

  14. Evaluation: Collectives on the K and BG/Q (Hybrid, 3D(2,1)) Smaller is better 6 6 Allreduce(K) Barrier(K) 5 5 (Worst Case) Rel. latency 4 4 3 3 2 2 1 1 0 0 1 2 100 200 276 1 2 100 200 276 # Failed Nodes # Failed Nodes Smaller is better 1.2 1.2 (Based on 16x8x8) (Based on 16x8x8) (Worst Case) 2 10 (Worst Case) Rel. latency 1 1 Rel. latency 8 1.5 0.8 0.8 6 0.6 0.6 1 4 0.4 0.4 Barrier(BG/Q) Allreduce(BG/Q) 0.5 2 0.2 0.2 0 0 0 0 1 2 100 184 1 2 100 184 # Failed Nodes # Failed Nodes On the K and BG/Q, collective operations are optimized for their network • Having spare nodes makes the optimization very difficult • BG/Q’s optimization works only with MPI_COMM_WORLD •

  15. Summary • We proposed and compared “Sliding Substitution” methods. • Communication performance degradation is observed – 7P-Stencil : • Simulation results: up to 40 collisions • Experimental results: up to 12 times larger latency – Collective communications: • up to 12 times lager latency (BG/Q, Barrier)

  16. Future Work • Evaluations with real applications • Node-Rank re-mapping algorithms, or better substitution methods • Discussion on the other network topology – Experiments using Tsubame 2.5 (Fat-tree) is scheduled

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend