failure detection and propagation in hpc systems
play

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3


  1. Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurélien Bouteiller 1 , Amina Guermouche 1 , Thomas Hérault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3 . LIP6 Paris, France 4 . University of Manchester, UK SC’16 – November 15, 2016

  2. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Failure detection: why? • Nodes do crash at scale (you’ve heard the story before) • Current solution: 1 Detection: TCP time-out ( ≈ 20 mn ) 2 Knowledge propagation: Admin network • Work on fail-stop errors assumes instantaneous failure detection • Seems we put the cart before the horse � 2 / 35

  3. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of one node 3 / 35

  4. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes 3 / 35

  5. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation 3 / 35

  6. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should come for free 3 / 35

  7. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should have minimal impact 3 / 35

  8. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Contribution • Failure-free overhead constant per node (memory, communications) • Failure detection with minimal overhead • Knowledge propagation based on fault-tolerant broadcast overlay • Tolerate an arbitrary number of failures (but bounded number within threshold interval) • Logarithmic worst-case repair time 4 / 35

  9. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation & experiments 5 / 35

  10. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation & experiments 6 / 35

  11. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Framework • Large-scale platform with (dense) interconnection graph (physical links) • One-port message passing model • Reliable links (messages not lost/duplicated/modified) • Communication time on each link: randomly distributed but bounded by τ • Permanent node crashes 7 / 35

  12. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Failure detector Definition Failure detector: distributed service able to return the state of any node, alive or dead. Perfect if: 1 any failure is eventually detected by all living nodes and 2 no living node suspects another living node Definition Stable configuration: all failed nodes are known to all processes (nodes may not be aware that they are in a stable configuration). 8 / 35

  13. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Vocabulary • Node = physical resource • Process = program running on node • Thread = part of a process that can run on a single core • Failure detector will detect both process and node failures • Failure detector mandatory to detect some node failures 9 / 35

  14. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation & experiments 10 / 35

  15. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Timeout techniques: p observes q p q Are you alive? • Pull technique I am alive • Observer p requests a live message from q � More messages � Long timeout p q I am alive • Push technique [1] I am alive • Observed q periodically sends heartbeats to p � Less messages � Faster detection (shorter timeout) [1]: W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Trans. Computers, 2002 11 / 35

  16. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Timeout techniques: platform-wide • All-to-all: � Immediate knowledge propagation � Dramatic overhead • Random nodes and gossip: � Quick knowledge propagation � Redundant/partial failure information (more later) � Difficult to define timeout � Difficult to bound detection latency 12 / 35

  17. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Algorithm for failure detection 6 5 7 • Processes arranged as a ring • Periodic heartbeats from a 4 node to its successor 8 • Maintain ring of live nodes 3 → Reconnect ring after a failure 0 → Inform all processes 2 1 13 / 35

  18. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η Heartbeat 6 5 7 4 8 3 0 2 1 14 / 35

  19. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η Heartbeat 6 5 7 4 8 3 0 2 1 14 / 35

  20. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat δ 6 5 7 4 8 3 0 2 1 14 / 35

  21. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 6 5 7 4 8 3 0 2 1 14 / 35

  22. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 2 δ 6 5 7 4 8 3 0 2 1 14 / 35

  23. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 2 δ 6 5 7 4 8 3 2 δ 0 2 1 Ring reconnected 14 / 35

  24. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 6 5 7 4 8 3 2 δ 0 2 1 Ring reconnected 14 / 35

  25. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Algorithm task Initialization emitter i ← ( i − 1 ) mod N observer i ← ( i + 1 ) mod N HB-Timeout ← η Susp-Timeout ← δ task T4: upon reception of NewObserver ( j ) observer i ← j D i ← ∅ HB-Timeout ← 0 end task end task task T1: When HB-Timeout expires task T5: upon reception of HB-Timeout ← η BcastMsg ( dead , s , D ) Send heartbeat ( i ) to observer i D i ← D i ∪ { dead } end task Send BcastMsg ( dead , s , D ) to Neighbors ( s , D ) task T2: upon reception of heartbeat ( emitter i ) end task Susp-Timeout ← δ end task function FindEmitter ( D i ) k ← emitter i task T3: When Susp-Timeout expires while k ∈ D i do Susp-Timeout ← 2 δ k ← ( k − 1 ) mod N D i ← D i ∪ emitter i dead ← emitter i return k emitter i ← FindEmitter ( D i ) end function Send NewObserver ( i ) to emitter i Send BcastMsg ( dead , i , D i ) to Neighbors ( i , D i ) end task 15 / 35

  26. Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Broadcast algorithm 6 7 4 5 • Hypercube Broadcast Algorithm [1] • Disjoint paths to deliver multiple broadcast message copies 2 3 • Recursive doubling broadcast algorithm by each node 0 1 • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 Node Node1 Node2 Node4 ( f : number of failures, 1 0 0-2-3 0-4-5 n : number of live processes) 2 0-1-3 0 0-4-6 3 0-1 0-2 0-4-5-7 4 0-1-5 0-2-6 0 5 0-1 0-2-6-7 0-4 6 0-1-3-7 0-2 0-4 7 0-1-3 0-2-6 0-4-5 [1] P. Ramanathan and Kang G. Shin, ’Reliable Broadcast Algorithm’, IEEE Trans. Computers, 1998 16 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend