Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurélien Bouteiller 1 , Amina Guermouche 1 , Thomas Hérault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3 . LIP6 Paris, France 4 . University of Manchester, UK SC’16 – November 15, 2016

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Failure detection: why? • Nodes do crash at scale (you’ve heard the story before) • Current solution: 1 Detection: TCP time-out ( ≈ 20 mn ) 2 Knowledge propagation: Admin network • Work on fail-stop errors assumes instantaneous failure detection • Seems we put the cart before the horse � 2 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of one node 3 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes 3 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation 3 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should come for free 3 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should have minimal impact 3 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Contribution • Failure-free overhead constant per node (memory, communications) • Failure detection with minimal overhead • Knowledge propagation based on fault-tolerant broadcast overlay • Tolerate an arbitrary number of failures (but bounded number within threshold interval) • Logarithmic worst-case repair time 4 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation & experiments 5 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Framework • Large-scale platform with (dense) interconnection graph (physical links) • One-port message passing model • Reliable links (messages not lost/duplicated/modified) • Communication time on each link: randomly distributed but bounded by τ • Permanent node crashes 7 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Failure detector Definition Failure detector: distributed service able to return the state of any node, alive or dead. Perfect if: 1 any failure is eventually detected by all living nodes and 2 no living node suspects another living node Definition Stable configuration: all failed nodes are known to all processes (nodes may not be aware that they are in a stable configuration). 8 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Vocabulary • Node = physical resource • Process = program running on node • Thread = part of a process that can run on a single core • Failure detector will detect both process and node failures • Failure detector mandatory to detect some node failures 9 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Timeout techniques: p observes q p q Are you alive? • Pull technique I am alive • Observer p requests a live message from q � More messages � Long timeout p q I am alive • Push technique [1] I am alive • Observed q periodically sends heartbeats to p � Less messages � Faster detection (shorter timeout) [1]: W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Trans. Computers, 2002 11 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Timeout techniques: platform-wide • All-to-all: � Immediate knowledge propagation � Dramatic overhead • Random nodes and gossip: � Quick knowledge propagation � Redundant/partial failure information (more later) � Difficult to define timeout � Difficult to bound detection latency 12 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Algorithm for failure detection 6 5 7 • Processes arranged as a ring • Periodic heartbeats from a 4 node to its successor 8 • Maintain ring of live nodes 3 → Reconnect ring after a failure 0 → Inform all processes 2 1 13 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η Heartbeat 6 5 7 4 8 3 0 2 1 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat δ 6 5 7 4 8 3 0 2 1 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 6 5 7 4 8 3 0 2 1 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 2 δ 6 5 7 4 8 3 0 2 1 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ 2 δ 6 5 7 4 8 3 2 δ 0 2 1 Ring reconnected 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Reconnecting the ring 0 1 2 3 4 . . . η : Heartbeat interval η δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 6 5 7 4 8 3 2 δ 0 2 1 Ring reconnected 14 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Algorithm task Initialization emitter i ← ( i − 1 ) mod N observer i ← ( i + 1 ) mod N HB-Timeout ← η Susp-Timeout ← δ task T4: upon reception of NewObserver ( j ) observer i ← j D i ← ∅ HB-Timeout ← 0 end task end task task T1: When HB-Timeout expires task T5: upon reception of HB-Timeout ← η BcastMsg ( dead , s , D ) Send heartbeat ( i ) to observer i D i ← D i ∪ { dead } end task Send BcastMsg ( dead , s , D ) to Neighbors ( s , D ) task T2: upon reception of heartbeat ( emitter i ) end task Susp-Timeout ← δ end task function FindEmitter ( D i ) k ← emitter i task T3: When Susp-Timeout expires while k ∈ D i do Susp-Timeout ← 2 δ k ← ( k − 1 ) mod N D i ← D i ∪ emitter i dead ← emitter i return k emitter i ← FindEmitter ( D i ) end function Send NewObserver ( i ) to emitter i Send BcastMsg ( dead , i , D i ) to Neighbors ( i , D i ) end task 15 / 35

Introduction Model Failure detector Worst-case analysis Implementation & experiments Conclusion Broadcast algorithm 6 7 4 5 • Hypercube Broadcast Algorithm [1] • Disjoint paths to deliver multiple broadcast message copies 2 3 • Recursive doubling broadcast algorithm by each node 0 1 • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 Node Node1 Node2 Node4 ( f : number of failures, 1 0 0-2-3 0-4-5 n : number of live processes) 2 0-1-3 0 0-4-6 3 0-1 0-2 0-4-5-7 4 0-1-5 0-2-6 0 5 0-1 0-2-6-7 0-4 6 0-1-3-7 0-2 0-4 7 0-1-3 0-2-6 0-4-5 [1] P. Ramanathan and Kang G. Shin, ’Reliable Broadcast Algorithm’, IEEE Trans. Computers, 1998 16 / 35

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

aCT: an introduction 1 History NorduGrid model was built on philosophy of ARC-CE and

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Heartbleed Presented by Duc Tran Agenda Background TLS OpenSSL TLS

Security 1 Recap: Protection Protection Prevent unintended/unauthorized accesses

The Comerica U.S. Economic Outlook A Taxonomy of Economic Risk Factors for 2017 Robert A. Dye

The Google File System Presented by: Alexa Leal Architecture the basic idea Question: 1. GFS

BOOM Analycs: Exploring Data-Centric, Declarave Programming for

The DragonBeam Framework: Hardware-Protected Security Modules for In-Place Intrusion Detection

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

aCT: an introduction 1 History NorduGrid model was built on philosophy of ARC-CE and

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Heartbleed Presented by Duc Tran Agenda Background TLS OpenSSL TLS

Security 1 Recap: Protection Protection Prevent unintended/unauthorized accesses

The Comerica U.S. Economic Outlook A Taxonomy of Economic Risk Factors for 2017 Robert A. Dye

The Google File System Presented by: Alexa Leal Architecture the basic idea Question: 1. GFS

BOOM Analy*cs: Exploring Data-Centric, Declara*ve Programming for

The DragonBeam Framework: Hardware-Protected Security Modules for In-Place Intrusion Detection

BOOM Analycs: Exploring Data-Centric, Declarave Programming for