Scalable Fault Tolerance with Charm++ Esteban Meneses Gengbin - PowerPoint PPT Presentation

Scalable Fault Tolerance with Charm++ Esteban Meneses Gengbin Zheng Celso L. Mendes Laxmikant V. Kalé Wednesday, April 28, 2010

Contents • Fault Tolerance Techniques in Charm++ • Recent Developments • Future Work 2 Wednesday, April 28, 2010

A problem hard to ignore Installed System Processors SMTBF 2000 ASCI White 8,192 40.0 h 2001 PSC Lemieux 3,016 9.7 h 2002 NERSC Seaborg 6,656 351.0 h 2002 ASCI Q 8,192 6.5 h 2003 Google 15,000 1.2 h 2006 Blue Gene/L 131,072 147.8 h Extract taken from High-End Computing Resilience [1] 3 Wednesday, April 28, 2010

We will live with failures 2484 separate node crashes on Jaguar during 537 days period ( Aug-22-2008 to Feb-10-2010 ) 4.62 failures per day What about Sequoia with 1.6 million cores or an exascale machine with 100 million cores? 4 Wednesday, April 28, 2010

Overview of Charm++ Fault Tolerant Techniques Wednesday, April 28, 2010

Proactive Fault Tolerance • Use knowledge about impending faults. • Evacuate objects from processors that may fail soon. Charm++ Objects Processor B Processor A Processor C Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Proactive Fault Tolerance in MPI Applications via Task Migration , In Proceedings of HIPC 2006, LNCS volume 4297, page 485 6 Wednesday, April 28, 2010

Checkpoint/Restart • Double in-memory checkpoint. • Synchronized checkpoint. Charm++ Objects Memory Overhead Processor A Processor B Processor C Processor D (buddy of B) Gengbin Zheng, Lixia Shi, Laxmikant V. Kale, FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI , Cluster 2004 7 Wednesday, April 28, 2010

Message Logging • Every message is stored in the sender log. • Pessimistic : messages and determinants have to be stored before delivery. m2 m Charm++ Objects m m2 Memory Overhead Processor A Processor B Processor C Processor D (buddy of B) Sayantan Chakravorty, Laxmikant V. Kale, A Fault Tolerance Protocol with Fast Fault Recovery , Proceedings of the 21st International Parallel and Distributed Processing Symposium, 2007, Long Beach California 8 Wednesday, April 28, 2010

Comparison (Reactive Approaches) Memory Communication Technique Recovery Time Overhead Overhead ☹ Checkpoint/ ☻ ☺ Restart ☹ ☹ Message ☺ Logging 9 Wednesday, April 28, 2010

Recent Developments Wednesday, April 28, 2010

Checkpoint/Restart Optimization • Discard old messages to resume progress as soon as possible. • Improve quiescence detection. • Combine message to update home location of objects. 11 Wednesday, April 28, 2010

Results Checkpoint Time Restart Time 0.22 22.5 0.165 16.9 Time (seconds) Time (seconds) 0.11 11.3 0.22 21.09 0.17 0.055 5.6 2.8 1.63 0 0 512 1024 512 1024 Number of cores Number of cores Application : Molecular3D (APOA1 ~100K atoms) Data Size : 624 KB per core (512 cores), 351 KB per core (1024 cores) 12 Wednesday, April 28, 2010

Message Logging Optimization • Memory overhead reduction: • Team-based approach. • Latency overhead reduction: • Causal protocol. 13 Wednesday, April 28, 2010

Team-based Approach • Goal: reduce memory overhead of message log. • Only messages crossing team boundaries are logged. m2 m Charm++ Objects m2 Memory Overhead Processor C Processor A Processor B (buddy of B) Team Y Team X 14 Wednesday, April 28, 2010

Processor Teams • Each team acts as a recovery unit : • All members must checkpoint in a coordinated fashion. • If one member fails, the whole team rolls back. Message Logging Checkpoint/Restart 1 k N Team Size Esteban Meneses, Celso L. Mendes and Laxmikant V. Kale, Team-based Message Logging: Preliminary Results , 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) 15 Wednesday, April 28, 2010

Results 16 Wednesday, April 28, 2010

Results (cont.) 17 Wednesday, April 28, 2010

Recovery Time 18 Wednesday, April 28, 2010

Further Developments • Highly connected objects should belong to the same team. • Exploit communication graph, dynamic groups, team-aware load balancer. • Teams can address some correlated failures. • Applicable to other message-logging protocols . Wednesday, April 28, 2010

Reducing Latency Object α Pessimistic m Message Object β m2 Logging Object γ Object α Causal m Message Object β m2 m3 ⊕ {m} Logging Object γ 20 Wednesday, April 28, 2010

Causal Protocol • No need to block the delivery of a message. • No need to contact remote processor for a local message. • Metadata is piggybacked in application’s messages. • Recovery may involve more processors. 21 Wednesday, April 28, 2010

Early Results Wednesday, April 28, 2010

Future Work Wednesday, April 28, 2010

Future Work Roadmap • Bigger Charm++ applications . • Enhance Proactive Approach with prediction schemes. • Enrich Team -based Approach. • Smarter team formation. • Coupling with load balancer. • SMP -aware fault tolerance. 24 Wednesday, April 28, 2010

Acknowledgments • Department of Energy – FastOS Program. • Colony-1 and Colony-2 projects. • NSF/NCSA • Deployment efforts specific for Blue Waters . • Machine allocation • TeraGrid MRAC – NCSA, TACC, ORNL • Greg Bronevetsky from LLNL. 25 Wednesday, April 28, 2010

References [1] Nathan DeBardeleben, James Laros, John Daly, Stephen Scott, Christian Engelmann and Bill Harrod. High End Computing Resilience: Analysis of Issues Facing the HEC Community and Path- Forward for Research and Development . Wednesday, April 28, 2010

Q&A 27 Wednesday, April 28, 2010

Thank You! Wednesday, April 28, 2010

Scalable Fault Tolerance with Charm++ Esteban Meneses Gengbin - PowerPoint PPT Presentation

Scalable Fault Tolerance with Charm++ Esteban Meneses Gengbin Zheng Celso L. Mendes Laxmikant V. Kal Wednesday, April 28, 2010 Contents Fault Tolerance Techniques in Charm++ Recent Developments Future Work 2 Wednesday, April

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Large-Scale Flow and Structure Formation in Stellar Atmospheres - I Nana L. Shatashvili 1,2 with

Practical Concerns for Scalable Synchronization Josh Triplett May 10, 2006 The basic problem

A Model of Coronal A Model of Coronal Helmets with Prominences Helmets with Prominences Eric

Binary Black Hole Coalescence in Galaxy Mergers Steinn Sigurdsson Penn State 30 Oct 02 CGWP

The Centenary of the Omori Formula for a Decay Law of Aftershock Activity Author; Tokuji Utsu,

Resonant Damping of Prominence Thread Oscillations: Effect of Partial Ionization Roberto Soler 1

STUDIES SCUBA-2 Ultra Deep Imaging EAO Survey ang ( , ASIAA ) W ei - Hao W on

Systems Delays (contd) Shankar Balachandran* Associate Professor, CSE Department Indian