APIs, Architecture and Modeling for Extreme Scale Resilience - PowerPoint PPT Presentation

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Failures on HPC systems ! System resilience is critical for future extreme-scale computing ! 191 failures out of 5-million node-hours • A production application using Laser-plasma interaction code ( pF3D ) • Hera,&Atlas&and&Coastal&clusters&@LLNL&=>&MTBF:&1.2&day& — C.f.&)&TSUBAME2.0&=>&MTBF:&a&day& ! In&extreme&scale,&failure&rate&will&increase& ! Now,&&HPC&systems&must&consider&failures&as&usual& events&& Lawrence Livermore National Laboratory 2 LLNL-PRES-661421

Motivation to resilience APIs Start ! Current MPI implementation does not have the MPI initialization capabilities • Standard MPI employs a fail-stop model End Application run ! When a failure occurs … Checkpointing Recovery • MPI terminates all processes Failure • The user locate, replace failed nodes with spare nodes Terminate processes • Re-initialize MPI • Restore the last checkpoint Locate failed node ! Applications will use more time for recovery Replace failed node • Users manually locate and replace the failed nodes with spare nodes via machinefile • The&manual&recovery&operaNons&may&introduce&extra& MPI re-initialization overhead&and&human&errors& � APIs to handle the failures are critical Restore checkpoint Lawrence Livermore National Laboratory 3 LLNL-PRES-661421

Resilience APIs, Architecture and the model ! Resilience APIs � Fault tolerant messaging Res esilien ence e API PIs: Fault tolerant messaging interface (FMI) interface (FMI) Compute nodes Parallel file system Lawrence Livermore National Laboratory 4 LLNL-PRES-661421

FMI: Fault Tolerant Messaging Interface [IPDPS2014] FMI&overview& FMI rank (virtual rank) 4 6 0 1 2 3 5 7 MPI-like interface User’s view FMI FMI’s view Fast checkpoint/restart Parity 0 Parity 0 P6-0 Parity 0 Parity 1 Parity 1 Parity 1 P2-0 P3-0 P4-0 P5-0 P7-0 P0-0 P1-0 P0-0 P0-0 P1-0 P1-0 Parity 2 Parity 3 P4-1 P5-1 P6-1 P7-1 P0-1 P0-1 P0-1 P1-1 P1-1 P2-1 P3-1 Parity 4 Parity 5 P6-2 P7-2 P1-1 P0-2 P0-2 P1-2 P1-2 P2-2 P3-2 P4-2 Parity 6 Parity 7 P0-2 P1-2 P5-2 P0 P2 P3 P4 P5 P6 P7 P8 P9 P1 Node 4 Node 0 Node 1 Node 2 Node 3 Dynamic node allocation 0 1 7 Scalable failure detection 6 2 3 5 4 ! FMI is a survivable messaging interface providing MPI-like interface • Scalable failure detection � Overlay network • Dynamic node allocation � FMI ranks are virtualized • Fast checkpoint/restart � In-memory diskless checkpoint/restart Lawrence Livermore National Laboratory 5 LLNL-PRES-661421

How FMI applications work ? FMI&example&code& FMI_Loop enables transparent recovery and ! roll-back on a failure int main (int *argc, char *argv[]) { Periodically write a checkpoint • FMI_Init(&argc, &argv); Restore the last checkpoint on a failure • FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ Processes are launched via fmirun ! n = FMI_Loop(…) while (( ) < numloop) { fmirun spawns fmirun.task on each • /* Application’s program */ node } fmirun.task calls fork/exec a user program • /* Application’s finalization */ fmirun broadcasts connection information • FMI_Finalize(); (endpoints) for FMI_init( … ) } Launch&FMI&processes& machine_file node0.fmi.gov node1.fmi.gov node2.fmi.gov fmirun node3.fmi.gov node4.fmi.gov Node&0& Node&1& Node&3& Node&4& Node&2& fmirun.task fmirun.task fmirun.task fmirun.task Spare node P1& P3& P7& P0& P2& P4& P5& P6& Lawrence Livermore National Laboratory 6 LLNL-PRES-661421

User perspective: No failures Node 0 Node 1 Node 2 Node 3 FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ n = FMI_Loop(…) while (( ) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) 4 = FMI_Loop(…) FMI_Finalize • User&perspecNve&when&no&failures&happens& • IteraNons:&4& • Checkpoint&frequency:&Every&2&iteraNons& • FMI_Loop&returns&incremented&iteraNon&id&& Lawrence Livermore National Laboratory 7 LLNL-PRES-661421

User perspective : Failure FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ while ((n = FMI_Loop(…)) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) restart: 1 2 = FMI_Loop(…) Transparently&migrate&FMI&rank&0& • &&1&to&a&spare&node& 3 = FMI_Loop(…) Restart&form&the&last&checkpoint& • 4 = FMI_Loop(…) – 2 th &checkpoint&at&iteraNon&2& With&FMI,&applicaNons&sNll&use&the& • FMI_Finalize same&series&of&ranks&even&a[er& failures & Lawrence Livermore National Laboratory 8 LLNL-PRES-661421

Resilience API: FMI_Loop FMI_Loop int FMI_Loop(void **ckpt, size_t *sizes, int len) ckpt : Array&of&pointers&to&variables&containing&data&that&needs&to&be&checkpointed& sizes: Array&of&sizes&of&each&checkpointed&variables& len : Length&of&arrays,& ckpt &and& sizes returns iteration id FMI constructs in-memory RAID-5 across compute nodes ! Checkpoint group size ! e.g.) group_size = 4 • FMI&checkpoinNng& Encoding group Encoding group Parity 0 P2-0 P4-0 P6-0 Parity 1 P3-0 P7-0 P5-0 0 2 4 6 8 10 12 14 P0-0 P6-1 P7-1 Parity 2 P4-1 P1-0 Parity 3 P5-1 P6-2 P0-1 P2-1 Parity 4 P1-1 P3-1 Parity 5 P7-2 P0-2 P2-2 P4-2 Parity 6 P1-2 P3-2 P5-2 Parity 7 Parity 1 P3-0 P5-0 P7-0 Parity 0 P2-0 P4-0 P6-0 1 3 5 7 9 11 13 15 P1-0 Parity 3 P5-1 P7-1 P0-0 Parity 2 P4-1 P6-1 P1-1 P3-1 Parity 5 P7-2 P0-1 P2-1 Parity 4 P6-2 P1-2 P3-2 Parity 7 P0-2 P2-2 P4-2 P5-2 Parity 6 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Lawrence Livermore National Laboratory 9 LLNL-PRES-661421

Application runtime with failures • Benchmark: Poisson’s equation solver using Jacobi iteration method – Stencil application benchmark – MPI_Isend, MPI_Irecv, MPI_Wait and MPI_Allreduce within a single iteration • For MPI, we use the SCR library for checkpointing – Since MPI is not survivable messaging interface, we write checkpoint memory on tmpfs • Checkpoint interval is optimized by Vaidya’s model for FMI and MPI P2P communication performance 2500 MPI 1-byte Latency Bandwidth (8MB) FMI MPI 3.555 usec 3.227 GB/s 2000 Performance (GFlops) FMI 3.573 usec 3.211 GB/s MPI + C FMI + C 1500 FMI + C/R FMI directly writes checkpoints via memcpy, and 1000 can exploit the bandwidth MTBF: 1 minute 500 Even with the high failure rate, FMI incurs only a 28% overhead 0 0 500 1000 1500 Lawrence Livermore National Laboratory # of Processes (12 processes/node) 10 LLNL-PRES-661421

Asynchronous multi-level checkpointing (MLC) [SC12] Time RAID-5 Level-1 checkpoint PFS Level-2 checkpoint Source: K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka, “Design and Modeling of a Non- Blocking Checkpointing System,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’12. Salt Lake City, Utah: IEEE Computer Society Press, 2012 Asynchronous&MLC&is&a&technique&for&achieving&high& • Failure analysis on Coastal cluster reliability&while&reducing&checkpoinNng&overhead& MTBF Failure rate Asynchronous&MLC&Use&storage&levels&hierarchically & • L1 failure 130 hours 2.13 -6 – RAID_5&checkpoint:&Frequent&&for&one&node&or&a&few& L2 failure 650 hours 4.27 -7 node&failure& – PFS&checkpoint:&Less&frequent&and&asynchronous&for& Source: A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable mulN_node&failure& Multi-level Checkpointing System,” in Proceedings of the 2010 ACM/IEEE International Conference for High Our&previous&work&model&the&asynchronous& • Performance Computing, Networking, Storage and Analysis (SC 10). MLC& & Lawrence Livermore National Laboratory 11 11 LLNL-PRES-661421

APIs, Architecture and Modeling for Extreme Scale Resilience - PowerPoint PPT Presentation

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Analysis of Security APIs (ASA-2) June 26, 2008 Minimizing Threats from Flawed Security APIs:

The current state of banking APIs Open APIs are high priority for financial institutions What are

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Semi-Automa+cally Modeling Web APIs to Create Linked APIs

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

JST-CREST Extreme Big Data Project (2013-2018) Future Non-Silo Extreme Big Data Scientific

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

Introduction to APIs and JSONs Importing Data in Python II APIs Application Programming

Perspectives dapplications des modes SAR et SARin pour le suivi du niveau des fleuves et leur

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

A Toolbox for Property Checking from Simulation using Incremental SAT Rob Sumners Centaur

in in Mo Mobi bile le De Devices ices FAST 2015 Daeho Jeong *+ , Youngjae Lee + , Jin-Soo

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita & IAC Support Astronomers

A final optimization Example: Alice wants to get to the bus stop as quickly as possible. The bus

STARS Student Transportation Allocation Reporting System Ridership Training Workshop September

Input-output I/O is very much architecture dependent I/O requires cooperation between

APIs, Architecture and Modeling for Extreme Scale Resilience - PowerPoint PPT Presentation

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Analysis of Security APIs (ASA-2) June 26, 2008 Minimizing Threats from Flawed Security APIs:

The current state of banking APIs Open APIs are high priority for financial institutions What are

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Semi-Automa+cally Modeling Web APIs to Create Linked APIs

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

JST-CREST Extreme Big Data Project (2013-2018) Future Non-Silo Extreme Big Data Scientific

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

Introduction to APIs and JSONs Importing Data in Python II APIs Application Programming

Perspectives dapplications des modes SAR et SARin pour le suivi du niveau des fleuves et leur

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

A Toolbox for Property Checking from Simulation using Incremental SAT Rob Sumners Centaur

in in Mo Mobi bile le De Devices ices FAST 2015 Daeho Jeong *+ , Youngjae Lee + , Jin-Soo

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita &amp; IAC Support Astronomers

A final optimization Example: Alice wants to get to the bus stop as quickly as possible. The bus

STARS Student Transportation Allocation Reporting System Ridership Training Workshop September

Input-output I/O is very much architecture dependent I/O requires cooperation between

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita & IAC Support Astronomers