Optimizing Asynchronous Multi-Level Checkpoint/Restart - PowerPoint PPT Presentation

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton onmoy oy Dey ey†, Ken ento Sato†2, Sa Ji Jian an Gu Guo†2, Bog ogdan Nicol olae†3, Jen ens Dom omke†2, Wei eikuan Yu Yu†, Franc Fr nck Cappel ello†3, Ka Kathryn Moh ohror or†4 † † Flor orida State e Univer ersity, USA †2 †2 RIKEN Cen enter er for or Com omputation onal Scien ence( e(R-CCS CCS), Japan †3 †3 Argon onne e Nation onal Labor orator ory, USA †4 †4 Lawren ence e Liver ermor ore e Nation onal Labor orator ory, USA

Introduction Checkpoint/Restart (C/R) ■ Checkpoint-and-Restart is a commonly used technique for large-scale applications running for long time that: Writes a snapshot of an application at fixed intervals and § On a failure, the application can restart from the last checkpoint § ■ With emergence of fast local storage, Multi-Level Checkpointing ( MLC ) has become a common approach with hierarchically written checkpoints Multi-Level Checkpointing time Storage hierarchy Level-1 Node-local storage LOCAL L1 INTERVAL XOR Encoded Groups XOR Level-2 L2 INTERVAL Parallel File System PFS

Background and Motivation Optimal Checkpoint Configuration ■ Determining the optimal checkpoint configuration is very crucial for efficient checkpointing. However, finding this optimal configuration for efficient checkpointing is complicated ■ There exists a tradeoff for finding the optimal configuration: Frequent checkpoint : Spends more I/O time for checkpointing § Infrequent checkpoint: Lose more useful computation on a failure § Low overhead but … Infrequent more resilient but … Frequent less resilient huge overhead checkpoint checkpoint

Background and Motivation Approaches to Determine Optimal Configuration ■ There are existing two approaches to determine checkpoint configuration Approach 1: Modeling checkpointing behaviors § Execution states are categorized into compute, checkpoint and § recovery state. This approach works well for simpler checkpoint models, but is significantly difficult to implement for complex systems Approach 2: Simulation for optimal checkpointing § Simulation approach is much more accurate than Modeling approach, § however, it takes very long time to find optimal checkpoint configurations • In this paper, we try to obtain the optimal checkpoint configuration for a given HPC system using the effectiveness and accuracy of the simulation approach and combine it with machine learning models to avoid the the time taken by simulation to obtain the optimal result.

Design and Implementation Combine simulation with Machine Learning ■ Apply various AI techniques to learn checkpoint schemes given different C/R scenarios. There are two distinct ways to achieve it: Machine Learning (ML) Model : Use existing machine learning models on § the simulated dataset to see how well it learns. Neural Network (NN) Model : Build our own neural network to see how § well it can learn and predict the optimal configuration. Optimal Checkpoint Different C/R scenarios count Decision De De Decision . . . De Decision Approximate Tr Tree T 1 Tr Tree T 2 Tree T n Tr optimum ML Model Efficiency L1 Checkpoint Optimal Count Different C/R scenarios Checkpoint L1 Checkpoint Interval interval ML model

Design and Implementation Simulation ■ The simulator has been developed to replicate the behavior of real-world scenarios when using three-level checkpoint for large scale systems. ■ The simulator is provided with three critical parameters for each level, checkpoint overhead, checkpoint restart time, and failure rates. ■ The parameters are used by the simulator to provide the user with elapsed time and the efficiency (% of time utilized by useful computations) of the system.

Design and Implementation Model Optimization Daisy Chaining : Feed the output from Checkpoint Count prediction as an § input to the Neural Network for Checkpoint Interval prediction Parameter Optimization/Reduction : Remove interdependent, redundant § parameters Daisy Chaining Different C/R scenarios Optimal Checkpoint interval Daisy Chain C/R NN model Optimal De Decision Decision De . . . Decision De Checkpoint Tr Tree T 1 Tr Tree T 2 Tr Tree T n Different C/R scenarios count Random Forest

Evaluation Neural Network vs Machine Learning For a three-level checkpoint model, the neural network showed better § performance with an improved accuracy between 19 to 51% in comparison to the machine learning models. Neural Network Performance Improvement vs Machine Learning Models 60.0000% 51.0520% 50.0000% Perofrmance Improvement 40.0000% 33.1202% 30.3900% 29.3348% Count_L1 30.0000% Count_L2 24.1105% 19.1765% 20.0000% 10.0000% 0.0000% Random Forest Gaussian Naïve Bayes Support Vector Clustering

Conclusion We present an idea to combine the simulation approach with machine § learning models to determine the optimized parameter values for different configurations of C/R. We show that our models can predict the optimized parameter values when § trained with the simulation approach We have also demonstrated that using techniques such as neural networks § can improve the performance over the machine learning models with neural network sometime exceeding the performance of a machine learning model by 50%.

Contact Information Name Contact Tonmoy Dey td18d@my.fsu.edu Kento Sato kento.sato@riken.jp Bogdan Nicolae bnicolae@anl.gov Jian Guo jian.guo@riken.jp Jens Domke jens.domke@riken.jp Weikuan Yu yuw@cs.fsu.edu Franck Cappello cappello@mcs.anl.gov Kathryn Mohror mohror1@llnl.gov

Optimizing Asynchronous Multi-Level Checkpoint/Restart - PowerPoint PPT Presentation

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton onmoy oy Dey ey, Ken ento Sato2, Sa Ji Jian an Gu Guo2, Bog ogdan Nicol olae3, Jen ens Dom omke2, Wei eikuan Yu

Checkpoint/Restart in Linux Sukadev Bhattiprolu IBM Linux Technology Center 09/2009 Linux is a

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Restart to Recover Restart and debottleneck your business operations to adjust to changes in

Exploration of Lossy Compression for Application- level Checkpoint/Restart Naoto Sasaki 1 ,

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Restart 20 Restart 20-21 Task Force: 21 Task Force: Report to Board of Education June 29, 2020

Checkpoint-Restart for a Network of Virtual Machines Rohan Garg, Komal Sodha, Zhengping Jin, Gene

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu,

Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap , Changwoo Min,

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5

Reopening Plan A Town Hall Meeting 7/27/2020 Scenario Planning for 2020-2021 * Task Force of Lay,

A GPU Inference System Scheduling Algorithm with Asynchronous Data Transfer Qin Zhang, Li Zha,

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie