Optimizing Asynchronous Multi-Level Checkpoint/Restart - - PowerPoint PPT Presentation

optimizing asynchronous multi level checkpoint restart
SMART_READER_LITE
LIVE PREVIEW

Optimizing Asynchronous Multi-Level Checkpoint/Restart - - PowerPoint PPT Presentation

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton onmoy oy Dey ey, Ken ento Sato2, Sa Ji Jian an Gu Guo2, Bog ogdan Nicol olae3, Jen ens Dom omke2, Wei eikuan Yu


slide-1
SLIDE 1

Ton

  • nmoy
  • y

Dey ey†, Ken ento Sa Sato†2, Ji Jian an Gu Guo†2, Bog

  • gdan Nicol
  • lae†3, Jen

ens Dom

  • mke†2, Wei

eikuan Yu Yu†, Fr Franc nck Cappel ello†3, Ka Kathryn Moh

  • hror
  • r†4

† † Flor

  • rida State

e Univer ersity, USA †2 †2 RIKEN Cen enter er for

  • r Com
  • mputation
  • nal Scien

ence( e(R-CCS CCS), Japan †3 †3 Argon

  • nne

e Nation

  • nal Labor
  • rator
  • ry, USA

†4 †4 Lawren ence e Liver ermor

  • re

e Nation

  • nal Labor
  • rator
  • ry, USA

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

slide-2
SLIDE 2

■ Checkpoint-and-Restart is a commonly used technique for large-scale applications running for long time that: § Writes a snapshot of an application at fixed intervals and § On a failure, the application can restart from the last checkpoint ■ With emergence of fast local storage, Multi-Level Checkpointing (MLC) has become a common approach with hierarchically written checkpoints

Introduction

Checkpoint/Restart (C/R)

Multi-Level Checkpointing

Level-1 Level-2

LOCAL

XOR

PFS

Storage hierarchy

time

Node-local storage Parallel File System

L1 INTERVAL L2 INTERVAL

XOR Encoded Groups

slide-3
SLIDE 3

■ Determining the optimal checkpoint configuration is very crucial for efficient

  • checkpointing. However, finding this optimal configuration for efficient

checkpointing is complicated ■ There exists a tradeoff for finding the optimal configuration: § Frequent checkpoint: Spends more I/O time for checkpointing § Infrequent checkpoint: Lose more useful computation on a failure

Background and Motivation

Optimal Checkpoint Configuration

Frequent checkpoint Infrequent checkpoint

Low overhead but … less resilient more resilient but … huge overhead

slide-4
SLIDE 4

■ There are existing two approaches to determine checkpoint configuration § Approach 1: Modeling checkpointing behaviors § Execution states are categorized into compute, checkpoint and recovery state. This approach works well for simpler checkpoint models, but is significantly difficult to implement for complex systems § Approach 2: Simulation for optimal checkpointing § Simulation approach is much more accurate than Modeling approach, however, it takes very long time to find optimal checkpoint configurations

  • In this paper, we try to obtain the optimal checkpoint configuration for

a given HPC system using the effectiveness and accuracy of the simulation approach and combine it with machine learning models to avoid the the time taken by simulation to obtain the optimal result.

Background and Motivation

Approaches to Determine Optimal Configuration

slide-5
SLIDE 5

■ Apply various AI techniques to learn checkpoint schemes given different C/R

  • scenarios. There are two distinct ways to achieve it:

§ Machine Learning (ML) Model: Use existing machine learning models on the simulated dataset to see how well it learns. § Neural Network (NN) Model: Build our own neural network to see how well it can learn and predict the optimal configuration.

Design and Implementation

Combine simulation with Machine Learning

ML model

Optimal Checkpoint interval

Different C/R scenarios

Optimal Checkpoint count

Different C/R scenarios Approximate

  • ptimum

L1 Checkpoint Interval Efficiency

ML Model

De Decision Tr Tree T1 De Decision Tr Tree T2 De Decision Tr Tree Tn

. . .

L1 Checkpoint Count

slide-6
SLIDE 6

■ The simulator has been developed to replicate the behavior of real-world scenarios when using three-level checkpoint for large scale systems. ■ The simulator is provided with three critical parameters for each level, checkpoint overhead, checkpoint restart time, and failure rates. ■ The parameters are used by the simulator to provide the user with elapsed time and the efficiency (% of time utilized by useful computations) of the system.

Design and Implementation

Simulation

slide-7
SLIDE 7

§ Daisy Chaining: Feed the output from Checkpoint Count prediction as an input to the Neural Network for Checkpoint Interval prediction § Parameter Optimization/Reduction: Remove interdependent, redundant parameters

Design and Implementation

Model Optimization

C/R NN model

Optimal Checkpoint interval Optimal Checkpoint count

Daisy Chain

Different C/R scenarios Different C/R scenarios

Random Forest

De Decision Tr Tree T1 De Decision Tr Tree T2 De Decision Tr Tree Tn

. . .

Daisy Chaining

slide-8
SLIDE 8

§ For a three-level checkpoint model, the neural network showed better performance with an improved accuracy between 19 to 51% in comparison to the machine learning models.

Evaluation

Neural Network vs Machine Learning

29.3348% 51.0520% 24.1105% 30.3900% 33.1202% 19.1765% 0.0000% 10.0000% 20.0000% 30.0000% 40.0000% 50.0000% 60.0000% Random Forest Gaussian Naïve Bayes Support Vector Clustering

Perofrmance Improvement

Neural Network Performance Improvement vs Machine Learning Models

Count_L1 Count_L2

slide-9
SLIDE 9

§ We present an idea to combine the simulation approach with machine learning models to determine the optimized parameter values for different configurations of C/R. § We show that our models can predict the optimized parameter values when trained with the simulation approach § We have also demonstrated that using techniques such as neural networks can improve the performance over the machine learning models with neural network sometime exceeding the performance of a machine learning model by 50%.

Conclusion

slide-10
SLIDE 10

Contact Information

Name Contact

Tonmoy Dey td18d@my.fsu.edu Kento Sato kento.sato@riken.jp Bogdan Nicolae bnicolae@anl.gov Jian Guo jian.guo@riken.jp Jens Domke jens.domke@riken.jp Weikuan Yu yuw@cs.fsu.edu Franck Cappello cappello@mcs.anl.gov Kathryn Mohror mohror1@llnl.gov