Applying Machine Learning to Understand Write Performance of - PowerPoint PPT Presentation

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems Presented by Bing Xie Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems • Problem – Understand the write performance of HPC applications running on large-scale systems • Contribution – Built accurate ML models for predicting the I/O write performance – Interpreted multi-stage write behaviors of large-scale I/O subsystems • Impact – Demonstrated that ML can be applied to predict the write performance of large-scale I/O subsystems – Delivered a generic solution applicable to various large-scale I/O subsystems and technologies 2 Open slide master to edit 2

Motivation: Reduce the Write Cost • Configure write burst size/rate tradeoffs • Guide I/O middleware (e.g., ROMIO) to adapt write patterns • Inform system job schedulers to yield tighter/better estimates of I/O cost and application runtime 3 Open slide master to edit 3

Related Works and Our Solution • I/O performance studies – Profiling supercomputer I/O subsystems under production loads – Darshan toolkit – Statistical benchmarking • I/O middleware systems – ROMIO, ADIOS • ML in I/O performance prediction – Tune I/O parameters at application level – Learn I/O patterns from job logs and system monitoring data • Our Solution – First ML work to predict write performance of large-scale parallel filesystems based on application write patterns, system architecture, and configurations 4 Open slide master to edit 4

Typical Scientific Applications • HPC codes compute • A generic example: XGC for a long time at large – Evaluate physical equations iteratively scales over space: compute cost is predictable • Produce write bursts – 4 types of bursts with different write frequencies and burst sizes: that stall application state snapshots: 500MB to 1.2GB • executions and diagnostic analysis bursts: 1MB – 400MB • impact application Bursts are stored as independent files • runtime – Write stalls comprise 7-20% of run time 5 Open slide master to edit 5

Target I/O systems • Titan and Spider 2 at OLCF/ORNL Metadata Server – Cray XK7 – Lustre filesystem Client Server Target • Cetus and Mira-FS1at ALCF/ANL – IBM Blue Gene/Q SAN – GPFS filesystem Supercomputer Storage System 6 Open slide master to edit 6

Challenges • High performance variability • Limited filesystem visibility for end-users 7 Open slide master to edit 7

High Performance Variability 1. CDFs of write performance variations on Titan and Cetus. 1 3. Write performance on Titan and Cetus is 0.8 highly variable. 0.6 CDF 0.4 0.2 Cetus Titan 2. The x-axis represents the relative measures 0 5 10 15 20 25 30 ( max/min ) of the write bandwidths of the Max/Min experiment data (IOR benchmarks) 8 Open slide master to edit 8

Our Approach • Highly variable, but reverts to mean over time – Model the mean performance – Effectively address the repeated I/O writes and aggregate impact • Limited visibility for end users – Extract features from write patterns and system architecture and configurations • Interference – Address noise as features • ML solution – Convergence-guaranteed sampling method – Lasso models – Systematic ML methodology 9 Open slide master to edit 9

End-to-end I/O Write Path Metadata Server Client Server Target Each Target is a RAID array. SAN Spider 2 Titan (Atlas1 and 2) Stripe_size Burst 0 b 0 b 1 b 2 b 3 Example: b 4 b 5 b 6 b 7 Stripe_Count =4 Starting_OST =23 Striping Burst 0 Server 23 Server 24 Server 25 Server 26 Target 23 Target 24 Target 25 Target 26 10 Open slide master to edit 10

Extract Features • Insight: infer end-to-end burst absorption time based on performance-related parameters ( write load, load skew, resources in use ) at each stage • Collectable performance-related parameters on Titan and Cetus • Predictable performance-related parameters on Spider 2 and Mira-FS1 • Positive and inverse forms of performance-related parameters on separate stages, adjacent stages, and noise • Titan/Spider 2: 41 features; Cetus/Mira-FS1: 30 features 11 Open slide master to edit 11

Systematic Machine Learning Approach In each training set Candidate a Lasso model features For each model Search for the model with 1. Train the model with minimum MSE from the 255 Lasso BEST 10-fold cross validation. models each for 1 training set 2. Evaluate the model by MODEL Mean Square Error (MSE). 12 Open slide master to edit 12

Experiments • Train models on a small scale data set – 3,465 (Titan) and 4,715 (Cetus) converged samples collected with multiple IOR benchmarks on the scale of 1-128 compute nodes • Evaluate models on medium scale – 668 (Titan) and 874 (Cetus) converged samples produced by 200 -512 compute nodes • Evaluation criteria – Accuracy of the best model – Effectiveness of features 13 Open slide master to edit 13

Reported 4 models • Lasso best – With minimum Mean Square Error from 255 Lasso models across the training set candidates • Lasso base – The Lasso model trained on the write scales of 1-128 compute nodes • Linear best – With minimum Mean Square Error from 255 Linear models across the training set candidates • Linear base – The Linear model trained on the write scales of 1-128 compute nodes 14 Open slide master to edit 14

Results on Titan and Cetus test set with 400, 512 nodes test set with 200, 256 nodes 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Relative True Error Titan/Spider 2 0.2 Relative True Error 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 Lasso best_lustre Lasso best_lustre Lasso base_lustre Lasso base_lustre -0.8 Linear best_lustre -0.8 Linear best_lustre Linear base_lustre Linear base_lustre -1 -1 5 8.33 14.34 20.92 34.38 48.4 130.61 5.04 11.56 21.02 30.08 48.85 84.64 250.51 Samples sorted by t, Unit:Sec Samples sorted by t, Unit:Sec test set with 200, 256 test set with 400, 512 nodes nodes 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Relative True Error Cetus/Mira-FS1 0.2 Relative True Error 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 Lasso best_gpfs -0.6 Lasso best_gpfs Lasso base_gpfs Lasso base_gpfs -0.8 Linear best_gpfs -0.8 Linear best_gpfs Linear base_gpfs Linear base_gpfs -1 -1 5.13 14.33 33.61 62.79 107.76 191.26 2330.2 5.06 13.08 27.83 50.49 95.92 207.04 1281.38 15 Open slide master to edit 15 Samples sorted by t, Unit:Sec Samples sorted by t, Unit:Sec

Results on Titan and Cetus Lasso_ best is highly accurate and the best model 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Relative True Error Titan/Spider 2 0.2 Relative True Error 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 Lasso best_lustre Lasso best_lustre Lasso base_lustre Lasso base_lustre -0.8 Linear best_lustre -0.8 Linear best_lustre Linear base_lustre Linear base_lustre -1 -1 5 8.33 14.34 20.92 34.38 48.4 130.61 5.04 11.56 21.02 30.08 48.85 84.64 250.51 Samples sorted by t, Unit:Sec Samples sorted by t, Unit:Sec 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Relative True Error Cetus/Mira-FS1 0.2 Relative True Error 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 Lasso best_gpfs -0.6 Lasso best_gpfs Lasso base_gpfs Lasso base_gpfs -0.8 Linear best_gpfs -0.8 Linear best_gpfs Linear base_gpfs Linear base_gpfs -1 -1 5.13 14.33 33.61 62.79 107.76 191.26 2330.2 5.06 13.08 27.83 50.49 95.92 207.04 1281.38 16 Open slide master to edit 16 Samples sorted by t, Unit:Sec Samples sorted by t, Unit:Sec

Conclusions • Problem – Understand the I/O write performance of large-scale supercomputers • Our Solution – Systematic ML approach with Lasso – Modeling the mean performance, extracting features from application write patterns, system architecture and configurations, convergence-guaranteed sampling • Findings – Lasso best is the most accurate model for both Titan and Cetus – Most effective features are load skew in supercomputers and resources in use on the system side • Applicability – Lasso models, features: Lustre, GPFS deployment – Systematic modeling method: generic supercomputer I/O subsystems 17 Open slide master to edit 17

Acknowledgement 18 Open slide master to edit 18

Applying Machine Learning to Understand Write Performance of - PowerPoint PPT Presentation

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems Presented by Bing Xie Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang ORNL

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

s ts s r

Towards a Generic XML Content Presentation Model Michael Pediaditakis (mp49@kent.ac.uk) David

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &

A Generic Tableau Prover and Its Integration with Isabelle Lawrence C. Paulson Computer

decomposing generalization models of generic, habitual and episodic statements Venkata S

The Suffix that Makes Persian Nouns Unique Masoud Jasbi Leila Habibi Stanford University

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in

Applying Machine Learning to Understand Write Performance of - PowerPoint PPT Presentation

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems Presented by Bing Xie Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang ORNL

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

s ts s r

Towards a Generic XML Content Presentation Model Michael Pediaditakis (mp49@kent.ac.uk) David

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &amp;

A Generic Tableau Prover and Its Integration with Isabelle Lawrence C. Paulson Computer

decomposing generalization models of generic, habitual and episodic statements Venkata S

The Suffix that Makes Persian Nouns Unique Masoud Jasbi Leila Habibi Stanford University

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &