Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of Computer Science

Summary of last lecture • Discrete-event simulations (DES) • Parallel DES: conservative vs. optimistic • Simulation of epidemic diffusion: agent-based, time-stepped modeling • Trace-driven network simulations: model event sequences Abhinav Bhatele, CMSC714 2

Why machine learning? • Proliferation of performance data • On-node hardware counters • Switch/network port counters • Power measurements • Traces and profiles • Supercomputing facilites’ data • Job queue logs, performance • Sensors: temperature, humidity, power Abhinav Bhatele, CMSC714 3

Types of ML-related tasks in HPC • Auto-tuning: parameter search • Find a well performing configuration • Predictive models: time, energy, … • Predict system state in the future • Time-series analysis • Identifying root causes/factors Abhinav Bhatele, CMSC714 4

Understanding network congestion • Congestion and its root causes not well understood • Study network hardware performance counters and their correlation with execution time • Use supervised learning to identify hardware components that lead to congestion and performance degradation Abhinav Bhatele, CMSC714 5

Understanding network congestion • Congestion and its root causes not well understood • Study network hardware performance counters and their correlation with execution time • Use supervised learning to identify hardware components that lead to congestion and performance degradation Hardware resource Contention indicator Source node Injection FIFO length Network link Number of sent packets Intermediate router Receive buffer length All Number of hops (dilation) Abhinav Bhatele, CMSC714 5

Investigating performance variability 3 MILC UMT AMG miniVite 2.5 Relative Performance 2 1.5 1 Nov 29 Dec 13 Dec 27 Jan 10 Jan 24 Feb 07 Feb 21 Mar 07 Mar 21 Apr 04 • Identify users to blame, important network counters • Predict future performance based on historical time-series data Abhinav Bhatele, CMSC714 6

Identifying best performing code variants � • Many computational science and −∆𝑣 = 1 engineering (CSE) codes rely on solving − div( 𝜏(u)) = 0 � ⋯ curl curl E + E = 𝑔 -grad( 𝛽 div( F )) + 𝛾 F = f sparse linear systems ⁞ • Many choices of numerical methods � • Optimal choice w.r.t. performance depends — on several things: — ?? — • Input data and its representation, algorithm and its — Preconditioner models — implementation, hardware architecture Linear Solver Platform � Abhinav Bhatele, CMSC714 7

Auto-tuning with limited training data Kripke: Performance variation due to input parameters 90 80 70 Number of con f gurations 60 50 40 30 20 10 0 1 10 100 1000 Execution time (s) Abhinav Bhatele, CMSC714 8

Auto-tuning with limited training data Kripke: Performance variation due to input parameters 90 • Application performance depends on many factors: 80 • Input parameters, algorithmic choices, runtime parameters 70 Number of con f gurations 60 50 40 30 20 10 0 1 10 100 1000 Execution time (s) Abhinav Bhatele, CMSC714 8

Auto-tuning with limited training data Quicksilver: Performance variation due to external factors 70 • Application performance depends on many factors: 60 • Input parameters, algorithmic choices, runtime parameters 50 • Performance also depends on: Number of runs 40 • Code changes, linked libraries 30 • Compilers, architecture 20 10 0 10 20 30 40 Execution time (s) Abhinav Bhatele, CMSC714 8

Auto-tuning with limited training data Quicksilver: Performance variation due to external factors 70 • Application performance depends on many factors: 60 • Input parameters, algorithmic choices, runtime parameters 50 • Performance also depends on: Number of runs 40 • Code changes, linked libraries 30 • Compilers, architecture 20 • Surrogate models + transfer learning 10 0 10 20 30 40 Execution time (s) Abhinav Bhatele, CMSC714 8

Questions Identifying the Culprits behind Network Congestion • Can you go over the differences between weak and strong models in ML? • How is OS noise accounted for in these runs or is Blue Gene/Q truly noiseless? • Why do you use R^2 and RCC? • Was the goal of the paper to identify parameters that are important or generate accurate predictions? Seems like the parameters identified at the end were pretty reasonable to expect to be important in the first place. • Why do the authors choose 5D torus network in particular? Abhinav Bhatele, CMSC714 9

Questions Identifying the Culprits behind Network Congestion • What about using deep learning techniques to do the nonparametric regression to predict the relation, since a lot of data samples have been collected? • Why is the Huber loss preferred over L1 or L2 loss in this task? How is the value of delta selected? • Why is it necessary to perform an exhaustive search over all possible feature combinations? If some features are useless, wouldn’t they be automatically ignored by the machine learning models? • Regarding the zero R^2 scores, what does it mean by an “artifact” of scaling? If we scale the features in the same way for both the training and the testing set, why is there a problem? How does standardization differ from scaling? If standardization is better, why wasn’t it used in this paper? • What are the problems or limitations of selecting features according to the rank correlation between every single feature and the execution time? Abhinav Bhatele, CMSC714 10

Questions Bootstrapping Parameter Space Exploration for Fast Tuning • The mapping of the parameter space to an undirected graph was a little confusing, could you go over it? • How does the label propagation routine choose 'prior beliefs'? How many labelled nodes are required for the results to converge? • If the models are pre run, how can you be sure to sample the configuration space (page 6, second column) properly? Does GEIST require an exhaustive search of the space? • If the hyperparameters are set by the initial random sample, is it possible to start with an anomalous sample that reduces performance dramatically (especially with the 'hard optimization problems' where there a few optimal solutions) • What dictates the choice of the number of iterations to perform in GEIST? Abhinav Bhatele, CMSC714 11

Questions Bootstrapping Parameter Space Exploration for Fast Tuning • How could we determine b ik , which denotes the prior belief on associating node i with label k? • How is the stability of this GEIST Algorithm? Since we predict the labels and do the sampling based on previous iterations, leading to propagation of error. • Can the autotuning problem be modeled as a regression task? • How do we represent configurations as vectors so that the set of nearest neighbors can be determined by computing L1 distances? • Is the prior belief term b ik used in the GEIST algorithm? Although GEIST does not require prior knowledge, can we further reduce the number of samples to collect by incorporating expert knowledge through this term? • How difficult is it to find suitable hyperparameters? Can this be time-consuming, because we might have to run GEIST multiple times with different hyperparameter settings? Abhinav Bhatele, CMSC714 12

Questions? Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu

Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of Computer Science Summary of last lecture Discrete-event simulations (DES) Parallel DES: conservative vs. optimistic

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Uni.lu HPC School 2019 PS12a: Machine / Deep learning I Keras/Tensorflow CPU/GPU Uni.lu High

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Refactoring Earthquake-Tsunami Causality and Messaging via Big Data Analytics: The Transformative

Das Gehirn eines Buddha: Wie wir zu Strke und innerem Frieden finden knnen Parabola Forum,

Are Killer Apps Killing Exascale? Al Geist Corporate Fellow Oak Ridge National Lab CCDSC 2016

Learning from a Learner Alexis Jacq (1,2), Matthieu Geist (1), Ana Paiva (2), Olivier Pietquin (1)

Workshop Goals and Structure David Keyes Mathematical and Computer Sciences & Engineering,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Welcome to the OWASP Toronto Meetup Hello, and happy 2018! Announcement: OWASP Top 10 2017

Crowd-sourced Event Localization using Smartphones Robin Wentao Ouyang Animesh Srivastava

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of Computer Science Summary of last lecture Discrete-event simulations (DES) Parallel DES: conservative vs. optimistic

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Uni.lu HPC School 2019 PS12a: Machine / Deep learning I Keras/Tensorflow CPU/GPU Uni.lu High

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

Refactoring Earthquake-Tsunami Causality and Messaging via Big Data Analytics: The Transformative

Das Gehirn eines Buddha: Wie wir zu Strke und innerem Frieden finden knnen Parabola Forum,

Are Killer Apps Killing Exascale? Al Geist Corporate Fellow Oak Ridge National Lab CCDSC 2016

Learning from a Learner Alexis Jacq (1,2), Matthieu Geist (1), Ana Paiva (2), Olivier Pietquin (1)

Workshop Goals and Structure David Keyes Mathematical and Computer Sciences &amp; Engineering,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Welcome to the OWASP Toronto Meetup Hello, and happy 2018! Announcement: OWASP Top 10 2017

Crowd-sourced Event Localization using Smartphones Robin Wentao Ouyang Animesh Srivastava

Sambuz

Useful Links

Newsletter

Mail Us

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Workshop Goals and Structure David Keyes Mathematical and Computer Sciences & Engineering,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &