Lecture 24: Machine Learning for HPC
Abhinav Bhatele, Department of Computer Science
High Performance Computing Systems (CMSC714)
Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of - - PowerPoint PPT Presentation
High Performance Computing Systems (CMSC714) Lecture 24: Machine Learning for HPC Abhinav Bhatele, Department of Computer Science Summary of last lecture Discrete-event simulations (DES) Parallel DES: conservative vs. optimistic
Abhinav Bhatele, Department of Computer Science
High Performance Computing Systems (CMSC714)
Abhinav Bhatele, CMSC714
2
Abhinav Bhatele, CMSC714
3
Abhinav Bhatele, CMSC714
4
Abhinav Bhatele, CMSC714
5
Abhinav Bhatele, CMSC714
5
Hardware resource Contention indicator Source node Injection FIFO length Network link Number of sent packets Intermediate router Receive buffer length All Number of hops (dilation)
Abhinav Bhatele, CMSC714
6
1 1.5 2 2.5 3 Nov 29 Dec 13 Dec 27 Jan 10 Jan 24 Feb 07 Feb 21 Mar 07 Mar 21 Apr 04 Relative Performance MILC AMG UMT miniVite
Abhinav Bhatele, CMSC714
implementation, hardware architecture
7
Platform
⋯
−∆𝑣 = 1 −div(𝜏(u)) = 0 curl curl E + E = 𝑔
⁞
Preconditioner Linear Solver ?? models
— — — —
Abhinav Bhatele, CMSC714
8
10 20 30 40 50 60 70 80 90 1 10 100 1000 Number of confgurations Execution time (s) Kripke: Performance variation due to input parameters
Abhinav Bhatele, CMSC714
8
10 20 30 40 50 60 70 80 90 1 10 100 1000 Number of confgurations Execution time (s) Kripke: Performance variation due to input parameters
Abhinav Bhatele, CMSC714
8
10 20 30 40 50 60 70 10 20 30 40 Number of runs Execution time (s) Quicksilver: Performance variation due to external factors
Abhinav Bhatele, CMSC714
8
10 20 30 40 50 60 70 10 20 30 40 Number of runs Execution time (s) Quicksilver: Performance variation due to external factors
Abhinav Bhatele, CMSC714
9
Identifying the Culprits behind Network Congestion
Abhinav Bhatele, CMSC714
relation, since a lot of data samples have been collected?
selected?
some features are useless, wouldn’t they be automatically ignored by the machine learning models?
features in the same way for both the training and the testing set, why is there a problem? How does standardization differ from scaling? If standardization is better, why wasn’t it used in this paper?
between every single feature and the execution time?
10
Identifying the Culprits behind Network Congestion
Abhinav Bhatele, CMSC714
11
Bootstrapping Parameter Space Exploration for Fast Tuning
Abhinav Bhatele, CMSC714
12
Bootstrapping Parameter Space Exploration for Fast Tuning
based on previous iterations, leading to propagation of error.
determined by computing L1 distances?
prior knowledge, can we further reduce the number of samples to collect by incorporating expert knowledge through this term?
might have to run GEIST multiple times with different hyperparameter settings?
Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu