A Mixture of Experts Approach for Runtime Mapping in Dynamic - - PowerPoint PPT Presentation
A Mixture of Experts Approach for Runtime Mapping in Dynamic - - PowerPoint PPT Presentation
A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of Informatics University of Edinburgh Modern computing hardware Diverse Stochastjc Evolving 2 Parallelism Mapping Program Computatjon Steps
2
Modern computing hardware
Diverse Stochastjc Evolving
3
Parallelism Mapping
Program Hardware
Computatjon Steps
4
Parallelism Mapping
Hardware Workloads Sofuware Data
Program Hardware
5
Parallelism Mapping
Hardware Workloads Sofuware Data
Program Hardware Program performance is sensitjve to the environment
6
What exactly is the problem ?
Optjmal partjtjoning of the parallel work is not static and non-trivial
7
Existjng approaches are based on
- ne-size-fjts-all policy
What exactly is the problem ?
8
Existjng approaches are based on
- ne-size-fjts-all policy
➔ Not suitable for dynamic environments ➔ Hard to extend and update
What exactly is the problem ?
9
➔ Determine optjmal resources for a parallel program
Avoid under-subscription / over-subscription
➔ Enable program auto-tuning
Adapt smartly to varying resources
➔ Program and Platgorm aware
Generic and portable
Goals
10
Applicatjon
Runtjme Operatjng System Hardware
Where does it fit in the stack
11
State Space
12
Idea
➔ Identjfy best mapping policy in each set
13
Idea
Ek E1 E2 Ek-1
➔ Identjfy best mapping policy in each set
14
Idea
E1 E2 Ek Ek-1
….
Ek E1 E2 Ek-1
➔ Collect these policies
15
E1 E2 Ek Ek-1
….
Ek E1 E2 Ek-1
Idea
➔ Choose the best policy based on current state
16
E1 E2 Ek Ek-1
….
Ek E1 E2 Ek-1
Idea
➔ Choose the best policy based on current state
17
➔ Ensemble of experts (mapping policies) ➔ Smart way to select the best expert at runtjme ➔ Combine offmine prior models with online learning
Mixture of Experts based Mapping
18
Mixture of Experts based Mapping
Expert 2 Expert 1
# threads
Expert k
. . . .
# threads # threads
19
Mixture of Experts based Mapping
How to select the best expert ? Expensive to evaluate with # threads of all experts
Expert 2 Expert 1
# threads
Expert k
. . . .
# threads # threads
20
Mixture of Experts based Mapping
How to select the best expert ? Expensive to evaluate with # threads of all experts
Expert 2 Expert 1
# threads
Expert k
. . . .
# threads # threads
Environment predictor
21
Mixture of Experts based Mapping
How to select the best expert ? Expensive to evaluate with # threads of all experts Environment predictor
environment
Expert 2 Expert 1
# threads
Expert k
. . . .
environment # threads environment # threads
22
Thread predictor
Predictive Modelling
Environment predictor
What is the best # threads What should the environment should look like
23
Thread predictor
Predictive Modelling
Environment predictor
Input-feature-vector = < code, environment > f = (c,e) What is the best # threads What should the environment should look like
24
Approach – Machine Learning
25
➔ Hand crafued solutjons infeasible
Approach – Machine Learning
Learning algorithm Model Training data New input prediction Data Pre-processing
26
➔ Hand crafued solutjons infeasible ➔ Train offmine, deploy online ➔ Supervised learning, Cross-validated ➔ Trained on NAS, evaluated on additjonal benchmarks
Approach – Machine Learning
Learning algorithm Model Training data New input prediction Data Pre-processing
* Training overhead: one-off cost of 9216 experiments
27
➔ Various confjguratjons of program pairs and # threads
9216 experiments ; 3 weeks for runs; 1.1 GB log
➔ Feature space dimensionality reductjon : Informatjon gain
10 / 154 rich subset of features
➔ Linear Regression Models
Training phase
28
STATIC (code) DYNAMIC (environment) # instructjons # workload threads # branches # processors # load/store run queue size CPU load page free list rate cached memory
Features
29
How to select the best expert
Online Expert Selector
Select expert ‘k’
30
How to select the best expert
Online Expert Selector
Select expert ‘k’
Use 'Environment predictor' as a proxy to select the best mapping policy
31
All put together...
32
How many experts ?
33
How many experts ?
- pen questjon
34
Started with 4 experts
35
Workloads : Small (light), large (heavy) Hardware : Low, high frequent
Platgorm : 32-core Intel Xeon Benchmarks : NAS, SpecOMP, Parsec (OpenMP) Comparison : OpenMP default, Online, Offmine, Analytjc
Evaluation
Online: “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline: “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload”, Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 Analytic: “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.
36
Results
1.17x over analytjc 1.26x over offmine 1.38x over online
37
Why multjple experts ? Why not a single model ?
Ek E1 E2 Ek-1 M
38
Multjple experts
- utperforms single
model Why multjple experts ? Why not a single model ?
39
Can this approach be used with other optjmizatjon techniques ?
40
Affjnity-based scheduling
Can this approach be used with other optjmizatjon techniques ?
41
➔ Adaptjve to dynamic environment ➔ Predictjve modelling at its heart ➔ Environment predictor as a proxy to select the best
mapping policy
To sum up...
Developed an approach for smart parallelism mapping
42
➔ Integratjng this concept in CnC ➔ Focus on tuning component ➔ Runtjme and Applicatjon tuning ➔ Dynamic partjtjoning of resources to steps
What next ?
43
Idea
Instances of computations (steps)
Step 1 Step 2 Step 3 Step 4
➔ Varying resource requirements for steps ➔ Mapping depends on when data is ready
44
Take away
➔ One-size-fjts-none ➔ A bag of multjple policies is more practjcal than one ➔ Machine learning can be of help !!
Thank you
Murali Emani University of Edinburgh
m.k.emani@sms.ed.ac.uk
45
Backup
46
Adaptive Parallelism Mapping
➔ Program performance is sensitjve to the environment
Target
Hardware Inherent behavior Sofuware Data
- Large number of components
- Increased chances of failure
- Various characteristjcs
- Compute/memory/disk
bound
- Recurring upgrades
- Versions compatjbility
- Varying amount of I/O
- Scalability issues
47
48
All experts use the same features, they vary in importance across each expert.
49
50
Workloads : Small (light), large (heavy) Hardware : Low, high frequent
Platgorm : 32-core Intel Xeon 4 one-socket nodes, 8 cores/socket,
3.7.10 kernel
Compiler : gcc 4.6 “-O3 -fopenmp” Benchmarks : NAS, SpecOMP, Parsec (OpenMP) Comparison : OpenMP default, Online, Offmine, Analytjc
Evaluation
Online: “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline: “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload”, Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 Analytic: “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.
51
Graceful additjon
- f experts