 
              A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of Informatics University of Edinburgh
Modern computing hardware Diverse Stochastjc Evolving 2
Parallelism Mapping Program Computatjon Steps Hardware 3
Parallelism Mapping Program Workloads Hardware Sofuware Data Hardware 4
Parallelism Mapping Program Workloads Hardware Sofuware Data Hardware Program performance is sensitjve to the environment 5
What exactly is the problem ? Optjmal partjtjoning of the parallel work is not static and non-trivial 6
What exactly is the problem ? Existjng approaches are based on one-size-fjts-all policy 7
What exactly is the problem ? Existjng approaches are based on one-size-fjts-all policy ➔ Not suitable for dynamic environments ➔ Hard to extend and update 8
Goals ➔ Determine optjmal resources for a parallel program Avoid under-subscription / over-subscription ➔ Enable program auto-tuning Adapt smartly to varying resources ➔ Program and Platgorm aware Generic and portable 9
Where does it fit in the stack Applicatjon Runtjme Operatjng System Hardware 10
State Space 11
Idea ➔ Identjfy best mapping policy in each set 12
Idea ➔ Identjfy best mapping policy in each set E k E 1 E k-1 E 2 13
Idea ➔ Collect these policies E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 14
Idea ➔ Choose the best policy based on current state E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 15
Idea ➔ Choose the best policy based on current state E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 16
Mixture of Experts based Mapping ➔ Ensemble of experts (mapping policies) ➔ Smart way to select the best expert at runtjme ➔ Combine offmine prior models with online learning 17
Mixture of Experts based Mapping # threads Expert 1 # threads Expert 2 . . . . # threads Expert k 18
Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? # threads Expert 2 Expensive to evaluate with . . # threads of all experts . . # threads Expert k 19
Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? # threads Expert 2 Expensive to evaluate with . . # threads of all experts . . # threads Expert k Environment predictor 20
Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? environment # threads Expert 2 Expensive to evaluate with environment . . # threads of all experts . . # threads Expert k environment Environment predictor 21
Predictive Modelling Environment predictor Thread predictor What is the best What should the environment # threads should look like 22
Predictive Modelling Environment predictor Thread predictor What is the best What should the environment # threads should look like Input-feature-vector = < code, environment > f = (c,e) 23
Approach – Machine Learning 24
Approach – Machine Learning ➔ Hand crafued solutjons infeasible Learning prediction Data Training Model algorithm Pre-processing data New input 25
Approach – Machine Learning ➔ Hand crafued solutjons infeasible Learning prediction Data Training Model algorithm Pre-processing data New input ➔ Train offmine, deploy online ➔ Supervised learning, Cross-validated ➔ Trained on NAS, evaluated on additjonal benchmarks 26 * Training overhead: one-off cost of 9216 experiments
Training phase ➔ Various confjguratjons of program pairs and # threads 9216 experiments ; 3 weeks for runs; 1.1 GB log ➔ Feature space dimensionality reductjon : Informatjon gain 10 / 154 rich subset of features ➔ Linear Regression Models 27
Features STATIC DYNAMIC (code) (environment) # instructjons # workload threads # branches # processors # load/store run queue size CPU load page free list rate cached memory 28
How to select the best expert Online Expert Selector Select expert ‘k’ 29
How to select the best expert Online Expert Selector Select expert ‘k’ Use 'Environment predictor' as a proxy to select the best mapping policy 30
All put together... 31
How many experts ? 32
How many experts ? open questjon 33
Started with 4 experts 34
Evaluation Platgorm : 32-core Intel Xeon Benchmarks : NAS, SpecOMP, Parsec ( OpenMP ) Comparison : OpenMP default, Online, Offmine, Analytjc Workloads : Small ( light ), large ( heavy ) Hardware : Low, high frequent Online : “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline : “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload ” , Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 35 Analytic : “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.
Results 1.17x over analytjc 1.26x over offmine 1.38x over online 36
Why multjple experts ? Why not a single model ? E k E 1 M E k-1 E 2 37
Why multjple experts ? Why not a single model ? Multjple experts outperforms single model 38
Can this approach be used with other optjmizatjon techniques ? 39
Can this approach be used with other optjmizatjon techniques ? Affjnity-based scheduling 40
To sum up... Developed an approach for smart parallelism mapping ➔ Adaptjve to dynamic environment ➔ Predictjve modelling at its heart ➔ Environment predictor as a proxy to select the best mapping policy 41
What next ? ➔ Integratjng this concept in CnC ➔ Focus on tuning component ➔ Runtjme and Applicatjon tuning ➔ Dynamic partjtjoning of resources to steps 42
Idea Instances of computations (steps) Step 1 Step 2 Step 3 Step 4 ➔ Varying resource requirements for steps ➔ Mapping depends on when data is ready 43
Take away ➔ One-size-fjts- none ➔ A bag of multjple policies is more practjcal than one ➔ Machine learning can be of help !! Thank you Murali Emani University of Edinburgh m.k.emani@sms.ed.ac.uk 44
Backup 45
Adaptive Parallelism Mapping ➔ Program performance is sensitjve to the environment ● Various characteristjcs ● Large number of components ● Compute/memory/disk ● Increased chances of failure bound Inherent Hardware behavior Target Sofuware Data ● Recurring upgrades ● Varying amount of I/O ● Versions compatjbility ● Scalability issues 46
47
All experts use the same features, they vary in importance across each expert. 48
49
Evaluation Platgorm : 32-core Intel Xeon 4 one-socket nodes, 8 cores/socket, 3.7.10 kernel Compiler : gcc 4.6 “-O3 -fopenmp” Benchmarks : NAS, SpecOMP, Parsec ( OpenMP ) Comparison : OpenMP default, Online, Offmine, Analytjc Workloads : Small ( light ), large ( heavy ) Hardware : Low, high frequent Online : “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline : “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload ” , Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 50 Analytic : “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.
What is the efgect of increasing # experts ? Graceful additjon of experts What about # experts > 4 ? Needs more analysis 51
Recommend
More recommend