A Mixture of Experts Approach for Runtime Mapping in Dynamic - - PowerPoint PPT Presentation

a mixture of experts approach for runtime mapping in
SMART_READER_LITE
LIVE PREVIEW

A Mixture of Experts Approach for Runtime Mapping in Dynamic - - PowerPoint PPT Presentation

A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of Informatics University of Edinburgh Modern computing hardware Diverse Stochastjc Evolving 2 Parallelism Mapping Program Computatjon Steps


slide-1
SLIDE 1

A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments

Murali Emani

School of Informatics University of Edinburgh

slide-2
SLIDE 2

2

Modern computing hardware

Diverse Stochastjc Evolving

slide-3
SLIDE 3

3

Parallelism Mapping

Program Hardware

Computatjon Steps

slide-4
SLIDE 4

4

Parallelism Mapping

Hardware Workloads Sofuware Data

Program Hardware

slide-5
SLIDE 5

5

Parallelism Mapping

Hardware Workloads Sofuware Data

Program Hardware Program performance is sensitjve to the environment

slide-6
SLIDE 6

6

What exactly is the problem ?

Optjmal partjtjoning of the parallel work is not static and non-trivial

slide-7
SLIDE 7

7

Existjng approaches are based on

  • ne-size-fjts-all policy

What exactly is the problem ?

slide-8
SLIDE 8

8

Existjng approaches are based on

  • ne-size-fjts-all policy

➔ Not suitable for dynamic environments ➔ Hard to extend and update

What exactly is the problem ?

slide-9
SLIDE 9

9

➔ Determine optjmal resources for a parallel program

Avoid under-subscription / over-subscription

➔ Enable program auto-tuning

Adapt smartly to varying resources

➔ Program and Platgorm aware

Generic and portable

Goals

slide-10
SLIDE 10

10

Applicatjon

Runtjme Operatjng System Hardware

Where does it fit in the stack

slide-11
SLIDE 11

11

State Space

slide-12
SLIDE 12

12

Idea

➔ Identjfy best mapping policy in each set

slide-13
SLIDE 13

13

Idea

Ek E1 E2 Ek-1

➔ Identjfy best mapping policy in each set

slide-14
SLIDE 14

14

Idea

E1 E2 Ek Ek-1

….

Ek E1 E2 Ek-1

➔ Collect these policies

slide-15
SLIDE 15

15

E1 E2 Ek Ek-1

….

Ek E1 E2 Ek-1

Idea

➔ Choose the best policy based on current state

slide-16
SLIDE 16

16

E1 E2 Ek Ek-1

….

Ek E1 E2 Ek-1

Idea

➔ Choose the best policy based on current state

slide-17
SLIDE 17

17

➔ Ensemble of experts (mapping policies) ➔ Smart way to select the best expert at runtjme ➔ Combine offmine prior models with online learning

Mixture of Experts based Mapping

slide-18
SLIDE 18

18

Mixture of Experts based Mapping

Expert 2 Expert 1

# threads

Expert k

. . . .

# threads # threads

slide-19
SLIDE 19

19

Mixture of Experts based Mapping

How to select the best expert ? Expensive to evaluate with # threads of all experts

Expert 2 Expert 1

# threads

Expert k

. . . .

# threads # threads

slide-20
SLIDE 20

20

Mixture of Experts based Mapping

How to select the best expert ? Expensive to evaluate with # threads of all experts

Expert 2 Expert 1

# threads

Expert k

. . . .

# threads # threads

Environment predictor

slide-21
SLIDE 21

21

Mixture of Experts based Mapping

How to select the best expert ? Expensive to evaluate with # threads of all experts Environment predictor

environment

Expert 2 Expert 1

# threads

Expert k

. . . .

environment # threads environment # threads

slide-22
SLIDE 22

22

Thread predictor

Predictive Modelling

Environment predictor

What is the best # threads What should the environment should look like

slide-23
SLIDE 23

23

Thread predictor

Predictive Modelling

Environment predictor

Input-feature-vector = < code, environment > f = (c,e) What is the best # threads What should the environment should look like

slide-24
SLIDE 24

24

Approach – Machine Learning

slide-25
SLIDE 25

25

➔ Hand crafued solutjons infeasible

Approach – Machine Learning

Learning algorithm Model Training data New input prediction Data Pre-processing

slide-26
SLIDE 26

26

➔ Hand crafued solutjons infeasible ➔ Train offmine, deploy online ➔ Supervised learning, Cross-validated ➔ Trained on NAS, evaluated on additjonal benchmarks

Approach – Machine Learning

Learning algorithm Model Training data New input prediction Data Pre-processing

* Training overhead: one-off cost of 9216 experiments

slide-27
SLIDE 27

27

➔ Various confjguratjons of program pairs and # threads

9216 experiments ; 3 weeks for runs; 1.1 GB log

➔ Feature space dimensionality reductjon : Informatjon gain

10 / 154 rich subset of features

➔ Linear Regression Models

Training phase

slide-28
SLIDE 28

28

STATIC (code) DYNAMIC (environment) # instructjons # workload threads # branches # processors # load/store run queue size CPU load page free list rate cached memory

Features

slide-29
SLIDE 29

29

How to select the best expert

Online Expert Selector

Select expert ‘k’

slide-30
SLIDE 30

30

How to select the best expert

Online Expert Selector

Select expert ‘k’

Use 'Environment predictor' as a proxy to select the best mapping policy

slide-31
SLIDE 31

31

All put together...

slide-32
SLIDE 32

32

How many experts ?

slide-33
SLIDE 33

33

How many experts ?

  • pen questjon
slide-34
SLIDE 34

34

Started with 4 experts

slide-35
SLIDE 35

35

Workloads : Small (light), large (heavy) Hardware : Low, high frequent

Platgorm : 32-core Intel Xeon Benchmarks : NAS, SpecOMP, Parsec (OpenMP) Comparison : OpenMP default, Online, Offmine, Analytjc

Evaluation

Online: “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline: “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload”, Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 Analytic: “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.

slide-36
SLIDE 36

36

Results

1.17x over analytjc 1.26x over offmine 1.38x over online

slide-37
SLIDE 37

37

Why multjple experts ? Why not a single model ?

Ek E1 E2 Ek-1 M

slide-38
SLIDE 38

38

Multjple experts

  • utperforms single

model Why multjple experts ? Why not a single model ?

slide-39
SLIDE 39

39

Can this approach be used with other optjmizatjon techniques ?

slide-40
SLIDE 40

40

Affjnity-based scheduling

Can this approach be used with other optjmizatjon techniques ?

slide-41
SLIDE 41

41

➔ Adaptjve to dynamic environment ➔ Predictjve modelling at its heart ➔ Environment predictor as a proxy to select the best

mapping policy

To sum up...

Developed an approach for smart parallelism mapping

slide-42
SLIDE 42

42

➔ Integratjng this concept in CnC ➔ Focus on tuning component ➔ Runtjme and Applicatjon tuning ➔ Dynamic partjtjoning of resources to steps

What next ?

slide-43
SLIDE 43

43

Idea

Instances of computations (steps)

Step 1 Step 2 Step 3 Step 4

➔ Varying resource requirements for steps ➔ Mapping depends on when data is ready

slide-44
SLIDE 44

44

Take away

➔ One-size-fjts-none ➔ A bag of multjple policies is more practjcal than one ➔ Machine learning can be of help !!

Thank you

Murali Emani University of Edinburgh

m.k.emani@sms.ed.ac.uk

slide-45
SLIDE 45

45

Backup

slide-46
SLIDE 46

46

Adaptive Parallelism Mapping

➔ Program performance is sensitjve to the environment

Target

Hardware Inherent behavior Sofuware Data

  • Large number of components
  • Increased chances of failure
  • Various characteristjcs
  • Compute/memory/disk

bound

  • Recurring upgrades
  • Versions compatjbility
  • Varying amount of I/O
  • Scalability issues
slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

All experts use the same features, they vary in importance across each expert.

slide-49
SLIDE 49

49

slide-50
SLIDE 50

50

Workloads : Small (light), large (heavy) Hardware : Low, high frequent

Platgorm : 32-core Intel Xeon 4 one-socket nodes, 8 cores/socket,

3.7.10 kernel

Compiler : gcc 4.6 “-O3 -fopenmp” Benchmarks : NAS, SpecOMP, Parsec (OpenMP) Comparison : OpenMP default, Online, Offmine, Analytjc

Evaluation

Online: “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline: “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload”, Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 Analytic: “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.

slide-51
SLIDE 51

51

Graceful additjon

  • f experts

What is the efgect of increasing # experts ? What about # experts > 4 ? Needs more analysis