Performance and energy optimization of concurrent pipelined - - PowerPoint PPT Presentation

performance and energy optimization of concurrent
SMART_READER_LITE
LIVE PREVIEW

Performance and energy optimization of concurrent pipelined - - PowerPoint PPT Presentation

Framework Complexity Experiments Conclusion Performance and energy optimization of concurrent pipelined applications Anne Benoit, Paul Renaud-Goud and Yves Robert Institut Universitaire de France ROMA team, LIP Ecole Normale Sup


slide-1
SLIDE 1

Framework Complexity Experiments Conclusion

Performance and energy optimization

  • f concurrent pipelined applications

Anne Benoit, Paul Renaud-Goud and Yves Robert

Institut Universitaire de France ROMA team, LIP ´ Ecole Normale Sup´ erieure de Lyon, France

CCGSC 2010, Flat Rock, NC

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 1/ 38

slide-2
SLIDE 2

Framework Complexity Experiments Conclusion

Motivations

Mapping concurrent pipelined applications onto distributed platforms: practical applications, but difficult problems Assess problem hardness ⇒ different mapping rules and platform characteristics Energy saving is becoming a crucial problem Several concurrent objective functions: period, latency, power ⇒ Multi-criteria approach: minimize power consumption while guaranteeing some performance Exhaustive complexity study Heuristics on most general (NP-complete) case

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 2/ 38

slide-3
SLIDE 3

Framework Complexity Experiments Conclusion

Motivations

Mapping concurrent pipelined applications onto distributed platforms: practical applications, but difficult problems Assess problem hardness ⇒ different mapping rules and platform characteristics Energy saving is becoming a crucial problem Several concurrent objective functions: period, latency, power ⇒ Multi-criteria approach: minimize power consumption while guaranteeing some performance Exhaustive complexity study Heuristics on most general (NP-complete) case

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 2/ 38

slide-4
SLIDE 4

Framework Complexity Experiments Conclusion

Motivations

Mapping concurrent pipelined applications onto distributed platforms: practical applications, but difficult problems Assess problem hardness ⇒ different mapping rules and platform characteristics Energy saving is becoming a crucial problem Several concurrent objective functions: period, latency, power ⇒ Multi-criteria approach: minimize power consumption while guaranteeing some performance Exhaustive complexity study Heuristics on most general (NP-complete) case

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 2/ 38

slide-5
SLIDE 5

Framework Complexity Experiments Conclusion

Motivations

Mapping concurrent pipelined applications onto distributed platforms: practical applications, but difficult problems Assess problem hardness ⇒ different mapping rules and platform characteristics Energy saving is becoming a crucial problem Several concurrent objective functions: period, latency, power ⇒ Multi-criteria approach: minimize power consumption while guaranteeing some performance Exhaustive complexity study Heuristics on most general (NP-complete) case

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 2/ 38

slide-6
SLIDE 6

Framework Complexity Experiments Conclusion

Why bother with energy?

Minimizing total energy consumed by processors: very important objective (economic and environmental reasons)

  • M. P. Mills, The internet begins with coal, Environment and

Climate News (1999) Algorithmic techniques:

Shut down idle processors Dynamic speed scaling processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow The higher the speed, the higher the power consumption Power = f × V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = sα + Pstatic, with 2 ≤ α ≤ 3

Problem: decide which processors to enroll, and at which speed to run them

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 3/ 38

slide-7
SLIDE 7

Framework Complexity Experiments Conclusion

Why bother with energy?

Minimizing total energy consumed by processors: very important objective (economic and environmental reasons)

  • M. P. Mills, The internet begins with coal, Environment and

Climate News (1999) Algorithmic techniques:

Shut down idle processors Dynamic speed scaling processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow The higher the speed, the higher the power consumption Power = f × V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = sα + Pstatic, with 2 ≤ α ≤ 3

Problem: decide which processors to enroll, and at which speed to run them

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 3/ 38

slide-8
SLIDE 8

Framework Complexity Experiments Conclusion

Why bother with energy?

Minimizing total energy consumed by processors: very important objective (economic and environmental reasons)

  • M. P. Mills, The internet begins with coal, Environment and

Climate News (1999) Algorithmic techniques:

Shut down idle processors Dynamic speed scaling: processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow The higher the speed, the higher the power consumption Power = f × V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = sα + Pstatic, with 2 ≤ α ≤ 3

Problem: decide which processors to enroll, and at which speed to run them

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 3/ 38

slide-9
SLIDE 9

Framework Complexity Experiments Conclusion

Why bother with energy?

Minimizing total energy consumed by processors: very important objective (economic and environmental reasons)

  • M. P. Mills, The internet begins with coal, Environment and

Climate News (1999) Algorithmic techniques:

Shut down idle processors Dynamic speed scaling: processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow The higher the speed, the higher the power consumption Power = f × V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = sα + Pstatic, with 2 ≤ α ≤ 3

Problem: decide which processors to enroll, and at which speed to run them

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 3/ 38

slide-10
SLIDE 10

Framework Complexity Experiments Conclusion

Why bother with energy?

Minimizing total energy consumed by processors: very important objective (economic and environmental reasons)

  • M. P. Mills, The internet begins with coal, Environment and

Climate News (1999) Algorithmic techniques:

Shut down idle processors Dynamic speed scaling: processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow The higher the speed, the higher the power consumption Power = f × V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = sα + Pstatic, with 2 ≤ α ≤ 3

Problem: decide which processors to enroll, and at which speed to run them

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 3/ 38

slide-11
SLIDE 11

Framework Complexity Experiments Conclusion

Motivating example

Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-12
SLIDE 12

Framework Complexity Experiments Conclusion

Motivating example

Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-13
SLIDE 13

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-14
SLIDE 14

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-15
SLIDE 15

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-16
SLIDE 16

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-17
SLIDE 17

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-18
SLIDE 18

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-19
SLIDE 19

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-20
SLIDE 20

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-21
SLIDE 21

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-22
SLIDE 22

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-23
SLIDE 23

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-24
SLIDE 24

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-25
SLIDE 25

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-26
SLIDE 26

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-27
SLIDE 27

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-28
SLIDE 28

Framework Complexity Experiments Conclusion

Motivating example

P = 33 + 83 = 539 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-29
SLIDE 29

Framework Complexity Experiments Conclusion

Motivating example

P = 539 P = 8 Period: T = 3 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-30
SLIDE 30

Framework Complexity Experiments Conclusion

Motivating example

P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-31
SLIDE 31

Framework Complexity Experiments Conclusion

Motivating example

P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8 L = 17

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

slide-32
SLIDE 32

Framework Complexity Experiments Conclusion

Outline of the talk

1

Framework Application and platform Mapping rules Metrics

2

Complexity results Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing

3

Experiments Heuristics Experiments Summary

4

Conclusion

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 5/ 38

slide-33
SLIDE 33

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Outline of the talk

1

Framework Application and platform Mapping rules Metrics

2

Complexity results Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing

3

Experiments Heuristics Experiments Summary

4

Conclusion

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 6/ 38

slide-34
SLIDE 34

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Application model and execution platform

Concurrent pipelined applications

w i

a: weight of stage Si a (ith stage of application a)

δi

a: size of outcoming data of Si a

Processors with multiple speeds (or modes): {su,1, . . . , su,mu} Constant speed during the execution Platform fully interconnected; bu,v: bandwidth between processors Pu and Pv;

  • verlap or non-overlap of communications and computations

Three platform types:

Fully homogeneous, or speed homogeneous Communication homogeneous, or speed heterogeneous Fully heterogeneous

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 7/ 38

slide-35
SLIDE 35

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Application model and execution platform

Concurrent pipelined applications

w i

a: weight of stage Si a (ith stage of application a)

δi

a: size of outcoming data of Si a

Processors with multiple speeds (or modes): {su,1, . . . , su,mu} Constant speed during the execution Platform fully interconnected; bu,v: bandwidth between processors Pu and Pv;

  • verlap or non-overlap of communications and computations

Three platform types:

Fully homogeneous, or speed homogeneous Communication homogeneous, or speed heterogeneous Fully heterogeneous

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 7/ 38

slide-36
SLIDE 36

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Application model and execution platform

Concurrent pipelined applications

w i

a: weight of stage Si a (ith stage of application a)

δi

a: size of outcoming data of Si a

Processors with multiple speeds (or modes): {su,1, . . . , su,mu} Constant speed during the execution Platform fully interconnected; bu,v: bandwidth between processors Pu and Pv;

  • verlap or non-overlap of communications and computations

Three platform types:

Fully homogeneous, or speed homogeneous Communication homogeneous, or speed heterogeneous Fully heterogeneous

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 7/ 38

slide-37
SLIDE 37

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Mapping rules

Mapping with no processor sharing: relevant in practice (security rules)

One-to-one mapping Interval mapping

General mapping with resource sharing: better resource utilization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 8/ 38

slide-38
SLIDE 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Mapping rules

Mapping with no processor sharing: relevant in practice (security rules)

One-to-one mapping Interval mapping

General mapping with resource sharing: better resource utilization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 8/ 38

slide-39
SLIDE 39

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Mapping rules

Mapping with no processor sharing: relevant in practice (security rules)

One-to-one mapping Interval mapping

General mapping with resource sharing: better resource utilization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 8/ 38

slide-40
SLIDE 40

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics without resource sharing

Interval mapping on a single application with no resource sharing; k intervals Ij of stages from Sdj to Sej Period T of an application: minimum delay between the processing of two consecutive data sets

T (overlap) = max

j∈{1,...,k}

  • max
  • δdj −1

balloc(dj −1),alloc(dj ) , ej

i=dj wi

salloc(dj ) , δej balloc(dj ),alloc(ej +1)

  • Latency L of an application: time, for a data set, to go

through the whole pipeline

L = δ0 balloc(0),alloc(1) +

m

  • j=1

  

ej

  • i=dj

wi salloc(dj ) + δej balloc(dj ),alloc(ej +1)   

Power P of the platform: sum of power of processors

P =

  • Pu

P(u), P(u) = Pdyn(su)+Pstat(u), Pdyn(su) = sα

u ,

2 ≤ α ≤ 3

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 9/ 38

slide-41
SLIDE 41

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics without resource sharing

Interval mapping on a single application with no resource sharing; k intervals Ij of stages from Sdj to Sej Period T of an application: minimum delay between the processing of two consecutive data sets

T (overlap) = max

j∈{1,...,k}

  • max
  • δdj −1

balloc(dj −1),alloc(dj ) , ej

i=dj wi

salloc(dj ) , δej balloc(dj ),alloc(ej +1)

  • Latency L of an application: time, for a data set, to go

through the whole pipeline

L = δ0 balloc(0),alloc(1) +

m

  • j=1

  

ej

  • i=dj

wi salloc(dj ) + δej balloc(dj ),alloc(ej +1)   

Power P of the platform: sum of power of processors

P =

  • Pu

P(u), P(u) = Pdyn(su)+Pstat(u), Pdyn(su) = sα

u ,

2 ≤ α ≤ 3

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 9/ 38

slide-42
SLIDE 42

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics without resource sharing

Interval mapping on a single application with no resource sharing; k intervals Ij of stages from Sdj to Sej Period T of an application: minimum delay between the processing of two consecutive data sets

T (overlap) = max

j∈{1,...,k}

  • max
  • δdj −1

balloc(dj −1),alloc(dj ) , ej

i=dj wi

salloc(dj ) , δej balloc(dj ),alloc(ej +1)

  • Latency L of an application: time, for a data set, to go

through the whole pipeline

L = δ0 balloc(0),alloc(1) +

m

  • j=1

  

ej

  • i=dj

wi salloc(dj ) + δej balloc(dj ),alloc(ej +1)   

Power P of the platform: sum of power of processors

P =

  • Pu

P(u), P(u) = Pdyn(su)+Pstat(u), Pdyn(su) = sα

u ,

2 ≤ α ≤ 3

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 9/ 38

slide-43
SLIDE 43

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics without resource sharing

Interval mapping on a single application with no resource sharing; k intervals Ij of stages from Sdj to Sej Period T of an application: minimum delay between the processing of two consecutive data sets

T (overlap) = max

j∈{1,...,k}

  • max
  • δdj −1

balloc(dj −1),alloc(dj ) , ej

i=dj wi

salloc(dj ) , δej balloc(dj ),alloc(ej +1)

  • Latency L of an application: time, for a data set, to go

through the whole pipeline

L = δ0 balloc(0),alloc(1) +

m

  • j=1

  

ej

  • i=dj

wi salloc(dj ) + δej balloc(dj ),alloc(ej +1)   

Power P of the platform: sum of power of processors

P =

  • Pu

P(u), P(u) = Pdyn(su)+Pstat(u), Pdyn(su) = sα

u ,

2 ≤ α ≤ 3

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 9/ 38

slide-44
SLIDE 44

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-45
SLIDE 45

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-46
SLIDE 46

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-47
SLIDE 47

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-48
SLIDE 48

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-49
SLIDE 49

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Metrics with resource sharing

With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2m − 1)T, where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

slide-50
SLIDE 50

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Optimization problems

Minimizing one criterion:

Period or latency: minimize maxa Wa × Ta or maxa Wa × La Power: minimize P =

u P(u)

Fixing one criterion:

Fix the period or latency of each application → fix an array of periods or latencies Fix a bound on total power consumption P

Multi-criteria approach: minimizing one criterion, fixing the

  • ther ones

Energy criterion = power consumption, i.e., energy per time unit ⇒ combination power/period

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 11/ 38

slide-51
SLIDE 51

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Optimization problems

Minimizing one criterion:

Period or latency: minimize maxa Wa × Ta or maxa Wa × La Power: minimize P =

u P(u)

Fixing one criterion:

Fix the period or latency of each application → fix an array of periods or latencies Fix a bound on total power consumption P

Multi-criteria approach: minimizing one criterion, fixing the

  • ther ones

Energy criterion = power consumption, i.e., energy per time unit ⇒ combination power/period

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 11/ 38

slide-52
SLIDE 52

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics

Optimization problems

Minimizing one criterion:

Period or latency: minimize maxa Wa × Ta or maxa Wa × La Power: minimize P =

u P(u)

Fixing one criterion:

Fix the period or latency of each application → fix an array of periods or latencies Fix a bound on total power consumption P

Multi-criteria approach: minimizing one criterion, fixing the

  • ther ones

Energy criterion = power consumption, i.e., energy per time unit ⇒ combination power/period

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 11/ 38

slide-53
SLIDE 53

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Outline of the talk

1

Framework Application and platform Mapping rules Metrics

2

Complexity results Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing

3

Experiments Heuristics Experiments Summary

4

Conclusion

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 12/ 38

slide-54
SLIDE 54

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Mono-criterion complexity results

Period minimization: proc-hom proc-het com-hom special-app1 com-hom com-het

  • ne-to-one

polynomial (binary search) NP-complete interval polynomial NP-complete NP-complete Latency minimization: proc-hom proc-het com-hom special-app1 com-hom com-het

  • ne-to-one

polynomial NP-complete NP-complete interval polynomial (binary search) NP-complete

1special-app: com-hom & pipe-hom Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 13/ 38

slide-55
SLIDE 55

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Mono-criterion complexity results

Period minimization: proc-hom proc-het com-hom special-app1 com-hom com-het

  • ne-to-one

polynomial (binary search) NP-complete interval polynomial NP-complete NP-complete Latency minimization: proc-hom proc-het com-hom special-app1 com-hom com-het

  • ne-to-one

polynomial NP-complete NP-complete interval polynomial (binary search) NP-complete

1special-app: com-hom & pipe-hom Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 13/ 38

slide-56
SLIDE 56

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Latency minimization (1)

Problem: one-to-one mapping - many applications - heterogeneous platform - no communication - homogeneous pipelines - minimize maxa La Single application: greedy polynomial algorithm Many applications: reduction from 3-partition 3-partition:

Input: 3m + 1 integers a1, a2, . . . , a3m and B such that

  • i ai = mB

Does there exist a partition I1, . . . , Im of {1, . . . , 3m} such that for all j ∈ {1, . . . , m}, |Ij| = 3 and

i∈Ij ai = B?

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 14/ 38

slide-57
SLIDE 57

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Latency minimization (2)

3-partition: renumbering of the ai such that:

         a1,1 + a1,2 + a1,3 = B a2,1 + a2,2 + a2,3 = B . . . am,1 + am,2 + am,3 = B

Reduction: Can we obtain a latency L0 ≤ B? Equivalence of problems

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 15/ 38

slide-58
SLIDE 58

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Bi-criteria complexity results

Period/latency minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one
  • r

polynomial NP-complete interval Power/period minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one

polynomial (minimum matching) NP-complete interval polynomial NP-complete

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 16/ 38

slide-59
SLIDE 59

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Bi-criteria complexity results

Period/latency minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one
  • r

polynomial NP-complete interval Power/period minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one

polynomial (minimum matching) NP-complete interval polynomial NP-complete

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 16/ 38

slide-60
SLIDE 60

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Power/period minimization

Problem: one-to-one mapping - many applications - communication homogeneous platform - power minimization for a given array of periods Minimum weighted matching of a bipartite graph

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 17/ 38

slide-61
SLIDE 61

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Bi-criteria complexity results

Period/latency minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one
  • r

polynomial NP-complete interval Power/period minimization: proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one

polynomial (minimum matching) NP-complete interval polynomial NP-complete

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 18/ 38

slide-62
SLIDE 62

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Single application (1)

Problem: interval mapping - single application - fully homogeneous platform - power minimization for a given period P(i, j, k): minimum power to run stages Si to Sj using exactly k processors → looking for min1≤k≤p P(1, n, k) Recurrence relation: P(i, j, k) = min

1≤ℓ≤j−1 (P(i, ℓ, k − 1) + P(ℓ + 1, j, 1))

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 19/ 38

slide-63
SLIDE 63

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Single application (2)

P(i, i, q) = +∞ if q > 1 Fj

i : possible powers of a processor running the stages

Si to Sj, fulfilling the period constraint

F j

i =

  • Pdyn(sℓ) + Pstat, max
  • δi−1

b , j

k=i w k

sℓ , δj b

  • ≤ T, ℓ ∈ {1, . . . , m}
  • P(i, j, 1) =
  • min Fj

i

if Fj

i = ∅

+∞

  • therwise

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 20/ 38

slide-64
SLIDE 64

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Many applications (1)

Problem: interval mapping - fully homogeneous platform - power minimization for given periods by application Pq

a : minimum power consumed by q processors so that the

period constraint on the application a is met, found by the previous dynamic programming P(a, k): minimum power consumed by k processors on the applications 1, . . . , a, unknown Initialization: ∀k ∈ {1, . . . , p} P(1, k) = Pk

1

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 21/ 38

slide-65
SLIDE 65

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Many applications (2)

Recurrence: P(a, k) = min1≤q<k

  • P(a − 1, k − q) + Pq

a

  • Anne.Benoit@ens-lyon.fr

CCGSC 2010 Performance and energy optimization 22/ 38

slide-66
SLIDE 66

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Tri-criteria complexity results

proc-hom proc-het com-hom special-app com-hom com-het

  • ne-to-one
  • r

NP-complete interval Reduction from 2-partition (Instance of 2-partition: a1, a2, . . . , an with σ =

n

  • i=1

ai)

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 23/ 38

slide-67
SLIDE 67

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Problem instance

One-to-one mapping - fully homogeneous platform P0 = P∗ + αX(σ/2 + 1/2), L0 = L∗ − X(σ/2 − 1/2), T 0 = L0 where P∗ and L∗ are power and latency when each Si is run at speed s2i−1

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 24/ 38

slide-68
SLIDE 68

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

Main ideas

K big enough and X small enough so that the stage Si must be processed at speed s2i−1 or s2i For a subset I of {1, . . . , n}, if (Si is run at speed s2i ⇔ i ∈ I), P = P∗ +

  • i∈I

(αaiX + o(X)) , L = L∗ −

  • i∈I

(aiX − o(X)) Recall: P0 = P∗ + αX(σ/2 + 1/2) , L0 = L∗ − X(σ/2 − 1/2)

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 25/ 38

slide-69
SLIDE 69

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

And for general mappings with resource sharing?

Exhaustive complexity study with no resource sharing: new polynomial algorithms for multiple applications and results of NP-completeness With the simplified latency model, tri-criteria polynomial dynamic programming algorithm with no resource sharing and speed-homogeneous platforms With resource sharing or speed-heterogeneous platforms, all problem instances are NP-hard, even for only period minimization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 26/ 38

slide-70
SLIDE 70

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

And for general mappings with resource sharing?

Exhaustive complexity study with no resource sharing: new polynomial algorithms for multiple applications and results of NP-completeness With the simplified latency model, tri-criteria polynomial dynamic programming algorithm with no resource sharing and speed-homogeneous platforms With resource sharing or speed-heterogeneous platforms, all problem instances are NP-hard, even for only period minimization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 26/ 38

slide-71
SLIDE 71

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing

And for general mappings with resource sharing?

Exhaustive complexity study with no resource sharing: new polynomial algorithms for multiple applications and results of NP-completeness With the simplified latency model, tri-criteria polynomial dynamic programming algorithm with no resource sharing and speed-homogeneous platforms With resource sharing or speed-heterogeneous platforms, all problem instances are NP-hard, even for only period minimization

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 26/ 38

slide-72
SLIDE 72

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Outline of the talk

1

Framework Application and platform Mapping rules Metrics

2

Complexity results Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing

3

Experiments Heuristics Experiments Summary

4

Conclusion

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 27/ 38

slide-73
SLIDE 73

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Heuristics

Tri-criteria problem: power consumption minimization given a bound on period and latency per application, on speed heterogeneous platform Each heuristic (except H2) exists in two variants: interval mapping without resource sharing and general mapping with resource sharing in order to evaluate the impact of processor reuse Latency model of ¨ Ozg¨ uner: L = (2m − 1)T H1: random cuts H2: one entire application per processor (assignment problem) H2-split: interval splitting H3: two-step heuristic: choose a speed distribution and find a valid mapping (variants on both steps)

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 28/ 38

slide-74
SLIDE 74

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Fix processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-75
SLIDE 75

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Mapping heuristic: find a valid maping

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-76
SLIDE 76

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Mapping heuristic: find a valid maping

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-77
SLIDE 77

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Mapping heuristic: find a valid maping

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-78
SLIDE 78

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-79
SLIDE 79

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-80
SLIDE 80

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-81
SLIDE 81

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-82
SLIDE 82

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-83
SLIDE 83

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-84
SLIDE 84

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-85
SLIDE 85

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-86
SLIDE 86

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-87
SLIDE 87

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-88
SLIDE 88

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

H3-energy

Iterate the process: increase processor speeds

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

slide-89
SLIDE 89

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Experimental plan

Integer linear program to assess the absolute performance of the heuristics on small instances Small instances: two or three applications, around 15 stages per application, around 8 processors Execution time on 30 small instances: less than one second for all heuristics, one week for the ILP Each heuristic and the ILP: variant without sharing (”-n”) and variant with sharing (”-r”)

General behavior of heuristics Impact of resource sharing Scalability of heuristics

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 30/ 38

slide-90
SLIDE 90

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Experimental plan

Integer linear program to assess the absolute performance of the heuristics on small instances Small instances: two or three applications, around 15 stages per application, around 8 processors Execution time on 30 small instances: less than one second for all heuristics, one week for the ILP Each heuristic and the ILP: variant without sharing (”-n”) and variant with sharing (”-r”)

General behavior of heuristics Impact of resource sharing Scalability of heuristics

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 30/ 38

slide-91
SLIDE 91

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Experimental plan

Integer linear program to assess the absolute performance of the heuristics on small instances Small instances: two or three applications, around 15 stages per application, around 8 processors Execution time on 30 small instances: less than one second for all heuristics, one week for the ILP Each heuristic and the ILP: variant without sharing (”-n”) and variant with sharing (”-r”)

General behavior of heuristics Impact of resource sharing Scalability of heuristics

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 30/ 38

slide-92
SLIDE 92

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Increasing latency

0.2 0.4 0.6 0.8 1 3 4 5 6 7 8 9 10 11 12 1/Energy nbInter cplex-r H1-r H2 H2-split-r H3-upDown-r H3-speed-r H3-energy-r best Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 31/ 38

slide-93
SLIDE 93

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Increasing number of processors

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 1/Energy nbProcs cplex-r H1-r H2 H2-split-r H3-upDown-r H3-speed-r H3-energy-r best Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 32/ 38

slide-94
SLIDE 94

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Impact of static power

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 500 1000 1500 2000 1/Energy max Estat cplex-n H1-n H1-r H2 H2-split-n H2-split-r H3-upDown-n H3-upDown-r Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 33/ 38

slide-95
SLIDE 95

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Impact of mode distribution

0.5 1 1.5 2 2.5 3 10 20 30 40 50 60 70 80 1/Energy su,l+1 - su,l cplex-n H1-n H1-r H2 H2-split-n H2-split-r H3-upDown-n H3-upDown-r Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 34/ 38

slide-96
SLIDE 96

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Scalability

10000 20000 30000 40000 50000 60000 2 4 6 8 10 12 14 16 18 20 Energy nbApp H1-r H2 H2-split-r H3-upDown-r H3-speed-r H3-energy-r best Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 35/ 38

slide-97
SLIDE 97

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Summary of experiments

Efficient heuristics: best heuristic always at 90% of the

  • ptimal solution on small instances

Supremacy of H2-split-r, better in average, and gets even better when problem instances get larger H3 has smaller execution time (one second versus three minutes for 20 applications), ILP not usable in practice Resource sharing becomes crucial with important static power (use fewer processors) or with distant modes (better use of all available speed)

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 36/ 38

slide-98
SLIDE 98

Framework Complexity Experiments Conclusion Heuristics Experiments Summary

Summary of experiments

Efficient heuristics: best heuristic always at 90% of the

  • ptimal solution on small instances

Supremacy of H2-split-r, better in average, and gets even better when problem instances get larger H3 has smaller execution time (one second versus three minutes for 20 applications), ILP not usable in practice Resource sharing becomes crucial with important static power (use fewer processors) or with distant modes (better use of all available speed)

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 36/ 38

slide-99
SLIDE 99

Framework Complexity Experiments Conclusion

Outline of the talk

1

Framework Application and platform Mapping rules Metrics

2

Complexity results Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing

3

Experiments Heuristics Experiments Summary

4

Conclusion

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 37/ 38

slide-100
SLIDE 100

Framework Complexity Experiments Conclusion

Conclusion and future work

Exhaustive complexity study

new polynomial algorithms new NP-completeness proofs impact of model on complexity (tri-criteria homogeneous)

Experimental study

efficient heuristics impact of resource reuse

Current/future work

continuous speeds approximation algorithms

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 38/ 38

slide-101
SLIDE 101

Framework Complexity Experiments Conclusion

Conclusion and future work

Exhaustive complexity study

new polynomial algorithms new NP-completeness proofs impact of model on complexity (tri-criteria homogeneous)

Experimental study

efficient heuristics impact of resource reuse

Current/future work

continuous speeds approximation algorithms

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 38/ 38

slide-102
SLIDE 102

Framework Complexity Experiments Conclusion

Conclusion and future work

Exhaustive complexity study

new polynomial algorithms new NP-completeness proofs impact of model on complexity (tri-criteria homogeneous)

Experimental study

efficient heuristics impact of resource reuse

Current/future work

continuous speeds approximation algorithms

Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 38/ 38