Performance and energy optimization of concurrent pipelined - PowerPoint PPT Presentation

Framework Complexity Experiments Conclusion Motivating example P = 3 3 + 8 3 = 539 Period: T = 3 Latency: L = 8 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

Framework Complexity Experiments Conclusion Motivating example P = 539 P = 8 Period: T = 3 Latency: L = 8 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

Framework Complexity Experiments Conclusion Motivating example P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

Framework Complexity Experiments Conclusion Motivating example P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8 L = 17 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 4/ 38

Framework Complexity Experiments Conclusion Outline of the talk Framework 1 Application and platform Mapping rules Metrics Complexity results 2 Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing Experiments 3 Heuristics Experiments Summary Conclusion 4 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 5/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Outline of the talk Framework 1 Application and platform Mapping rules Metrics Complexity results 2 Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing Experiments 3 Heuristics Experiments Summary Conclusion 4 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 6/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Application model and execution platform Concurrent pipelined applications a ( i th stage of application a ) w i a : weight of stage S i δ i a : size of outcoming data of S i a Processors with multiple speeds (or modes): { s u , 1 , . . . , s u , m u } Constant speed during the execution Platform fully interconnected; b u , v : bandwidth between processors P u and P v ; overlap or non-overlap of communications and computations Three platform types: Fully homogeneous, or speed homogeneous Communication homogeneous, or speed heterogeneous Fully heterogeneous Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 7/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Mapping rules Mapping with no processor sharing: relevant in practice (security rules) One-to-one mapping Interval mapping General mapping with resource sharing: better resource utilization Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 8/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Metrics without resource sharing Interval mapping on a single application with no resource sharing; k intervals I j of stages from S d j to S e j Period T of an application: minimum delay between the processing of two consecutive data sets � ej i = dj w i δ dj − 1 δ ej � � �� T ( overlap ) = max max , , j ∈{ 1 ,..., k } b alloc( dj − 1) , alloc( dj ) s alloc( dj ) b alloc( dj ) , alloc( ej +1) Latency L of an application: time, for a data set, to go through the whole pipeline   ej δ ej δ 0 m w i � � L = + +     b alloc(0) , alloc(1) s alloc( dj ) b alloc( dj ) , alloc( ej +1) j =1 i = dj Power P of the platform: sum of power of processors � P dyn ( s u ) = s α P = P ( u ) , P ( u ) = P dyn ( s u )+ P stat ( u ) , u , 2 ≤ α ≤ 3 P u Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 9/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Metrics with resource sharing With classical latency definition, NP-completeness of the execution scheduling, given a mapping with a period/latency objective ⇒ for general mappings, latency model of ¨ Ozg¨ uner: L = (2 m − 1) T , where m − 1 is the number of processor changes, and T the period of the application Period given ⇒ bound on number of processor changes Given an application, we can check if the mapping is valid, given a bound on period and latency per application: For period, check that each processor can handle its load computation and meet some communication constraints For latency, check the number of processor changes Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 10/ 38

Framework Complexity Experiments Conclusion Application and platform Mapping rules Metrics Optimization problems Minimizing one criterion: Period or latency: minimize max a W a × T a or max a W a × L a Power: minimize P = � u P ( u ) Fixing one criterion: Fix the period or latency of each application → fix an array of periods or latencies Fix a bound on total power consumption P Multi-criteria approach: minimizing one criterion, fixing the other ones Energy criterion = power consumption, i.e., energy per time unit ⇒ combination power/period Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 11/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Outline of the talk Framework 1 Application and platform Mapping rules Metrics Complexity results 2 Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing Experiments 3 Heuristics Experiments Summary Conclusion 4 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 12/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Mono-criterion complexity results Period minimization: proc-hom proc-het special-app 1 com-hom com-hom com-het one-to-one polynomial (binary search) NP-complete interval polynomial NP-complete NP-complete Latency minimization: proc-hom proc-het special-app 1 com-hom com-hom com-het one-to-one polynomial NP-complete NP-complete interval polynomial (binary search) NP-complete 1 special-app: com-hom & pipe-hom Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 13/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Latency minimization (1) Problem: one-to-one mapping - many applications - heterogeneous platform - no communication - homogeneous pipelines - minimize max a L a Single application: greedy polynomial algorithm Many applications: reduction from 3-partition 3-partition : Input: 3 m + 1 integers a 1 , a 2 , . . . , a 3 m and B such that � i a i = mB Does there exist a partition I 1 , . . . , I m of { 1 , . . . , 3 m } such that for all j ∈ { 1 , . . . , m } , | I j | = 3 and � i ∈ I j a i = B ? Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 14/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Latency minimization (2) 3-partition : renumbering of the a i such that:  a 1 , 1 + a 1 , 2 + a 1 , 3 = B   a 2 , 1 + a 2 , 2 + a 2 , 3 = B   . . .     a m , 1 + a m , 2 + a m , 3 = B Reduction: Can we obtain a latency L 0 ≤ B ? Equivalence of problems Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 15/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Bi-criteria complexity results Period/latency minimization: proc-hom proc-het com-hom special-app com-hom com-het one-to-one or polynomial NP-complete interval Power/period minimization: proc-hom proc-het com-hom special-app com-hom com-het one-to-one polynomial (minimum matching) NP-complete interval polynomial NP-complete Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 16/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Power/period minimization Problem: one-to-one mapping - many applications - communication homogeneous platform - power minimization for a given array of periods Minimum weighted matching of a bipartite graph Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 17/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Bi-criteria complexity results Period/latency minimization: proc-hom proc-het com-hom special-app com-hom com-het one-to-one or polynomial NP-complete interval Power/period minimization: proc-hom proc-het com-hom special-app com-hom com-het one-to-one polynomial (minimum matching) NP-complete interval polynomial NP-complete Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 18/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Single application (1) Problem: interval mapping - single application - fully homogeneous platform - power minimization for a given period P ( i , j , k ): minimum power to run stages S i to S j using exactly k processors → looking for min 1 ≤ k ≤ p P (1 , n , k ) Recurrence relation: P ( i , j , k ) = 1 ≤ ℓ ≤ j − 1 ( P ( i , ℓ, k − 1) + P ( ℓ + 1 , j , 1)) min Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 19/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Single application (2) P ( i , i , q ) = + ∞ if q > 1 F j i : possible powers of a processor running the stages S i to S j , fulfilling the period constraint � � � � δ i − 1 � j k = i w k , δ j F j i = P dyn ( s ℓ ) + P stat , max b , ≤ T , ℓ ∈ { 1 , . . . , m } s ℓ b min F j if F j � i � = ∅ P ( i , j , 1) = i + ∞ otherwise Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 20/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Many applications (1) Problem: interval mapping - fully homogeneous platform - power minimization for given periods by application P q a : minimum power consumed by q processors so that the period constraint on the application a is met, found by the previous dynamic programming P ( a , k ): minimum power consumed by k processors on the applications 1 , . . . , a , unknown P (1 , k ) = P k Initialization: ∀ k ∈ { 1 , . . . , p } 1 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 21/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Many applications (2) P ( a − 1 , k − q ) + P q � � Recurrence: P ( a , k ) = min 1 ≤ q < k a Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 22/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Tri-criteria complexity results proc-hom proc-het com-hom special-app com-hom com-het one-to-one or NP-complete interval Reduction from 2-partition n � (Instance of 2-partition : a 1 , a 2 , . . . , a n with σ = a i ) i =1 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 23/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Problem instance One-to-one mapping - fully homogeneous platform P 0 = P ∗ + α X ( σ/ 2 + 1 / 2), L 0 = L ∗ − X ( σ/ 2 − 1 / 2), T 0 = L 0 where P ∗ and L ∗ are power and latency when each S i is run at speed s 2 i − 1 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 24/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing Main ideas K big enough and X small enough so that the stage S i must be processed at speed s 2 i − 1 or s 2 i For a subset I of { 1 , . . . , n } , if ( S i is run at speed s 2 i ⇔ i ∈ I ), P = P ∗ + L = L ∗ − � � ( α a i X + o ( X )) , ( a i X − o ( X )) i ∈I i ∈I Recall: P 0 = P ∗ + α X ( σ/ 2 + 1 / 2) L 0 = L ∗ − X ( σ/ 2 − 1 / 2) , Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 25/ 38

Framework Complexity Experiments Conclusion Mono-criterion Bi-criteria Tri-criteria With resource sharing And for general mappings with resource sharing? Exhaustive complexity study with no resource sharing: new polynomial algorithms for multiple applications and results of NP-completeness With the simplified latency model, tri-criteria polynomial dynamic programming algorithm with no resource sharing and speed-homogeneous platforms With resource sharing or speed-heterogeneous platforms, all problem instances are NP-hard, even for only period minimization Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 26/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Outline of the talk Framework 1 Application and platform Mapping rules Metrics Complexity results 2 Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing Experiments 3 Heuristics Experiments Summary Conclusion 4 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 27/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Heuristics Tri-criteria problem: power consumption minimization given a bound on period and latency per application, on speed heterogeneous platform Each heuristic (except H2) exists in two variants: interval mapping without resource sharing and general mapping with resource sharing in order to evaluate the impact of processor reuse Latency model of ¨ Ozg¨ uner: L = (2 m − 1) T H1: random cuts H2: one entire application per processor (assignment problem) H2-split: interval splitting H3: two-step heuristic: choose a speed distribution and find a valid mapping (variants on both steps) Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 28/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary H3-energy Fix processor speeds Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary H3-energy Mapping heuristic: find a valid maping Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary H3-energy Iterate the process: increase processor speeds Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 29/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Experimental plan Integer linear program to assess the absolute performance of the heuristics on small instances Small instances: two or three applications, around 15 stages per application, around 8 processors Execution time on 30 small instances: less than one second for all heuristics, one week for the ILP Each heuristic and the ILP: variant without sharing (”-n”) and variant with sharing (”-r”) General behavior of heuristics Impact of resource sharing Scalability of heuristics Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 30/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Increasing latency 1 0.8 0.6 1/Energy 0.4 0.2 0 3 4 5 6 7 8 9 10 11 12 nbInter cplex-r H2 H3-upDown-r H3-energy-r H1-r H2-split-r H3-speed-r best Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 31/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Increasing number of processors 1 0.8 0.6 1/Energy 0.4 0.2 0 1 2 3 4 5 6 7 nbProcs cplex-r H2 H3-upDown-r H3-energy-r H1-r H2-split-r H3-speed-r best Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 32/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Impact of static power 1.2 1.1 1 0.9 0.8 1/Energy 0.7 0.6 0.5 0.4 0.3 0.2 0 500 1000 1500 2000 max Estat cplex-n H1-r H2-split-n H3-upDown-n H1-n H2 H2-split-r H3-upDown-r Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 33/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Impact of mode distribution 3 2.5 2 1/Energy 1.5 1 0.5 0 10 20 30 40 50 60 70 80 s u,l+1 - s u,l cplex-n H1-r H2-split-n H3-upDown-n H1-n H2 H2-split-r H3-upDown-r Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 34/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Scalability 60000 50000 40000 Energy 30000 20000 10000 0 0 2 4 6 8 10 12 14 16 18 20 nbApp H1-r H2-split-r H3-speed-r best H2 H3-upDown-r H3-energy-r Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 35/ 38

Framework Complexity Experiments Conclusion Heuristics Experiments Summary Summary of experiments Efficient heuristics: best heuristic always at 90% of the optimal solution on small instances Supremacy of H2-split-r, better in average, and gets even better when problem instances get larger H3 has smaller execution time (one second versus three minutes for 20 applications), ILP not usable in practice Resource sharing becomes crucial with important static power (use fewer processors) or with distant modes (better use of all available speed) Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 36/ 38

Framework Complexity Experiments Conclusion Outline of the talk Framework 1 Application and platform Mapping rules Metrics Complexity results 2 Mono-criterion problems Bi-criteria problems Tri-criteria problems With resource sharing Experiments 3 Heuristics Experiments Summary Conclusion 4 Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 37/ 38

Framework Complexity Experiments Conclusion Conclusion and future work Exhaustive complexity study new polynomial algorithms new NP-completeness proofs impact of model on complexity (tri-criteria homogeneous) Experimental study efficient heuristics impact of resource reuse Current/future work continuous speeds approximation algorithms Anne.Benoit@ens-lyon.fr CCGSC 2010 Performance and energy optimization 38/ 38

Performance and energy optimization of concurrent pipelined - PowerPoint PPT Presentation

Framework Complexity Experiments Conclusion Performance and energy optimization of concurrent pipelined applications Anne Benoit, Paul Renaud-Goud and Yves Robert Institut Universitaire de France ROMA team, LIP Ecole Normale Sup

Concurrent Enrollment A Guide for Parents and Students What is Concurrent Enrollment? Concurrent

Concurrent Message Service M. Clemencic CERN - LHCb Forum on Concurrent Programming Models and

Concurrent Programming in Scala 1 / 7 Concurrent Programming 1 Concurrent programming:

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Modeling and Analyzing Concurrent Systems Robert B. France 1 Overview Why model and analyze

Chapter 3 Concurrent Execution DM519 Concurrent Programming 1 Repetition (Concepts, Models,

Concurrent programming made simple The (r)evolution of transactional memory Torvald Riegel Nuno

Hardware Design with VHDL Concurrent Stmts ECE 443 Concurrent Signal Assignment Statements This

CONCURRENT COLLECTIONS 2 5/24/11 Concurrent Collec9ons

Concurrent Enrollment Board Policy 6172.1 May 13, 2020 Background Definition of concurrent

Towards safer Concurrent Device Drivers Making Safer Concurrent Device Drivers. Modeling RMoX

Frame- -Aggregated Concurrent Aggregated Concurrent Frame Matching Switch Matching Switch Bill

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Modeling and Analyzing Concurrent Systems Robert B. France 1

An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture Erik

Aaron Turon Mozilla Research C/C++ ML/Haskell Rust Safe systems programming Why

Software Engineering I (02161) Week 5 Assoc. Prof. Hubert Baumeister Informatics and

INF580 Large-scale Mathematical Programming TD6 Random projections Leo Liberti CNRS

Understanding your responsibilities as an employer of PAs Welcome This webinar is being

SUPERVISION OF A PRO BONO CLINIC Vicky Ling Founder member of the Law Consultancy Network

8 th Annual University of Toronto Patent Colloquium Non-Infringing Alternatives Sandon Shogilev,

Genetic Testing of Children for the Sake of Other Family Members Eiji Maruyama Kobe University