A C omparative S tudy of M odulo S cheduling T echniques Josep M. - PowerPoint PPT Presentation

UNIVERSITAT POLITÈCNICA DE CATALUNYA UPC A C omparative S tudy of M odulo S cheduling T echniques Josep M. Codina, Josep Llosa and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,josepll,antonio}@ac.upc.es

Software Pipelining UPC � Instruction Scheduling for VLIW/Superscalar Processors � VLIW processors in DSP market � EPIC/IPF � Loop Scheduling: Software Pipelining � Loops consume most of the application’ execution time I NTRODUCTION � Software Pipelining a loop is an NP-complete problem � Software Pipelining big family of techniques � Modulo Scheduling based on heuristics

Motivation UPC � Modulo Scheduling is an environment to define techniques � Different factors to take into account � Lot of techniques can fit in the environment. Different ideas � Proposals in the literature evaluated without common � Platform (i.e. compiler) I NTRODUCTION � Benchmarks � Target architectures � Measures � Lack of a thorough comparison

Objectives UPC � Perform a comparison of state-of-the-art MS techniques � Qualitative � Quantitative � The work is target to compiler writers I NTRODUCTION � Is one of the techniques better than the others for all architectures? � Which is the most powerful technique for a given architecture?

Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions

Basic Ideas UPC Initiation Interval (II) Stage 1 Prolog Stage 2 M ODULO S CHEDULING Stage 3 Stage 2 Stage 1 Kernel Iteration 1 Iteration 2 Epilog Iteration 3 Iteration 4

Basic Scheme UPC Find MII and Set II=MII Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí

Basic Scheme UPC MII depends on Find MII and Set II=MII • Resources • Recurrences Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí

Basic Scheme UPC Find MII and Set II=MII Look for a schedule Look for a • Ordering the nodes schedule M ODULO S CHEDULING • Finding a feasible cycle • Top-Down/Bottom-up No • Bi-directional Increase the II Found it ? • When no feasible cycle • Use of backtracking Sí • Increase the II

Basic Scheme UPC Find MII and Set II=MII Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Can we meet the constraints? • Resources Sí • Dependences

Basic Scheme UPC • The larger the II, the more likely to find a schedule Find MII and Set II=MII • The larger the II, the lower the performance • II lower than the length of a single iteration Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí

Backtracking UPC � Not always beneficial � Can produce better schedules � Can just increase the process of finding a schedule M ODULO S CHEDULING � In some cases, no feasible schedule for a given II Backtracking must be limited � BudgetRatio: Ratio of the maximum number of operation scheduling steps attempted before increasing the II

UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique

UPC � Value of the code generated • Effectiveness on exploting ILP � Parallelism • What is the difference between II and MII? � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique

UPC � Value of the code generated � Parallelism • Software pipelining puts high demands on register pressure � Register pressure • How many regs are needed? S ELECTION C RITERIA • How many loops within a given � Code size number of registers? � Execution time � Effectiveness/Cost of the technique

UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA • Crucial in embedded domains � Code size • Stages of a schedule � Execution time � Effectiveness/Cost of the technique

UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time • Main objective � Effectiveness/Cost of the technique

UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique • Can all the loops be scheduled? • Compilation time

Techniques Stage MS UPC � Modulo Scheduling Techniques • Post-pass that can be applied after a MS technique � Iterative Modulo Scheduling (IMS) • To reduce the Register Pressure T ECHNIQUES C OMPARED � Swing Modulo Scheduling (SMS) � Slack Modulo Scheduling (Slack MS) • Without increasing the II � Integrated Register-sensitive Iterative Software Pipelining method (IRIS) • Moves operations by II � Complementary techniques • Various heuristics. We selected 3UP+RSS heuristic � Stage Modulo Scheduling (Stage MS)

Main Differences UPC IMS SMS Slack MS IRIS •Priority to recurrences •Dynamic Order of T ECHNIQUES C OMPARED Top-Down •No pred. and succ. Top-Down nodes •Based on Slack scheduled in partial schedule •Bi-directional •Close to pred •Bi-directional Finding a Stage MS Top-Down or succ. cycle Heuristics •Close to pred or succ depending on the benefit Backtracking Yes No Yes Yes

Qualitative Comparison UPC IMS SMS Slack MS IRIS •Order Parallelism Backtracking Order Backtracking •Backtracking T ECHNIQUES C OMPARED •Order Register Stage No Bi-directional Pressure Heuristics •Bi-directional Code Size Yes Yes Yes Yes Effectiveness Backtracking Backtracking Backtracking No Cost Backtracking

Environment UPC � Platform (i.e. compiler) ICTINEO � Benchmarks S TUDY E NVIRONMENT � SPECfp95 1936 loops � Perfect Club � Target architectures � Some architectures varying the complexity Less Constrained � Low Complexity architecture � Medium Complexity architecture � Complex architecture More Constrained

Architectures Description UPC Low Complexity Medium Complexity Complex Architecture Fully Pipelined Simple ops Fully Pipelined ops Non-Pipelined Complex ops 8-Issue 4-Issue 4 write-ports Unlimited register ports S TUDY E NVIRONMENT 8 read-ports 1936 loops 2 memory ports 2 Int FU and 2 FP FU Latencies Low/Medium Complex MEM 2 3 ADD, SUB, COMP 1 3 INT MUL 2 4 DIV, MOD, SQRT 6 8 ADD, SUB, COMP 3 5 FP MUL 6 8 DIV, MOD, SQRT 18 20

Methodology UPC � Study of the BudgetRatio for each architecture: 1, 2.5, 5 and 10 � Effectiveness � Performance � Cost S TUDY E NVIRONMENT � Measures for each technique with and without Stage MS � Effectiveness and cost � Parallelism � Register pressure � Code size � Execution

BudgetRatio Study UPC 1,2 Low Complexity 1,15 Architecture 1,1 Sum II/Sum MI IMS 1,05 IRIS Slack Medium Complex 1 Low Complexity Complexity Architecture 0,95 Performance Effectiveness Cost 0,9 1 2,5 5 10 10 BudgetRatio 5 2.5 450 12 400 10 % non scheduled ops 350 300 8 Total time 250 R ESULTS 6 200 150 4 100 2 50 0 0 1 2,5 5 10 1 2,5 5 10 BudgetRatio BudgetRatio

II vs MII UPC UPC 1,014 1,012 Average (II/MII) 1,01 IMS 1,008 SMS 1,006 IRIS 1,004 Slack 1,002 1 R ESULTS Low Medium Architectures

Register Pressure UPC UPC 1,9 1,8 MaxLive/MinAvg 1,7 IMS 1,6 IMS+ST 1,5 SMS 1,4 SMS+ST 1,3 IRIS IRIS+ST 1,2 Slack 1,1 Slack+ST 1 R ESULTS Low Medium Architectures

Execution Time UPC UPC Low Complexity Architecture Medium Complexity Architecture Millions Millions IMS 28700 IMS 42000 IMS+ST 28600 IMS+ST 40000 SMS 28500 SMS 38000 SMS+ST Cycles 28400 SMS+ST Cycles 36000 IRIS 28300 IRIS 34000 IRIS+ST 28200 IRIS+ST 32000 Slack 28100 R ESULTS Slack 30000 Slack+ST 28000 Slack+ST 28000 Techniques Techniques

Complex Architecture UPC UPC II vs MII 1,8 Millions 1,7 1,07 42000 IMS MaxLive/MinAvg 1,6 1,06 Average (II/MII) 40000 IMS+ST IMS 1,05 1,5 SMS 38000 IMS IMS+ST 1,04 1,4 SMS+ST SMS SMS Cycles 36000 1,03 IRIS 1,3 IRIS SMS+ST 34000 1,02 IRIS+ST Slack IRIS 1,2 32000 Slack 1,01 IRIS+ST 1,1 Slack+ST R ESULTS R ESULTS 30000 1 Slack 1 Slack+ST Techniques 28000 Techniques Techniques

A C omparative S tudy of M odulo S cheduling T echniques Josep M. - PowerPoint PPT Presentation

UNIVERSITAT POLITCNICA DE CATALUNYA UPC A C omparative S tudy of M odulo S cheduling T echniques Josep M. Codina, Josep Llosa and Antonio Gonzlez Dept. of Computer Architecture Universitat Politcnica de Catalunya Barcelona, SPAIN E-mail:

C OMPARATIVE E FFECTIVENESS E FFECTIVENESS AND AND S AFETY S AFETY OF OF C OMPARATIVE N EW - VERSUS

Compiling T echniques Lecture 7: Bottom-Up Parsing Christophe Dubach Overview Bottom-Up

C ASE S TUDY II C ASE S TUDY II FAIRHOLME Ignore the crowd. C URRENT I NVESTMENT O PPORTUNITY We

C OMPARATIVE A NALYSIS O F S OFTWARE L IBRARIES F OR P UBLIC K EY C RYPTOGRAPHY Ashraf Abusharekh

S CHEDULING , M AINTAINING , AND A DJUSTING L ABOR D ISTRIBUTION (LD) ON S PONSORED PROJECTS (G OR

C URRENT S CHEDULE & B USES 2010/2011 Tier 1 7:45 2:25 Alderman, Archer, Bessemer, Foust

Kings Junior High S cheduling Meeting - Incoming 8th Graders CLASS OF 2025 In the 8th Grade...

P ARAGON : Q O S-A WARE S CHEDULING F OR H ETEROGENEOUS D ATACENTERS Christina Delimitrou and

Choosing the Right Irrigation Equipment, S cheduling and Audit Requirements Kris Loomis CLIA,CID

J O PERATIONS S CHEDULING LEARNING GOALS T his supplement focuses on operations After reading

[D ISK S CHEDULING A LGORITHMS ] Shrideep Pallickara Computer Science Colorado State University

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel Sanchez, David Lo, Richard M.

[CPU S CHEDULING ] Shrideep Pallickara Computer Science Colorado State University CS370:

CAIS O Frequency R esponse S tudy S takeholder Conference GE E nergy CAIS O Nicholas W.

MES A COUNTY Regional Waste Diversion S tudy S TAKEHOLDER MEETING May 29, 2018, Grand

O IL & G AS P RODUCTION T RANSPORTATION I MPACT S TUDY Presentation to Boulder County Board of

A Comprehensive Scenario Agnostic Data LifeCycle Model for an Efficient Data Complexity

Limited Discrepancy AND/OR Search and its Application to Optimization Tasks in Graphical Models

Proving Termination of Imperative Programs using Max-SMT Daniel Larraz, Albert Oliveras, Enric

Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of

Steven R. Seidel 1951-2014 Professional Background Ph.D 1979 University of Iowa

Indrustry/Academic partnership: y p p ageme an experience at DAMA-UPC Mana Josep-L.

Using XACML for access control in Social Networks Anna Carreras, Eva Rodrguez, Jaime Delgado

Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C.

A C omparative S tudy of M odulo S cheduling T echniques Josep M. - PowerPoint PPT Presentation

UNIVERSITAT POLITCNICA DE CATALUNYA UPC A C omparative S tudy of M odulo S cheduling T echniques Josep M. Codina, Josep Llosa and Antonio Gonzlez Dept. of Computer Architecture Universitat Politcnica de Catalunya Barcelona, SPAIN E-mail:

C OMPARATIVE E FFECTIVENESS E FFECTIVENESS AND AND S AFETY S AFETY OF OF C OMPARATIVE N EW - VERSUS

Compiling T echniques Lecture 7: Bottom-Up Parsing Christophe Dubach Overview Bottom-Up

C ASE S TUDY II C ASE S TUDY II FAIRHOLME Ignore the crowd. C URRENT I NVESTMENT O PPORTUNITY We

C OMPARATIVE A NALYSIS O F S OFTWARE L IBRARIES F OR P UBLIC K EY C RYPTOGRAPHY Ashraf Abusharekh

S CHEDULING , M AINTAINING , AND A DJUSTING L ABOR D ISTRIBUTION (LD) ON S PONSORED PROJECTS (G OR

C URRENT S CHEDULE &amp; B USES 2010/2011 Tier 1 7:45 2:25 Alderman, Archer, Bessemer, Foust

Kings Junior High S cheduling Meeting - Incoming 8th Graders CLASS OF 2025 In the 8th Grade...

P ARAGON : Q O S-A WARE S CHEDULING F OR H ETEROGENEOUS D ATACENTERS Christina Delimitrou and

Choosing the Right Irrigation Equipment, S cheduling and Audit Requirements Kris Loomis CLIA,CID

J O PERATIONS S CHEDULING LEARNING GOALS T his supplement focuses on operations After reading

[D ISK S CHEDULING A LGORITHMS ] Shrideep Pallickara Computer Science Colorado State University

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel Sanchez, David Lo, Richard M.

[CPU S CHEDULING ] Shrideep Pallickara Computer Science Colorado State University CS370:

CAIS O Frequency R esponse S tudy S takeholder Conference GE E nergy CAIS O Nicholas W.

MES A COUNTY Regional Waste Diversion S tudy S TAKEHOLDER MEETING May 29, 2018, Grand

O IL &amp; G AS P RODUCTION T RANSPORTATION I MPACT S TUDY Presentation to Boulder County Board of

A Comprehensive Scenario Agnostic Data LifeCycle Model for an Efficient Data Complexity

Limited Discrepancy AND/OR Search and its Application to Optimization Tasks in Graphical Models

Proving Termination of Imperative Programs using Max-SMT Daniel Larraz, Albert Oliveras, Enric

Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of

Steven R. Seidel 1951-2014 Professional Background Ph.D 1979 University of Iowa

Indrustry/Academic partnership: y p p ageme an experience at DAMA-UPC Mana Josep-L.

Using XACML for access control in Social Networks Anna Carreras, Eva Rodrguez, Jaime Delgado

Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C.

C URRENT S CHEDULE & B USES 2010/2011 Tier 1 7:45 2:25 Alderman, Archer, Bessemer, Foust

O IL & G AS P RODUCTION T RANSPORTATION I MPACT S TUDY Presentation to Boulder County Board of