Instruction Scheduling Last time Instruction scheduling using list - PDF document

Instruction Scheduling Last time – Instruction scheduling using list scheduling Today – Improvements on list scheduling – Register renaming – Unrolling – Software pipelining CS553 Lecture Instruction Scheduling II 2 Improving Instruction Scheduling Techniques – Register renaming Deal with data hazards – Scheduling loads – Loop unrolling – Software pipelining Deal with control hazards – Predication and speculation (next week) CS553 Lecture Instruction Scheduling II 3 1

Register Renaming Idea – Reduce false data dependences by reducing register reuse – Give the instruction scheduler greater freedom Example add $r1, $r2, 1 add $r1, $r2, 1 st $r1, [$fp+52] st $r1, [$fp+52] mul $r1, $r3, 2 mul $r11, $r3, 2 st $r1, [$fp+40] st $r11, [$fp+40] add $r1, $r2, 1 mul $r11, $r3, 2 st $r1, [$fp+52] st $r11, [$fp+40] CS553 Lecture Instruction Scheduling II 4 Scheduling Loads Reality – Loads can take many cycles (slow caches, cache misses) – Many cycles may be wasted Most modern architectures provide non-blocking (delayed) loads – Loads never stall – Instead, the use of a register stalls if the value is not yet available – Scheduler should try to place loads well before the use of target register CS553 Lecture Instruction Scheduling II 5 2

Scheduling Loads (cont) Hiding latency – Place independent instructions behind loads load r1 load r1 load r2 add r3 add r3 load r2 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 time time – How many instructions should we insert? – Depends on latency – Difference between cache miss and cache hits are growing – If we underestimate latency: Stall waiting for the load – If we overestimate latency: Hold register longer than necessary Wasted parallelism CS553 Lecture Instruction Scheduling II 6 Balanced Scheduling [Kerns and Eggers’92] Idea – Impossible to know the latencies statically – Instead of estimating latency, balance the ILP (instruction-level parallelism) across all loads – Schedule for characteristics of the code instead of for characteristics of the machine Balancing load – Compute load level parallelism # independent instructions LLP = 1 + # of loads that can use this parallelism CS553 Lecture Instruction Scheduling II 7 3

Balanced Scheduling Example Example 3 3 L0 3 X0 X2 list balanced scheduling scheduling w=5 w=1 8 8 L1 8 X1 X3 L0 L0 L0 X0 L1 X0 X1 X0 X1 8 X4 X2 X1 L1 X3 X2 X2 L1 X3 X3 LLP for L0 = 1+4/2 = 3 X4 X4 X4 LLP for L1 = 1+4/2 = 3 Pessimistic Optimistic CS553 Lecture Instruction Scheduling II 8 Loop Unrolling Idea – Replicate body of loop and iterate fewer times – Reduces loop overhead (test and branch) – Creates larger loop body ⇒ more scheduling freedom Example ldf L: ldf [r1], f0 fadds fadds f0, f1, f2 stf stf f2, [r1] sub cmp sub r1, 4, r1 bg Loop cmp r1, 0 nop overhead bg L ldf nop 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cycles per iteration: 12 CS553 Lecture Instruction Scheduling II 9 4

Loop Unrolling Example Sample loop L: ldf [r1], f0 ldf fadds f0, f1, f2 fadds ldf ldf [r1-4], f10 fadds fadds f10, f1, f12 stf stf f2, [r1] stf sub stf f12, [r1-4] cmp sub r1, 8, r1 bg nop cmp r1, 0 Loop ldf bg L overhead nop 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cycles per iteration: 14/2 = 7 (71% speedup!) The larger window lets us hide some of the latency of the fadds instruction CS553 Lecture Instruction Scheduling II 10 Loop Unrolling Summary Benefit – Loop unrolling allows us to schedule code across iteration boundaries, providing more scheduling freedom Issues – How much unrolling should we do? – Try various unrolling factors and see which provides the best schedule? – Unroll as much as possible within a code expansion budget? – An alternative: Software pipelining CS553 Lecture Instruction Scheduling II 11 5

Software Pipelining Basic idea – Software pipelining is a systematic approach to scheduling across iteration boundaries without doing loop unrolling – Try to move the long latency instructions to previous iterations of the loop – Use independent instructions to hide their latency – Three parts of a software pipeline – Kernel: Steady state execution of the pipeline – Prologue: Code to fill the pipeline – Epilogue: Code to empty the pipeline CS553 Lecture Instruction Scheduling II 12 Visualizing Software Pipelining CS553 Lecture Instruction Scheduling II 13 6

Software Pipelining versus Loop Unrolling CS553 Lecture Instruction Scheduling II 14 SW Pipelining (Step 1: Construct DAG and Assign Registers) int A[100], B[100], C[100]; for (i=0; i<100; i++) { B[i] = A[i] + B[i] + C[i]; } CS553 Lecture Instruction Scheduling II 15 7

SW Pipelining (Step 2: “Unroll”, Schedule, Find Pattern) This pattern does not work!! CS553 Lecture Instruction Scheduling II 16 SW Pipelining (Step 3: Satisfy register constraints) CS553 Lecture Instruction Scheduling II 17 8

SW Pipelining and Loop Unrolling Summary Unrolling removes branching overhead and helps tolerate data dependence latency SW pipelining maintains max parallelism in steady state through continuous tolerance of data dependence latency Both work best with loops that are parallel, getting ILP by taking instructions from different iterations CS553 Lecture Instruction Scheduling II 18 Software Pipelining Complications – What if there is control flow within the loop? – Use control-flow profiles to identify most frequent path through the loop – Optimize for the most frequent path – How do we identify the most frequent path? – Profiling CS553 Lecture Instruction Scheduling II 19 9

Concepts Improving instruction scheduling – Register renaming – Balanced load scheduling – Loop unrolling Instruction scheduling across basic blocks – Software pipelining CS553 Lecture Instruction Scheduling II 20 Next Time Lecture – More instruction scheduling – profiling – trace scheduling CS553 Lecture Instruction Scheduling II 21 10

Instruction Scheduling Last time Instruction scheduling using list - PDF document

Instruction Scheduling Last time Instruction scheduling using list scheduling Today Improvements on list scheduling Register renaming Unrolling Software pipelining CS553 Lecture Instruction Scheduling II 2 Improving

Instruction Scheduling Last time Register allocation Today Instruction

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

Profile-Guided Optimizations Last time Instruction scheduling Register renaming

1 What Limits Performance? Stalls (Data Hazards) Data hazards Code Instruction depends on

Part C Instruction scheduling Instruction scheduling character stream token stream

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to

Instruction scheduling However, that order is usually not the only valid one, in the sense that

Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir

Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk

Predication and Speculation Last time Instruction scheduling Profile-guided

Outline Modern architectures Spring 2006 Delay slots Introduction to instruction

Outline Modern architectures Spring 2003 Delay slots Introduction to instruction

Optimization++ Complexities and strategies of optimization Instruction Scheduling

Lecture 18: Pipelining Todays topics: Hazards and instruction scheduling Branch

Block Scheduling at PGHS aka Bell on Bells Block Scheduling Used to Improve the Following:

Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013

Balanced Independent Sets on Colored Interval Graphs Sujoy Bhore, Jan-Henrik Haunert, Fabian

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo

Balancing Gossip Exchanges in Networks with van Renesse and Firewalls L. Rodrigues

Laura Avanzino Department of Experimental Medicine, section of Human Physiology University of

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin Wong

Instruction Scheduling Last time Instruction scheduling using list - PDF document

Instruction Scheduling Last time Instruction scheduling using list scheduling Today Improvements on list scheduling Register renaming Unrolling Software pipelining CS553 Lecture Instruction Scheduling II 2 Improving

Instruction Scheduling Last time Register allocation Today Instruction

Instruction Scheduling Last week Register allocation Today Instruction scheduling

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

Profile-Guided Optimizations Last time Instruction scheduling Register renaming

1 What Limits Performance? Stalls (Data Hazards) Data hazards Code Instruction depends on

Part C Instruction scheduling Instruction scheduling character stream token stream

Instruction Scheduling List scheduling [Gibbons &amp; Muchnick 86] Reorder instructions to

Instruction scheduling However, that order is usually not the only valid one, in the sense that

Static &amp; Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir

Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk

Predication and Speculation Last time Instruction scheduling Profile-guided

Outline Modern architectures Spring 2006 Delay slots Introduction to instruction

Outline Modern architectures Spring 2003 Delay slots Introduction to instruction

Optimization++ Complexities and strategies of optimization Instruction Scheduling

Lecture 18: Pipelining Todays topics: Hazards and instruction scheduling Branch

Block Scheduling at PGHS aka Bell on Bells Block Scheduling Used to Improve the Following:

Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013

Balanced Independent Sets on Colored Interval Graphs Sujoy Bhore, Jan-Henrik Haunert, Fabian

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo

Balancing Gossip Exchanges in Networks with van Renesse and Firewalls L. Rodrigues

Laura Avanzino Department of Experimental Medicine, section of Human Physiology University of

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin Wong

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to

Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir