instruction scheduling
play

Instruction Scheduling Last time Instruction scheduling using list - PDF document

Instruction Scheduling Last time Instruction scheduling using list scheduling Today Improvements on list scheduling Register renaming Unrolling Software pipelining CS553 Lecture Instruction Scheduling II 2 Improving


  1. Instruction Scheduling Last time – Instruction scheduling using list scheduling Today – Improvements on list scheduling – Register renaming – Unrolling – Software pipelining CS553 Lecture Instruction Scheduling II 2 Improving Instruction Scheduling Techniques – Register renaming Deal with data hazards – Scheduling loads – Loop unrolling – Software pipelining Deal with control hazards – Predication and speculation (next week) CS553 Lecture Instruction Scheduling II 3 1

  2. Register Renaming Idea – Reduce false data dependences by reducing register reuse – Give the instruction scheduler greater freedom Example add $r1, $r2, 1 add $r1, $r2, 1 st $r1, [$fp+52] st $r1, [$fp+52] mul $r1, $r3, 2 mul $r11, $r3, 2 st $r1, [$fp+40] st $r11, [$fp+40] add $r1, $r2, 1 mul $r11, $r3, 2 st $r1, [$fp+52] st $r11, [$fp+40] CS553 Lecture Instruction Scheduling II 4 Scheduling Loads Reality – Loads can take many cycles (slow caches, cache misses) – Many cycles may be wasted Most modern architectures provide non-blocking (delayed) loads – Loads never stall – Instead, the use of a register stalls if the value is not yet available – Scheduler should try to place loads well before the use of target register CS553 Lecture Instruction Scheduling II 5 2

  3. Scheduling Loads (cont) Hiding latency – Place independent instructions behind loads load r1 load r1 load r2 add r3 add r3 load r2 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 time time – How many instructions should we insert? – Depends on latency – Difference between cache miss and cache hits are growing – If we underestimate latency: Stall waiting for the load – If we overestimate latency: Hold register longer than necessary Wasted parallelism CS553 Lecture Instruction Scheduling II 6 Balanced Scheduling [Kerns and Eggers’92] Idea – Impossible to know the latencies statically – Instead of estimating latency, balance the ILP (instruction-level parallelism) across all loads – Schedule for characteristics of the code instead of for characteristics of the machine Balancing load – Compute load level parallelism # independent instructions LLP = 1 + # of loads that can use this parallelism CS553 Lecture Instruction Scheduling II 7 3

  4. Balanced Scheduling Example Example 3 3 L0 3 X0 X2 list balanced scheduling scheduling w=5 w=1 8 8 L1 8 X1 X3 L0 L0 L0 X0 L1 X0 X1 X0 X1 8 X4 X2 X1 L1 X3 X2 X2 L1 X3 X3 LLP for L0 = 1+4/2 = 3 X4 X4 X4 LLP for L1 = 1+4/2 = 3 Pessimistic Optimistic CS553 Lecture Instruction Scheduling II 8 Loop Unrolling Idea – Replicate body of loop and iterate fewer times – Reduces loop overhead (test and branch) – Creates larger loop body ⇒ more scheduling freedom Example ldf L: ldf [r1], f0 fadds fadds f0, f1, f2 stf stf f2, [r1] sub cmp sub r1, 4, r1 bg Loop cmp r1, 0 nop overhead bg L ldf nop 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cycles per iteration: 12 CS553 Lecture Instruction Scheduling II 9 4

  5. Loop Unrolling Example Sample loop L: ldf [r1], f0 ldf fadds f0, f1, f2 fadds ldf ldf [r1-4], f10 fadds fadds f10, f1, f12 stf stf f2, [r1] stf sub stf f12, [r1-4] cmp sub r1, 8, r1 bg nop cmp r1, 0 Loop ldf bg L overhead nop 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cycles per iteration: 14/2 = 7 (71% speedup!) The larger window lets us hide some of the latency of the fadds instruction CS553 Lecture Instruction Scheduling II 10 Loop Unrolling Summary Benefit – Loop unrolling allows us to schedule code across iteration boundaries, providing more scheduling freedom Issues – How much unrolling should we do? – Try various unrolling factors and see which provides the best schedule? – Unroll as much as possible within a code expansion budget? – An alternative: Software pipelining CS553 Lecture Instruction Scheduling II 11 5

  6. Software Pipelining Basic idea – Software pipelining is a systematic approach to scheduling across iteration boundaries without doing loop unrolling – Try to move the long latency instructions to previous iterations of the loop – Use independent instructions to hide their latency – Three parts of a software pipeline – Kernel: Steady state execution of the pipeline – Prologue: Code to fill the pipeline – Epilogue: Code to empty the pipeline CS553 Lecture Instruction Scheduling II 12 Visualizing Software Pipelining CS553 Lecture Instruction Scheduling II 13 6

  7. Software Pipelining versus Loop Unrolling CS553 Lecture Instruction Scheduling II 14 SW Pipelining (Step 1: Construct DAG and Assign Registers) int A[100], B[100], C[100]; for (i=0; i<100; i++) { B[i] = A[i] + B[i] + C[i]; } CS553 Lecture Instruction Scheduling II 15 7

  8. SW Pipelining (Step 2: “Unroll”, Schedule, Find Pattern) This pattern does not work!! CS553 Lecture Instruction Scheduling II 16 SW Pipelining (Step 3: Satisfy register constraints) CS553 Lecture Instruction Scheduling II 17 8

  9. SW Pipelining and Loop Unrolling Summary Unrolling removes branching overhead and helps tolerate data dependence latency SW pipelining maintains max parallelism in steady state through continuous tolerance of data dependence latency Both work best with loops that are parallel, getting ILP by taking instructions from different iterations CS553 Lecture Instruction Scheduling II 18 Software Pipelining Complications – What if there is control flow within the loop? – Use control-flow profiles to identify most frequent path through the loop – Optimize for the most frequent path – How do we identify the most frequent path? – Profiling CS553 Lecture Instruction Scheduling II 19 9

  10. Concepts Improving instruction scheduling – Register renaming – Balanced load scheduling – Loop unrolling Instruction scheduling across basic blocks – Software pipelining CS553 Lecture Instruction Scheduling II 20 Next Time Lecture – More instruction scheduling – profiling – trace scheduling CS553 Lecture Instruction Scheduling II 21 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend