263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - - PowerPoint PPT Presentation
263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - - PowerPoint PPT Presentation
263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8.1 InstrucLon scheduling basics 8.2 Scheduling for ILP processors 8.2 Scheduling for ILP
Overview
§ 8.1 InstrucLon scheduling basics § 8.2 Scheduling for ILP processors
8.2 Scheduling for ILP processors
§ IntroducLon to ILP § Scheduling for acyclic regions
§ Types and shapes § Region forma2on § Schedule construc2on § Resource management
§ Scheduling for cyclic regions
§ So8ware pipelining § Modulo scheduling § Specula2on and predica2on
8.2.3 Scheduling for cyclic regions
§ Scheduling loops
§ The majority of program execu2on 2me is spent in loops § We already know several techniques to speed up loop execu2on
§ Parallelizing loops § Loop unrolling § Loop fusion § …
§ All of these techniques have a scheduling barrier at the end of one (or several) itera2ons
Increasing ILP w/ loop unrolling
§ Running example § Machine model
§ 4 issue § 1 control, 2 ALU, 1 memory § Latencies:
§ Add: 1 cycle § Mul: 3 cycles § Ld: 2 cycles § St: 1 cycle § Cmp: 1 cycle § Branch: 1 cycle
for (i=0; i<0x1000; i++) { b[i] = a[i] * 3; }
mov r1 ← @a mov r2 ← @b add r5 ← r1, #0x4000 loop: ld r3 ← mem[r1] mul r4 ← r3, #3 st mem[r2] ← r4 add r1 ← r1, #4 add r2 ← r2, #4 clt p1 ← r1, r5 b p1, @loop
Increasing ILP w/ loop unrolling
§ Scheduling the loop with list scheduling (Baseline) Throughput: 6 cycles / 1 iteraLon (100%) Code size (schedule length): 6 (100%)
1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop
1 time 2 3 1000 ... iteration
cycle ALU 1 ALU 2 MEM control 4 1 1 6 2 2 3 4 5 5 3 7
Increasing ILP w/ loop unrolling
§ Unrolling twice § Throughput: 7 cycles / 2 iteraLons (-42%) Code size (schedule length): 7 (+17%)
11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 14 add r1 ← r1, #4 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 24 add r6 ← r6, #4 25 add r7 ← r7, #4 06 clt p1 ← r6, r5 07 b p1, @loop 1,2 time 3,4 5,6 999,1000 ... iteration
cycle ALU 1 ALU 2 MEM control 14 11 1 24 21 2 12 06 3 22 4 5 15 13 6 25 23 07
Increasing ILP w/ loop unrolling
§ Unrolling 4x Throughput: 9 cycles / 4 iteraLons (-63%) Code size (scheduled instrucLons): 9 (+50%)
11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 14 add r1 ← r1, #4 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 24 add r6 ← r6, #4 25 add r7 ← r7, #4 31 ld r12 ← mem[r10] 32 mul r13 ← r12, #3 33 st mem[r11] ← r13 34 add r10 ← r10, #4 35 add r11 ← r11, #4 41 ld r16 ← mem[r14] 42 mul r17 ← r16, #3 43 st mem[r15] ← r17 44 add r14 ← r14, #4 45 add r15 ← r15, #4 06 clt p1 ← r6, r5 07 b p1, @loop
1,2,3,4
time
5,6,7,8
9,10,11,12 997,998 999,1000
... iteration
cycle ALU 1 ALU 2 MEM control 14 11 1 24 21 2 34 12 31 3 44 22 41 4 32 5 15 42 13 6 25 23 7 35 06 33 8 45 34 07
Increasing ILP w/ loop unrolling
§ Scheduling loops
§ Unrolling: performance improvements, but § Scheduling barrier is s2ll there, only unroll factor loop bodies can be overlapped § Increase in
§ Code size § Register pressure
§ Unrolling is useful for loops with
§ Lots of control flow within the loop body § Trace can find most likely path
§ Unrolling “ignores” loop structure
8.2.3.1 So^ware pipelining
§ Exploit loop structure of program § Pipelining: overlap of stages
10
8.2.3.1 So^ware pipelining
§ Exploit loop structure of program § Pipelining: overlap of stages
11
Let’s try again
§ Scheduling the first iteraLon
- n our 4-issue machine
1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop
iteraLon 1 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 3 4 6 5 5 3 7 6 7 8 9
Let’s try again
13
cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 2 3 4 5 6 7 8 9
§ Consider 8-issue machine § Duplicate 4-issue machine
§ Control 2nd group by predicate Q § Execute only if Q==true § “Predicated execu2on”
Let’s try again
§ Scheduling the 2nd iteraLon
- n an 8-issue machine
1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop
iteraLon 1 iteraLon 2 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 6 2 5 5 3 7 6 6 7 5 3 7 8 9
Let’s try again
§ Scheduling the 3rd iteraLon
- n a 12-issue machine
1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop
iteraLon 1 iteraLon 2 iteraLon 3 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 6 2 1 5 5 3 7 4 6 6 2 7 5 3 7 8 6 9 5 3 7
Let’s try again
§ ObservaLon: schedules of different iteraLons are idenLcal
cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 2 6 1 5 4 5 3 7 6 2 6 1 7 4 5 3 7 8 2 6 1 9 4 5 3 7 10 2 6 1 … … … … …
§ except for the start-up cycles § except for the wind-down cycles
§ Schedules can be
- verlapped
Performance
§ Throughput
§ for each individual itera2on: 6 cycles § 1st itera2on: 6 cycles § each addi2onal itera2on: 2 cycles § n itera2ons: 2*n + 4
§ Code size (w/ predicated execuLon)
§ 7 instruc2ons § Predicate Q: r1 < r5
cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 2 6 1 5 4 5 3 7 6 2 6 1 7 4 5 3 7 8 2 6 1 9 4 5 3 7 10 2 6 1 … … … … …
So^ware pipelining
§ Standard techniques (region scheduling, loop fusion, loop unrolling) do not yield sufficient ILP § Method of choice: so^ware pipelining
§ Overlap itera2ons of the loop body § Steady-state: kernel § Peak performance: 1 loop itera2on/cycle § No scheduling-barriers between itera2ons § No loop unrolling necessary § Requires sufficient resources
1 2 3 4
iteration 1
1 2 3 4
2
1 2 3 4
3
1 2 3 4
4
1 2 3 4
1000
1 2 3 4
999
1 2 3 4
998
1 2 3 4
997 ... time
1 2 3 4 1 2 3 4
So^ware pipelining
§ Basic idea
§ Unroll the loop “completely” § Correctly schedule the loop under two constraints
§ All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II (ini2a2on interval) cycles a8er the previous itera2on
§ ExecuLon Lme in terms of stage count (SC)
§ One loop itera2on: SC×II cycles § Prologue/epilogue: (SC-1)×II cycles § Kernel steady state: II cycles
§ ExecuLon Lme of a so^ware pipelined loop: II×(n+sc-1) cycles
1 2 3 4
itera2on 1
1 2 3 4
2
1 2 3 4
3
1 2 3 4
4
...
2me II (ini2a2on interval) II II SC (stage count)
Modulo scheduling
§ Most common technique to find so^ware pipelined schedules § Basic concept
§ Unroll the loop “completely” § Schedule the loop under two constraints
§ All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II cycles a8er the previous itera2on
Modulo scheduling: problem formulaLon
§ Problem : find a schedule for one loop body iteraLon such that when the schedule is repeated at intervals of II cycles
§ No hardware resource conflict arises between opera2ons of the same and successive itera2ons of the loop body § No intra/inter–loop dependences are violated
Modulo scheduling: resource constraints
§ Handling resource constraints
§ No resource must be used by different opera2ons at two points in 2me that are separated by an interval that is a mul2ple of the ini2a2on interval § This requirement is iden2cal to: Within a single itera2on, no resource is ever used more than once at the same 2me modulo II
§ Search for suitable iniLaLon interval
Modulo scheduling: resource constraints
§ Modulo reservaLon tables
§ Table containing II rows and one column for each resource § Scheduling op at 2me t on resource r § Entry for r at t mod II must be free § Mark t mod II busy for r
II = 2
cycle
ALU 1 ALU 2 MEM control
1 cycle ALU 1 ALU 2 MEM control 1 2
II = 3
Modulo scheduling: resource constraints
§ Modulo reservaLon tables
§ Table containing II rows and one column for each resource
1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop
iteraLon 1 cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 3 4 6 5 5 3 7
II = 2
cycle ALU 1 ALU 2 MEM control 2 6 1 1 4 5 3 7
Modulo scheduling: dependence constraints
§ Dependence constraints
§ Both loop-independent and loop-carried dependences must be considered § Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Delay: minimum 2me interval between the start of opera2ons
Dependence Delay Conserva2ve delay True Latency(pred) Latency(pred) An2 1-latency(succ) Output 1+latency(pred)− latency(succ) Latency(pred)
Modulo scheduling: dependence constraints
§ Dependence constraints
§ Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Distance: number of itera2ons separa2ng the two opera2ons
§ 0 : loop-independent, § > 0 : loop-carried dependence
8.2.3.2 SpeculaLon and predicaLon
Relaxing dependence constraints
§ Dependence constraints limit the amount of ILP § SpeculaLon and predicaLon aim at removing control dependences to increase ILP § Independent techniques that complement each other
§ Specula2on: move opera2ons up over a highly weighted branch § Predica2on: collapse short sequences of code a8er a balanced branch
Control speculaLon
§ SpeculaLve code moLon requires hardware support
§ Opera2ons genera2ng an excep2on
§ Memory opera2ons § Certain ALU opera2ons (div 0, overflow)
§ Two H/W models
§ Non-recovery specula2on § Recovery specula2on
Non-recovery speculaLon model
§ Uses the concept of “silent instrucLons” § ExcepLons are simply discarded
§ “Silent load”: if invalid
§ Canceled before it reaches the memory system § Garbage value returned
§ H/W implementaLon straighhorward:
§ Opcode contains a ‘specula2ve’ bit § Specula2ve bit is used to mask hardware excep2ons
§ Problems:
§ Handling of recoverable excep2ons? § Correct specula2on with abort?
Recovery speculaLon model
§ Supports full recovery from speculaLve excepLons (aka sen/nel scheduling) § Basic idea:
§ Split speculated instruc2ons into two parts
§ Non-excep2ng part: perform the actual opera2on § Executed specula2vely (i.e., can be hoisted/sunk) § Sen2nel part: check and flag an excep2on if necessary § Must remain in “home” block
§ H/W implementaLon:
§ Opcode bits to indicate specula2on § Register flags to indicate validity of content, and also type of excep2on
§ Contents needs to be preserved over context switches
§ Correct specula2on with abort problem solved
Data speculaLon
§ Aka run-Lme memory disambiguaLon: assume memory addresses do not collide at compile-Lme, check at runLme § Requires H/W to remember the locaLon of speculated memory loads/stores
§ Expensive and thus typically not used
function swap(int *a, int *b) { int t = *a; *a = *b; *b = t; }
Predicated execuLon
§ PredicaLon also requires hardware support
§ Predica2on guard
§ Processor state § Predica2on registers
§ Full predicaLon vs. ParLal predicaLon
§ Full predica2on
§ Instruc2ons take addi2onal operands that control execu2on § If condi2on not met, treated as a NOP
§ Par2al predica2on
§ Only certain instruc2ons support predicated execu2on § I686 cmov
Predicated execuLon
§ Full vs. parLal predicaLon
x = *ap; if (x > 0) q = *bp + 1; else q = x – 1; *zp = q; <T> load x = 0[ap] <T> cmpgt p2, p3 = x, 0 <p2> load t1 = 0[bp] <p2> add q = t1, 1 <p3> sub q = x, 1 <T> store 0[zp] = q load x = 0[ap] cmpgt p2 = x, 0 load,s t1 = 0[bp] add t2 = t1, 1 sub t3 = x, 1 select q = p2, t2, t3 store 0[zp] = q
Predicated execuLon
§ Compiler techniques
§ If-conversion
§ Translate control dependencies into data dependencies
§ Logical reduc2on of predicates
§ Build BDD to reduce the number of predicates
§ Hyperblock scheduling
§ “Superblocks with if-conversion”
§ Modulo scheduling
§ To use the same code for prologue, kernel, and epilogue
References and further reading §
- B. Rau et al. “Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture
for High Performance Scien2fic Compu2ng”, Proceedings of the 14th Annual Workshop on Microprogramming (MICRO), 1981 §
- M. Lam. “So8ware Pipelining: An Effec2ve Scheduling Technique for VLIW Machines,”
Proceeding of the ACM SIGPLAN Conference on Programming Language Design and ImplementaHon (PLDI), 1988 §
- B. Rau. “Itera2ve Modulo Scheduling: An Algorithm for So8ware Pipelining Loops”,
Proceedings of the 27th Annual Workshop on Microprogramming (MICRO), 1994 §
- P. Faraboschi et al. “Instruc2on scheduling for instruc2on level parallel processors,”
Proceedings of the IEEE , vol.89, no.11, 2001
§ with thanks to Bernhard Egger for slide material for 2810 “Advanced Compiler Design”, Spring 2014, ETH Zurich
55