263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - - PowerPoint PPT Presentation

263 2810 advanced compiler design 8 2 scheduling for ilp
SMART_READER_LITE
LIVE PREVIEW

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8.1 InstrucLon scheduling basics 8.2 Scheduling for ILP processors 8.2 Scheduling for ILP


slide-1
SLIDE 1

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors

Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

slide-2
SLIDE 2

Overview

§ 8.1 InstrucLon scheduling basics § 8.2 Scheduling for ILP processors

slide-3
SLIDE 3

8.2 Scheduling for ILP processors

§ IntroducLon to ILP § Scheduling for acyclic regions

§ Types and shapes § Region forma2on § Schedule construc2on § Resource management

§ Scheduling for cyclic regions

§ So8ware pipelining § Modulo scheduling § Specula2on and predica2on

slide-4
SLIDE 4

8.2.3 Scheduling for cyclic regions

§ Scheduling loops

§ The majority of program execu2on 2me is spent in loops § We already know several techniques to speed up loop execu2on

§ Parallelizing loops § Loop unrolling § Loop fusion § …

§ All of these techniques have a scheduling barrier at the end of one (or several) itera2ons

slide-5
SLIDE 5

Increasing ILP w/ loop unrolling

§ Running example § Machine model

§ 4 issue § 1 control, 2 ALU, 1 memory § Latencies:

§ Add: 1 cycle § Mul: 3 cycles § Ld: 2 cycles § St: 1 cycle § Cmp: 1 cycle § Branch: 1 cycle

for (i=0; i<0x1000; i++) { b[i] = a[i] * 3; }

mov r1 ← @a mov r2 ← @b add r5 ← r1, #0x4000 loop: ld r3 ← mem[r1] mul r4 ← r3, #3 st mem[r2] ← r4 add r1 ← r1, #4 add r2 ← r2, #4 clt p1 ← r1, r5 b p1, @loop

slide-6
SLIDE 6

Increasing ILP w/ loop unrolling

§ Scheduling the loop with list scheduling (Baseline) Throughput: 6 cycles / 1 iteraLon (100%) Code size (schedule length): 6 (100%)

1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop

1 time 2 3 1000 ... iteration

cycle ALU 1 ALU 2 MEM control 4 1 1 6 2 2 3 4 5 5 3 7

slide-7
SLIDE 7

Increasing ILP w/ loop unrolling

§ Unrolling twice § Throughput: 7 cycles / 2 iteraLons (-42%) Code size (schedule length): 7 (+17%)

11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 14 add r1 ← r1, #4 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 24 add r6 ← r6, #4 25 add r7 ← r7, #4 06 clt p1 ← r6, r5 07 b p1, @loop 1,2 time 3,4 5,6 999,1000 ... iteration

cycle ALU 1 ALU 2 MEM control 14 11 1 24 21 2 12 06 3 22 4 5 15 13 6 25 23 07

slide-8
SLIDE 8

Increasing ILP w/ loop unrolling

§ Unrolling 4x Throughput: 9 cycles / 4 iteraLons (-63%) Code size (scheduled instrucLons): 9 (+50%)

11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 14 add r1 ← r1, #4 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 24 add r6 ← r6, #4 25 add r7 ← r7, #4 31 ld r12 ← mem[r10] 32 mul r13 ← r12, #3 33 st mem[r11] ← r13 34 add r10 ← r10, #4 35 add r11 ← r11, #4 41 ld r16 ← mem[r14] 42 mul r17 ← r16, #3 43 st mem[r15] ← r17 44 add r14 ← r14, #4 45 add r15 ← r15, #4 06 clt p1 ← r6, r5 07 b p1, @loop

1,2,3,4

time

5,6,7,8

9,10,11,12 997,998
 999,1000

... iteration

cycle ALU 1 ALU 2 MEM control 14 11 1 24 21 2 34 12 31 3 44 22 41 4 32 5 15 42 13 6 25 23 7 35 06 33 8 45 34 07

slide-9
SLIDE 9

Increasing ILP w/ loop unrolling

§ Scheduling loops

§ Unrolling: performance improvements, but § Scheduling barrier is s2ll there, only unroll factor loop bodies can be overlapped § Increase in

§ Code size § Register pressure

§ Unrolling is useful for loops with

§ Lots of control flow within the loop body § Trace can find most likely path

§ Unrolling “ignores” loop structure

slide-10
SLIDE 10

8.2.3.1 So^ware pipelining

§ Exploit loop structure of program § Pipelining: overlap of stages

10

slide-11
SLIDE 11

8.2.3.1 So^ware pipelining

§ Exploit loop structure of program § Pipelining: overlap of stages

11

slide-12
SLIDE 12

Let’s try again

§ Scheduling the first iteraLon

  • n our 4-issue machine

1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop

iteraLon 1 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 3 4 6 5 5 3 7 6 7 8 9

slide-13
SLIDE 13

Let’s try again

13

cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 2 3 4 5 6 7 8 9

§ Consider 8-issue machine § Duplicate 4-issue machine

§ Control 2nd group by predicate Q § Execute only if Q==true § “Predicated execu2on”

slide-14
SLIDE 14

Let’s try again

§ Scheduling the 2nd iteraLon

  • n an 8-issue machine

1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop

iteraLon 1 iteraLon 2 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 6 2 5 5 3 7 6 6 7 5 3 7 8 9

slide-15
SLIDE 15

Let’s try again

§ Scheduling the 3rd iteraLon

  • n a 12-issue machine

1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop

iteraLon 1 iteraLon 2 iteraLon 3 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 6 2 1 5 5 3 7 4 6 6 2 7 5 3 7 8 6 9 5 3 7

slide-16
SLIDE 16

Let’s try again

§ ObservaLon: schedules of different iteraLons are idenLcal

cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 2 6 1 5 4 5 3 7 6 2 6 1 7 4 5 3 7 8 2 6 1 9 4 5 3 7 10 2 6 1 … … … … …

§ except for the start-up cycles § except for the wind-down cycles

§ Schedules can be

  • verlapped
slide-17
SLIDE 17

Performance

§ Throughput

§ for each individual itera2on: 6 cycles § 1st itera2on: 6 cycles § each addi2onal itera2on: 2 cycles § n itera2ons: 2*n + 4

§ Code size (w/ predicated execuLon)

§ 7 instruc2ons § Predicate Q: r1 < r5

cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 1 3 4 4 2 6 1 5 4 5 3 7 6 2 6 1 7 4 5 3 7 8 2 6 1 9 4 5 3 7 10 2 6 1 … … … … …

slide-18
SLIDE 18

So^ware pipelining

§ Standard techniques (region scheduling, loop fusion, loop unrolling) do not yield sufficient ILP § Method of choice: so^ware pipelining

§ Overlap itera2ons of the loop body § Steady-state: kernel § Peak performance: 1 loop itera2on/cycle § No scheduling-barriers between itera2ons § No loop unrolling necessary § Requires sufficient resources

1 2 3 4

iteration 1

1 2 3 4

2

1 2 3 4

3

1 2 3 4

4

1 2 3 4

1000

1 2 3 4

999

1 2 3 4

998

1 2 3 4

997 ... time

1 2 3 4 1 2 3 4

slide-19
SLIDE 19

So^ware pipelining

§ Basic idea

§ Unroll the loop “completely” § Correctly schedule the loop under two constraints

§ All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II (ini2a2on interval) cycles a8er the previous itera2on

§ ExecuLon Lme in terms of stage count (SC)

§ One loop itera2on: SC×II cycles § Prologue/epilogue: (SC-1)×II cycles § Kernel steady state: II cycles

§ ExecuLon Lme of a so^ware pipelined loop: II×(n+sc-1) cycles

1 2 3 4

itera2on 1

1 2 3 4

2

1 2 3 4

3

1 2 3 4

4

...

2me II (ini2a2on interval) II II SC (stage count)

slide-20
SLIDE 20

Modulo scheduling

§ Most common technique to find so^ware pipelined schedules § Basic concept

§ Unroll the loop “completely” § Schedule the loop under two constraints

§ All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II cycles a8er the previous itera2on

slide-21
SLIDE 21

Modulo scheduling: problem formulaLon

§ Problem : find a schedule for one loop body iteraLon such that when the schedule is repeated at intervals of II cycles

§ No hardware resource conflict arises between opera2ons of the same and successive itera2ons of the loop body § No intra/inter–loop dependences are violated

slide-22
SLIDE 22

Modulo scheduling: resource constraints

§ Handling resource constraints

§ No resource must be used by different opera2ons at two points in 2me that are separated by an interval that is a mul2ple of the ini2a2on interval § This requirement is iden2cal to: Within a single itera2on, no resource is ever used more than once at the same 2me modulo II

§ Search for suitable iniLaLon interval

slide-23
SLIDE 23

Modulo scheduling: resource constraints

§ Modulo reservaLon tables

§ Table containing II rows and one column for each resource § Scheduling op at 2me t on resource r § Entry for r at t mod II must be free § Mark t mod II busy for r

II = 2

cycle

ALU 1 ALU 2 MEM control

1 cycle ALU 1 ALU 2 MEM control 1 2

II = 3

slide-24
SLIDE 24

Modulo scheduling: resource constraints

§ Modulo reservaLon tables

§ Table containing II rows and one column for each resource

1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 7 b p1, @loop

iteraLon 1 cycle ALU 1 ALU 2 MEM control 1 1 4 2 2 3 4 6 5 5 3 7

II = 2

cycle ALU 1 ALU 2 MEM control 2 6 1 1 4 5 3 7

slide-25
SLIDE 25

Modulo scheduling: dependence constraints

§ Dependence constraints

§ Both loop-independent and loop-carried dependences must be considered § Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Delay: minimum 2me interval between the start of opera2ons

Dependence Delay Conserva2ve delay True Latency(pred) Latency(pred) An2 1-latency(succ) Output 1+latency(pred)− latency(succ) Latency(pred)

slide-26
SLIDE 26

Modulo scheduling: dependence constraints

§ Dependence constraints

§ Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Distance: number of itera2ons separa2ng the two opera2ons

§ 0 : loop-independent, § > 0 : loop-carried dependence

slide-27
SLIDE 27

8.2.3.2 SpeculaLon and predicaLon

slide-28
SLIDE 28

Relaxing dependence constraints

§ Dependence constraints limit the amount of ILP § SpeculaLon and predicaLon aim at removing control dependences to increase ILP § Independent techniques that complement each other

§ Specula2on: move opera2ons up over a highly weighted branch § Predica2on: collapse short sequences of code a8er a balanced branch

slide-29
SLIDE 29

Control speculaLon

§ SpeculaLve code moLon requires hardware support

§ Opera2ons genera2ng an excep2on

§ Memory opera2ons § Certain ALU opera2ons (div 0, overflow)

§ Two H/W models

§ Non-recovery specula2on § Recovery specula2on

slide-30
SLIDE 30

Non-recovery speculaLon model

§ Uses the concept of “silent instrucLons” § ExcepLons are simply discarded

§ “Silent load”: if invalid

§ Canceled before it reaches the memory system § Garbage value returned

§ H/W implementaLon straighhorward:

§ Opcode contains a ‘specula2ve’ bit § Specula2ve bit is used to mask hardware excep2ons

§ Problems:

§ Handling of recoverable excep2ons? § Correct specula2on with abort?

slide-31
SLIDE 31

Recovery speculaLon model

§ Supports full recovery from speculaLve excepLons (aka sen/nel scheduling) § Basic idea:

§ Split speculated instruc2ons into two parts

§ Non-excep2ng part: perform the actual opera2on § Executed specula2vely (i.e., can be hoisted/sunk) § Sen2nel part: check and flag an excep2on if necessary § Must remain in “home” block

§ H/W implementaLon:

§ Opcode bits to indicate specula2on § Register flags to indicate validity of content, and also type of excep2on

§ Contents needs to be preserved over context switches

§ Correct specula2on with abort problem solved

slide-32
SLIDE 32

Data speculaLon

§ Aka run-Lme memory disambiguaLon: assume memory addresses do not collide at compile-Lme, check at runLme § Requires H/W to remember the locaLon of speculated memory loads/stores

§ Expensive and thus typically not used

function swap(int *a, int *b) { int t = *a; *a = *b; *b = t; }

slide-33
SLIDE 33

Predicated execuLon

§ PredicaLon also requires hardware support

§ Predica2on guard

§ Processor state § Predica2on registers

§ Full predicaLon vs. ParLal predicaLon

§ Full predica2on

§ Instruc2ons take addi2onal operands that control execu2on § If condi2on not met, treated as a NOP

§ Par2al predica2on

§ Only certain instruc2ons support predicated execu2on § I686 cmov

slide-34
SLIDE 34

Predicated execuLon

§ Full vs. parLal predicaLon

x = *ap; if (x > 0) q = *bp + 1; else q = x – 1; *zp = q; <T> load x = 0[ap] <T> cmpgt p2, p3 = x, 0 <p2> load t1 = 0[bp] <p2> add q = t1, 1 <p3> sub q = x, 1 <T> store 0[zp] = q load x = 0[ap] cmpgt p2 = x, 0 load,s t1 = 0[bp] add t2 = t1, 1 sub t3 = x, 1 select q = p2, t2, t3 store 0[zp] = q

slide-35
SLIDE 35

Predicated execuLon

§ Compiler techniques

§ If-conversion

§ Translate control dependencies into data dependencies

§ Logical reduc2on of predicates

§ Build BDD to reduce the number of predicates

§ Hyperblock scheduling

§ “Superblocks with if-conversion”

§ Modulo scheduling

§ To use the same code for prologue, kernel, and epilogue

slide-36
SLIDE 36

References and further reading §

  • B. Rau et al. “Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture

for High Performance Scien2fic Compu2ng”, Proceedings of the 14th Annual Workshop on Microprogramming (MICRO), 1981 §

  • M. Lam. “So8ware Pipelining: An Effec2ve Scheduling Technique for VLIW Machines,”

Proceeding of the ACM SIGPLAN Conference on Programming Language Design and ImplementaHon (PLDI), 1988 §

  • B. Rau. “Itera2ve Modulo Scheduling: An Algorithm for So8ware Pipelining Loops”,

Proceedings of the 27th Annual Workshop on Microprogramming (MICRO), 1994 §

  • P. Faraboschi et al. “Instruc2on scheduling for instruc2on level parallel processors,”

Proceedings of the IEEE , vol.89, no.11, 2001

slide-37
SLIDE 37

§ with thanks to Bernhard Egger for slide material for 2810 “Advanced Compiler Design”, Spring 2014, ETH Zurich

55