263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

Overview § 8.1 InstrucLon scheduling basics § 8.2 Scheduling for ILP processors

8.2 Scheduling for ILP processors § IntroducLon to ILP § Scheduling for acyclic regions § Types and shapes § Region forma2on § Schedule construc2on § Resource management § Scheduling for cyclic regions § So8ware pipelining § Modulo scheduling § Specula2on and predica2on

8.2.3 Scheduling for cyclic regions § Scheduling loops § The majority of program execu2on 2me is spent in loops § We already know several techniques to speed up loop execu2on § Parallelizing loops § Loop unrolling § Loop fusion § … § All of these techniques have a scheduling barrier at the end of one (or several) itera2ons

Increasing ILP w/ loop unrolling § Running example mov r1 ← @a mov r2 ← @b add r5 ← r1, #0x4000 for (i=0; i<0x1000; i++) { loop: ld r3 ← mem[r1] b[i] = a[i] * 3; mul r4 ← r3, #3 } st mem[r2] ← r4 add r1 ← r1, #4 § Machine model add r2 ← r2, #4 clt p1 ← r1, r5 § 4 issue b p1, @loop § 1 control, 2 ALU, 1 memory § Latencies: § Add: 1 cycle § Mul: 3 cycles § Ld: 2 cycles § St: 1 cycle § Cmp: 1 cycle § Branch: 1 cycle

Increasing ILP w/ loop unrolling § Scheduling the loop with list scheduling ( Baseline) cycle ALU 1 ALU 2 MEM control 0 4 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 1 6 3 st mem[r2] ← r4 4 add r1 ← r1, #4 2 2 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 3 7 b p1, @loop 4 5 5 3 7 time ... iteration 1 2 3 1000 Throughput: 6 cycles / 1 iteraLon (100%) Code size (schedule length): 6 (100%)

Increasing ILP w/ loop unrolling § Unrolling twice cycle ALU 1 ALU 2 MEM control 0 14 11 11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 1 24 21 13 st mem[r2] ← r4 14 add r1 ← r1, #4 2 12 06 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 3 22 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 4 24 add r6 ← r6, #4 25 add r7 ← r7, #4 06 clt p1 ← r6, r5 5 15 13 07 b p1, @loop 6 25 23 07 time ... iteration 1,2 3,4 5,6 999,1000 § Throughput: 7 cycles / 2 iteraLons (-42%) Code size (schedule length): 7 (+17%)

Increasing ILP w/ loop unrolling § Unrolling 4x cycle ALU 1 ALU 2 MEM control 11 ld r3 ← mem[r1] 0 14 11 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 1 24 21 14 add r1 ← r1, #4 15 add r2 ← r2, #4 31 2 34 12 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 22 41 3 44 23 st mem[r7] ← r9 24 add r6 ← r6, #4 32 4 25 add r7 ← r7, #4 31 ld r12 ← mem[r10] 42 5 15 13 32 mul r13 ← r12, #3 33 st mem[r11] ← r13 6 25 23 34 add r10 ← r10, #4 35 add r11 ← r11, #4 06 7 35 33 41 ld r16 ← mem[r14] 42 mul r17 ← r16, #3 8 45 34 07 43 st mem[r15] ← r17 44 add r14 ← r14, #4 time 45 add r15 ← r15, #4 06 clt p1 ← r6, r5 ... iteration 997,998   9,10,11,12 1,2,3,4 5,6,7,8 07 b p1, @loop 999,1000 Throughput: 9 cycles / 4 iteraLons (-63%) Code size (scheduled instrucLons): 9 (+50%)

Increasing ILP w/ loop unrolling § Scheduling loops § Unrolling: performance improvements, but § Scheduling barrier is s2ll there, only unroll factor loop bodies can be overlapped § Increase in § Code size § Register pressure § Unrolling is useful for loops with § Lots of control flow within the loop body § Trace can find most likely path § Unrolling “ignores” loop structure

8.2.3.1 So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 10

8.2.3.1 So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 11

Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 § Scheduling the first iteraLon 4 add r1 ← r1, #4 5 add r2 ← r2, #4 on our 4-issue machine 6 clt p1 ← r1, r5 7 b p1, @loop iteraLon 1 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 3 6 4 5 5 3 7 6 7 8 9

Let’s try again § Consider 8-issue machine cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control § Duplicate 4-issue 0 machine 1 § Control 2 nd group 2 by predicate Q 3 § Execute only if 4 Q==true 5 § “Predicated 6 execu2on” 7 8 9 13

Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 2 nd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on an 8-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 5 5 3 7 6 6 5 7 3 7 8 9

Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 3 rd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on a 12-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 iteraLon 3 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 1 5 5 3 7 4 6 6 2 5 7 3 7 6 8 5 9 3 7

Let’s try again § ObservaLon: schedules of different iteraLons are idenLcal cycle ALU 1 ALU 2 MEM control § except for the start-up 0 1 cycles 1 4 § except for the wind-down 1 2 2 cycles 3 4 § Schedules can be 6 1 4 2 overlapped 5 5 4 3 7 6 1 6 2 5 7 4 3 7 6 1 8 2 5 9 4 3 7 6 10 2 1 … … … … …

Performance § Throughput cycle ALU 1 ALU 2 MEM control § for each individual 0 1 itera2on: 6 cycles 1 4 § 1 st itera2on: 6 cycles 1 2 2 § each addi2onal itera2on: 2 cycles 3 4 § n itera2ons: 2*n + 4 6 1 4 2 5 5 4 3 7 § Code size (w/ predicated 6 1 6 2 execuLon) 5 7 4 3 7 § 7 instruc2ons 6 1 8 2 § Predicate Q: r1 < r5 5 9 4 3 7 6 10 2 1 … … … … …

So^ware pipelining § Standard techniques (region scheduling, loop fusion, loop unrolling) do not yield sufficient ILP § Method of choice: so^ware pipelining time § Overlap itera2ons of ... the loop body iteration 1 2 3 4 997 998 999 1000 1 § Steady-state: kernel 2 1 3 2 1 § Peak performance: 4 3 2 1 1 loop itera2on/cycle 4 3 2 1 4 3 2 4 3 1 § No scheduling-barriers 4 2 1 3 2 1 between itera2ons 4 3 2 1 4 3 2 1 § No loop unrolling necessary 4 3 2 4 3 § Requires sufficient resources 4

So^ware pipelining 2me § Basic idea ... itera2on 1 2 3 4 § Unroll the loop “completely” § Correctly schedule the loop under 1 II (ini2a2on interval) two constraints II 2 1 § All itera2on bodies have iden2cal schedules II 3 2 1 § Each new itera2on starts exactly II (ini2a2on interval) cycles 4 3 2 1 a8er the previous itera2on 4 3 2 § ExecuLon Lme in terms of stage count (SC) SC (stage count) 4 3 § One loop itera2on: SC×II cycles § Prologue/epilogue: (SC-1)×II cycles 4 § Kernel steady state: II cycles § ExecuLon Lme of a so^ware pipelined loop: II×(n+sc-1) cycles

Modulo scheduling § Most common technique to find so^ware pipelined schedules § Basic concept § Unroll the loop “completely” § Schedule the loop under two constraints § All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II cycles a8er the previous itera2on

Modulo scheduling: problem formulaLon § Problem : find a schedule for one loop body iteraLon such that when the schedule is repeated at intervals of II cycles § No hardware resource conflict arises between opera2ons of the same and successive itera2ons of the loop body § No intra/inter–loop dependences are violated

Modulo scheduling: resource constraints § Handling resource constraints § No resource must be used by different opera2ons at two points in 2me that are separated by an interval that is a mul2ple of the ini2a2on interval § This requirement is iden2cal to: Within a single itera2on, no resource is ever used more than once at the same 2me modulo II § Search for suitable iniLaLon interval

Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource II = 2 cycle ALU 1 ALU 2 MEM control 0 1 II = 3 cycle ALU 1 ALU 2 MEM control 0 1 2 § Scheduling op at 2me t on resource r § Entry for r at t mod II must be free § Mark t mod II busy for r

Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource iteraLon 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 cycle ALU 1 ALU 2 MEM control 3 st mem[r2] ← r4 4 add r1 ← r1, #4 0 1 5 add r2 ← r2, #4 1 4 6 clt p1 ← r1, r5 7 b p1, @loop 2 2 3 4 6 5 5 3 7 cycle ALU 1 ALU 2 MEM control II = 2 6 0 2 1 5 3 7 1 4

Modulo scheduling: dependence constraints § Dependence constraints § Both loop-independent and loop-carried dependences must be considered § Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Delay: minimum 2me interval between the start of opera2ons Dependence Delay Conserva2ve delay Latency(pred) Latency(pred) True 1-latency(succ) 0 An2 1+latency(pred)− Latency(pred) Output latency(succ)

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8.1 InstrucLon scheduling basics 8.2 Scheduling for ILP processors 8.2 Scheduling for ILP

Partner Partner Partner Partner +1 202 263 3241 +1 202 263 3241 +1 202 263 3241 + 1 202 263

263-2810: Advanced Compiler Design 6.0 Partial redundancy elimination Thomas R. Gross Computer

263-2810: Advanced Compiler Design 3.0 SSA form for arbitrary control flow graphs Thomas R. Gross

263-2810: Advanced Compiler Design 2.0 Sta>c Single Assignment Form Thomas R. Gross Computer

263-2810: Advanced Compiler Design Compilation with dynamic information Thomas R. Gross Computer

ASD TUG 2810 HYBRID TUGS TUGS TUGS & WORKBOATS ASD TUG 2810 HYBRID TUGS POLLUTION TUGS

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Exploiting More ILP ILP = __________ _ ________

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator Using In-Situ

Code Generation Wilhelm/Maurer: Compiler Design, Chapter 12 Reinhard Wilhelm

MANINDER KAUR professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1

Design of Control Path Debdeep Mukhopadhyay IIT Madras Hardwired Hardware GCD Processor An

1 How gates work NOT ( ) NAND ( ) o Transistors can be made to

Concepts of programming languages Lecture 6 Wouter Swierstra Faculty of Science Information and

Optimizing and Comparing CMOS Implementations of the C-element in 65nm Technology: Self-Timed

Chapter Five 1 2004 Morgan Kaufmann Publishers The Processor: Datapath & Control

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8.1 InstrucLon scheduling basics 8.2 Scheduling for ILP processors 8.2 Scheduling for ILP

Partner Partner Partner Partner +1 202 263 3241 +1 202 263 3241 +1 202 263 3241 + 1 202 263

263-2810: Advanced Compiler Design 6.0 Partial redundancy elimination Thomas R. Gross Computer

263-2810: Advanced Compiler Design 3.0 SSA form for arbitrary control flow graphs Thomas R. Gross

263-2810: Advanced Compiler Design 2.0 Sta&gt;c Single Assignment Form Thomas R. Gross Computer

263-2810: Advanced Compiler Design Compilation with dynamic information Thomas R. Gross Computer

ASD TUG 2810 HYBRID TUGS TUGS TUGS &amp; WORKBOATS ASD TUG 2810 HYBRID TUGS POLLUTION TUGS

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Exploiting More ILP ILP = __________________ _________________ ________________

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator Using In-Situ

Code Generation Wilhelm/Maurer: Compiler Design, Chapter 12 Reinhard Wilhelm

MANINDER KAUR professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1

Design of Control Path Debdeep Mukhopadhyay IIT Madras Hardwired Hardware GCD Processor An

1 How gates work NOT ( ) NAND ( ) o Transistors can be made to

Concepts of programming languages Lecture 6 Wouter Swierstra Faculty of Science Information and

Optimizing and Comparing CMOS Implementations of the C-element in 65nm Technology: Self-Timed

Chapter Five 1 2004 Morgan Kaufmann Publishers The Processor: Datapath &amp; Control

263-2810: Advanced Compiler Design 2.0 Sta>c Single Assignment Form Thomas R. Gross Computer

ASD TUG 2810 HYBRID TUGS TUGS TUGS & WORKBOATS ASD TUG 2810 HYBRID TUGS POLLUTION TUGS

Exploiting More ILP ILP = __________ _ ________

Chapter Five 1 2004 Morgan Kaufmann Publishers The Processor: Datapath & Control