Optimal Software Pipelining of Loops with Control Flows Han-Saem - - PowerPoint PPT Presentation

optimal software pipelining of loops with control flows
SMART_READER_LITE
LIVE PREVIEW

Optimal Software Pipelining of Loops with Control Flows Han-Saem - - PowerPoint PPT Presentation

Optimal Software Pipelining of Loops with Control Flows Han-Saem YUN Jihong KI M Soo-Mook MOON Computer Architecture and Embedded Systems Lab. Seoul National University, KOREA 16 th I nternational Conference on Supercomputing (I CS


slide-1
SLIDE 1

1

ICS ’ 02 1/39 CARES Lab. / Seoul National Univ.

Han-Saem YUN Jihong KI M Soo-Mook MOON

Computer Architecture and Embedded Systems Lab.

Seoul National University, KOREA 16th I nternational Conference on Supercomputing (I CS’ 02)

Optimal Software Pipelining

  • f Loops with Control Flows
slide-2
SLIDE 2

2

ICS ’ 02 2/39 CARES Lab. / Seoul National Univ.

Software Pipelining (SP) Example

  • p4: r1= r1+ r3
  • p1: r1= r1-r2
  • p2: cc0= (r1< = 0)
  • p3: if (!cc0)
  • p5: r5= r1< < 4
  • p6: r5= load @r5
  • p7: cc0= (r5< = 0)
  • p8: if (!cc0)
  • p5 op1
  • p2 op6
  • p7 op3
  • p4 op8
  • p1 op8
  • p2 op1
  • p5 op3 op2 op1
  • p6 op5 op3 op2 op1
  • p7 op6 op5 op3 op2 op1
  • p8 op7 op6 op5
  • p3 op2 op1
  • p6 op4
  • p7 op5 op1
  • p7 op6 op8 op4
  • p7 op6 op4
  • p7 op8 op5 op1
  • p8 op6 op2

Much more complex than SP of loops without control flows ! Even formulating the problem is very difficult !! (let alone

discussing optimality issue..)

slide-3
SLIDE 3

3

ICS ’ 02 3/39 CARES Lab. / Seoul National Univ.

  • Optimally software

pipelined program

– every execution path of the program runs in the shortest possible time subject to true dependences, and..

  • resource constraints..

(well-defined??)

Optimal SP

a= d+ 1 if d= = 0 b= g1(a) c= f1(a) b= g2(a) c= f2(a) d= b+ 1 if c= = 0

For any path, there is a dependence chain whose length is equal to the number of execution cycles of the path

a= d+ 1 if d= = 0 b= g1(a) c= f1(a) b= g2(a) c= f2(a) d= b+ 1 if c= = 0 a= d+ 1 if d= = 0 d= b+ 1 if c= = 0

length of dependence chain = 6 = # of execution cycles

slide-4
SLIDE 4

4

ICS ’ 02 4/39 CARES Lab. / Seoul National Univ.

Previous Results on Optimal SP

loops wit hout cont rol f lows loops wit h cont rol f lows wit hout resource const raint s

(only t rue dependences)

wit h resource const raint s

Poly-time optimal algorithm

  • Aiken, Nicolau, PLDI’ 88
  • Unroll & modulo scheduling

Poly-time approximation algo.

  • Gasperoni et al. PPL

’ 94

  • < about 2 x optimal performance

NP-hard

P ract ical set t ings..

  • ptimality fomulation & heuristics
  • register pressure,
  • D-cache conscious,
  • clustered VLIW, …

NOT well-defined..

Generally, NO optimal solution exists!

  • for some loops..
  • Schwiegelshon, Gasperoni,

Ebcioglu, MICRO ’ 89

  • illustrated 2 loops..
slide-5
SLIDE 5

5

ICS ’ 02 5/39 CARES Lab. / Seoul National Univ.

Schwiegelshohn et al.’ s Result [MICRO

’ 89, JPDS ’ 91]

  • Illustrates some loops that cannot have semantically equivalent,
  • ptimally software-pipelined programs

– lacks a formalism required to develop generalized results

  • No further research result on optimal SP has been reported for

more than a decade

– Possibly having been discouraged by the pessimistic result

The set of all loops with control flows

slide-6
SLIDE 6

6

ICS ’ 02 6/39 CARES Lab. / Seoul National Univ.

Our Recent Result [CC

’ 01]

  • Describe a strong necessary condition for a loop to have

equivalent optimally software-pipelined program

– i.e., present a nonexistence proof for loops that do not satisfy the condition – generalization of Schwiehelshon’ s result

  • As part of the formal treatment, propose a formalization of

software pipelining of loops with control flows

Schwiegelshohn’ s result

slide-7
SLIDE 7

7

ICS ’ 02 7/39 CARES Lab. / Seoul National Univ.

Our Contribution 1: Optimality Condition

  • Necessary and sufficient condition for a loop to have

semantically equivalent optimally software-pipelined program

– We call the condition “ optimality condition” – Exactly identify what can and cannot be achieved by SP

  • Developed a decision procedure to compute the condition

Schwiegelshohn’ s result Our previous result

slide-8
SLIDE 8

8

ICS ’ 02 8/39 CARES Lab. / Seoul National Univ.

Our Contribution 1: Optimality Condition

Loops without

  • ptimal solutions

Loops with

  • ptimal solutions
  • Necessary and sufficient condition for a loop to have

semantically equivalent optimally software-pipelined program

– We call the condition “ optimality condition” – Exactly identify what can and cannot be achieved by SP

  • Developed a decision procedure to compute the condition
slide-9
SLIDE 9

9

ICS ’ 02 9/39 CARES Lab. / Seoul National Univ.

Our Contribution 2: Optimal Algorithm

  • Algorithm to compute an optimal solution for every loop satisfying the
  • ptimality condition (the necessary and sufficient condition)

– Quite expensive, but covers the right region completely. – Also serves as a proof for the sufficient part of the optimality condition

Loops without

  • ptimal solution

Optimality condition

Loops with

  • ptimal solution
slide-10
SLIDE 10

10

ICS ’ 02 10/ 39 CARES Lab. / Seoul National Univ.

Our Contribution 3: Conservative Optimal Algo.

  • An efficient algorithm to compute an optimal solution for “ almost every”

loops satisfying the optimality condition

– “ almost” ≡ more than 90% (for loops used in our experiment..) – “ efficient” ≡ in less than 30 sec.

Optimality condition

slide-11
SLIDE 11

11

ICS ’ 02 11/ 39 CARES Lab. / Seoul National Univ.

Our Contribution 4: Experiments

  • Measure actual portion of each region

– A : loops without optimal solutions – B : loops with optimal solutions, but the optimal solutions cannot be computed by the (efficient) conservative optimal algorithm – C : loops with optimal solutions, and the optimal solutions can be computed by the (efficient) conservative optimal algorithm

Optimality condition

A B C

slide-12
SLIDE 12

12

ICS ’ 02 12/ 39 CARES Lab. / Seoul National Univ.

Our Contribution 4: Experiments

  • Actual portion of each region

– Quite optimistic ! (A ≅ 10% , C ≅ 75%) (for loops used in our experiment..)

  • Resource requirement for optimal solution

– Not so excessive

Optimality condition

A B C

slide-13
SLIDE 13

13

ICS ’ 02 13/ 39 CARES Lab. / Seoul National Univ.

Road Map

  • 1. Optimality Condition
  • 2. Optimal SP Algorithm
  • 3. Conservative Optimal SP Algorithm
  • 4. Experimental Results
slide-14
SLIDE 14

14

ICS ’ 02 14/ 39 CARES Lab. / Seoul National Univ.

Optimality Condition: Intuitive Explanation

  • Operations in the

sequential programs are required to be moved by only bounded range to yield time optimal execution

– for any execution path

k nodes

a= a+ 1 a= a+ 1 a= a+ 1 a= a+ 1

k nodes

b= b+ 1 b= b+ 1 b= b+ 1 b= b+ 1

k cycles A sequential path Time optimal parallel path

a= a+ 1 b= b+ 1 a= a+ 1 b= b+ 1 a= a+ 1 b= b+ 1 a= a+ 1 b= b+ 1

Moved acoross k-1 “ a= a+ 1” (what if k → ∞ ?)

a= a+ 1 b= b+ 1

slide-15
SLIDE 15

15

ICS ’ 02 15/ 39 CARES Lab. / Seoul National Univ.

Example Loops

Loops with optimal solution Loops without optimal solution Show the existence of the

  • ptimal solution by construction

(i.e., by optimal SP algorithm presented in next chapter) Present nonexistence proof in the following slide

slide-16
SLIDE 16

16

ICS ’ 02 16/ 39 CARES Lab. / Seoul National Univ.

Sketch of Nonexistence Proof

  • What if a loop

– does NOT satisfying the optimality condition – but has an optimal solution ?

  • Then, we can construct problematic execution paths s.t.

– Code motions of unbounded range are needed for them to be optimally executed

  • Intuitively, code motion of unbounded range incurs

– for conditional branch: unbounded code expansion – for non-branch: unbounded live range ⇒ unbounded registers – Neither of them is possible

  • But, how to prove mathematically?

– Establish formalization of software pipelining! – Schwiegelshohn’ s proof is based on quite a specific property of SP

  • transformation. Thus, fail to lead to generalized results
slide-17
SLIDE 17

17

ICS ’ 02 17/ 39 CARES Lab. / Seoul National Univ.

For Loops Violating Optimality Condition …

  • Can’ t we guarantee anything for them ?

– in terms of optimal SP

  • From the nonexistence proof,

– The problematic paths are exposed to too much parallelism..

  • I.e., requires too large scheduling windows

– But, the same problem is also in the single-path

  • loops. How has it been handled ?
  • Add artifact dependences !!

k nodes

a= a+ 1 a= a+ 1 a= a+ 1 a= a+ 1

k nodes

b= b+ 1 b= b+ 1 b= b+ 1 b= b+ 1 a= a+ 1 b= b+ 1 a= f() b= a+ 1 c= b* 2 a= f’ (a) b= a+ 1 c= b* 2

  • e.g.: Every operation in i-th iteration is dependent
  • n every operation in (i-a)-th iteration..

– For some a> 0

slide-18
SLIDE 18

18

ICS ’ 02 18/ 39 CARES Lab. / Seoul National Univ.

Road Map

  • 1. Optimality Condition
  • 2. Optimal SP Algorithm
  • 3. Efficient Optimal SP Algorithm
  • 4. Experimental Results
slide-19
SLIDE 19

19

ICS ’ 02 19/ 39 CARES Lab. / Seoul National Univ.

Optimal SP Algorithm

  • Computes an optimal solution for every loop satisfying the
  • ptimality condition

– Mostly based on the latest version of Aiken & Nicolau’ s Perfect Pipelining [TPDS ’ 95]

  • Modification to renaming framework

– Aiken’ s original algorithm doesn’ t always handle false dependences appropriately

  • Use Ebcioglu’ s on the fly dynamic renaming restrictively

– For optimal SP, false dependences should be overcome completely

  • so that the parallel schedule is constrained by true

dependences only

– We use SSA (Static Single Assignment) for renaming framework

slide-20
SLIDE 20

20

ICS ’ 02 20/ 39 CARES Lab. / Seoul National Univ.

Optimality of the SP Algorithm

  • Important properties

– greedy: ASAP compaction scheduling – no false dependence: code motion is never hampered by false dependences; constrained by true dependences only

  • From the properties, we can prove

– For any parallel execution path of software pipelined program, there is a dependence chain whose length is equal to # of execution cycles of the parallel path

  • Note that the algorithm also serves as a proof for the

sufficient part of the optimality condition

slide-21
SLIDE 21

21

ICS ’ 02 21/ 39 CARES Lab. / Seoul National Univ.

Road Map

  • 1. Optimality Condition
  • 2. Optimal SP Algorithm
  • 3. Conservative Optimal Algorithm
  • 4. Experimental Results
slide-22
SLIDE 22

22

ICS ’ 02 22/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimal SP Algorithm

Optimality condition

  • An efficient algorithm to compute an optimal solution for “ almost every”

loops satisfying the optimality condition

– “ almost” ≡ more than 90% (for loops used in our experiment..) – “ efficient” ≡ in less than 30 sec.

slide-23
SLIDE 23

23

ICS ’ 02 23/ 39 CARES Lab. / Seoul National Univ.

Intermediate Representation (IR) for SP

Loop in

CFG form SP

’ ed Loop in I R form Loop in

I R form SP

’ ed Loop in CFG form

Software Pipelining

Transform into IR

(e.g., if-converson for IMPACT compiler)

Transform back into CFG

(e.g., reverse if-converson for IMPACT compiler)

slide-24
SLIDE 24

24

ICS ’ 02 24/ 39 CARES Lab. / Seoul National Univ.

Some SP techniques…(e.g., EPS, Perfect Pipelining)

Loop in

CFG form SP

’ ed Loop in CFG form

Software Pipelining

slide-25
SLIDE 25

25

ICS ’ 02 25/ 39 CARES Lab. / Seoul National Univ.

Example of Our SP (& IR)

Software Pipelining

Transform into IR Transform back into CFG

  • Our IR: Nondeterministic Control Flow Graphs (NCFG)
slide-26
SLIDE 26

26

ICS ’ 02 26/ 39 CARES Lab. / Seoul National Univ.

Our Intermediate Representation

  • Nondeterministic Control Flow Graph (NCFG)

– Milicev [IPPS ’ 97, IPPS ’ 98, SIGPLAN Notices ’ 01, IJPP ’ 02] – Similar to the one used in PRE by Knoop [TOPLAS ’ 94] – Code motion on NCFG is similar to code motion on CFG

  • Compensation code at join, renaming, …

– NCFG : CFG = NFA : DFA

  • For a path in CFG, there is a corresponding thread of paths in NCFG
  • Transforming CFG to NCFG is direct (as DFA to NFA)
  • Transforming NCFG to CFG is very similar to NFA to DFA conversion

– A nice property (in terms of optimal SP)

  • If any path of NCFG runs in the shortest possible subject to true dependences
  • nly, the corresponding NCFG is optimal
  • Finding optimal SP algorithm for NCFG is sufficieint !

CFG NCFG Optimal NCFG Optimal CFG

trivial Known (Milicev)

Let’ s find it !!

slide-27
SLIDE 27

27

ICS ’ 02 27/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimal SP Algorithm

For any pair of simple cycles (c1,c2) in NCFG, there exist critical dependence cycles, d1 in c1 and d2 in c2, s.t. d1 and d2 are dependent on each other (infomal..)

  • Runs on loops in NCFG form
  • Can be regarded an generalized modulo scheduling
  • Guarantees the optimality if a loop satisfies the

condition (conservative..)

– stronger than the optimality condition – but, covers more than 90% of loops satisfying the

  • ptimality condition (for loops used in our experiment)

Conservative Optimality Condition

slide-28
SLIDE 28

28

ICS ’ 02 28/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimality Condition

For any pair of simple cycles (c1,c2) in NCFG, there exist critical dependence cycles, d1 in c1 and d2 in c2, s.t. d1 and d2 are dependent on each other

Conservative Optimality Condition 3 simple cycles in NCFG

slide-29
SLIDE 29

29

ICS ’ 02 29/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimality Condition

For any pair of simple cycles (c1,c2) in NCFG, there exist critical dependence cycles, d1 in c1 and d2 in c2, s.t. d1 and d2 are dependent on each other

Conservative Optimality Condition

Critical dependence cycles

in each NCFG cycles We call the set of dependence cycles dependence kernel

slide-30
SLIDE 30

30

ICS ’ 02 30/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimal Algo. : Step 1

I I left= 2 I I right= 2 I I left+ I I right = 4

  • Find I I of each block

– For each simple cycle in NCFG, there is an equation.

  • LHS = symbolic sum of IIs of blocks which the NCFG cycle consists of
  • RHS = the length of the critical dependence cycle of the NCFG cycle

– Find any tuple (II 1, II 2, … ) that satisfies all the equations – If such tuple of IIs does not exist, apply split transformation

  • So that the number of variables (IIs) may exceed the number of independent

equations

(I I left , I I right) = (2,2)

slide-31
SLIDE 31

31

ICS ’ 02 31/ 39 CARES Lab. / Seoul National Univ.

Split Transformation

I I 1= 3 I I 2= 2 I I 1+ I I 2 = 4 NO solution exists.. I I 11= 3 , I I 22= 2 I I 12+ I I 21 = 4 I I 11+ I I 12 + I I 21 = 7 I I 22+ I I 12 + I I 21 = 7 I I 22+ I I 12 + I I 21 = 7 I I 11+ I I 22+ I I 12 + I I 21 = 9 (I I 11 , I I 12 , I I 21 , I I 22) = (3, 2, 2, 2)

# of variables > # of independent equations

slide-32
SLIDE 32

32

ICS ’ 02 32/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimal Algo. : Step 2

(I I left , I I right) = (2,2)

  • Given a tuple of II of each block, find a schedule

– Schedule : each operation ’ s offset from the beginning of its block

  • c.f.: flat schedule in modulo scheduling

– Use the longest path-based algorithm (as in modulo scheduling)

  • Weight of edge = latency (if forward edge) , latency- I I (if backedge)

1- I I left 1 1- I I left 1- I I right 1 1- I I left 1- I I right 1- I I right

slide-33
SLIDE 33

33

ICS ’ 02 33/ 39 CARES Lab. / Seoul National Univ.

Conservative Optimal Algo. : Step 3

  • Code motion according to the schedule

– As in CFG – Shade region = parallel instruction

slide-34
SLIDE 34

34

ICS ’ 02 34/ 39 CARES Lab. / Seoul National Univ.

Final Step: Transform NCFG back into NCFG

slide-35
SLIDE 35

35

ICS ’ 02 35/ 39 CARES Lab. / Seoul National Univ.

Road Map

  • 1. Optimality Condition
  • 2. Optimal SP Algorithm
  • 3. Conservative Optimal SP Algorithm
  • 4. Experimental Results
slide-36
SLIDE 36

36

ICS ’ 02 36/ 39 CARES Lab. / Seoul National Univ.

Experimental Settings

  • SPARC-based VLIW testbed [Moon, MICRO

’ 97]

  • Loops (with control flows) extracted from SPEC95int

– # of paths ≤ 4 – with up to 64 operations

  • As in other works, e.g., [Huff, PLDI’ 93], [Govindarjan, MICRO’ 95]
  • Latency

– L.1: load = 2 cycles, others = 1 cycle – L.2: load = 3 cycles, others = 1 cycle – L.3: load = 3 cycles, others = 2 cycle

slide-37
SLIDE 37

37

ICS ’ 02 37/ 39 CARES Lab. / Seoul National Univ.

Experimental Scenario

Optimality condition

A B C

  • Exp. 1: Compute optimality condition

– If not satisfied within CPU time < 30 sec, falls into A

  • Exp.2: Compute the sufficient condition for efficient SP algorithm

– If not satisfied within CPU time < 30 sec, falls into B

  • Exp. 3: Compute the optimal solution by the efficient SP algorithm

– If computes in 30sec, falls into C. Otherwise, falls into B

slide-38
SLIDE 38

38

ICS ’ 02 38/ 39 CARES Lab. / Seoul National Univ.

Experimental Results

  • Actual portion of each region

– Quite optimistic !

  • L.1 : (A ≅ 10% , C ≅ 75%) // load = 2 cycles, other = 1 cycle
  • L.2: (A ≅ 12% , C ≅ 72%) // load = 3 cycles, other = 1 cycle
  • L.3: (A ≅ 12% , C ≅ 77%) // load = 3 cycles, other = 2 cycle

Optimality condition

A B C

slide-39
SLIDE 39

39

ICS ’ 02 39/ 39 CARES Lab. / Seoul National Univ.

Summary

  • Exactly identified what can and cannot be achieved by SP
  • For eligible loops, presented optimal SP algorithms

– First one: covers all eligible loops but expensive – Second one: covers most eligible loops and efficient

  • Promising experimental results
  • Future work

– Realistic resource constrained SP algorithm – Candidate: Enhanced Pipeline Scheduling (EPS) guided by the efficient optimal algorithm