Iterative Optimization in the Polyhedral Model: Part I, - - PowerPoint PPT Presentation

iterative optimization in the polyhedral model part i one
SMART_READER_LITE
LIVE PREVIEW

Iterative Optimization in the Polyhedral Model: Part I, - - PowerPoint PPT Presentation

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Nol Pouchet , Cdric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March 12, 2007 Fifth International


slide-1
SLIDE 1

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache

ALCHEMY, INRIA Futurs / University of Paris-Sud XI

March 12, 2007

Fifth International Symposium on Code Generation and Optimization

San Jose, California

slide-2
SLIDE 2

Outline: CGO’07

Outline

Context of this study:

◮ Focus on Loop Nest Optimization for regular loops ◮ Automatic method for parallelism extraction / loop transformation ◮ Combine iterative methods with the power of the polyhedral model ◮ Solution independent of the compiler and the target machine

Our contribution:

◮ Search space construction

◮ 1 point in the space ⇔ 1 distinct legal program version ◮ suitable for various exploration methods

◮ Performance

◮ 99% of the best speedup attained within 20 runs of a dedicated heuristic ◮ wall clock optimal transformation discoverable on small kernels 2

slide-3
SLIDE 3

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

Original Schedule

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); }

  • θS1 = i

θS2 = i

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); } ◮ Specify the outer-most loop only ◮ Initial outer-most loop is i

4

slide-4
SLIDE 4

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

Distribute loops

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); }

  • θS1 = i

θS2 = i+n

for (i=0; i<n; ++i) . S1(i); for (i=n; i<2*n; ++i) . for (j=0; j<n; ++j) . . S2(i-n,j); ◮ Specify the outer-most loop only ◮ All instances of S1 are executed before the first S2 instance

4

slide-5
SLIDE 5

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

Distribute loops + Interchange loops for S2

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); }

  • θS1 = i

θS2 = j+n

for (i=0; i<n; ++i) . S1(i); for (j=n; j<2*n; ++j) . for (i=0; i<n; ++i) . . S2(i,j-n); ◮ Specify the outer-most loop only ◮ The outer-most loop for S2 becomes j

4

slide-6
SLIDE 6

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

Distribute loops + Interchange loops for S2

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); }

  • θS1 = i

θS2 = j+n

for (i=0; i<n; ++i) . S1(i); for (j=n; j<2*n; ++j) . for (i=0; i<n; ++i) . . S2(i,j-n);

Transformation Description

reversal

Changes the direction in which a loop traverses its iteration range

skewing

Makes the bounds of a given loop depend on an outer loop counter

interchange

Exchanges two loops in a perfectly nested loop, a.k.a. permutation

peeling

Extracts one iteration of a given loop

shifting

Allows to reorder loops

fusion

Fuses two loops, a.k.a. jamming

distribution

Splits a single loop nest into many, a.k.a. fission or splitting

4

slide-7
SLIDE 7

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

for (i=0; i<n; ++i) { . S1(i); . for (j=0; j<n; ++j) . . S2(i,j); } ◮ A schedule is an affine function of the iteration vector and the

parameters θS1(

  • xS1)

= t1S1.iS1 +t2S1.n+t3S1.1 θS2(

  • xS2)

= t1S2.iS2 +t2S2.jS2 +t3S2.n+t4S2.1

6

slide-8
SLIDE 8

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

for (i=0; i<n; ++i) { . s[i] = 0; . for (j=0; j<n; ++j) . . s[i] = s[i]+a[i][j]*x[j]; } ◮ A schedule is an affine function of the iteration vector and the

parameters θS1(

  • xS1)

= t1S1.iS1 +t2S1.n+t3S1.1 θS2(

  • xS2)

= t1S2.iS2 +t2S2.jS2 +t3S2.n+t4S2.1

◮ For −1 ≤ t ≤ 1, there are 37 = 2187 possible schedules

6

slide-9
SLIDE 9

Scheduling in the Polyhedral Model: A Motivating Example CGO’07

One-Dimensional Scheduling

for (i=0; i<n; ++i) { . s[i] = 0; . for (j=0; j<n; ++j) . . s[i] = s[i]+a[i][j]*x[j]; } ◮ A schedule is an affine function of the iteration vector and the

parameters θS1(

  • xS1)

= t1S1.iS1 +t2S1.n+t3S1.1 θS2(

  • xS2)

= t1S2.iS2 +t2S2.jS2 +t3S2.n+t4S2.1

◮ For −1 ≤ t ≤ 1, there are 37 = 2187 possible schedules ◮ But only 129 legal distinct schedules

6

slide-10
SLIDE 10

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules 7

slide-11
SLIDE 11

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules

matmult locality fir h264 crout

  • i-Bounds

−1,1 −1,1 0,1 −1,1 −3,3 c-Bounds −1,1 −1,1 0,3 0,4 −3,3

#Sched.

1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

7

slide-12
SLIDE 12

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules

matmult locality fir h264 crout

  • i-Bounds

−1,1 −1,1 0,1 −1,1 −3,3 c-Bounds −1,1 −1,1 0,3 0,4 −3,3

#Sched.

1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

#Legal

6561 912 792 360 798

7

slide-13
SLIDE 13

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules

matmult locality fir h264 crout

  • i-Bounds

−1,1 −1,1 0,1 −1,1 −3,3 c-Bounds −1,1 −1,1 0,3 0,4 −3,3

#Sched.

1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

#Legal

6561 912 792 360 798

◮ Rely on the polyhedral model and Integer Linear Programming to

guarantee completeness and correctness of the space properties

7

slide-14
SLIDE 14

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules

matmult locality fir h264 crout

  • i-Bounds

−1,1 −1,1 0,1 −1,1 −3,3 c-Bounds −1,1 −1,1 0,3 0,4 −3,3

#Sched.

1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

#Legal

6561 912 792 360 798

◮ Rely on the polyhedral model and Integer Linear Programming to

guarantee completeness and correctness of the space properties

◮ Search space will emcoumpass unique, distinct compositions of

reversal, skewing, interchange, fusion, peeling, shifting, distribution

7

slide-15
SLIDE 15

Scheduling in the Polyhedral Model: Overview CGO’07

Our Objective

1

Search space construction

◮ Efficiently construct a space of all legal, distinct affine schedules

matmult locality fir h264 crout

  • i-Bounds

−1,1 −1,1 0,1 −1,1 −3,3 c-Bounds −1,1 −1,1 0,3 0,4 −3,3

#Sched.

1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

#Legal

6561 912 792 360 798

◮ Rely on the polyhedral model and Integer Linear Programming to

guarantee completeness and correctness of the space properties

◮ Search space will emcoumpass unique, distinct compositions of

reversal, skewing, interchange, fusion, peeling, shifting, distribution

2

Search space exploration

◮ Perform exhaustive scan to discover wall clock optimal schedule, and

evidences of intricacy of the best transformation

◮ Build an efficient heuristic to accelerate the space traversal 7

slide-16
SLIDE 16

Search Space Construction: Preliminaries CGO’07

Polyhedral Representation of Programs

Static Control Parts

◮ Loops have affine control only

9

slide-17
SLIDE 17

Search Space Construction: Preliminaries CGO’07

Polyhedral Representation of Programs

Static Control Parts

◮ Loops have affine control only ◮ Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i) . for (j=1; j<=n; ++j) . . if (i<=n-j+2) . . . s[i] = ...

DS1 =

      1 −1 −1 1 1 −1 −1 1 −1 −1 1 2       .     i j n 1     ≥ 9

slide-18
SLIDE 18

Search Space Construction: Preliminaries CGO’07

Polyhedral Representation of Programs

Static Control Parts

◮ Loops have affine control only ◮ Iteration domain: represented as integer polyhedra ◮ Memory accesses: static references, represented as affine functions of

  • xS and

p

for (i=0; i<n; ++i) { . s[i] = 0; . for (j=0; j<n; ++j) . . s[i] = s[i]+a[i][j]*x[j]; } fs( xS2) = 1 .  

  • xS2

n 1   fa( xS2) =

  • 1

1

  • .

 

  • xS2

n 1   fx( xS2) = 1 .  

  • xS2

n 1  

9

slide-19
SLIDE 19

Search Space Construction: Preliminaries CGO’07

Polyhedral Representation of Programs

Static Control Parts

◮ Loops have affine control only ◮ Iteration domain: represented as integer polyhedra ◮ Memory accesses: static references, represented as affine functions of

  • xS and

p

◮ Data dependence between S1 and S2: a subset of the Cartesian

product of DS1 and DS2 (exact analysis)

for (i=1; i<=3; ++i) { . s[i] = 0; . for (j=1; j<=3; ++j) . . s[i] = s[i] + 1; }

DS1δS2 :

         1 −1 1 −1 −1 3 1 −1 −1 3 1 −1 −1 3          .     iS1 iS2 jS2 1     = 0 ≥

i

S1 iterations S2 iterations

9

slide-20
SLIDE 20

Search Space Construction: Preliminaries CGO’07

Polyhedral Representation of Programs

Static Control Parts

◮ Loops have affine control only ◮ Iteration domain: represented as integer polyhedra ◮ Memory accesses: static references, represented as affine functions of

  • xS and

p

◮ Data dependence between S1 and S2: a subset of the Cartesian

product of DS1 and DS2 (exact analysis)

◮ Reduced dependence graph labeled by dependence polyhedra

9

slide-21
SLIDE 21

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

10

slide-22
SLIDE 22

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&!"#$"%&'()*+,-&'&+,

Property (Causality condition for schedules) Given RδS, θR and θS are legal iff for each pair of instances in dependence:

θR( xR) < θS( xS)

Equivalently: ∆R,S = θS(

xS)−θR( xR)−1 ≥ 0

10

slide-23
SLIDE 23

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&!"#$"%&'())"

Lemma (Affine form of Farkas lemma) Let D be a nonempty polyhedron defined by A

  • x+

b ≥

  • 0. Then any affine function f(
  • x)

is non-negative everywhere in D iff it is a positive affine combination:

f(

  • x) = λ0 +

λT(A

  • x+

b), with λ0 ≥ 0 and λ ≥ 0. λ0 and λT are called the Farkas multipliers.

10

slide-24
SLIDE 24

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&8$9:$)&!";;$ <$%(/& 8$9:$) =0%*(>%("9)

10

slide-25
SLIDE 25

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&8$9:$)&!";;$ <$%(/& 8$9:$) =0%*(>%("9)

=$+6&*7&7+"

10

slide-26
SLIDE 26

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&8$9:$)&!";;$ <$%(/& 8$9:$) =0%*(>%("9) 4&!"#$%&'()%&*$

θS( xS)−θR( xR)−1 = λ0 + λT

  • DR,S
  • xR
  • xS
  • +

dR,S

  • ≥ 0

           DRδS iR : λD1,1 −λD1,2 +λD1,3 −λD1,4 iS : −λD1,1 +λD1,2 +λD1,5 −λD1,6 jS : λD1,7 −λD1,8 n : λD1,4 +λD1,6 +λD1,8 1 : λD1,0

10

slide-27
SLIDE 27

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&8$9:$)&!";;$ <$%(/& 8$9:$) =0%*(>%("9) 4&!"#$%&'()%&*$

θS( xS)−θR( xR)−1 = λ0 + λT

  • DR,S
  • xR
  • xS
  • +

dR,S

  • ≥ 0

           DRδS iR : −t1R = λD1,1 −λD1,2 +λD1,3 −λD1,4 iS : t1S = −λD1,1 +λD1,2 +λD1,5 −λD1,6 jS : t2S = λD1,7 −λD1,8 n : t3S −t2R = λD1,4 +λD1,6 +λD1,8 1 : t4S −t3R −1 = λD1,0

10

slide-28
SLIDE 28

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '()*(+,*&

  • ,."/0%")

123+"&

  • ,."/0%")

4&5$0)$%(*6&,7+/(*(7+ 4&8$9:$)&!";;$ <$%(/& 8$9:$) =0%*(>%("9) 4&?/"+*(3,$*(7+ 4&!"#$%&'(#)

◮ Solve the constraint system ◮ Use (optimized) Fourier-Motzkin projection algorithm

◮ Reduce redundancy ◮ Detect implicit equalities 10

slide-29
SLIDE 29

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '(")*+,(-".$,)& /,0+12$0).*

304"#& 5$*.$)2.& 6270%8#0* 9+1)0& 6270%8#0*

:&/"8*"#$.;&2,)%$.$,) :&<"(="*&30--" !"#$%& <"(="* >8#.$?#$0(* :&@%0).$12".$,) :&A(,B02.$,)

10

slide-30
SLIDE 30

Search Space Construction: Way to Go CGO’07

Space Construction

!"#$%& '(")*+,(-".$,)& /,0+12$0).*

304"#& 5$*.$)2.& 6270%8#0* 9+1)0& 6270%8#0*

:&/"8*"#$.;&2,)%$.$,) :&<"(="*&30--" !"#$%& <"(="* >8#.$?#$0(*

@$A02.$,)

:&B%0).$12".$,) :&C(,A02.$,)

◮ One point in the space ⇔ one set of legal schedules

w.r.t. the dependence

10

slide-31
SLIDE 31

Search Space Construction: Way to Go CGO’07

Overview

Algorithm

◮ Add constraints obtained for each dependence ◮ Bound the space ◮ Search space: set of linear constraints on the schedule coefficients

(i.e. Z-polytope)

◮ To each integral point in the space corresponds a distinct program

version where the semantics is preserved

Benchmark

  • i-Bounds

#Sched #Legal Time

matmult −1,1 1.9×104 912 0.029 locality −1,1 5.9×104 6561 0.022 fir 0,1 1.2×107 792 0.047 h264 −1,1 1.8×108 360 0.024 crout −3,3 2.6×1015 798 0.046

11

slide-32
SLIDE 32

Search Space Exploration: Framework for Iterative Optimization CGO’07

Workflow

!"#$% &'(&')'*+,+-#* .+'&,+-/'%0#1(-2,+-#*%,*3%&4*%#5%6,)'%)#4&0'%0#3'% 7-+8%+&,*)5#&1'3%!"#$% "#3'%9'*'&,+-#*

!#4&0' "#3' !+,+-0 :*,2;)-) <'&*'2% ='*'&,+-#* >*-+% ='*'&,+-#*

$#2;8'3&,2%0#1(4+-*9%2-6&,&-')

$.$?-6 $#2;?-6 "?##= @4*

!"#$%&'()*(+" #$,+

  • ++,*)#."/0$%"1)0,2)0+"#$345+0678

"#1(-2,+-#*

9$(:1+,0)( 0+&0+7+45)5'$4 $/";!$9 <$34,+, 7+)0#1"7&)#+

!(,0'% "#*)+&40+-#* !(,0'% AB(2#&,+-#* C,&9'+ "#3' ◮ CLooG: http://www.cloog.org ◮ PiPLib: http://www.piplib.org ◮ PolyLib: http://icps.u-strasbg.fr/polylib

12

slide-33
SLIDE 33

Search Space Exploration: Exhaustive Scan CGO’07

Performance Distribution [1/2]

6e+08 8e+08 1e+09 1.2e+09 1.4e+09 1.6e+09 1.8e+09 2e+09 100 200 300 400 500 600 700 800 900 1000 Cycles Transformation identifier matmult

  • riginal

5e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09 1000 2000 3000 4000 5000 6000 7000 Cycles Transformation identifier locality

  • riginal

Figure: Performance distribution for matmult and locality

13

slide-34
SLIDE 34

Search Space Exploration: Exhaustive Scan CGO’07

Performance Distribution [2/2]

1.26e+09 1.28e+09 1.3e+09 1.32e+09 1.34e+09 1.36e+09 1.38e+09 1.4e+09 1.42e+09 100 200 300 400 500 600 700 800 Cycles Transformation identifier crout

  • riginal

(a) GCC -O3

1.26e+09 1.27e+09 1.28e+09 1.29e+09 1.3e+09 1.31e+09 1.32e+09 1.33e+09 1.34e+09 100 200 300 400 500 600 700 800 Cycles Transformation identifier crout

  • riginal

(b) ICC -fast

Figure: The effect of the compiler

14

slide-35
SLIDE 35

Search Space Exploration: Exhaustive Scan CGO’07

Performance Comparison

Figure: Best Version vs Original

15

slide-36
SLIDE 36

Search Space Exploration: Heuristic Scan CGO’07

Heuristic Scan

Propose a decoupling heuristic:

◮ The general “form” of the schedule is embedded in the iterator

coefficients

◮ Decouple the schedule: θS(

  • xS) = (
  • ı

p c)  

  • xS
  • n

1  

16

slide-37
SLIDE 37

Search Space Exploration: Heuristic Scan CGO’07

Heuristic Scan

Propose a decoupling heuristic:

◮ The general “form” of the schedule is embedded in the iterator

coefficients

◮ Decouple the schedule: θS(

  • xS) = (
  • ı

p c)  

  • xS
  • n

1  

◮ Parameters and constant coefficients can be seen as a refinement

16

slide-38
SLIDE 38

Search Space Exploration: Heuristic Scan CGO’07

Heuristic Scan

Propose a decoupling heuristic:

◮ The general “form” of the schedule is embedded in the iterator

coefficients

◮ Decouple the schedule: θS(

  • xS) = (
  • ı

p c)  

  • xS
  • n

1  

◮ Parameters and constant coefficients can be seen as a refinement

Adressing scalability to larger SCoPs:

1

impose a static or dynamic limit to the number of runs (limit to the

ı part)

2

replace an exhaustive enumeration of the

ı combinations by a limited set

  • f random draws in the

ı space.

16

slide-39
SLIDE 39

Search Space Exploration: Heuristic Scan CGO’07

Results

40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Maximum speedup achieved (in %) Runs locality Decoupling Random 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Maximum speedup achieved (in %) Runs matmult Decoupling Random 65 70 75 80 85 90 95 100 2 4 6 8 10 12 14 16 18 20 Maximum speedup achieved (in %) Runs mvt Decoupling Random

Figure: Comparison between random and decoupling heuristics

5e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09 1000 2000 3000 4000 5000 6000 7000 Cycles Transformation identifier locality

  • riginal

6e+08 8e+08 1e+09 1.2e+09 1.4e+09 1.6e+09 1.8e+09 2e+09 100 200 300 400 500 600 700 800 900 1000 Cycles Transformation identifier matmult

  • riginal

4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09 1.1e+09 1.2e+09 1.3e+09 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Cycles (M)

  • Transfo. ID

matvecttransp Original

17

slide-40
SLIDE 40

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

18

slide-41
SLIDE 41

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

18

slide-42
SLIDE 42

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

18

slide-43
SLIDE 43

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

Ongoing and future work:

◮ Couple with state-of-the-art feedback-directed iterative methods ◮ Part II: multidimensional schedules ◮ Integrate into GCC GRAPHITE branch

18

slide-44
SLIDE 44

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

Ongoing and future work:

◮ Couple with state-of-the-art feedback-directed iterative methods ◮ Part II: multidimensional schedules ◮ Integrate into GCC GRAPHITE branch

18

slide-45
SLIDE 45

Conclusion: CGO’07

Conclusion

◮ Optimizing and / or Enabling transformation framework on top of

the compiler

◮ Encouraging speedups, fast heuristic convergence ◮ On small kernels, optimal transformation can be discovered

Ongoing and future work:

◮ Couple with state-of-the-art feedback-directed iterative methods ◮ Part II: multidimensional schedules ◮ Integrate into GCC GRAPHITE branch

18

slide-46
SLIDE 46

Questions: CGO’07 19

slide-47
SLIDE 47

Questions: A Transformation Example CGO’07

Intricacy of the Transformed Code

Optimal Transformation for locality, GCC 4 -O3, P4 Xeon S1: B[j] = A[j] S2: C[j] = A[j + N] for (i=0;i<=M;i++) { for (j=0;j<=M;j++) { S1(i,j); S2(i,j); } } for (c1=-N;c1<=min(-2,M-N);c1++) for (j=0;j<=M;j++) S1(c1+N,j); for (c1=-1;c1<=M-N;c1++) { for (j=0;j<=M;j++) S2(c1+1,j); for (j=0;j<=M;j++) S1(c1+N,j); } for (c1=max(M-N+1,-1);c1<=M-1;c1++) for (j=0;j<=M;j++) S2(c1+1,j);

→ 19.4% speedup, without vectorization

21