Iterative Optimization in the Polyhedral Model: Part II, - - PowerPoint PPT Presentation

iterative optimization in the polyhedral model part ii
SMART_READER_LITE
LIVE PREVIEW

Iterative Optimization in the Polyhedral Model: Part II, - - PowerPoint PPT Presentation

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Louis-Nol Pouchet 1 Cdric Bastoul 1 Albert Cohen 1 John Cavazos 2 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France 2 Dept. of Computer &


slide-1
SLIDE 1

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time

Louis-Noël Pouchet1 Cédric Bastoul1 Albert Cohen1 John Cavazos2

1ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France

  • 2Dept. of Computer & Information Sciences, University of Delaware, USA

June 9, 2008

ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation

Tucson, Arizona

slide-2
SLIDE 2

Introduction: Situation PLDI’08

Motivation

◮ New architecture → New high-performance libraries needed ◮ New architecture → New optimization flow needed ◮ Architecture complexity/diversity increases faster than optimization

progress

◮ Traditional approaches lose performance portability. . .

We want a portable optimization process!

INRIA Saclay / U. of Delaware 2 / 18

slide-3
SLIDE 3

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

.........

INRIA Saclay / U. of Delaware 3 / 18

slide-4
SLIDE 4

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

......... locality improvement, = vectorization, parallelization, etc...

INRIA Saclay / U. of Delaware 3 / 18

slide-5
SLIDE 5

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

......... parameter tuning, = phase ordering, etc...

INRIA Saclay / U. of Delaware 3 / 18

slide-6
SLIDE 6

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

......... pattern recognition, = hand-tuned kernel codes, etc...

INRIA Saclay / U. of Delaware 3 / 18

slide-7
SLIDE 7

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

......... = Auto-tuning libraries

INRIA Saclay / U. of Delaware 3 / 18

slide-8
SLIDE 8

Introduction: The Problem PLDI’08

The Optimization Problem

Architectural characteristics

ALU, SIMD, Caches, ...

Compiler optimization interaction

GCC has 205 passes...

Domain knowledge

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 2 Code for architecture 1 Code for architecture N

.........

Our approach: build an expressive set of program versions

In reality, there is a complex interplay between all components

INRIA Saclay / U. of Delaware 3 / 18

slide-9
SLIDE 9

Generating Program Versions: Overview PLDI’08

Iterative Optimization Flow

Input code

Optimization 1 Optimization N

.........

Optimization 2

High-level transformations Compiler Target code

INRIA Saclay / U. of Delaware 4 / 18

slide-10
SLIDE 10

Generating Program Versions: Overview PLDI’08

Iterative Optimization Flow

Input code Compiler Target code Set of program versions Program version = result of a sequence of loop transformation

INRIA Saclay / U. of Delaware 4 / 18

slide-11
SLIDE 11

Generating Program Versions: Overview PLDI’08

Iterative Optimization Flow

Input code Compiler Target code Run Space explorer Final code Set of program versions Program version = result of a sequence of loop transformation

INRIA Saclay / U. of Delaware 4 / 18

slide-12
SLIDE 12

Generating Program Versions: Properties PLDI’08

Set of Program Versions

What matters is the result of the application of optimizations, not the

  • ptimization sequence

All-in-one approach:

◮ Legality: semantics is always preserved ◮ Uniqueness: all versions of the set are distinct ◮ Expressiveness: a version is the result of an arbitrarily complex

sequence of loop transformation

INRIA Saclay / U. of Delaware 5 / 18

slide-13
SLIDE 13

Generating Program Versions: The Representation PLDI’08

The Polyhedral Model in a Nutshell

◮ Arbitrarily complex sequence of loop transformations are modeled in a

single optimization step: new scheduling matrix

◮ Granularity: each executed instance of each statement

Θ :      

for (i = ...; i < ...; ++i) S1(i); for (i = ...; i < ...; ++i) S2(i);

◮ First row → all outer-most loops

INRIA Saclay / U. of Delaware 6 / 18

slide-14
SLIDE 14

Generating Program Versions: The Representation PLDI’08

The Polyhedral Model in a Nutshell

◮ Arbitrarily complex sequence of loop transformations are modeled in a

single optimization step: new scheduling matrix

◮ Granularity: each executed instance of each statement

Θ :      

for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S2(i,j);

◮ Second row → all next outer-most loops

INRIA Saclay / U. of Delaware 6 / 18

slide-15
SLIDE 15

Generating Program Versions: The Representation PLDI’08

The Polyhedral Model in a Nutshell

◮ Arbitrarily complex sequence of loop transformations are modeled in a

single optimization step: new scheduling matrix

◮ Granularity: each executed instance of each statement

Θ :      

for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);

◮ Minor change → significant impact

INRIA Saclay / U. of Delaware 6 / 18

slide-16
SLIDE 16

Generating Program Versions: The Representation PLDI’08

The Polyhedral Model in a Nutshell

◮ Arbitrarily complex sequence of loop transformations are modeled in a

single optimization step: new scheduling matrix

◮ Granularity: each executed instance of each statement

Θ :   

  • ı
  • p

c

  • ı
  • p

c   

for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);

Transformation Description

  • ı

reversal

Changes the direction in which a loop traverses its iteration range

skewing

Makes the bounds of a given loop depend on an outer loop counter

interchange

Exchanges two loops in a perfectly nested loop, a.k.a. permutation

  • p

fusion

Fuses two loops, a.k.a. jamming

distribution

Splits a single loop nest into many, a.k.a. fission or splitting

c

peeling

Extracts one iteration of a given loop

shifting

Allows to reorder loops

INRIA Saclay / U. of Delaware 6 / 18

slide-17
SLIDE 17

Generating Program Versions: Contributions PLDI’08

Previous Contributions

Previous work (CGO’07, Part I, One-Dimensional Time):

◮ Focus on Static Control Parts (SCoP)

◮ SCoP: Consecutive set of statements with affine control flow

◮ Complete framework for one-dimensional schedules ◮ Efficient search space construction, efficient traversal ◮ Drawbacks in applicability ◮ Drawbacks in expressiveness

We previously solved a simpler problem...

INRIA Saclay / U. of Delaware 7 / 18

slide-18
SLIDE 18

Generating Program Versions: Contributions PLDI’08

New Contributions

Dealing with multidimensional schedules:

◮ Applicability on any Static Control Parts ◮ Increased expressiveness ◮ Design of scalable traversal methods

◮ Dedicated genetic algorithm ◮ Dedicated heuristic INRIA Saclay / U. of Delaware 8 / 18

slide-19
SLIDE 19

Generating Program Versions: Looking Into Details PLDI’08

Deeper In The Method

Multidimensional schedules: high expressiveness, complex problem Set of program versions Tested versions

  • combinatorial expression of legality
  • heuristic needed: greedy selection of

dependences + ordering (see Some Efficient Solutions to the Affine Scheduling

Problem, Part II: Multidimensional Time, Feautrier, 1992)

  • Code generation friendly bounds on the

schedule coefficients

  • multiple polytopes to traverse
  • large and expressive spaces

(up to 10 )

  • partial enumeration (mandatory):

completion mechanism+ subspace partitioning

  • shape the space:
  • ptimized polytope projection (required)

+ constrained dynamic scan

Space construction Space traversal

50

Distinct schedules

INRIA Saclay / U. of Delaware 9 / 18

slide-20
SLIDE 20

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst

for (i = 0; i < M; i++) for (j = 0; j < M; j++) { tmp[i][j] = 0.0; for (k = 0; k < M; k++) tmp[i][j] += block[i][k] * cos1[j][k]; } for (i = 0; i < M; i++) for (j = 0; j < M; j++) { sum2 = 0.0; for (k = 0; k < M; k++) sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); }

◮ Extensive study of 8x8 Discrete Cosine Transform (UTDSP) ◮ Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

INRIA Saclay / U. of Delaware 10 / 18

slide-21
SLIDE 21

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst

Θ :            

◮ Extensive study of 8x8 Discrete Cosine Transform (UTDSP) ◮ Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

INRIA Saclay / U. of Delaware 10 / 18

slide-22
SLIDE 22

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst

◮ best ◮ average ◮ worst

◮ Take one specific value for the first row ◮ Try the 19863 possible values for the second row

INRIA Saclay / U. of Delaware 10 / 18

slide-23
SLIDE 23

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 18000 Point index of the second schedule dimension, first one fixed Performance distribution (sorted) - 8x8 DCT

◮ Take one specific value for the first row ◮ Try the 19863 possible values for the second row ◮ Very low proportion of best points: < 0.02%

INRIA Saclay / U. of Delaware 10 / 18

slide-24
SLIDE 24

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst

Large performance variation

◮ Performance variation is large for good values of the first row

INRIA Saclay / U. of Delaware 10 / 18

slide-25
SLIDE 25

Traversing the Search Space: Extensive Analysis PLDI’08

Observations on the Performance Distribution

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 10 20 30 40 50 60 Performance improvement Point index for the first schedule row Performance distribution - 8x8 DCT Best Average Worst

Small performance variation

◮ Performance variation is large for good values of the first row ◮ It is usually reduced for bad values of the first row

INRIA Saclay / U. of Delaware 10 / 18

slide-26
SLIDE 26

Traversing the Search Space: Extensive Analysis PLDI’08

Scanning The Space of Program Versions

The search space:

◮ Performance variation indicates to partition the space ◮ Non-uniform distribution of performance ◮ No clear analytical property of the optimization function

→ Build dedicated heuristic and genetic operators aware of these static

and dynamic characteristics

INRIA Saclay / U. of Delaware 11 / 18

slide-27
SLIDE 27

Traversing the Search Space: Heuristic PLDI’08

Dedicated Heuristic

◮ Multidimensional version of the heuristic presented in Part I ◮ Discover 80%+ of the performance improvement in less than 50 runs for

small kernels

◮ Feedback directed, yet deterministic ◮ Leverages our knowledge about performance distribution ◮ Relies on the completion algorithm to instantiate the full version ◮ But unsatisfactory results for larger programs...

INRIA Saclay / U. of Delaware 12 / 18

slide-28
SLIDE 28

Traversing the Search Space: Genetic Operators PLDI’08

Dedicated GA Operators

Mutation

◮ Performance distribution is not uniform ◮ Tailored to focus on the most promising subspaces ◮ Preserves legality (closed under affine constraints)

Cross-over

◮ Row cross-over

  • +
  • =
  • ◮ Column cross-over
  • +
  • =
  • ◮ Both preserve legality

INRIA Saclay / U. of Delaware 13 / 18

slide-29
SLIDE 29

Traversing the Search Space: Genetic Operators PLDI’08

Dedicated GA Results

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 50 100 150 200 250 300 350 400 450 500 Performance Improvement Number of runs GA versus Random - 8x8 DCT Random GA 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 18000 Performance improvement Point index of the second schedule dimension, first one fixed Performance distribution (sorted) - 8x8 DCT

◮ GA converges towards the maximal space speedup

INRIA Saclay / U. of Delaware 14 / 18

slide-30
SLIDE 30

Traversing the Search Space: Experimental Results PLDI’08

Experimental Results [1/3]

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 d c t e d g e i i r f i r l m s f i r m a t m u l t l a t n r m l p c l u d c m p r a d a r a v e r a g e Performance improvement Performance improvement for AMD Athlon64 Heuristic GA

baseline: gcc -O3 -ftree-vectorize -msse2

INRIA Saclay / U. of Delaware 15 / 18

slide-31
SLIDE 31

Traversing the Search Space: Experimental Results PLDI’08

Experimental Results [2/3]

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 d c t e d g e i i r f i r l m s f i r m a t m u l t l a t n r m l p c l u d c m p r a d a r a v e r a g e Performance improvement Performance improvement for ST231 Heuristic GA

baseline: st200cc -O3 -OPT:alias=restrict -mauto-prefetch

INRIA Saclay / U. of Delaware 16 / 18

slide-32
SLIDE 32

Traversing the Search Space: Experimental Results PLDI’08

Experimental Results [3/3]

Looking into details (hardware counters+compilation trace):

◮ Better activity of the processing units ◮ Best version may vary significantly for different architectures ◮ Different source code may trigger different compiler optimizations

→ Our method is a portable optimization process

INRIA Saclay / U. of Delaware 17 / 18

slide-33
SLIDE 33

Conclusion: PLDI’08

Conclusion

◮ Scalable algorithms (GA and heuristic) to traverse the space, with

dedicated pruning and search strategies

◮ Part I + Part II: applicability observed on various compilers (GCC, ICC,

Open64) and architectures (x86_32, x86_64, MIPS32, ST231 VLIW)

◮ Tunable framework: open to other search space construction

strategies

◮ Take-home message:

◮ All-in-one: legality, uniqueness, expressiveness ◮ Generic and portable approach for high-level transformation selection INRIA Saclay / U. of Delaware 18 / 18

slide-34
SLIDE 34

Conclusion: PLDI’08

Tunuing: Distribute and Tile

◮ Focus on fuse/distribute legality affine constraints (presented algorithm

with additional constraints)

◮ Use PLuTo as a tiling / parallel backend ◮ Driven by program versions ◮ Excellent performance gains (research report coming soon...)

INRIA Saclay / U. of Delaware 19 / 18