Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - - PowerPoint PPT Presentation

polyhedral loop optimization part i
SMART_READER_LITE
LIVE PREVIEW

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - - PowerPoint PPT Presentation

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September 14, 2015 Overview Monday: Basics of the polyhedral model Modeling Transformation Code generation Wednesday: Practice


slide-1
SLIDE 1

Polyhedral Loop Optimization (Part I)

Armin Größlinger

SPPEXA Doctoral Retreat 2015

September 14, 2015

slide-2
SLIDE 2

2

Overview

  • Monday: Basics of the polyhedral model

– Modeling – Transformation – Code generation

  • Wednesday: Practice

– Available tools – Use LLVM+Polly to analyze and optimize codes

slide-3
SLIDE 3

3

Polyhedral Compilation

slide-4
SLIDE 4

4

Phases of the Model

1) Modeling

  • Describe loop iterations (= iteration domain)
  • Compute dependences between iterations

2) Transformation(s)

  • Reorder iterations to exhibit desired properties, e.g.,

parallelism, increased data locality, etc.

  • Must respect dependences

3) Code Generation

  • Turn transformed model into efficient code
slide-5
SLIDE 5

5

Model

  • For each statement:

– iteration domain: (unordered) set of loop iterations – schedule: relation between iteration domain and

(virtual) multi-dimensional execution time

– all schedules have same dimensionality

  • For each array access:

– access relation: maps iteration to memory location

  • For each pair of statements:

– dependences: relation between iteration domains

  • All sets and relations are defined by Presburger formulas
slide-6
SLIDE 6

6

Presburger Formulas I

  • Modeling must be based on a decidable theory
  • Loop iterations and memory locations are discrete units

→ must use integers (not rationals or reals)

  • Theory of inequalities (and congruences) over the integers

with addition is decidable.

  • Theory of inequalities over the integers with addition and

multiplication is not decidable.

  • Consequence: use affine expressions

(affine = linear + additive constant) and congruences, e.g.

slide-7
SLIDE 7

7

Presburger Formulas II

  • Presburger formulas allow existential and universal

quantification

  • Formulas with quantifiers are equivalent to formulas

without quantifiers, e.g.,

  • Geometrically, Presburger formulas describe the integer

points in polyhedra (possibly with “holes” when quantifiers

  • r congruences are used).
slide-8
SLIDE 8

8

Modeling Example I

for (i = 1; i <= n; ++i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1])

  • Iteration domain:
  • Dependences:
slide-9
SLIDE 9

9

Modeling Example II

for (i = 2; i <= n; i += 2) for (j = 1+(i/4)*4; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1])

slide-10
SLIDE 10

10

Modeling Example III

for (i = n; i >= 1; --i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1])

Historical solution: loops must be normalized, i.e., loops run forward and have unit stride:

for (i = -n; i <= -1; ++i) for (j = 1; j <= i; ++j) S: A[-i][j] = 0.5 * (A[-i-1][j] + A[-i][j-1])

slide-11
SLIDE 11

11

Modeling with Domain and Schedule

  • Use unordered iteration domain + schedule
  • Execution order is defined by lexicographic order on

schedule

  • When no explicit schedule is given, an identity schedule is

assumed.

for (i = n; i >= 1; --i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1])

slide-12
SLIDE 12

12

Modeling Example IV

  • Do not confuse loop and array dimensions

for (i=0; i<=n; ++i) for (j=0; j<=n; ++j) A[i+j] = …

– Two loops but only one array dimension! – Do not identify loop dimensions with array dimensions

  • Exercise:

– Draw the iteration space and the dependences between

iterations

slide-13
SLIDE 13

13

Model Extraction from Source Code

  • Iteration domains extracted from loop bounds (directly)
  • Access relations extracted from array accesses (directly)
  • Schedule (for sequential program) can be constructed

systematically: for n nested loops, construct schedule with 2n+1 dimensions, every second dimension ensures the textual order

  • f the statements.

A; for (i=…) { B; for (j=…) { C; D; } E; } F;

slide-14
SLIDE 14

14

Computing Dependences

  • Computing the dependences is the main work
  • There exists a dependence

when

– existence: – conflict: – order:

all hold (in the integers!)

  • Optimization: remove transitive dependences from

solution, e.g. only last write before a read access → compute lexicographic maximum of i in dependence of j.

slide-15
SLIDE 15

15

Dependence Computation Example

for (i = 1; i <= n; ++i) for (j = 0; j <= 1; ++j) S: A[2*i+j] = … for (k = 1; k <= 2*n+1; ++k) T: … = A[k]

  • Existence:
  • Conflict:
  • (Order: S textually before T)
slide-16
SLIDE 16

16

Tools and Compilers

  • Tools to solve dependence systems (among others):

– PIP (“parametric integer programming, P. Feautrier) – isl (“integer set library”, S. Verdoolaege)

  • Today part of production compilers

– IBM XL – GCC (“Graphite”) – LLVM (“Polly”)

slide-17
SLIDE 17

17

Practical Challenges in Modeling

  • Additional complications in real codes:

– Are “A” and “B” the same array or different arrays? – Aliasing of inner dimensions in arrays of arrays (or

arrays represented by pointer to pointers), e.g., float **A = …; A[3] = A[4]; → not enough dependences calculated

slide-18
SLIDE 18

18

Transformations

  • Compute a new schedule for each statement
  • Desired properties

– maximal parallelism – minimal latency – locality improvement

slide-19
SLIDE 19

19

Transformation Example

for (k=1; k<=m; k++) { for (i=1; i<n; i++) A[i] = (A[i-1] + A[i+1]) * 0.5f; } i k

k ,ik ,i1 k ,ik1,i−1 k ,ik1,i

→ flow dependence → output dependence Which iterations can be executed in parallel?

slide-20
SLIDE 20

20

Parallel Schedule

i k independent iterations

  • Parallel schedule:
  • The second component (k) could be anything else as long

as it is linearly independent from the first component.

slide-21
SLIDE 21

21

Transformed Space

t p for (int t=0; t<=n+2*m-4; t++) { int lb = max(0, (t-n+3)/2); int ub = min(m-1, t/2); parfor (int p=lb; p<=ub; p++) { int i = t - 2*p + 1; A[i] = (A[i-1] + A[i+1]) * 0.5f; } } Transformation: t = i + 2k – 3 p = k – 1

i = t – 2p + 1

slide-22
SLIDE 22

22

Schedules with Rational Coefficients

  • Schedules with fractional coefficients make sense!
  • Dependence:
  • Schedule:
  • Interpretation of the schedule:
  • Rational schedules are implicitly “floored”.
slide-23
SLIDE 23

23

Computing a new Schedule

  • Correctness criterion for a schedule:
  • Challenges:

– How to solve the “implies” part? – How to treat lexicographic precedence?

slide-24
SLIDE 24

24

Feautrier's Scheduling Algorithm I

  • For each statement S, try to find a schedule

where, for each dimension d, i.e. determine the coefficients (for each dimension) such that holds.

slide-25
SLIDE 25

25

Feautrier's Scheduling Algorithm II

  • Main idea of Feautrier Scheduler:

A linear function is guaranteed to be non-negative over a polyhedron if it is a positive combination of the polyhedron's bounds. (Linear form of Farkas' Lemma)

  • Try to solve (starting with )

where are the expressions defining the iteration domains of the involved statements.

  • Technique: equate coefficients of the right and left side (the

resulting system is affine!)

slide-26
SLIDE 26

26

Feautrier Scheduler III

  • In general, not all dependences can be satisfied in the first

schedule dimension → try to satisfy as many dependences as possible, satisfy remaining dependences in second dimension if possible, and so on.

  • Every dependence is “carried” by a certain schedule

dimension c. The schedule must ensure

slide-27
SLIDE 27

27

Optimization Criteria

  • Order in which dependences are satisfied not determined
  • Different solutions possible even for a fixed order
  • Select an additional target, e.g.,

– minimize latency (shortest theoretical execution) – do not use more processors than necessary – improve locality (put source and target of dependences

with high data volume on same processor)

  • Problem: relation between theoretical optimum (e.g. minimal

latency) and performance on real hardware unknown → room for exploration/autotuning/machine learning

slide-28
SLIDE 28

28

Performance of Different Solutions

3D Jacobi Stencil, Experiment by Stefan Kronawitter

slide-29
SLIDE 29

29

Another Transformation: Tiling

  • To improve data locality, tiling is often necessary to speed

up a code.

for (i=0; i<n; ++i) for (j=0; j<n; ++j) ... for (iT=0; i<n/B; ++i) for (i=B*iT; i<B*iT+B; ++i) for (jT=0; j<n/B; ++j) for (j=B*jT; j<B*jT+B; ++j) ... for (iT=0; i<n/B; ++i) for (jT=0; j<n/B; ++j) for (i=B*iT; i<B*iT+B; ++i) for (j=B*jT; j<B*jT+B; ++j) ...

slide-30
SLIDE 30

30

Limitations with Tiling

  • Tiling transformation is only affine for a given tile size
  • Parametric tiling requires extensions of the polyhedral

model which are not part of wide-spread tools (yet?)

  • Tiling not legal in all cases; sufficient criterion:

for all d.

slide-31
SLIDE 31

31

Other Transformations

  • Index Set Splitting: split iteration domains to allow for

better schedules

  • Memory Layout Transformations: more cache-friendly data

layout

  • Save Memory: compute minimum required working set

size

  • Communication Introduction: enumerate memory elements

that must be transferred between tiles

  • ...
slide-32
SLIDE 32

32

Code Generation

  • After transformation we must generate executable code for

the transformed model

  • Generated code must be efficient
  • Generation of good code is non-trivial
slide-33
SLIDE 33

33

The Essence of Code Generation

x

y

b c f g a d e h

For efficiency: No case distinctions inside the loops!

Enumerate iterations in lexicographic order.

T2 T1

for (x=a; x·b-1; x++) for (y=e; y·g; y++) T1; for (x=b; x·c; x++) f for (y=e; y·f-1; y++) T1; for (y=f; y·g; y++) f T1; T2; g for (y=g+1; y·h; y++) T2; g for (x=c+1; x·d; x++) for (y=f; y·h; y++) T2; for (x=a; x·d; x++) f for (y=e; y·h; y++) f if (a·x·c ^ e·y·g) T1; if (b·x·d ^ f·y·h) T2; g g

slide-34
SLIDE 34

34

Principles of Code Generation

  • Project all domains onto the outermost dimension
  • Compute a disjoint union of the projections, i.e., polyhedra

with the property that for each polyhedron P and each statement S

  • holds. These polyhedra can be ordered lexicographically

and enumerated by a sequence of loops.

  • For each loop, repeat the procedure with the current

dimension turned into a parameter.

slide-35
SLIDE 35

35

Code Generation is Tricky

  • Code generation has many tunables
  • Main trade-off: code size vs. optimization of control flow
  • One can easily generate thousands of lines of code, even

millions from relatively small inputs

  • Interplay between polyhedral code generator and backend

compiler can be tricky, e.g., compiler may reorder

  • perations and reduce performance thereby
slide-36
SLIDE 36

36

Conclusions

  • “Homework” for Wednesday:

Find a code in your area of research which may profit from polyhedral transformations.