polyhedral loop optimization part i
play

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - PowerPoint PPT Presentation

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September 14, 2015 Overview Monday: Basics of the polyhedral model Modeling Transformation Code generation Wednesday: Practice


  1. Polyhedral Loop Optimization (Part I) Armin Größlinger SPPEXA Doctoral Retreat 2015 September 14, 2015

  2. Overview ● Monday: Basics of the polyhedral model – Modeling – Transformation – Code generation ● Wednesday: Practice – Available tools – Use LLVM+Polly to analyze and optimize codes 2

  3. Polyhedral Compilation 3

  4. Phases of the Model 1) Modeling Describe loop iterations (= iteration domain) • Compute dependences between iterations • 2) Transformation(s) Reorder iterations to exhibit desired properties, e.g., • parallelism, increased data locality, etc. Must respect dependences • 3) Code Generation Turn transformed model into efficient code • 4

  5. Model ● For each statement: – iteration domain: (unordered) set of loop iterations – schedule: relation between iteration domain and (virtual) multi-dimensional execution time – all schedules have same dimensionality ● For each array access: – access relation: maps iteration to memory location ● For each pair of statements: – dependences: relation between iteration domains ● All sets and relations are defined by Presburger formulas 5

  6. Presburger Formulas I ● Modeling must be based on a decidable theory ● Loop iterations and memory locations are discrete units → must use integers (not rationals or reals) ● Theory of inequalities (and congruences) over the integers with addition is decidable. ● Theory of inequalities over the integers with addition and multiplication is not decidable. ● Consequence: use affine expressions (affine = linear + additive constant) and congruences, e.g. 6

  7. Presburger Formulas II ● Presburger formulas allow existential and universal quantification ● Formulas with quantifiers are equivalent to formulas without quantifiers, e.g., ● Geometrically, Presburger formulas describe the integer points in polyhedra (possibly with “holes” when quantifiers or congruences are used). 7

  8. Modeling Example I for (i = 1; i <= n; ++i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Iteration domain: ● Dependences: 8

  9. Modeling Example II for (i = 2; i <= n; i += 2 ) for (j = 1+ (i/4)*4 ; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) 9

  10. Modeling Example III for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) Historical solution: loops must be normalized, i.e., loops run forward and have unit stride: for (i = -n; i <= -1; ++ i ) for (j = 1; j <= i; ++j) S: A[-i][j] = 0.5 * (A[-i-1][j] + A[-i][j-1]) 10

  11. Modeling with Domain and Schedule for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Use unordered iteration domain + schedule ● Execution order is defined by lexicographic order on schedule ● When no explicit schedule is given, an identity schedule is assumed. 11

  12. Modeling Example IV ● Do not confuse loop and array dimensions for (i=0; i<=n; ++i) for (j=0; j<=n; ++j) A[i+j] = … – Two loops but only one array dimension! – Do not identify loop dimensions with array dimensions ● Exercise: – Draw the iteration space and the dependences between iterations 12

  13. Model Extraction from Source Code ● Iteration domains extracted from loop bounds (directly) ● Access relations extracted from array accesses (directly) ● Schedule (for sequential program) can be constructed systematically: for n nested loops, construct schedule with 2 n +1 dimensions, every second dimension ensures the textual order of the statements. A; for (i=…) { B; for (j=…) { C; D; } E; } F; 13

  14. Computing Dependences ● Computing the dependences is the main work ● There exists a dependence when – existence: – conflict: – order: all hold (in the integers!) ● Optimization: remove transitive dependences from solution, e.g. only last write before a read access → compute lexicographic maximum of i in dependence of j. 14

  15. Dependence Computation Example for (i = 1; i <= n; ++i) for (j = 0; j <= 1; ++j) S: A[2*i+j] = … for (k = 1; k <= 2*n+1; ++k) T: … = A[k] ● Existence: ● Conflict: ● (Order: S textually before T) 15

  16. Tools and Compilers ● Tools to solve dependence systems (among others): – PIP (“parametric integer programming, P. Feautrier) – isl (“integer set library”, S. Verdoolaege) ● Today part of production compilers – IBM XL – GCC (“Graphite”) – LLVM (“Polly”) 16

  17. Practical Challenges in Modeling ● Additional complications in real codes: – Are “A” and “B” the same array or different arrays? – Aliasing of inner dimensions in arrays of arrays (or arrays represented by pointer to pointers), e.g., float **A = …; A[3] = A[4]; → not enough dependences calculated 17

  18. Transformations ● Compute a new schedule for each statement ● Desired properties – maximal parallelism – minimal latency – locality improvement 18

  19. Transformation Example for (k=1; k<=m; k++) { for (i=1; i<n; i++) A[i] = (A[i-1] + A[i+1]) * 0.5f; } k  k ,i  k ,i  1   k ,i  k  1, i − 1   k ,i  k  1, i  → flow dependence → output dependence i Which iterations can be executed in parallel? 19

  20. Parallel Schedule k independent iterations i ● Parallel schedule: ● The second component (k) could be anything else as long as it is linearly independent from the first component. 20

  21. Transformed Space p t for (int t=0; t<=n+2*m-4; t++) { Transformation: int lb = max(0, (t-n+3)/2); t = i + 2k – 3 int ub = min(m-1, t/2); p = k – 1 ⇒ parfor (int p=lb; p<=ub; p++) { i = t – 2p + 1 int i = t - 2*p + 1; A[i] = (A[i-1] + A[i+1]) * 0.5f; } 21 }

  22. Schedules with Rational Coefficients ● Schedules with fractional coefficients make sense! ● Dependence: ● Schedule: ● Interpretation of the schedule: ● Rational schedules are implicitly “floored”. 22

  23. Computing a new Schedule ● Correctness criterion for a schedule: ● Challenges: – How to solve the “implies” part? – How to treat lexicographic precedence? 23

  24. Feautrier's Scheduling Algorithm I ● For each statement S , try to find a schedule where, for each dimension d , i.e. determine the coefficients (for each dimension) such that holds. 24

  25. Feautrier's Scheduling Algorithm II ● Main idea of Feautrier Scheduler: A linear function is guaranteed to be non-negative over a polyhedron if it is a positive combination of the polyhedron's bounds. (Linear form of Farkas' Lemma) ● Try to solve (starting with ) where are the expressions defining the iteration domains of the involved statements. ● Technique: equate coefficients of the right and left side (the resulting system is affine!) 25

  26. Feautrier Scheduler III ● In general, not all dependences can be satisfied in the first schedule dimension → try to satisfy as many dependences as possible, satisfy remaining dependences in second dimension if possible, and so on. ● Every dependence is “carried” by a certain schedule dimension c . The schedule must ensure 26

  27. Optimization Criteria ● Order in which dependences are satisfied not determined ● Different solutions possible even for a fixed order ● Select an additional target, e.g., – minimize latency (shortest theoretical execution) – do not use more processors than necessary – improve locality (put source and target of dependences with high data volume on same processor) ● Problem: relation between theoretical optimum (e.g. minimal latency) and performance on real hardware unknown → room for exploration/autotuning/machine learning 27

  28. Performance of Different Solutions 3D Jacobi Stencil, Experiment by Stefan Kronawitter 28

  29. Another Transformation: Tiling ● To improve data locality, tiling is often necessary to speed up a code. for (i=0; i<n; ++i) for (j=0; j<n; ++j) ... for (iT=0; i<n/B; ++i) for (i=B*iT; i<B*iT+B; ++i) for (jT=0; j<n/B; ++j) for (j=B*jT; j<B*jT+B; ++j) ... for (iT=0; i<n/B; ++i) for (jT=0; j<n/B; ++j) for (i=B*iT; i<B*iT+B; ++i) for (j=B*jT; j<B*jT+B; ++j) ... 29

  30. Limitations with Tiling ● Tiling transformation is only affine for a given tile size ● Parametric tiling requires extensions of the polyhedral model which are not part of wide-spread tools (yet?) ● Tiling not legal in all cases; sufficient criterion: for all d. 30

  31. Other Transformations ● Index Set Splitting: split iteration domains to allow for better schedules ● Memory Layout Transformations: more cache-friendly data layout ● Save Memory: compute minimum required working set size ● Communication Introduction: enumerate memory elements that must be transferred between tiles ● ... 31

  32. Code Generation ● After transformation we must generate executable code for the transformed model ● Generated code must be efficient ● Generation of good code is non-trivial 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend