Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - PowerPoint PPT Presentation

Polyhedral Loop Optimization (Part I) Armin Größlinger SPPEXA Doctoral Retreat 2015 September 14, 2015

Overview ● Monday: Basics of the polyhedral model – Modeling – Transformation – Code generation ● Wednesday: Practice – Available tools – Use LLVM+Polly to analyze and optimize codes 2

Polyhedral Compilation 3

Phases of the Model 1) Modeling Describe loop iterations (= iteration domain) • Compute dependences between iterations • 2) Transformation(s) Reorder iterations to exhibit desired properties, e.g., • parallelism, increased data locality, etc. Must respect dependences • 3) Code Generation Turn transformed model into efficient code • 4

Model ● For each statement: – iteration domain: (unordered) set of loop iterations – schedule: relation between iteration domain and (virtual) multi-dimensional execution time – all schedules have same dimensionality ● For each array access: – access relation: maps iteration to memory location ● For each pair of statements: – dependences: relation between iteration domains ● All sets and relations are defined by Presburger formulas 5

Presburger Formulas I ● Modeling must be based on a decidable theory ● Loop iterations and memory locations are discrete units → must use integers (not rationals or reals) ● Theory of inequalities (and congruences) over the integers with addition is decidable. ● Theory of inequalities over the integers with addition and multiplication is not decidable. ● Consequence: use affine expressions (affine = linear + additive constant) and congruences, e.g. 6

Presburger Formulas II ● Presburger formulas allow existential and universal quantification ● Formulas with quantifiers are equivalent to formulas without quantifiers, e.g., ● Geometrically, Presburger formulas describe the integer points in polyhedra (possibly with “holes” when quantifiers or congruences are used). 7

Modeling Example I for (i = 1; i <= n; ++i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Iteration domain: ● Dependences: 8

Modeling Example II for (i = 2; i <= n; i += 2 ) for (j = 1+ (i/4)*4 ; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) 9

Modeling Example III for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) Historical solution: loops must be normalized, i.e., loops run forward and have unit stride: for (i = -n; i <= -1; ++ i ) for (j = 1; j <= i; ++j) S: A[-i][j] = 0.5 * (A[-i-1][j] + A[-i][j-1]) 10

Modeling with Domain and Schedule for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Use unordered iteration domain + schedule ● Execution order is defined by lexicographic order on schedule ● When no explicit schedule is given, an identity schedule is assumed. 11

Modeling Example IV ● Do not confuse loop and array dimensions for (i=0; i<=n; ++i) for (j=0; j<=n; ++j) A[i+j] = … – Two loops but only one array dimension! – Do not identify loop dimensions with array dimensions ● Exercise: – Draw the iteration space and the dependences between iterations 12

Model Extraction from Source Code ● Iteration domains extracted from loop bounds (directly) ● Access relations extracted from array accesses (directly) ● Schedule (for sequential program) can be constructed systematically: for n nested loops, construct schedule with 2 n +1 dimensions, every second dimension ensures the textual order of the statements. A; for (i=…) { B; for (j=…) { C; D; } E; } F; 13

Computing Dependences ● Computing the dependences is the main work ● There exists a dependence when – existence: – conflict: – order: all hold (in the integers!) ● Optimization: remove transitive dependences from solution, e.g. only last write before a read access → compute lexicographic maximum of i in dependence of j. 14

Dependence Computation Example for (i = 1; i <= n; ++i) for (j = 0; j <= 1; ++j) S: A[2*i+j] = … for (k = 1; k <= 2*n+1; ++k) T: … = A[k] ● Existence: ● Conflict: ● (Order: S textually before T) 15

Tools and Compilers ● Tools to solve dependence systems (among others): – PIP (“parametric integer programming, P. Feautrier) – isl (“integer set library”, S. Verdoolaege) ● Today part of production compilers – IBM XL – GCC (“Graphite”) – LLVM (“Polly”) 16

Practical Challenges in Modeling ● Additional complications in real codes: – Are “A” and “B” the same array or different arrays? – Aliasing of inner dimensions in arrays of arrays (or arrays represented by pointer to pointers), e.g., float **A = …; A[3] = A[4]; → not enough dependences calculated 17

Transformations ● Compute a new schedule for each statement ● Desired properties – maximal parallelism – minimal latency – locality improvement 18

Transformation Example for (k=1; k<=m; k++) { for (i=1; i<n; i++) A[i] = (A[i-1] + A[i+1]) * 0.5f; } k  k ,i  k ,i  1   k ,i  k  1, i − 1   k ,i  k  1, i  → flow dependence → output dependence i Which iterations can be executed in parallel? 19

Parallel Schedule k independent iterations i ● Parallel schedule: ● The second component (k) could be anything else as long as it is linearly independent from the first component. 20

Transformed Space p t for (int t=0; t<=n+2*m-4; t++) { Transformation: int lb = max(0, (t-n+3)/2); t = i + 2k – 3 int ub = min(m-1, t/2); p = k – 1 ⇒ parfor (int p=lb; p<=ub; p++) { i = t – 2p + 1 int i = t - 2*p + 1; A[i] = (A[i-1] + A[i+1]) * 0.5f; } 21 }

Schedules with Rational Coefficients ● Schedules with fractional coefficients make sense! ● Dependence: ● Schedule: ● Interpretation of the schedule: ● Rational schedules are implicitly “floored”. 22

Computing a new Schedule ● Correctness criterion for a schedule: ● Challenges: – How to solve the “implies” part? – How to treat lexicographic precedence? 23

Feautrier's Scheduling Algorithm I ● For each statement S , try to find a schedule where, for each dimension d , i.e. determine the coefficients (for each dimension) such that holds. 24

Feautrier's Scheduling Algorithm II ● Main idea of Feautrier Scheduler: A linear function is guaranteed to be non-negative over a polyhedron if it is a positive combination of the polyhedron's bounds. (Linear form of Farkas' Lemma) ● Try to solve (starting with ) where are the expressions defining the iteration domains of the involved statements. ● Technique: equate coefficients of the right and left side (the resulting system is affine!) 25

Feautrier Scheduler III ● In general, not all dependences can be satisfied in the first schedule dimension → try to satisfy as many dependences as possible, satisfy remaining dependences in second dimension if possible, and so on. ● Every dependence is “carried” by a certain schedule dimension c . The schedule must ensure 26

Optimization Criteria ● Order in which dependences are satisfied not determined ● Different solutions possible even for a fixed order ● Select an additional target, e.g., – minimize latency (shortest theoretical execution) – do not use more processors than necessary – improve locality (put source and target of dependences with high data volume on same processor) ● Problem: relation between theoretical optimum (e.g. minimal latency) and performance on real hardware unknown → room for exploration/autotuning/machine learning 27

Performance of Different Solutions 3D Jacobi Stencil, Experiment by Stefan Kronawitter 28

Another Transformation: Tiling ● To improve data locality, tiling is often necessary to speed up a code. for (i=0; i<n; ++i) for (j=0; j<n; ++j) ... for (iT=0; i<n/B; ++i) for (i=B*iT; i<B*iT+B; ++i) for (jT=0; j<n/B; ++j) for (j=B*jT; j<B*jT+B; ++j) ... for (iT=0; i<n/B; ++i) for (jT=0; j<n/B; ++j) for (i=B*iT; i<B*iT+B; ++i) for (j=B*jT; j<B*jT+B; ++j) ... 29

Limitations with Tiling ● Tiling transformation is only affine for a given tile size ● Parametric tiling requires extensions of the polyhedral model which are not part of wide-spread tools (yet?) ● Tiling not legal in all cases; sufficient criterion: for all d. 30

Other Transformations ● Index Set Splitting: split iteration domains to allow for better schedules ● Memory Layout Transformations: more cache-friendly data layout ● Save Memory: compute minimum required working set size ● Communication Introduction: enumerate memory elements that must be transferred between tiles ● ... 31

Code Generation ● After transformation we must generate executable code for the transformed model ● Generated code must be efficient ● Generation of good code is non-trivial 32

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - PowerPoint PPT Presentation

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September 14, 2015 Overview Monday: Basics of the polyhedral model Modeling Transformation Code generation Wednesday: Practice

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Repetition Types of Loops Counting loop Know how many times to loop

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Data Structures and Algorithms Chapter 7: Sorting (Quicksort) Text: Read Weiss, 7.7 Izmir

Solutions for Coursework I CS-270, 2011/12 Oliver Kullmann Computer Science Department College

Definition Recursion See "Recursion". Recursion If you still don't get it, see:

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Web Conference Summary of July 30, 2013 Technical Workshop on Case Studies to Assess Potential

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Cryogenics and Plan for Instrumentation Alan A. Hahn FNAL 7/28/16 Mini-Review of HV Test at

Network Flow Truck company: Wants to send as many trucks as possible from s to t. Limit 1 of

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA - PowerPoint PPT Presentation

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September 14, 2015 Overview Monday: Basics of the polyhedral model Modeling Transformation Code generation Wednesday: Practice

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

Repetition Types of Loops Counting loop Know how many times to loop

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &amp;

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Data Structures and Algorithms Chapter 7: Sorting (Quicksort) Text: Read Weiss, 7.7 Izmir

Solutions for Coursework I CS-270, 2011/12 Oliver Kullmann Computer Science Department College

Definition Recursion See &quot;Recursion&quot;. Recursion If you still don't get it, see:

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Web Conference Summary of July 30, 2013 Technical Workshop on Case Studies to Assess Potential

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Cryogenics and Plan for Instrumentation Alan A. Hahn FNAL 7/28/16 Mini-Review of HV Test at

Network Flow Truck company: Wants to send as many trucks as possible from s to t. Limit 1 of

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

Definition Recursion See "Recursion". Recursion If you still don't get it, see: