CS 293S Optimizing for Parallelism and Locality: Affine - PowerPoint PPT Presentation

CS 293S Optimizing for Parallelism and Locality: Affine Transformation Yufei Ding Reference Book: “Optimizing Compilers for Modern Architecture” by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall

Review of last this lecture � Data Dependence � True, Anti-, Output dependence � Source and Sink � Distance vector, direction vector � Relation between Reordering transformation and Direction vector � Loop dependence � loop-carried dependence � Loop-Independent Dependences � Dependence graph 2

Important point: order of Dependence Graph vectors depends on order of loops, not use in arrays DO I = 1, 100 from S1 to S2: (<) S1 D(I) = A (5, I) level-1 antidependence DO J=1, 100 S1 is the source, S2 is the sink S2 A(J, I-1) = B(I) + C S2 S1 ENDDO s1 ENDDO δ 1-1 s2 � Nodes for statements � Edges for data dependences � Labels on edges for dependence levels and types 3

DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO ENDDO 1. True dependences denoted by S i d S j 2. Antidependence denoted by S i d -1 S j 3. Output dependence denoted by S i d 0 S j d and δ are used interchangeably 4

Review � Depdendence Tests � GCD � Controlling execution order � determining the upper/lower bound through projection by Fourier-Motzkin elimination � General algorithms to determine loop bounds � inner to outer levels to generate � outer to inner levels to refine 5

Data Dependence Tests � Given the loop nest: for (i = 0; i < N; i++) a[f(i)] = ... ... = a[g(i)] � A dependence exists if there exist an integer i and an i’ such that: f(i) = g(i’) � 0 <= i, i’ < N � If i < i’, write happens before read (true dependence) � If i > i’, write happens after read (anti dependence) 6

Solution: GCD test � Does f(i) = g(i’) have a solution? � assume f(i) = a*i + b g(i) = c*i + d � f(i) = g(i’) ⇒ ai + b = ci’ + d ⇒ a1*i + a2*i’ = a3 � An equation a1*i + a2*i’ = a3 has a solution iff gcd(a1, a2) evenly divides a3 7

Examples for (i = 1; i < 10; i++) { Z[2*i] = . . .; � 2i = 2j + 1 } � gcd(2, -2) = 2, and 2 does not for (j = 1; j < 10; j++){ divide 1 evenly. Thus, there is Z[2*j+1] = . . .; no solution. } Other Examples: 15*i + 6*j - 9*k = 12 has a solution (gcd = 3) 2*i + 7*j = 3 has a solution (gcd = 1) 9*i + 6*j = 10 has no solution (gcd = 3) 8

Finding the GCD � Finding GCD with Euclid’s algorithm gcd(27, 15): a = 27, b = 15 � Repeat (suppose a>b) a = 27 mod 15 = 12 � a = a mod b a = 15 mod 12 = 3 � swap a and b a = 12 mod 3 = 0 � until b is 0 (resulting a is gcd = 3 the gcd) � Why? If g divides a and b, then g divides a mod b 9

Downsides to GCD test � If f(i) = g(i’) fails the GCD test, then there is no i, i’ that can produce a dependence → loop has no dependences � If f(i) = g(i’), there might be a dependence, but might not � i and i’ that satisfy equation might fall outside bounds � Loop may be parallelizable, but cannot tell � Unfortunately, most loops have gcd(a, b) = 1, which divides everything � Other optimizations (loop interchange) can tolerate dependences in certain situations for (i = 1; i < 10; i++) Z[i] = Z[i+10]; 10

Other dependence tests � GCD test: doesn’t account for loop bounds, does not provide useful information in many cases � Banerjee test (Utpal Banerjee): more accurate test, takes directions and loop bounds into account � Omega test (William Pugh): even more accurate test, precise but can be very slow � Range test (Blume and Eigenmann): works for non-linear subscripts � Compilers tend to perform simple tests and only perform more complex tests if they cannot prove non-existence of dependence 11

Code generation by loop transformation for (i=0; i<=5; i++) for (j=0; j<=7; j++) for (j=i; j<=7; j++) for (i=0; i<=min(5, j); i++) Z[j, i] = 0; Z[j, i] = 0; � The problem of how we choose an ordering that honors the data dependences and optimizes for data locality and parallelism is generally hard. � Here we assume that a legal and desirable ordering is given, and show how to generate code that enforce the ordering. 12

Code generation by loop transformation � Analysis: � Rectangular: all loop bounds are constants à Easy � More complicated, but still quite realistic: the upper and/or lower bounds on one loop index can depend on the values of the indexes of the outer loops. à ?? � Goal: � outermost loop bounds: constants � inner loop bounds: linear combinations of outer loop index variables and constants. 13

Fourier-Motzkin elimination � Input: a polyhedron S defined by a set of linear constraints on x 1 , x 2 , ..., x n . A given variable x m that is to be eliminated. � Output: a polyhedron S’ defined by linear constraints on x 1 , x 2 , ..., x m-1 , x m+1 , ..., x n that is a projection of S onto dimensions Iteration space other than the x m for (i=0; i<=5; i++) for (j=i; j<=7; j++) Z[j, i] = 0; 14

Fourier-Motzkin Elimination Algorithm: � For every pair of a lower bound and an upper bound on x m , such as L<= c 1 x m & c 2 x m <= U, create a new constraint c 2 L <= c 1 U. � S’ is the set including all new constrains and those in S that do not contain x m . � It is possible that S’ is an empty space. 15

Example To Eliminate i. for (i=0; i<=5; i++) for (j=i; j<=7; j++) � one lower bound: 0 <= i Z[j, i] = 0; � two upper bounds: i <= j and i <= 5. � This generates two constraints: i>=0; � 0 <= j and 0 <= 5. i<=5; j>=i; � The latter is trivially true and can j<=7; be ignored. i>=0; � The former gives the lower bound i<=min(5,j); on j, and the original upper bound j < 7 gives the upper bound. j>=0; j<=7; 16

Loop-Bounds Generation Algorithm � Compute the loop bounds from the innermost to the outer loops. for (i=0; i<=5; i++) for (j=i; j<=7; j++) S n = S; Z[j, i] = 0; for (i=n; i>=1; i--){ L vi = all the lower bounds on v i in S i ; i>=0; U vi = all the upper bounds on v i in S i ; i<=5; S i-1 = Constraints by eliminating v i from S i ; j>=i; } j<=7; target order: j,i /* remove redundancies */ S’= Φ ; L i : 0 bounds on i for (i=1; i<=n; i++){ U i : 5,j is (0, min(5,j)); Remove any bounds in L vi and U vi implied by S’; L j : 0 bounds on j Add the remaining constraints of L vi and U vi on U j : 7 is (0, 7). v i to S’; } 17

Loop-Bounds Generation � Compute the loop bounds from the innermost to the outer loops. for (i=0; i<= 8 ; i++) for (j=i; j<=7; j++) S n = S; Z[j, i] = 0; for (i=n; i>=1; i--){ L vi = all the lower bounds on v i in S i ; i>=0; U vi = all the upper bounds on v i in S i ; i<=8; S i-1 = Constraints by eliminating v i from S i ; j>=i; } j<=7; target order: j,i /* remove redundancies */ S’= Φ ; L i : 0 bounds on i for (i=1; i<=n; i++){ U i : 8,j is (0, j); Remove any bounds in L vi and U vi implied by S’; L j : 0 bounds on j Add the remaining constraints of L vi and U vi on U j : 7 is (0, 7). v i to S’; } 18

Target: sweep through diagonally. for (i=0; i<=5; i++) for (j=i; j<=7; j++) [0,0], [1,1], [2,2], [3,3], [4,4], [5,5] Z[j, i] = 0; [0,1], [1,2], [2,3], [3,4], [4,5] i>=0; [0,2], [1,3], [2,4], [3,5] i<=5; ... j>=i; [0,6], [1,7] j<=7; [0,7] k=j-i, order: k, j. L j : k for (k=0; k<=7; k++) j-k>=0; U j : 5+k, 7 for (j=k; j<=min(5+k,7); j++) j-k<=5; L k : 0 Z[j, j-k] =0; U k : 7 j>=j-k; j<=7. 19

CS 293S Optimizing for Parallelism and Locality: Affine - PowerPoint PPT Presentation

CS 293S Optimizing for Parallelism and Locality: Affine Transformation Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Nol Pouche, Mary Hall Review of last this

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Compiling for Parallelism & Locality Last time SSA and its uses Today

CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for

Optimizing FFT-based Polynomial Arithmetic for Data Locality and Parallelism Marc Moreno Maza

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

On the affine VW supercategory Mee Seong Im West Point, NY Interactions of quantum affine

Implementing Algorithms in MIPS Assembly (Part 2) February 611, 2013 1 / 37 Outline Reading

CS 1111: RESOLVING AMBIGUITY In class You have 3 sheets of paper, do not do anything with

Give Me Letters 2, 3 and 6! Partial Password Implementations and Attacks David Aspinall,

Faster Attend-Infer-Repeat with Tractable Probabilistic Models Karl Stelzner 1 , Robert Peharz 2 ,

A Data Driven Approach for Algebraic Loop Invariants Paper by Rahul Sharma, Saurabh Gupta,

Screaming Ch Channels When Electromagnetic Side Channels Meet Radio Transceivers Giovanni

Introduction to PL/pgSQL Procedural Language Overview PostgreSQL allows user-defined functions

The Rakudo Update

Sambuz

Useful Links

Newsletter

Mail Us

CS 293S Optimizing for Parallelism and Locality: Affine - PowerPoint PPT Presentation

CS 293S Optimizing for Parallelism and Locality: Affine Transformation Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Nol Pouche, Mary Hall Review of last this

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Compiling for Parallelism &amp; Locality Last time SSA and its uses Today

CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for

Optimizing FFT-based Polynomial Arithmetic for Data Locality and Parallelism Marc Moreno Maza

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Tiling: A Data Locality Optimizing Algorithm Previously Kelly &amp; Pugh transformation

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

On the affine VW supercategory Mee Seong Im West Point, NY Interactions of quantum affine

Implementing Algorithms in MIPS Assembly (Part 2) February 611, 2013 1 / 37 Outline Reading

CS 1111: RESOLVING AMBIGUITY In class You have 3 sheets of paper, do not do anything with

Give Me Letters 2, 3 and 6! Partial Password Implementations and Attacks David Aspinall,

Faster Attend-Infer-Repeat with Tractable Probabilistic Models Karl Stelzner 1 , Robert Peharz 2 ,

A Data Driven Approach for Algebraic Loop Invariants Paper by Rahul Sharma, Saurabh Gupta,

Screaming Ch Channels When Electromagnetic Side Channels Meet Radio Transceivers Giovanni

Introduction to PL/pgSQL Procedural Language Overview PostgreSQL allows user-defined functions

The Rakudo Update

Sambuz

Useful Links

Newsletter

Mail Us

Compiling for Parallelism & Locality Last time SSA and its uses Today

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation