A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, - - PowerPoint PPT Presentation

a relaxed criterion for loop tiling
SMART_READER_LITE
LIVE PREVIEW

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, - - PowerPoint PPT Presentation

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege UPMC/INRIA/ENS September 22, 2015 1/22 Tiling Main benefit: enhance data locality 2/22 Tiling Main benefit: enhance data locality Useful in architectures


slide-1
SLIDE 1

A Relaxed Criterion for Loop Tiling

Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

UPMC/INRIA/ENS

September 22, 2015

1/22

slide-2
SLIDE 2

Tiling

Main benefit: enhance data locality

2/22

slide-3
SLIDE 3

Tiling

Main benefit: enhance data locality Useful in architectures with a memory hierarchy

2/22

slide-4
SLIDE 4

Tiling Example

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) C[i ][ j ] = A[j ] + B[j ];

i j : execution order

3/22

slide-5
SLIDE 5

Tiling Example

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) C[i ][ j ] = A[j ] + B[j ];

i j : execution order

3/22

slide-6
SLIDE 6

Tiling Example

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) C[i ][ j ] = A[j ] + B[j ];

i j : execution order

3/22

slide-7
SLIDE 7

Permutability Requirements

To perform tiling we check for permutability

4/22

slide-8
SLIDE 8

Permutability Requirements

To perform tiling we check for permutability Classical loop permutability criterion

Each dependence is forward in each loop of the band

4/22

slide-9
SLIDE 9

Permutability Requirements

To perform tiling we check for permutability Classical loop permutability criterion

Each dependence is forward in each loop of the band

A dependence is forward if it is oriented from earlier to later iterations

4/22

slide-10
SLIDE 10

Permutability Requirement Examples

A[i ][ j ] = f(A[i−1][j ], A[i ][ j −1]);

i j : dependence : execution order

5/22

slide-11
SLIDE 11

Permutability Requirement Examples

A[i ][ j ] = f(A[i−1][j ], A[i ][ j −1]);

i j : dependence : execution order

5/22

slide-12
SLIDE 12

Permutability Requirement Examples

A[i ][ j ] = f(A[i−1][j ], A[i ][ j −1]);

i j OK : dependence : execution order

5/22

slide-13
SLIDE 13

Permutability Requirement Examples

A = f(A);

i j : dependence (only some dependences shown) : execution order

5/22

slide-14
SLIDE 14

Permutability Requirement Examples

A = f(A);

i j : dependence (only some dependences shown) : execution order

5/22

slide-15
SLIDE 15

Permutability Requirement Examples

A = f(A);

i j NOT OK : dependence (only some dependences shown) : execution order

5/22

slide-16
SLIDE 16

Loop Transformation Legality

A loop transformation is correct if live ranges do not interfere after the transformation

6/22

slide-17
SLIDE 17

Loop Transformation Legality

A loop transformation is correct if live ranges do not interfere after the transformation

6/22

slide-18
SLIDE 18

True and Falce Dependence

Types of dependences

true dependences: write → read false dependences

anti dependence: read → write

  • utput dependence: write → write

7/22

slide-19
SLIDE 19

True and Falce Dependence

Types of dependences

true dependences: write → read false dependences

anti dependence: read → write

  • utput dependence: write → write

False dependences

are caused by memory reuse prevent live ranges from overlapping

Iteration j Iteration j+1 s1(j) s2(j) WAR Live range s2(j+1) s1(j+1)

7/22

slide-20
SLIDE 20

True and Falce Dependence

Types of dependences

true dependences: write → read false dependences

anti dependence: read → write

  • utput dependence: write → write

False dependences

are caused by memory reuse prevent live ranges from overlapping

Iteration j Iteration j+1 s1(j) s2(j) WAR Live range s2(j+1) s1(j+1)

Dependences adjacent to live ranges

7/22

slide-21
SLIDE 21

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 8/22

slide-22
SLIDE 22

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Classical tiling criterion: not allowed

8/22

slide-23
SLIDE 23

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Classical tiling criterion: not allowed but tiling is possible

8/22

slide-24
SLIDE 24

Relaxed Permutability Criterion

Main idea a loop transformation is correct if it does not lead to live range interference

9/22

slide-25
SLIDE 25

Relaxed Permutability Criterion

Main idea a loop transformation is correct if it does not lead to live range interference tiling only changes the order of execution of iterations

9/22

slide-26
SLIDE 26

Relaxed Permutability Criterion

Main idea a loop transformation is correct if it does not lead to live range interference tiling only changes the order of execution of iterations if live ranges are local to an iteration then they are guaranteed not to interfer due to tiling

9/22

slide-27
SLIDE 27

Relaxed Permutability Criterion

Classical Permutability Criterion

Each dependence is forward in each loop of the band

10/22

slide-28
SLIDE 28

Relaxed Permutability Criterion

Classical Permutability Criterion

Each dependence is forward in each loop of the band

Relaxed Permutability Criterion

The same classical criterion except that we ignore anti-dependences that are adjacent to only local live ranges

10/22

slide-29
SLIDE 29

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Classical tiling criterion: not allowed but tiling is possible

11/22

slide-30
SLIDE 30

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Classical tiling criterion: not allowed but tiling is possible

11/22

slide-31
SLIDE 31

False Dependences prevent Tiling

for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) { S1: t = A[i ]; S2: B[i ][ j ] = t ; }

i j

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Classical tiling criterion: not allowed but tiling is possible Relaxed tiling criterion: allowed

11/22

slide-32
SLIDE 32

Tilability of Selected PolyBench Benchmarks

Original 3AC 3AC + relaxed criterion Expanded 3mm yes no yes yes dynprog yes no yes yes fdtd-2d yes no yes yes syr2k yes no yes yes fdtd-apml yes no yes yes bicg yes no yes yes symm no no yes yes cholesky no no yes yes

12/22

slide-33
SLIDE 33

Conclusion

Relaxed permutability criterion Allows tiling in presence of false dependences No expansion or privatization required Future directions Combination with on-demand array expansion

13/22

slide-34
SLIDE 34

Outline

Relaxed Permutability Criterion Conclusion PENCIL

14/22

slide-35
SLIDE 35

Outline

Relaxed Permutability Criterion Conclusion PENCIL

15/22

slide-36
SLIDE 36

Motivation

Programming accelerators: low level APIs (OpenCL, CUDA, . . . )

16/22

slide-37
SLIDE 37

Motivation

Programming accelerators: low level APIs (OpenCL, CUDA, . . . ) Problems of low level APIs: difficult to use, non portable code, . . .

16/22

slide-38
SLIDE 38

Motivation

Programming accelerators: low level APIs (OpenCL, CUDA, . . . ) Problems of low level APIs: difficult to use, non portable code, . . . Solution

Write code in a high level language Use a compiler for parallelization/optimization

16/22

slide-39
SLIDE 39

PENCIL

17/22

slide-40
SLIDE 40

Pencil Intermediate Language

Subset of C99 restrictions on pointer use

goals: no write aliasing, constant array references, . . .

18/22

slide-41
SLIDE 41

Pencil Intermediate Language

Subset of C99 restrictions on pointer use

goals: no write aliasing, constant array references, . . . restrictions:

C99 VLA syntax for array declaration (int A[m] instead of int ∗A) cannot read or write to a pointer except passing an array reference to a function: foo(A);

18/22

slide-42
SLIDE 42

Pencil Intermediate Language

Subset of C99 restrictions on pointer use

goals: no write aliasing, constant array references, . . . restrictions:

C99 VLA syntax for array declaration (int A[m] instead of int ∗A) cannot read or write to a pointer except passing an array reference to a function: foo(A);

no gotos Extensions (builtins and directives)

__pencil_assume(expression) __pencil_kill(T) __pencil_reduce(...) #pragma pencil independent

Summary functions

18/22

slide-43
SLIDE 43

Compiling Pencil

We use the PPCG polyhedral compiler Polyhedral moldel: an algebraic representation of programs (focus on loop nests). Static-affine control

Static control: not data dependent ( if (A[i ])). loop bounds, conditionals and array subscripts should be affine with respect to the loop iterators and a set of symbolic

  • constants. Affine: i + j ≥ 0. Non-affine: i ∗ i ≥ 0

19/22

slide-44
SLIDE 44

Compiling Pencil

We use the PPCG polyhedral compiler Polyhedral moldel: an algebraic representation of programs (focus on loop nests). Static-affine control

Static control: not data dependent ( if (A[i ])). loop bounds, conditionals and array subscripts should be affine with respect to the loop iterators and a set of symbolic

  • constants. Affine: i + j ≥ 0. Non-affine: i ∗ i ≥ 0

19/22

slide-45
SLIDE 45

Compiling Pencil Extensions

Non static-affine array accesses (read/write)

Treated as a may access to the whole array dimension Example:

A[i] = B[foo(i)];

is treated as

A[i] = B[*];

20/22

slide-46
SLIDE 46

Compiling Pencil Extensions

Non static-affine array accesses (read/write)

Treated as a may access to the whole array dimension Example:

A[i] = B[foo(i)];

is treated as

A[i] = B[*];

Non static-affine conditionals

if (A[i]) { . . B[i] = 0; . . C[i] = 0; }

20/22

slide-47
SLIDE 47

Compiling Pencil Extensions

__pencil_assume(expression) expression is an affine constraint on loop parameters expression is added to the context (set of affine constraints on

loop parameters) This information (context) is used whenever needed

21/22

slide-48
SLIDE 48

Compiling Pencil Extensions

__pencil_assume(expression) expression is an affine constraint on loop parameters expression is added to the context (set of affine constraints on

loop parameters) This information (context) is used whenever needed

independent directive

Used currently during parallelism detection (remove loop carried dependences)

21/22

slide-49
SLIDE 49

Image Processing: Experiments

Speedup of code generated from Pencil over OpenCV library

resize dilate color conversion affine warping 2D convolution gaussian smoothing basic histogram 0.2 0.5 1.0 2.0 4.0 10.0 Speedups (logarithmic scale)

OpenCV-OpenCL PPCG-OpenCL

resize dilate color conversion affine warping 2D convolution gaussian smoothing basic histogram 0.2 0.5 1.0 2.0 4.0 10.0 Speedups (logarithmic scale)

OpenCV-OpenCL PPCG-OpenCL

resize dilate color conversion affine warping 2D convolution gaussian smoothing basic histogram 0.2 0.5 1.0 2.0 4.0 10.0 Speedups (logarithmic scale)

OpenCV-OpenCL PPCG-OpenCL

Nvidia Fermi ARM Mali AMD Radeon

22/22