Compiling Affine Loop Nests for Distributed-Memory Parallel - - PowerPoint PPT Presentation

compiling affine loop nests for distributed memory
SMART_READER_LITE
LIVE PREVIEW

Compiling Affine Loop Nests for Distributed-Memory Parallel - - PowerPoint PPT Presentation

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 1622, 2013 Denver, Colorado 1 / 46 Introduction 1 Distributed-memory code generation 2 The


slide-1
SLIDE 1

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16–22, 2013 Denver, Colorado

1 / 46

slide-2
SLIDE 2

1

Introduction

2

Distributed-memory code generation The problem, challenges, and past efforts Our approach (Pluto distmem)

3

Experimental Evaluation

4

Conclusions

2 / 46

slide-3
SLIDE 3

Introduction

Distributed-memory compilation

Manual parallelization for distributed-memory is extremely hard (even for affine loop nests) Objectives Automatically generate MPI code from sequential C affine loop nests

3 / 46

slide-4
SLIDE 4

Introduction

Distributed-memory compilation

Manual parallelization for distributed-memory is extremely hard (even for affine loop nests) Objectives Automatically generate MPI code from sequential C affine loop nests

3 / 46

slide-5
SLIDE 5

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-6
SLIDE 6

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-7
SLIDE 7

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-8
SLIDE 8

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-9
SLIDE 9

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-10
SLIDE 10

Introduction

Distributed-memory compilation – why again?

Large amount of literature already exists through early 1990s

1

Past works: limited success

2

Still no automatic tool has been available

3

However, we now have new polyhedral libraries, transformation frameworks, code generators, and tools

4

The same techniques are needed to compile for CPUs-GPU heterogeneous multicores

5

Can be integrated with emerging runtimes

Make a fresh attempt to solve this problem

4 / 46

slide-11
SLIDE 11

Introduction

Why do we need communication?

Communication during parallelization is a result of data dependences No data dependences ⇒ (∼) no communication Parallel loop implies no dependences satisfied by it

Communication is due to dependences that are satisfied

  • utside but have (non-zero) components on the parallel loop

5 / 46

slide-12
SLIDE 12

Introduction

Why do we need communication?

Communication during parallelization is a result of data dependences No data dependences ⇒ (∼) no communication Parallel loop implies no dependences satisfied by it

Communication is due to dependences that are satisfied

  • utside but have (non-zero) components on the parallel loop

5 / 46

slide-13
SLIDE 13

Introduction

Why do we need communication?

Communication during parallelization is a result of data dependences No data dependences ⇒ (∼) no communication Parallel loop implies no dependences satisfied by it

Communication is due to dependences that are satisfied

  • utside but have (non-zero) components on the parallel loop

5 / 46

slide-14
SLIDE 14

Introduction

Dependences and Communication

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure : Inner parallel loop, j: hyperplane (0,1)

The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop

6 / 46

slide-15
SLIDE 15

Introduction

Dependences and Communication

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure : Inner parallel loop, j: hyperplane (0,1)

The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop

6 / 46

slide-16
SLIDE 16

Introduction

Dependences and Communication

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure : Inner parallel loop, j: hyperplane (0,1)

The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop

6 / 46

slide-17
SLIDE 17

Introduction

Dependences and Communication

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure : Inner parallel loop, j: hyperplane (0,1)

The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop

6 / 46

slide-18
SLIDE 18

Introduction

Dependences and Communication

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure : Inner parallel loop, j: hyperplane (0,1)

The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop

6 / 46

slide-19
SLIDE 19

Introduction

A polyhedral optimizer – various phases

1 Extracting a polyhedral representation (from sequential C) 2 Dependence analysis 3 Transformation and parallelization 4 Code generation (getting out of polyhedral extraction) S1 S2 S4 S3 S5 ST i=1 L r.i=i A r310=r.i,1 ST i=r310 C cr320=r.i,10 BT L1,cr320,le L r.i=i L r.i=i M r300=r.i,4 L1: LABEL

for (i = 0; i < N; i++) for (k = 0; k < N; k++) for (i = 0; i < N; i++) S1 S2 for (j = 0; j < N; j++) S1 for (j = 0; j < N; j++) S2 0 <= i <= N−1 0 <= j <= N−1 0 <= k <= N−1 i j k

φ( x) = k φ( x) ≤ k

φ

φ( x) ≥ k

7 / 46

slide-20
SLIDE 20

Introduction

A polyhedral optimizer – various phases

1 Extracting a polyhedral representation (from sequential C) 2 Dependence analysis 3 Transformation and parallelization 4 Code generation (getting out of polyhedral extraction) S1 S2 S4 S3 S5 ST i=1 L r.i=i A r310=r.i,1 ST i=r310 C cr320=r.i,10 BT L1,cr320,le L r.i=i L r.i=i M r300=r.i,4 L1: LABEL

for (i = 0; i < N; i++) for (k = 0; k < N; k++) for (i = 0; i < N; i++) S1 S2 for (j = 0; j < N; j++) S1 for (j = 0; j < N; j++) S2 0 <= i <= N−1 0 <= j <= N−1 0 <= k <= N−1 i j k

φ( x) = k φ( x) ≤ k

φ

φ( x) ≥ k

7 / 46

slide-21
SLIDE 21

Introduction

Distributed-memory parallelization

Involves a number of sub-problems

1 Finding the right computation partitioning 2 Data distribution and data allocation (weak scaling) 3 Determining communication sets given the above 4 Packing and unpacking data 5 Determining communication partners given the above 8 / 46

slide-22
SLIDE 22

Introduction

Distributed-memory parallelization

Involves a number of sub-problems

1 Finding the right computation partitioning 2 Data distribution and data allocation (weak scaling) 3 Determining communication sets given the above 4 Packing and unpacking data 5 Determining communication partners given the above 8 / 46

slide-23
SLIDE 23

Distributed-memory code generation

1

Introduction

2

Distributed-memory code generation The problem, challenges, and past efforts Our approach (Pluto distmem)

3

Experimental Evaluation

4

Conclusions

9 / 46

slide-24
SLIDE 24

Distributed-memory code generation The problem, challenges, and past efforts

Distributed-memory code generation

What to send? Whom to send to? Difficulties For non-uniform dependences, not known how far dependences traverse Number of iterations (or tiles) is not known at compile time Number of processors may not be known at compile time (portability) Virtual to physical processor approach: are you sending to two virtual processors that are the same physical processor?

10 / 46

slide-25
SLIDE 25

Distributed-memory code generation The problem, challenges, and past efforts

Distributed-memory code generation

What to send? Whom to send to? Difficulties For non-uniform dependences, not known how far dependences traverse Number of iterations (or tiles) is not known at compile time Number of processors may not be known at compile time (portability) Virtual to physical processor approach: are you sending to two virtual processors that are the same physical processor?

10 / 46

slide-26
SLIDE 26

Distributed-memory code generation The problem, challenges, and past efforts

A near-neighbor computation example

for (t=1; t<=T−1; t++){ for (j=1; j<=N−1; j++){ u[t%2][j] = 0.333∗(u[(t−1)% 2][j−1] + u[(t−1)%2][j] + u[(t−1)%2][j+1]); } }

j i

N T

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b rs Tile rs Communication data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 11 / 46

slide-27
SLIDE 27

Distributed-memory code generation The problem, challenges, and past efforts

A near-neighbor computation example

for (t=1; t<=T−1; t++){ for (j=1; j<=N−1; j++){ u[t%2][j] = 0.333∗(u[(t−1)% 2][j−1] + u[(t−1)%2][j] + u[(t−1)%2][j+1]); } }

j i

N T

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b rs Tile rs Communication data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 11 / 46

slide-28
SLIDE 28

Distributed-memory code generation The problem, challenges, and past efforts

Floyd-Warshall example

Use to compute all-pairs shortest-paths in a directed graph

for (k=0; k < N; k++) { for (y=0; y < N; y++) { for (x=0; x < N; x++) { pathDistanceMatrix[y ][ x] = min(pathDistanceMatrix[y][k] + pathDistanceMatrix[k][x ], pathDistanceMatrix[y ][ x ]); } } }

Figure : Floyd-warshall algorithm

12 / 46

slide-29
SLIDE 29

Distributed-memory code generation The problem, challenges, and past efforts

Floyd-Warshall communication pattern

x y

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b rs Iterations mapped to a single processor (tile size = 1 for parallel loo rs Flow-out set (row k, column k)

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6

Figure : Communication for Floyd-Warshall: at outer loop iteration k − 1, processor(s) updating the kth row and kth column broadcast them to processors along their column and row respectively.

13 / 46

slide-30
SLIDE 30

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation: example – 2-d seidel

Performing distributed memory code generation after transformation

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

Distance vectors: (0,1,1), (0,1,0), (0,1,-1), (0,0,1), (0,1,-1), (1,-1,1), (1,0,-1), (1,-1,0), (1,-1,-1)

14 / 46

slide-31
SLIDE 31

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation: example – 2-d seidel

Performing distributed memory code generation after transformation

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

Distance vectors: (0,1,1), (0,1,0), (0,1,-1), (0,0,1), (0,1,-1), (1,-1,1), (1,0,-1), (1,-1,0), (1,-1,-1)

14 / 46

slide-32
SLIDE 32

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation: example – 2-d seidel

Performing distributed memory code generation after transformation

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

Distance vectors: (0,1,1), (0,1,0), (0,1,-1), (0,0,1), (0,1,-1), (1,-1,1), (1,0,-1), (1,-1,0), (1,-1,-1)

14 / 46

slide-33
SLIDE 33

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-34
SLIDE 34

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-35
SLIDE 35

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-36
SLIDE 36

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-37
SLIDE 37

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-38
SLIDE 38

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-39
SLIDE 39

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-40
SLIDE 40

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-41
SLIDE 41

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-42
SLIDE 42

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

i t

N-2 T-1

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 15 / 46

slide-43
SLIDE 43

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

Distance vectors: (0,1,1), (0,1,0), (0,1,-1), (0,0,1), (0,1,-1), (1,-1,1), (1,0,-1), (1,-1,0), (1,-1,-1) T(t, i, j) = (t, t + i, 2t + i + j) Tile all dimensions Create a tile schedule, and identify loop to be parallelized Generate communication primitives on this code

16 / 46

slide-44
SLIDE 44

Distributed-memory code generation The problem, challenges, and past efforts

Code generation after transformation

Performing distributed memory code generation on transformed code

for (t=0; t<=T−1; t++) { for (i=1; i<=N−2; i++) { for (j=1; j<=N−2; j++) { a[ i ][ j ] = (a[i−1][j−1] + a[i−1][j] + a[i−1][j+1] + a[i ][ j−1] + a[ i ][ j ] + a[i ][ j+1] + a[i+1][j−1] + a[i+1][j] + a[i+1][j+1])/9.0; } } }

Distance vectors: (0,1,1), (0,1,0), (0,1,-1), (0,0,1), (0,1,-1), (1,-1,1), (1,0,-1), (1,-1,0), (1,-1,-1) T(t, i, j) = (t, t + i, 2t + i + j) Tile all dimensions Create a tile schedule, and identify loop to be parallelized Generate communication primitives on this code

16 / 46

slide-45
SLIDE 45

Distributed-memory code generation The problem, challenges, and past efforts

Computing data accessed

if ((N >= 3) && (T >= 1)) { for (t1=0;t1<=floord(N+2∗T−4,32);t1++) { lbp=max(ceild(t1,2), ceild (32∗t1−T+1,32)); ubp=min(min(floord(N+T−3,32),floord(32∗t1+N+29,64)),t1); #pragma omp parallel for for (t2=lbp;t2<=ubp;t2++) { for (t3=max(ceild(64∗t2−N−28,32),t1);t3<=min(min(min(min(floord(N+T−3,16),floord(32∗t1−32∗t2+N+29,16)) for (t4=max(max(max(32∗t1−32∗t2,32∗t2−N+2),16∗t3−N+2),−32∗t2+32∗t3−N−29);t4<=min(min(min(min for (t5=max(max(32∗t2,t4+1),32∗t3−t4−N+2);t5<=min(min(32∗t2+31,32∗t3−t4+30),t4+N−2);t5++) { for (t6=max(32∗t3,t4+t5+1);t6<=min(32∗t3+31,t4+t5+N−2);t6++) { a[−t4+t5][−t4−t5+t6]=(a[−t4+t5−1][−t4−t5+t6−1]+a[−t4+t5−1][−t4−t5+t6]+a[−t4+t5−1][−t4− } } } } } / ∗ communication code should go here ∗/ } }

Image of (−t4 + t5, −t4 − t5 + t6) over an integer set Straightforward to accomplish via polyhedral libraries

ISL: just create an isl map Polylib: use polylib image function or projections

17 / 46

slide-46
SLIDE 46

Distributed-memory code generation The problem, challenges, and past efforts

Computing data accessed

if ((N >= 3) && (T >= 1)) { for (t1=0;t1<=floord(N+2∗T−4,32);t1++) { lbp=max(ceild(t1,2), ceild (32∗t1−T+1,32)); ubp=min(min(floord(N+T−3,32),floord(32∗t1+N+29,64)),t1); #pragma omp parallel for for (t2=lbp;t2<=ubp;t2++) { for (t3=max(ceild(64∗t2−N−28,32),t1);t3<=min(min(min(min(floord(N+T−3,16),floord(32∗t1−32∗t2+N+29,16)) for (t4=max(max(max(32∗t1−32∗t2,32∗t2−N+2),16∗t3−N+2),−32∗t2+32∗t3−N−29);t4<=min(min(min(min for (t5=max(max(32∗t2,t4+1),32∗t3−t4−N+2);t5<=min(min(32∗t2+31,32∗t3−t4+30),t4+N−2);t5++) { for (t6=max(32∗t3,t4+t5+1);t6<=min(32∗t3+31,t4+t5+N−2);t6++) { a[−t4+t5][−t4−t5+t6]=(a[−t4+t5−1][−t4−t5+t6−1]+a[−t4+t5−1][−t4−t5+t6]+a[−t4+t5−1][−t4− } } } } } / ∗ communication code should go here ∗/ } }

Image of (−t4 + t5, −t4 − t5 + t6) over an integer set Straightforward to accomplish via polyhedral libraries

ISL: just create an isl map Polylib: use polylib image function or projections

17 / 46

slide-47
SLIDE 47

Distributed-memory code generation The problem, challenges, and past efforts

Computing data accessed

if ((N >= 3) && (T >= 1)) { for (t1=0;t1<=floord(N+2∗T−4,32);t1++) { lbp=max(ceild(t1,2), ceild (32∗t1−T+1,32)); ubp=min(min(floord(N+T−3,32),floord(32∗t1+N+29,64)),t1); #pragma omp parallel for for (t2=lbp;t2<=ubp;t2++) { for (t3=max(ceild(64∗t2−N−28,32),t1);t3<=min(min(min(min(floord(N+T−3,16),floord(32∗t1−32∗t2+N+29,16)) for (t4=max(max(max(32∗t1−32∗t2,32∗t2−N+2),16∗t3−N+2),−32∗t2+32∗t3−N−29);t4<=min(min(min(min for (t5=max(max(32∗t2,t4+1),32∗t3−t4−N+2);t5<=min(min(32∗t2+31,32∗t3−t4+30),t4+N−2);t5++) { for (t6=max(32∗t3,t4+t5+1);t6<=min(32∗t3+31,t4+t5+N−2);t6++) { a[−t4+t5][−t4−t5+t6]=(a[−t4+t5−1][−t4−t5+t6−1]+a[−t4+t5−1][−t4−t5+t6]+a[−t4+t5−1][−t4− } } } } } / ∗ communication code should go here ∗/ } }

Image of (−t4 + t5, −t4 − t5 + t6) over an integer set Straightforward to accomplish via polyhedral libraries

ISL: just create an isl map Polylib: use polylib image function or projections

17 / 46

slide-48
SLIDE 48

Distributed-memory code generation The problem, challenges, and past efforts

Computing data accessed – parametric

What we are interested in: data accessed for a given t1, t2 for example Parametric in t1, t2, N (don’t eliminate t1, t2 from the system) Yields data written to or being read in a given iteration For previous code, given t1, t2, N, we get: 1 ≤ d2 ≤ N − 2 max(1, 32t2 − 31) ≤ d1 ≤ min(T − 2, 32t2 + 31) 64t2 − 32t1 − 31 ≤ d1 ≤ 64t2 − 32t1 + 31 −31 ≤ 32t1 − 32t2 ≤ N − 1 d1 can be bounded

18 / 46

slide-49
SLIDE 49

Distributed-memory code generation The problem, challenges, and past efforts

Computing data accessed – parametric

What we are interested in: data accessed for a given t1, t2 for example Parametric in t1, t2, N (don’t eliminate t1, t2 from the system) Yields data written to or being read in a given iteration For previous code, given t1, t2, N, we get: 1 ≤ d2 ≤ N − 2 max(1, 32t2 − 31) ≤ d1 ≤ min(T − 2, 32t2 + 31) 64t2 − 32t1 − 31 ≤ d1 ≤ 64t2 − 32t1 + 31 −31 ≤ 32t1 − 32t2 ≤ N − 1 d1 can be bounded

18 / 46

slide-50
SLIDE 50

Distributed-memory code generation The problem, challenges, and past efforts

Past approaches

1 Access function based [dHPF PLDI’98, Griebl-Classen

IPDPS’06]

2 Dependence-based [Amarasinghe-Lam PLDI’93]

Our approach is dependence-based + Dependence information is already available (last writer property would mean some of the analysis need not be redone) + Natural − May not be the right granularity

19 / 46

slide-51
SLIDE 51

Distributed-memory code generation The problem, challenges, and past efforts

Past approaches

1 Access function based [dHPF PLDI’98, Griebl-Classen

IPDPS’06]

2 Dependence-based [Amarasinghe-Lam PLDI’93]

Our approach is dependence-based + Dependence information is already available (last writer property would mean some of the analysis need not be redone) + Natural − May not be the right granularity

19 / 46

slide-52
SLIDE 52

Distributed-memory code generation Our approach (Pluto distmem)

1

Introduction

2

Distributed-memory code generation The problem, challenges, and past efforts Our approach (Pluto distmem)

3

Experimental Evaluation

4

Conclusions

20 / 46

slide-53
SLIDE 53

Distributed-memory code generation Our approach (Pluto distmem)

Pluto-distmem: Dependences and Communication Sets

Flow dependences lead to communication (anti and output dependences do not) The flow-out set of a tile is the set of all values that are written to inside the tile, and then next read from outside the tile The write-out set of a tile is the set of all those data elements to which the last write access across the entire iteration space is performed in the tile Construct flow-out sets using flow dependences

21 / 46

slide-54
SLIDE 54

Distributed-memory code generation Our approach (Pluto distmem)

Pluto-distmem: Dependences and Communication Sets

Flow dependences lead to communication (anti and output dependences do not) The flow-out set of a tile is the set of all values that are written to inside the tile, and then next read from outside the tile The write-out set of a tile is the set of all those data elements to which the last write access across the entire iteration space is performed in the tile Construct flow-out sets using flow dependences

21 / 46

slide-55
SLIDE 55

Distributed-memory code generation Our approach (Pluto distmem)

Pluto-distmem: Dependences and Communication Sets

Flow dependences lead to communication (anti and output dependences do not) The flow-out set of a tile is the set of all values that are written to inside the tile, and then next read from outside the tile The write-out set of a tile is the set of all those data elements to which the last write access across the entire iteration space is performed in the tile Construct flow-out sets using flow dependences

21 / 46

slide-56
SLIDE 56

Distributed-memory code generation Our approach (Pluto distmem)

Pluto-distmem: Dependences and Communication Sets

Flow dependences lead to communication (anti and output dependences do not) The flow-out set of a tile is the set of all values that are written to inside the tile, and then next read from outside the tile The write-out set of a tile is the set of all those data elements to which the last write access across the entire iteration space is performed in the tile Construct flow-out sets using flow dependences

21 / 46

slide-57
SLIDE 57

Distributed-memory code generation Our approach (Pluto distmem)

Flow-out set

for (t=1; t<=T−1; t++) for (j=1; j<=N−1; j++) u[t%2][j] = 0.333∗(u[(t−1)%2][j−1] + u[(t−1)%2][j] + u[(t−1)%2][j+1]);

t i

Dependences Tiles Flow-out set of ST FO(ST) is sent to {π(RT1) ∪ π(RT2) ∪ π(RT3)} ST RT1 RT2 RT3

FO

22 / 46

slide-58
SLIDE 58

Distributed-memory code generation Our approach (Pluto distmem)

Flow-out set

for (t=1; t<=T−1; t++) for (j=1; j<=N−1; j++) u[t%2][j] = 0.333∗(u[(t−1)%2][j−1] + u[(t−1)%2][j] + u[(t−1)%2][j+1]);

t i

Dependences Tiles Flow-out set of ST FO(ST) is sent to {π(RT1) ∪ π(RT2) ∪ π(RT3)} ST RT1 RT2 RT3

FO

22 / 46

slide-59
SLIDE 59

Distributed-memory code generation Our approach (Pluto distmem)

Computing flow-out set for variable x

Input Depth of parallel loop: l; set Sw of write access, statement pairs for variable x 1: F x

  • ut = ∅

2: for each Mw, Si ∈ Sw do 3: for each dependence e(Si → Sj) ∈ E do 4: if e is of type RAW and source access of e is Mw then 5: El =

  • ti

1 = tj 1 ∧ ti 2 = tj 2 ∧ . . . ∧ ti l = tj l

  • 6:

C t

e = DT e ∩ El

7: I t

e = project out

  • C t

e , mSi + 1, mSj

  • 8:

Ot

e = project out

  • DT

e , mSi + 1, mSj

  • \ I t

e

9: F x

  • ut = F x
  • ut ∪ Ip(MSi

w , Ot e, l)

10: end if 11: end for 12: end for Output F x

  • ut

23 / 46

slide-60
SLIDE 60

Distributed-memory code generation Our approach (Pluto distmem)

Determining communication partners

1 A compiler-assisted runtime technique

Define two functions as part of the output code for each data variable, x. If t1, . . . , tl is the set of sequential dimensions surrounding parallel dimension tp:

2 σx(t1, t2, . . . , tl, tp): set of processors that need the flow-out

set for data variable x from the processor calling this function

3 π(t1, t2, . . . , tl, tp): rank of processor that executes (t1, t2,

. . . , tl, tp)

24 / 46

slide-61
SLIDE 61

Distributed-memory code generation Our approach (Pluto distmem)

Determining communication partners

1 A compiler-assisted runtime technique

Define two functions as part of the output code for each data variable, x. If t1, . . . , tl is the set of sequential dimensions surrounding parallel dimension tp:

2 σx(t1, t2, . . . , tl, tp): set of processors that need the flow-out

set for data variable x from the processor calling this function

3 π(t1, t2, . . . , tl, tp): rank of processor that executes (t1, t2,

. . . , tl, tp)

24 / 46

slide-62
SLIDE 62

Distributed-memory code generation Our approach (Pluto distmem)

Determining communication partners

1 A compiler-assisted runtime technique

Define two functions as part of the output code for each data variable, x. If t1, . . . , tl is the set of sequential dimensions surrounding parallel dimension tp:

2 σx(t1, t2, . . . , tl, tp): set of processors that need the flow-out

set for data variable x from the processor calling this function

3 π(t1, t2, . . . , tl, tp): rank of processor that executes (t1, t2,

. . . , tl, tp)

24 / 46

slide-63
SLIDE 63

Distributed-memory code generation Our approach (Pluto distmem)

Determining communication partners

1 A compiler-assisted runtime technique

Define two functions as part of the output code for each data variable, x. If t1, . . . , tl is the set of sequential dimensions surrounding parallel dimension tp:

2 σx(t1, t2, . . . , tl, tp): set of processors that need the flow-out

set for data variable x from the processor calling this function

3 π(t1, t2, . . . , tl, tp): rank of processor that executes (t1, t2,

. . . , tl, tp)

24 / 46

slide-64
SLIDE 64

Distributed-memory code generation Our approach (Pluto distmem)

The sigma function

Dependence: a relation between source and target iterations ( s → t) For each such RAW dependence: (s1, s2, . . . , sp, . . . , sm) → (t1, t2, . . . , tp, . . . , tm) Project out intra-tile iterators to obtain inter-tile dependences: (s1, s2, . . . , sp) → (t1, t2, . . . , tp) Scanning (t1, t2, . . . , tp) parametric in (s1, s2, . . . , sp) enumerates receiver tiles for a given sending tile Apply π function to determine your receivers Code generated at compile-time: at runtime, we have the identities of the receivers for a flexible π

25 / 46

slide-65
SLIDE 65

Distributed-memory code generation Our approach (Pluto distmem)

The sigma function

Dependence: a relation between source and target iterations ( s → t) For each such RAW dependence: (s1, s2, . . . , sp, . . . , sm) → (t1, t2, . . . , tp, . . . , tm) Project out intra-tile iterators to obtain inter-tile dependences: (s1, s2, . . . , sp) → (t1, t2, . . . , tp) Scanning (t1, t2, . . . , tp) parametric in (s1, s2, . . . , sp) enumerates receiver tiles for a given sending tile Apply π function to determine your receivers Code generated at compile-time: at runtime, we have the identities of the receivers for a flexible π

25 / 46

slide-66
SLIDE 66

Distributed-memory code generation Our approach (Pluto distmem)

The sigma function

Dependence: a relation between source and target iterations ( s → t) For each such RAW dependence: (s1, s2, . . . , sp, . . . , sm) → (t1, t2, . . . , tp, . . . , tm) Project out intra-tile iterators to obtain inter-tile dependences: (s1, s2, . . . , sp) → (t1, t2, . . . , tp) Scanning (t1, t2, . . . , tp) parametric in (s1, s2, . . . , sp) enumerates receiver tiles for a given sending tile Apply π function to determine your receivers Code generated at compile-time: at runtime, we have the identities of the receivers for a flexible π

25 / 46

slide-67
SLIDE 67

Distributed-memory code generation Our approach (Pluto distmem)

The sigma function

Dependence: a relation between source and target iterations ( s → t) For each such RAW dependence: (s1, s2, . . . , sp, . . . , sm) → (t1, t2, . . . , tp, . . . , tm) Project out intra-tile iterators to obtain inter-tile dependences: (s1, s2, . . . , sp) → (t1, t2, . . . , tp) Scanning (t1, t2, . . . , tp) parametric in (s1, s2, . . . , sp) enumerates receiver tiles for a given sending tile Apply π function to determine your receivers Code generated at compile-time: at runtime, we have the identities of the receivers for a flexible π

25 / 46

slide-68
SLIDE 68

Distributed-memory code generation Our approach (Pluto distmem)

The sigma function

Dependence: a relation between source and target iterations ( s → t) For each such RAW dependence: (s1, s2, . . . , sp, . . . , sm) → (t1, t2, . . . , tp, . . . , tm) Project out intra-tile iterators to obtain inter-tile dependences: (s1, s2, . . . , sp) → (t1, t2, . . . , tp) Scanning (t1, t2, . . . , tp) parametric in (s1, s2, . . . , sp) enumerates receiver tiles for a given sending tile Apply π function to determine your receivers Code generated at compile-time: at runtime, we have the identities of the receivers for a flexible π

25 / 46

slide-69
SLIDE 69

Distributed-memory code generation Our approach (Pluto distmem)

Packing and unpacking data

Use a linearized counted buffer

for (d0=max(max(1,32∗t1−32∗t3),32∗t3−N+32); d0<=min(T−2,32∗t1−32∗t3+30);d0++) for d1=max(1,32∗t3−d0+30);d1<=min(N−2,32∗t3−d0+31);d1++) { send buf u[send count u++] = u[d0][d1]; if (t1 <= min(floord(32∗t3+T−33,32),2∗t3−1)) { for (d1=−32∗t1+64∗t3−31;d1<=min(N−1,−32∗t1+64∗t3);d1++) send buf u[send count u++] = u[32∗t1−32∗t3+31][d1]; } }

Unpacking – just reverse the assignment

26 / 46

slide-70
SLIDE 70

Distributed-memory code generation Our approach (Pluto distmem)

Packing and unpacking data

Use a linearized counted buffer

for (d0=max(max(1,32∗t1−32∗t3),32∗t3−N+32); d0<=min(T−2,32∗t1−32∗t3+30);d0++) for d1=max(1,32∗t3−d0+30);d1<=min(N−2,32∗t3−d0+31);d1++) { send buf u[send count u++] = u[d0][d1]; if (t1 <= min(floord(32∗t3+T−33,32),2∗t3−1)) { for (d1=−32∗t1+64∗t3−31;d1<=min(N−1,−32∗t1+64∗t3);d1++) send buf u[send count u++] = u[32∗t1−32∗t3+31][d1]; } }

Unpacking – just reverse the assignment

26 / 46

slide-71
SLIDE 71

Distributed-memory code generation Our approach (Pluto distmem)

Packing and unpacking data

Use a linearized counted buffer

for (d0=max(max(1,32∗t1−32∗t3),32∗t3−N+32); d0<=min(T−2,32∗t1−32∗t3+30);d0++) for d1=max(1,32∗t3−d0+30);d1<=min(N−2,32∗t3−d0+31);d1++) { send buf u[send count u++] = u[d0][d1]; if (t1 <= min(floord(32∗t3+T−33,32),2∗t3−1)) { for (d1=−32∗t1+64∗t3−31;d1<=min(N−1,−32∗t1+64∗t3);d1++) send buf u[send count u++] = u[32∗t1−32∗t3+31][d1]; } }

Unpacking – just reverse the assignment

26 / 46

slide-72
SLIDE 72

Distributed-memory code generation Our approach (Pluto distmem)

Determining Communication Partners

σx(s1, s2, . . . , sl, sp) = {π(t1, t2, . . . , tl, tp) | ∃e ∈ E on x, DT

e (s1, .., sp, .., t1, .., tp, ..,

p, 1)} DT

e is the dependence polyhedron corresponding to e

27 / 46

slide-73
SLIDE 73

Distributed-memory code generation Our approach (Pluto distmem)

Strengths and Limitations

t i

Dependences Tiles Flow-out set of ST FO(ST) is sent to {π(RT1) ∪ π(RT2) ∪ π(RT3)} ST RT1 RT2 RT3

FO

+ Good for broadcast or multicast style communication + A processor will never receive the same data twice − Okay for disjoint point-to-point communication − A processor could be sent data that it does not need

28 / 46

slide-74
SLIDE 74

Distributed-memory code generation Our approach (Pluto distmem)

Strengths and Limitations

t i

Dependences Tiles Flow-out set of ST FO(ST) is sent to {π(RT1) ∪ π(RT2) ∪ π(RT3)} ST RT1 RT2 RT3

FO

+ Good for broadcast or multicast style communication + A processor will never receive the same data twice − Okay for disjoint point-to-point communication − A processor could be sent data that it does not need

28 / 46

slide-75
SLIDE 75

Distributed-memory code generation Our approach (Pluto distmem)

Sub-problems

1 Constructing communication sets 2 Packing and unpacking data 3 Determining receivers 4 Generating actual communication primitives 29 / 46

slide-76
SLIDE 76

Distributed-memory code generation Our approach (Pluto distmem)

Improvement over previous approaches

Based on last-writer dependences, more precise Avoids redundant communication due to virtual-physical processor mapping in several cases Works with all polyhedral transformations on affine loop nests Further refinements possible: flow-out intersection flow-in, flow-out set partitioning, and data movement for heterogeneous systems (CPU/GPU) [Dathathri et al. PACT 2013]

30 / 46

slide-77
SLIDE 77

Distributed-memory code generation Our approach (Pluto distmem)

Improvement over previous approaches

Based on last-writer dependences, more precise Avoids redundant communication due to virtual-physical processor mapping in several cases Works with all polyhedral transformations on affine loop nests Further refinements possible: flow-out intersection flow-in, flow-out set partitioning, and data movement for heterogeneous systems (CPU/GPU) [Dathathri et al. PACT 2013]

30 / 46

slide-78
SLIDE 78

Distributed-memory code generation Our approach (Pluto distmem)

Driven by Computation / Data flow

Code generation is for a given computation transformation / distribution Data moves as dictated by (last-writer) dependences for the computation partitioning specified There is no owning processor for data Data distribution only affects communication at start, and is needed for weak scaling and allocation purposes We use a push model (synchronous with clear separation between computation and communication phases)

31 / 46

slide-79
SLIDE 79

Distributed-memory code generation Our approach (Pluto distmem)

Driven by Computation / Data flow

Code generation is for a given computation transformation / distribution Data moves as dictated by (last-writer) dependences for the computation partitioning specified There is no owning processor for data Data distribution only affects communication at start, and is needed for weak scaling and allocation purposes We use a push model (synchronous with clear separation between computation and communication phases)

31 / 46

slide-80
SLIDE 80

Experimental Evaluation

1

Introduction

2

Distributed-memory code generation

3

Experimental Evaluation

4

Conclusions

32 / 46

slide-81
SLIDE 81

Experimental Evaluation

Experimental evaluation

Code generation support implemented in the Pluto tool (http://pluto-compiler.sourceforge.net) Experiments on a 32-node InfiniBand cluster running MVAPICH2 (running 1 process per node) Codes experimented capture different communication styles (near-neighbor, broadcast style, multicast style) All codes automatically transformed Generated codes were compiled with icc -fast (-O3 -ipo

  • static) version 11.1

33 / 46

slide-82
SLIDE 82

Experimental Evaluation

Performance summary

Benchmark seq pluto-seq Execution time for our (number of procs) Speedup: our-32 (icc) 1 2 4 8 16 32 seq

  • ur-1

strmm 30.4m 247s 240s 124.6s 63.5s 33.6s 17.3s 9.4s 194 26.3 trmm 35.5m 91.8s 96.4s 51.3s 27.4s 15.3s 7.14s 3.74s 570 24.5 dsyr2k 127s 39s 38.8s 22.4s 13.5s 6.80s 3.80s 1.57s 80.8 24.7 covcol 462s 30.9s 30.7s 16.7s 8.8s 4.60s 2.48s 1.30s 355 23.8 seidel 17.3m 643.5s 692s 338.7s 174.3s 94s 65.6s 33.0s 31.0 20.8 jac-2d 21.9m 206.7s 218s 111.2s 62.3s 40.7s 29.3s 21.5s 61.3 9.6 fdtd-2d 139s 129.7s 95.2s 70.7s 40.3s 25.3s 16.8s 11.7s 11.9 11.0 2d-heat 19m 266s 280s 157s 81s 52s 33s 24.0s 47.5 11.7 3d-heat 590.6s 222s 236s 118s 68.7s 41.5s 26.3s 18.8s 31.4 12.6 lu 82.9s 28s 29.5s 18.8s 9.28s 5.67s 4.3s 3.9s 21.3 7.56 floyd-warshall 2012s 2012s 2062s 1041s 527s 273s 153s 112s 18.0 18.0

Mean (geometric) speedup of 60.7× over icc-seq and of 15.9× over pluto-seq A more detailed comparison with manually written code and HPF in the paper Often hard to write such code by hand even for simple affine loop nests (non-rectangularlity, tiling, discontiguity)

34 / 46

slide-83
SLIDE 83

Experimental Evaluation

Tool available (BETA)

Available publicly at: http://pluto-compiler.sourceforge.net $ ../../polycc floyd.c –distmem –commreport –mpiomp –tile –isldep –lastwriter –cloogsh -o seidel.distopt.c $ mpicc -O3 -openmp floyd.distopt.c sigma.c pi.c -o distopt

  • lpolyrt -lm

DISCLAIMER: beta release, not responsible for crashing your cluster!

35 / 46

slide-84
SLIDE 84

Conclusions

Conclusions and future work

First source-to-source tool for MPI code generation for affine loop nests Improves over previous distributed memory code generation approaches When coupled with prior work in polyhedral transformation, a fully automatic distributed-memory parallelizer Future work: integrating it with dynamic scheduling runtimes and enabling data-flow style parallelization: asynchronous communication and overlap of computation and communication, load balance come free of cost

36 / 46

slide-85
SLIDE 85

Conclusions

Conclusions and future work

First source-to-source tool for MPI code generation for affine loop nests Improves over previous distributed memory code generation approaches When coupled with prior work in polyhedral transformation, a fully automatic distributed-memory parallelizer Future work: integrating it with dynamic scheduling runtimes and enabling data-flow style parallelization: asynchronous communication and overlap of computation and communication, load balance come free of cost

36 / 46

slide-86
SLIDE 86

Conclusions

Thank you

Questions?

Acknowledgments AMD for an unrestricted research grant (2011–) Department of Science and Technology (India) for a grant under the FIST program

37 / 46