Automatic Parallelization: Parallelism and Tiling Roshan Dathathri - - PowerPoint PPT Presentation

automatic parallelization parallelism and tiling
SMART_READER_LITE
LIVE PREVIEW

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri - - PowerPoint PPT Presentation

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science and Automation Indian Institute of Science roshan@csa.iisc.ernet.in June 25, 2013 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25,


slide-1
SLIDE 1

Automatic Parallelization: Parallelism and Tiling

Roshan Dathathri

Department of Computer Science and Automation Indian Institute of Science roshan@csa.iisc.ernet.in

June 25, 2013

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 1 / 30

slide-2
SLIDE 2

Goals of program transformations/optimizations

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

slide-3
SLIDE 3

Goals of program transformations/optimizations

Increase performance

Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

slide-4
SLIDE 4

Goals of program transformations/optimizations

Increase performance

Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution

Reduce memory footprint Reduce energy usage Today: Source code transformations

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

slide-5
SLIDE 5

Goals of program transformations/optimizations

Increase performance

Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution

Reduce memory footprint Reduce energy usage Today: Source code transformations

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 3 / 30

slide-6
SLIDE 6

Memory Hierarchy

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 4 / 30

slide-7
SLIDE 7

Data Locality

Same memory location or related memory locations being frequently accessed Different classes of locality:

Spatial locality Temporal locality Group locality

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 5 / 30

slide-8
SLIDE 8

Spatial locality

Elements close-by (in space/memory) tend to be referenced soon e.g., c[i][j] in the code below

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 6 / 30

slide-9
SLIDE 9

Spatial locality

Elements close-by (in space/memory) tend to be referenced soon e.g., c[i][j] in the code below

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Innermost dimension of the array should vary the fastest, by a constant

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 6 / 30

slide-10
SLIDE 10

Which code exploits spatial reuse of c[i][j] better?

Snippet 1 Snippet 2

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } } for (k=0; k<N; k++) { for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Table: Matrix multiplication code

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 7 / 30

slide-11
SLIDE 11

Temporal locality

Same element tends to be referenced soon e.g., c[i][j] in the code below

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 8 / 30

slide-12
SLIDE 12

Temporal locality

Same element tends to be referenced soon e.g., c[i][j] in the code below

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Rank of an access function is less than the dimensionality of the loop nest

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 8 / 30

slide-13
SLIDE 13

Which code exploits temporal reuse of c[i][j] better?

Snippet 1 Snippet 2

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } } for (k=0; k<N; k++) { for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Table: Matrix multiplication code

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 9 / 30

slide-14
SLIDE 14

Group locality

Multiple accesses of the same array tend to reference the same element soon e.g., a[i + 1], a[i], a[i − 1] in the code below

for (t = 0; t < T−1; t++) { for ( i = 1; i < N+1; i++) { temp[i] = 0.125 ∗ (a[i+1] − 2.0 ∗ a[i] + a[i −1]); } for ( i = 1; i < N+1; i++) { a[i] = temp[i ]; } }

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 10 / 30

slide-15
SLIDE 15

Loop Tiling/Blocking

Executing iteration space in blocks: block-after-block Most important of all loop transformations Crucial for locality and parallelism

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 11 / 30

slide-16
SLIDE 16

Example – Tiling

for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } }

Original code

i j k

Figure: Locality in i, j, k dimensions

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 12 / 30

slide-17
SLIDE 17

Example – Tiling

// inter −tile iterators for (iT=0; iT<N; iT+=B) { for (jT=0; jT<N; jT+=B) { for (kT=0; kT<N; kT+=B) { // intra −tile iterators for ( i=iT; i<iT+B; i++) { for ( j=jT; j<jT+B; j++) { for (k=kT; k<kT+B; k++) { c[i ][ j] += a[i ][k]∗b[k][ j ]; } } } } } }

Tiled code with tile size B ∗ B ∗ B

i j k

tile boundary tile boundary

Figure: Exploiting reuse in i, j, k dimensions

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 13 / 30

slide-18
SLIDE 18

Tiling for Data Locality

Tiling for caches Data touched by a tile should fit in faster memory Improves data reuse – allows reuse in multiple directions

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 14 / 30

slide-19
SLIDE 19

Validity of Tiling

A tile is a piece of computation that can execute atomically in its entirety Should be able to construct a total order on the set of all tiles

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 15 / 30

slide-20
SLIDE 20

Example – Validity of Tiling

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

for (t =0; t<T; t++) { for ( i =2; i<N−1; i++) { a[t ][ i] += 0.333∗( a[t−1][i]+ a[t−1][i−1]+a[t−1][i +1]); } }

Original code

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 16 / 30

slide-21
SLIDE 21

Example – Validity of Tiling

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 17 / 30

slide-22
SLIDE 22

Example – Validity of Tiling

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

for ( t1 =0; t1<=T−1;t1++) { for ( t2=t1+2; t2<=t1+N−2;t2++) { a[t1][−t1+t2 ]+=0.333∗(a[t1−1][−t1+t2]+ a[t1−1][−t1+t2−1]+a[t1−1][−t1+t2+1]); } }

Skewed code

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 17 / 30

slide-23
SLIDE 23

Validity of Tiling

With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

slide-24
SLIDE 24

Validity of Tiling

With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h: h.D ≥ 0

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

slide-25
SLIDE 25

Validity of Tiling

With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h: h.D ≥ 0 1 1 1

  • .

1 1 1 1 −1

  • =

1 1 1 1 2

  • Roshan Dathathri (CSA, IISc)

Parallelism and Tiling June 25, 2013 18 / 30

slide-26
SLIDE 26

Validity of Tiling

With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h: h.D ≥ 0 1 1 1

  • .

1 1 1 1 −1

  • =

1 1 1 1 2

  • Consider dependences (1,0,1), (1, -2, 0), (0,1,0), (0,0,1): what kind of tiling

is valid?

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

slide-27
SLIDE 27

Validity of Tiling

With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h: h.D ≥ 0 1 1 1

  • .

1 1 1 1 −1

  • =

1 1 1 1 2

  • Consider dependences (1,0,1), (1, -2, 0), (0,1,0), (0,0,1): what kind of tiling

is valid?   1 2 1 1   .   1 1 −2 1 1 1   =   1 1 2 1 1 1  

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

slide-28
SLIDE 28

Example

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure: No 2-D tiling possible

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 19 / 30

slide-29
SLIDE 29

Different kinds of parallelism

Outer parallelism / communication-free parallelism Inner parallelism Pipelined parallelism Reduction parallelism SIMD (Singe Instruction Multiple Data) parallelism or vectorization

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 20 / 30

slide-30
SLIDE 30

Different kinds of parallelism

Outer parallelism / communication-free parallelism Inner parallelism Pipelined parallelism Reduction parallelism SIMD (Singe Instruction Multiple Data) parallelism or vectorization

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 21 / 30

slide-31
SLIDE 31

Outer parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 22 / 30

slide-32
SLIDE 32

Outer parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure: Outer parallel loop i

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 22 / 30

slide-33
SLIDE 33

Inner parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 23 / 30

slide-34
SLIDE 34

Inner parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure: Inner parallel loop, j

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 23 / 30

slide-35
SLIDE 35

Pipelined parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 24 / 30

slide-36
SLIDE 36

Pipelined parallelism (loops)

j i

N N

b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 1 2 3

Figure: Pipelined parallel loop

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 24 / 30

slide-37
SLIDE 37

Coarse-grained pipelined parallelism

φ2 φ1

N N

bc bc bc bc bc bc b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bc Instances of S1 b Instances of S2 rs A tile

1 2 3 4 5 1 2 3 4

Figure: Tiling an iteration space

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 25 / 30

slide-38
SLIDE 38

Tiling for Parallelism

Achieves coarse-grained parallelization Reduces frequency of synchronization - improves computation to communication ratio Can reduce volume of communication How does the tile size affect parallelism?

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 26 / 30

slide-39
SLIDE 39

Tiling for Parallelism

Achieves coarse-grained parallelization Reduces frequency of synchronization - improves computation to communication ratio Can reduce volume of communication How does the tile size affect parallelism?

Larger -> lesser frequency of synchronization, more load-imbalance Smaller -> more frequency of synchronization, reduces load-imbalance

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 26 / 30

slide-40
SLIDE 40

Tiling is directly related to parallelism

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 27 / 30

slide-41
SLIDE 41

Tiling is directly related to parallelism

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

Tiling is valid -> Parallelism (at least pipelined parallelism) exists

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 27 / 30

slide-42
SLIDE 42

Tiling is directly related to parallelism

i t

N N

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

1 2 3 4 1 2 3 4

Figure: Dependences (1,0), (1,1), (1,-1)

Tiling is valid -> Parallelism (at least pipelined parallelism) exists

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 28 / 30

slide-43
SLIDE 43

References

PLUTO - An automatic parallelizer and locality optimizer for multicores http://pluto-compiler.sourceforge.net/ PoCC - The Polyhedral Compiler Collection http://www.cse.ohio-state.edu/~pouchet/software/pocc/

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 29 / 30

slide-44
SLIDE 44

Complexity of architectures and input code

General−purpose multicores clusters GPUs Distributed−memory Heterogeneous platform Dense matrices Regular, Graphs Trees HashTables Sets Lists Sparse codes control flow Irregular

Code complexity Architecture complexity

Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 30 / 30