Parametric Tiling with Inter-Tile Data Reuse Alexandre Isoard Alain - - PowerPoint PPT Presentation

parametric tiling with inter tile data reuse
SMART_READER_LITE
LIVE PREVIEW

Parametric Tiling with Inter-Tile Data Reuse Alexandre Isoard Alain - - PowerPoint PPT Presentation

Motivation and challenges Parametric analysis Current implementation and results Parametric Tiling with Inter-Tile Data Reuse Alexandre Isoard Alain Darte Compsys, LIP (Laboratoire de lInformatique du Paralllisme), Lyon IMPACT 4th


slide-1
SLIDE 1

Motivation and challenges Parametric analysis Current implementation and results

Parametric Tiling with Inter-Tile Data Reuse

Alexandre Isoard Alain Darte

Compsys, LIP (Laboratoire de l’Informatique du Parallélisme), Lyon

IMPACT 4th International Workshop on Polyhedral Compilation Techniques January 20, 2014 Vienna, Austria

1 / 25

slide-2
SLIDE 2

Motivation and challenges Parametric analysis Current implementation and results

Outline

1

Motivation and challenges Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

2

Parametric analysis Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

3

Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

2 / 25

slide-3
SLIDE 3

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Kernel Offloading

Host

CPU

Global Memory Accelerator

FPGA/GPU/MPPA/...

Local Memory slow fast ☛ Perform computations by blocks; ☛ Exploit data reuse; ☛ Use pipelining/prefetching; ☛ Reduce and coalesce communications (burst).

3 / 25

slide-4
SLIDE 4

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Rules and objectives

Data reuse: on the full iteration domain Rule 1: always use local data if already loaded or computed.

☛ Reduces communication volume, increases local memory. ☛ Enables full pipelining (load/compute/store sequence).

4 / 25

slide-5
SLIDE 5

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Rules and objectives

Data reuse: on the full iteration domain Rule 1: always use local data if already loaded or computed.

☛ Reduces communication volume, increases local memory. ☛ Enables full pipelining (load/compute/store sequence).

Blocking: thanks to tiling Rule 2: tiles executed in sequence (but a tile can be parallelized).

☛ Increases temporal reuse, reduces local memory. ☛ Increases spatial reuse, enables burst communications.

4 / 25

slide-6
SLIDE 6

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Rules and objectives

Data reuse: on the full iteration domain Rule 1: always use local data if already loaded or computed.

☛ Reduces communication volume, increases local memory. ☛ Enables full pipelining (load/compute/store sequence).

Blocking: thanks to tiling Rule 2: tiles executed in sequence (but a tile can be parallelized).

☛ Increases temporal reuse, reduces local memory. ☛ Increases spatial reuse, enables burst communications.

Variants for reuse domain, i.e., where data reuse is performed

Iteration domain reduced thanks to hierarchical tiling. Data reuse in a p-dimensional stripe, or at bounded distance.

4 / 25

slide-7
SLIDE 7

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Rules and objectives

Data reuse: on the full iteration domain Rule 1: always use local data if already loaded or computed.

☛ Reduces communication volume, increases local memory. ☛ Enables full pipelining (load/compute/store sequence).

Blocking: thanks to tiling Rule 2: tiles executed in sequence (but a tile can be parallelized).

☛ Increases temporal reuse, reduces local memory. ☛ Increases spatial reuse, enables burst communications.

Variants for reuse domain, i.e., where data reuse is performed

Iteration domain reduced thanks to hierarchical tiling. Data reuse in a p-dimensional stripe, or at bounded distance.

Then: scheduling/pipelining & memory allocation Rule 3: reuse analysis independently on scheduling. Rule 4: load as late as possible, store as soon as possible.

☛ Overlaps transfer and computation (multi-buffering). ☛ Reduces live-ranges, and possibly local memory size.

4 / 25

slide-8
SLIDE 8

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Rules and objectives Parametric in terms of tile sizes?

Data reuse: on the full iteration domain Rule 1: always use local data if already loaded or computed.

☛ Reduces communication volume, increases local memory. ☛ Enables full pipelining (load/compute/store sequence).

Blocking: thanks to tiling Rule 2: tiles executed in sequence (but a tile can be parallelized).

☛ Increases temporal reuse, reduces local memory. ☛ Increases spatial reuse, enables burst communications.

Variants for reuse domain, i.e., where data reuse is performed

Iteration domain reduced thanks to hierarchical tiling. Data reuse in a p-dimensional stripe, or at bounded distance.

Then: scheduling/pipelining & memory allocation Rule 3: reuse analysis independently on scheduling. Rule 4: load as late as possible, store as soon as possible.

☛ Overlaps transfer and computation (multi-buffering). ☛ Reduces live-ranges, and possibly local memory size.

4 / 25

slide-9
SLIDE 9

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Challenges and contributions

General principle for Load sets Load a data indexed by m just before a tile indexed by T if:

  • m is live-in for

T, i.e., read but not written earlier in T.

  • m has not been loaded in a previous tile.
  • m has not been defined earlier.

5 / 25

slide-10
SLIDE 10

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Challenges and contributions

General principle for Load sets Load a data indexed by m just before a tile indexed by T if:

  • m is live-in for

T, i.e., read but not written earlier in T.

  • m has not been loaded in a previous tile.
  • m has not been defined earlier.

Tiling defines a schedule on tile+iteration indices, thus “previous” and “earlier”. This schedule is not affine in terms of tile sizes.

5 / 25

slide-11
SLIDE 11

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Challenges and contributions

General principle for Load sets Load a data indexed by m just before a tile indexed by T if:

  • m is live-in for

T, i.e., read but not written earlier in T.

  • m has not been loaded in a previous tile.
  • m has not been defined earlier.

Tiling defines a schedule on tile+iteration indices, thus “previous” and “earlier”. This schedule is not affine in terms of tile sizes. Exact case Reads/writes are functions of iteration points. Can we express the relation “happens before” among iterations in a quasi-affine way? ☛ Yes. Parametric tiling with exact inter-tile reuse is feasible.

5 / 25

slide-12
SLIDE 12

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Challenges and contributions

General principle for Load sets Load a data indexed by m just before a tile indexed by T if:

  • m is live-in for

T, i.e., read but not written earlier in T.

  • m has not been loaded in a previous tile.
  • m has not been defined earlier.

Tiling defines a schedule on tile+iteration indices, thus “previous” and “earlier”. This schedule is not affine in terms of tile sizes. Exact case Reads/writes are functions of iteration points. Can we express the relation “happens before” among iterations in a quasi-affine way? ☛ Yes. Parametric tiling with exact inter-tile reuse is feasible. Approximations What if contributions of reads/writes are summarized at tile level? Approximated? ☛ No information loss if approximations are “pointwise”. More approximations needed otherwise.

5 / 25

slide-13
SLIDE 13

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Reads, writes, schedule

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-14
SLIDE 14

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Reads, writes, schedule

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-15
SLIDE 15

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Reads, writes, schedule

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-16
SLIDE 16

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Reads, writes, schedule

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-17
SLIDE 17

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Reads, writes, schedule

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-18
SLIDE 18

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Dependences

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-19
SLIDE 19

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Dependences

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-20
SLIDE 20

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Scheduling alternatives: loop reversal+interchange

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-21
SLIDE 21

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Scheduling alternatives: loop reversal+interchange+tiling

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-22
SLIDE 22

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Scheduling alternatives: loop skewing

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-23
SLIDE 23

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Scheduling alternatives: loop skewing+tiling

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

6 / 25

slide-24
SLIDE 24

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Scheduling alternatives: loop skewing+tiling

j i A B C Product of two polynomials: arguments in A and B; result in C.

for(int k=0; k <2*n -1; k++) { C[k] = 0; // S0 } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i+j] += A[i]*B[j]; // S1 } }

+ possibility of intra-tile parallelism.

6 / 25

slide-25
SLIDE 25

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Inter-tile data reuse in a tile strip

for(i=0; i<n; i++) for(j=0; j<n; j++) C[i+j] = C[i+j] + A[i]*B[j]; (i, j) → (n − j − 1, i)

j i

(i, j) → (i + j, i)

i j

In a tile, Load ≃ first read, Store ≃ last write.

7 / 25

slide-26
SLIDE 26

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Inter-tile data reuse in a tile strip

for(i=0; i<n; i++) for(j=0; j<n; j++) C[i+j] = C[i+j] + A[i]*B[j]; (i, j) → (n − j − 1, i)

j i

(i, j) → (i + j, i)

i j

In a tile strip, Load ≃ first read, Store ≃ last write.

7 / 25

slide-27
SLIDE 27

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Inter-tile data reuse in a tile strip

for(i=0; i<n; i++) for(j=0; j<n; j++) C[i+j] = C[i+j] + A[i]*B[j]; (i, j) → (n − j − 1, i)

j i

(i, j) → (i + j, i)

i j

In a reuse domain, Load ≃ first read, Store ≃ last write. Can actually be adapted to any parameterized reuse domain.

7 / 25

slide-28
SLIDE 28

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Objective: data transfers

J I

j i

Bound n, tiles of size b × b. Tiling with (i, j) → (i′, j′) = (n − j − 1, i). Access functions m = i + j = j′ + n − i′ − 1. Tile origin (I, J). Transfers LoadA, LoadB, LoadC, StoreC.

8 / 25

slide-29
SLIDE 29

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Objective: data transfers

J I

j i

Bound n, tiles of size b × b. Tiling with (i, j) → (i′, j′) = (n − j − 1, i). Access functions m = i + j = j′ + n − i′ − 1. Tile origin (I, J). Transfers LoadA, LoadB, LoadC, StoreC.

Load sets.

LoadA = {m | 0 ≤ m ≤ n − 1, J ≤ m ≤ J + b − 1}

8 / 25

slide-30
SLIDE 30

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Objective: data transfers

J I

j i

Bound n, tiles of size b × b. Tiling with (i, j) → (i′, j′) = (n − j − 1, i). Access functions m = i + j = j′ + n − i′ − 1. Tile origin (I, J). Transfers LoadA, LoadB, LoadC, StoreC.

Load sets.

LoadA = {m | 0 ≤ m ≤ n − 1, J ≤ m ≤ J + b − 1} LoadB = {m | J = 0, 0 ≤ m ≤ n − 1, n − I − b ≤ m ≤ n − I − 1}

8 / 25

slide-31
SLIDE 31

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Objective: data transfers

J I

j i

Bound n, tiles of size b × b. Tiling with (i, j) → (i′, j′) = (n − j − 1, i). Access functions m = i + j = j′ + n − i′ − 1. Tile origin (I, J). Transfers LoadA, LoadB, LoadC, StoreC.

Load sets.

LoadA = {m | 0 ≤ m ≤ n − 1, J ≤ m ≤ J + b − 1} LoadB = {m | J = 0, 0 ≤ m ≤ n − 1, n − I − b ≤ m ≤ n − I − 1} LoadC = {m | 0 ≤ m, n − I − b ≤ m ≤ n − 1 − I, J = 0} ∪ {m | max(1, J) ≤ m + I − n + 1 ≤ min(n − 1, J + b − 1)}

8 / 25

slide-32
SLIDE 32

Motivation and challenges Parametric analysis Current implementation and results Kernel offloading: rules of the game Reminders: scheduling and tiling Inter-tile data reuse: example

Objective: data transfers and local memory sizes

J I

j i

Bound n, tiles of size b × b. Tiling with (i, j) → (i′, j′) = (n − j − 1, i). Access functions m = i + j = j′ + n − i′ − 1. Tile origin (I, J). Transfers LoadA, LoadB, LoadC, StoreC.

Load sets. Local memory sizes with “double-buffering”.

LoadA = {m | 0 ≤ m ≤ n − 1, J ≤ m ≤ J + b − 1}

size 2b, when n ≥ 2b + 1: at least 2 tiles available. size n when n ≤ 2b: less than 2 tiles.

LoadB = {m | J = 0, 0 ≤ m ≤ n − 1, n − I − b ≤ m ≤ n − I − 1}

size b when n ≥ b: 1 full tile. size n when n ≤ b − 1: 1 partial tile.

LoadC = {m | 0 ≤ m, n − I − b ≤ m ≤ n − 1 − I, J = 0} ∪ {m | max(1, J) ≤ m + I − n + 1 ≤ min(n − 1, J + b − 1)}

size 3b − 1 = (2b − 1) + b si n ≥ 2b + 1: 2 full tiles. size b + n − 1 = (2b − 1) + (n − b) si b ≤ n ≤ 2b: 1 full tile, 1 partial tile. size 2n − 1 si n ≤ b − 1: 1 partial tile.

8 / 25

slide-33
SLIDE 33

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Outline

1

Motivation and challenges

2

Parametric analysis Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

3

Current implementation and results

9 / 25

slide-34
SLIDE 34

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, tiles, and schedules

With indices of tiles (tile sizes defined by s = (s1, . . . , sn))

  • i ∈ Tile(

T) ⇔

    

s1T1 ≤ i1 < s1(T1 + 1) . . . snTn ≤ in < sn(Tn + 1) ☛ Schedule on iteration points: i′ < i ⇔ ( T ′, i′) <lex ( T, i).

10 / 25

slide-35
SLIDE 35

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, tiles, and schedules

With indices of tiles (tile sizes defined by s = (s1, . . . , sn))

  • i ∈ Tile(

T) ⇔

    

s1T1 ≤ i1 < s1(T1 + 1) . . . snTn ≤ in < sn(Tn + 1) ☛ Schedule on iteration points: i′ < i ⇔ ( T ′, i′) <lex ( T, i).

With indices of tile origins

  • i ∈ Tile(

I) ⇔

    

I1 ≤ i1 < I1 + s1 . . . In ≤ in < In + sn with I, origin of Tile( T), i.e., I = (s1T1, . . . , snTn). ☛ Schedule on iteration points, for a tiling specified by a given tile:

  • i′ <
  • I

i ⇔ i′ <

I′

i ⇔ ( I′, i′) <lex ( I, i) and I′

s

≡ I

10 / 25

slide-36
SLIDE 36

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Intuitive expression of Load/Store sets

For Tile( I) with data reuse in ReuseDomain: Load( I) =

  • i∈Tile(
  • I)

    read(

  • i) \
  • i′<
  • i
  • i′∈ReuseDomain

read( i′) ∪ write( i′)

    

Store( I) =

  • i∈Tile(
  • I)

    write(

  • i) \
  • i′>
  • i
  • i′∈ReuseDomain

write( i′)

    

where i′ < i means that i′ is executed before i in the tiled schedule.

11 / 25

slide-37
SLIDE 37

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Intuitive expression of Load/Store sets

For Tile( I) with data reuse in ReuseDomain: Load( I) =

  • i∈Tile(
  • I)

    read(

  • i) \
  • i′<
  • i
  • i′∈ReuseDomain

read( i′) ∪ write( i′)

    

Store( I) =

  • i∈Tile(
  • I)

    write(

  • i) \
  • i′>
  • i
  • i′∈ReuseDomain

write( i′)

    

where i′ < i means that i′ is executed before i in the tiled schedule. ☛ Can we express i′ < i (“happens before”) in a parametric way?

11 / 25

slide-38
SLIDE 38

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <

i iff

  • i ∈ Tile(

T) and i′ ∈ Tile( T ′) ( T ′, i′) <lex ( T, i)

12 / 25

slide-39
SLIDE 39

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′) ( I′, i′) <lex ( I, i) and I′

s

≡ I

12 / 25

slide-40
SLIDE 40

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′)

  • I′ =

I ∧ i′ <lex i

  • r
  • I′ <lex

I ∧ I′

s

≡ I ⇔ I ⊏

  • s

I′

12 / 25

slide-41
SLIDE 41

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′)

  • I′ =

I ∧ i′ <lex i

  • r
  • I′ <lex

I ∧ I′

s

≡ I ⇔ I ⊏

  • s

I′

12 / 25

slide-42
SLIDE 42

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′)

  • I′ =

I ∧ i′ <lex i

  • r

(i′

1 < I1) ∨ (i′ 1 < I1 + s1 ∧ i′ 2 < I2) 12 / 25

slide-43
SLIDE 43

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′)

  • I′ =

I ∧ i′ <lex i

  • r

(I′

1 ≤ I1−s1)∨(I′ 1 ≤ I1∧I′ 2 ≤ I2−s2) 12 / 25

slide-44
SLIDE 44

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Tiling, relation “happens before” and unaligned tiles

A B C

  • i′ <
  • I

i iff

  • i ∈ Tile(

I) and i′ ∈ Tile( I′)

  • I′ =

I ∧ i′ <lex i

  • r
  • I′ ≺
  • s

I: partial order on tiles (aligned and unaligned tiles)

12 / 25

slide-45
SLIDE 45

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Load/Store computations with In/Out sets

Contribution of reads/writes summarized at tile level:

            

In( I) =

  • i∈Tile(
  • I)

  read(

  • i) \
  • i′∈Tile(
  • I),

i′<lex i

write( i′)

  

Out( I) =

  • i∈Tile(
  • I)

write(

  • i)

13 / 25

slide-46
SLIDE 46

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Load/Store computations with In/Out sets

Contribution of reads/writes summarized at tile level:

            

In( I) =

  • i∈Tile(
  • I)

  read(

  • i) \
  • i′∈Tile(
  • I),

i′<lex i

write( i′)

  

Out( I) =

  • i∈Tile(
  • I)

write(

  • i)

Load( I) =

  • i∈Tile(
  • I)

    read(

  • i) \
  • i′<
  • i
  • i′∈ReuseDomain

read( i′) ∪ write( i′)

    

☛ Load( I) = In( I) \

  

  • I′≺
  • s

I

In( I′) ∪ Out( I′)

  

13 / 25

slide-47
SLIDE 47

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Approximations: why?

Some operations may execute if conditions that are not analyzable. Some data may be accessed access functions that are not fully analyzable. Approximated In/Out sets for tiles ☛ In, Out, Out. due to the analysis (e.g., array regions); by choice to represent simpler sets (e.g., hyper-rectangles); to simplify the analysis (e.g., Fourier-Motzkin). Approximated Load/Store sets ☛ Store, Load. to simplify code generation; to perform communications by blocks; to simplify memory allocation; . . .

14 / 25

slide-48
SLIDE 48

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I))

15 / 25

slide-49
SLIDE 49

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I)) Simplified “exact” load formula, with aligned tiles Load( I) = (In ∪ Out)( I) \ (In ∪ Out)( I′ ⊏

s

I)

15 / 25

slide-50
SLIDE 50

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I)) Simplified “exact” load formula, with aligned tiles Load( I) = (In ∪ Out)( I) \

  • I′⊏
  • s

I

(In ∪ Out)( I′)

15 / 25

slide-51
SLIDE 51

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I)) Simplified “exact” load formula, with aligned tiles Load( I) = F( I) \

  • I′⊏
  • s

I

F( I′)

15 / 25

slide-52
SLIDE 52

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I)) Simplified “exact” load formula, with aligned tiles or all tiles? Load( I) = F( I) \

  • I′⊏
  • s

I

F( I′) ? = F( I) \

  • I′≺
  • s

I

F( I′)

15 / 25

slide-53
SLIDE 53

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Equality of unions

“Exact approximated” load formula Load( I) = Ra

  • I ∩ ((In′ ∪ Out)(

I) \ (In′ ∪ Out)( I′ ⊏

s

I)) Simplified “exact” load formula, with aligned tiles or all tiles? Load( I) = F( I) \

  • I′⊏
  • s

I

F( I′) ? = F( I) \

  • I′≺
  • s

I

F( I′) Definition (Function stable for unions) F : C ⊆ P(A) → P(B) is stable for unions iff ∀C′, C′′ ⊆ C,

  • X∈C′ X =

X∈C′′ X ⇒ X∈C′ F(X) = X∈C′′ F(X).

  • I′⊏
  • s

I Tile(

I′) =

  • I′≺
  • s

I Tile(

I′)

?

  • I′⊏
  • s

I F(

I′) =

  • I′≺
  • s

I F(

I′)

15 / 25

slide-54
SLIDE 54

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Pointwise functions

Definition (Function stable for unions) F : C ⊆ P(A) → P(B) is stable for unions iff ∀C′, C′′ ⊆ C,

  • X∈C′ X =

X∈C′′ X ⇒ X∈C′ F(X) = X∈C′′ F(X).

equivalent to Definition (Pointwise function) A, B two sets, C ⊆ P(A). F : C → P(B) is pointwise iff there exists f : A → P(B) such that ∀X ∈ C, F(X) =

x∈X f (x).

Ex: F( I) = (In ∪ Out)( I) =

  • i∈T(
  • I)(read ∪ write)(
  • i).

16 / 25

slide-55
SLIDE 55

Motivation and challenges Parametric analysis Current implementation and results Tile index vs tile origin index Exact inter-tile reuse Approximated inter-tile reuse

Pointwise functions

Definition (Function stable for unions) F : C ⊆ P(A) → P(B) is stable for unions iff ∀C′, C′′ ⊆ C,

  • X∈C′ X =

X∈C′′ X ⇒ X∈C′ F(X) = X∈C′′ F(X).

equivalent to Definition (Pointwise function) A, B two sets, C ⊆ P(A). F : C → P(B) is pointwise iff there exists f : A → P(B) such that ∀X ∈ C, F(X) =

x∈X f (x).

Ex: F( I) = (In ∪ Out)( I) =

  • i∈T(
  • I)(read ∪ write)(
  • i).

Point-wise approximations Largest pointwise under-approximation: f (x) =

  • Y ∈C, x∈Y

F(Y ). Pointwise over-approximations schemes are possible.

16 / 25

slide-56
SLIDE 56

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Outline

1

Motivation and challenges

2

Parametric analysis

3

Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

17 / 25

slide-57
SLIDE 57

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Current implementation and future work

In progress: development of an automated tool iscc script (see demo) ⇒ complete tool based on ISL. Implement approximation schemes: due to code and/or by choice (complexity issues). Integrate with PIPS? Improve memory size computation: complexity issues, schedules (parallelism), piecewise lattice-based allocation. To do: experiments with blocking (see also DATE’13) FPGA? Workstation? GPU? Kalray MPPA? Cost model for hierarchical tiling. Other schemes of reuse (partial storage). Pointwise functions Useful for other approximations?

18 / 25

slide-58
SLIDE 58

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Script iscc 1/3

# Inputs Params := [N, s_1, s_2] -> { : s_1 >= 0 and s_2 >= 0 }; Domain := [N] -> { # Iteration domains S_1[k] : 0 <= k < 2N-1; S_2[i, j] : 0 <= i,j < N; } * Params; Read := [N] -> { # Read access functions S_2[i, j] -> A[i]; S_2[i, j] -> B[j]; S_2[i, j] -> C[i+j]; } * Domain; Write := [N] -> { # Write access functions S_1[k] -> C[k]; S_2[i, j] -> C[i+j]; } * Domain; Theta := [N] -> { # Preliminary mapping S_1[k] -> [k, 0, 0]; S_2[i, j] -> [i+j, i, 1]; };

19 / 25

slide-59
SLIDE 59

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Script iscc 2/3

# Tools for set manipulations Tiling := [s_1, s_2] -> { # Two dimensional tiling [[I_1, I_2] -> [i_1, i_2, k]] -> [i_1, i_2, k] : I_1 <= i_1 < I_1 + s_1 and I_2 <= i_2 < I_2 + s_2 }; Coalesce := { [I_1, I_2] -> [[I_1, I_2] -> [i_1, i_2, k]] }; Strip := { [I_1, I_2] -> [I_1, I_2’] }; Prev := { # Lexicographic order [[I_1, I_2] -> [i_1, i_2, k]] -> [[I_1, I_2] -> [i_1’, i_2’, k’]] : i_1’ <= i_1 - 1 or (i_1’ <= i_1 and i_2’ <= i_2 - 1)

  • r (i_1’ <= i_1 and i_2’ <= i_2 and k’ <= k - 1) };

TiledPrev := [s_1, s_2] -> { # Special ‘‘lexicographic’’ order [I_1, I_2] -> [I_1’, I_2’] : I_1’ <= I_1 - s_1 or (I_1’ <= I_1 and I_2’ <= I_2 - s_2) } * Strip; TiledNext := TiledPrev^-1; TiledRead := Tiling.(Theta^-1).Read; TiledWrite := Tiling.(Theta^-1).Write;

20 / 25

slide-60
SLIDE 60

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Script iscc 3/3

# Set/relation computations In := Coalesce.(TiledRead - (Prev.TiledWrite)); Out := Coalesce.TiledWrite; Load := In - ((TiledPrev.In) + (TiledPrev.Out)); Store := Out - (TiledNext.Out); print coalesce (Load % Params); print coalesce (Store % Params);

21 / 25

slide-61
SLIDE 61

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Pipelined schedule

Compute(0) Store(0) Store(−1) Load(1) Compute(1) Store(1) Load(2) Compute(2) Store(2) Load(3) Load(0)

22 / 25

slide-62
SLIDE 62

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Sizes of arrays in local memory

Transformation for tiling Sequential memory size jacobi-1d-imper S0(t, i)→(t, 2t + i, 0) S1(t, j)→(t, 2t + j + 1, 1) A[2s1 + s2] B[2s1 + s2 − 1] jacobi-2d-imper S0(t, i, j)→(t, 2t + i, 2t + i + j, 0) S1(t, i, j)→(t, 2t + i + 1, 2t + i + j + 1, 1) A[2s1 + s2, min(2s1, s2 + 1) + s3] B[2s1 + s2 − 1, min(2s1, s2) + s3 − 1] seidel-2d S0(t, i, j)→(t, t + i, 2t + i + j) A

s1 + s2 + 1,

min(2s1 + 2, s1 + s2, 2s2 + 2) + s3

  • floyd-warshall

S0(k, i, j)→(k, i, j) path

max(k + 1, n − k),

max(k + 1, n − k)

  • 23 / 25
slide-63
SLIDE 63

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Sizes of arrays in local memory

Transformation for tiling Pipelined memory size jacobi-1d-imper S0(t, i)→(t, 2t + i, 0) S1(t, j)→(t, 2t + j + 1, 1) A[2s1 + 2s2] B[2s1 + 2s2 − 2] jacobi-2d-imper S0(t, i, j)→(t, 2t + i, 2t + i + j, 0) S1(t, i, j)→(t, 2t + i + 1, 2t + i + j + 1, 1) A[2s1 + s2, min(2s1, s2 + 1) + 2s3] B[2s1 + s2 − 1, min(2s1, s2 + 1) + 2s3 − 2] seidel-2d S0(t, i, j)→(t, t + i, 2t + i + j) A

s1 + s2 + 1,

min(2s1 + 2, s1 + s2, 2s2 + 2) + 2s3

  • floyd-warshall

S0(k, i, j)→(k, i, j) path

max(k + 1, n − k),

max(k + 1, n − k, 2s2)

  • 24 / 25
slide-64
SLIDE 64

Motivation and challenges Parametric analysis Current implementation and results Current status Script with iscc Local memory allocation for PolyBench examples

Merci

Questions ?

25 / 25