Be Beyond Poly olyhedral Analysis of of OpenStream Programs Nun - - PowerPoint PPT Presentation

be beyond poly olyhedral analysis of of openstream
SMART_READER_LITE
LIVE PREVIEW

Be Beyond Poly olyhedral Analysis of of OpenStream Programs Nun - - PowerPoint PPT Presentation

Be Beyond Poly olyhedral Analysis of of OpenStream Programs Nun uno Mi Migu guel l Nob obre nunomiguel.nobre@manchester.ac.uk Join oint work ork wi with: Andi Drebes, Graham Riley and Antoniu Pop IMPACT 2019: January 23, 2019 |


slide-1
SLIDE 1

Be Beyond Poly

  • lyhedral Analysis of
  • f OpenStream Programs

Nun uno Mi Migu guel l Nob

  • bre nunomiguel.nobre@manchester.ac.uk

Join

  • int work
  • rk wi

with: Andi Drebes, Graham Riley and Antoniu Pop IMPACT 2019: January 23, 2019 | Valencia, Spain

slide-2
SLIDE 2

2 / 15

How to exploit today’s machines efficiently?

Task-par aralle llel l str tream amin ing da dataflow mo models have strong assets:

  • Point-to-point synchronization

▪ Hide latency

  • Numerous opportunities for parallelism

▪ Task, data and pipeline

  • Scheduling is the runtime’s job
  • Provide functional determinism
slide-3
SLIDE 3

2 / 15

How to exploit today’s machines efficiently?

Task-par aralle llel l str tream amin ing da dataflow mo models have strong assets:

  • Point-to-point synchronization

▪ Hide latency

  • Numerous opportunities for parallelism

▪ Task, data and pipeline

  • Scheduling is the runtime’s job
  • Provide functional determinism

But also disadvantages:

  • Manually specified tasks

▪ Challenging dependency specification ▪ Hard debugging ▪ What’s the right granularity?

  • Memory footprint: no in-place writes
slide-4
SLIDE 4

2 / 15

How to exploit today’s machines efficiently?

Task-par aralle llel l str tream amin ing da dataflow mo models have strong assets:

  • Point-to-point synchronization

▪ Hide latency

  • Numerous opportunities for parallelism

▪ Task, data and pipeline

  • Scheduling is the runtime’s job
  • Provide functional determinism

But also disadvantages:

  • Manually specified tasks

▪ Challenging dependency specification ▪ Hard debugging ▪ What’s the right granularity?

  • Memory foo
  • otprin

int: no no in-pla lace wr writ ites

slide-5
SLIDE 5

3 / 15

Why the polyhedral model?

  • Arbitrarily compose loop transformations inc. tiling

gr gran anula larit ity control

  • Static program analysis

str tream ams me memory foo

  • otprin

int/ t/boundin ing

  • Multi-objective: parallelism, ve

vectoriz izatio ion, multi multi-level cache reuse

  • Compact program representation unlike graph algorithms
  • Despite restrictions: stencils

ils, dense linear algebra and image filters

slide-6
SLIDE 6

Outline

4 / 15

1) Manual granularity tuning

  • Motivating example: Gauss-Seidel stencil

2) Stream bounding & automatic granularity tuning

  • The polynomial indexing problem
  • Future work solutions
slide-7
SLIDE 7

OpenStream: a (very) short overview

5 / 15

Data-flow extension to OpenMP

  • Tasks: units of work spawned as concurrent coroutines
  • Str

tream ams: unbounded channels for communication between tasks Tasks access stream elements through sliding wi wind ndows: created dynamically at runtime

c1 p1 p2 task task task … … ? ? ? ? stream

slide-8
SLIDE 8

OpenStream: a (very) short overview

5 / 15

Data-flow extension to OpenMP

  • Tasks: units of work spawned as concurrent coroutines
  • Str

tream ams: unbounded channels for communication between tasks Tasks access stream elements through sliding wi wind ndows: created dynamically at runtime

c1 p1 p2 task task task … … ? ? ? ? stream

slide-9
SLIDE 9

OpenStream: a (very) short overview

5 / 15

Data-flow extension to OpenMP

  • Tasks: units of work spawned as concurrent coroutines
  • Str

tream ams: unbounded channels for communication between tasks Tasks access stream elements through sliding wi wind ndows: created dynamically at runtime Stream accesses dictate the dependencies between tasks

c1 p1 p2 task task task … … ? ? ? a stream

slide-10
SLIDE 10

OpenStream: a (very) short overview

5 / 15

Data-flow extension to OpenMP

  • Tasks: units of work spawned as concurrent coroutines
  • Str

tream ams: unbounded channels for communication between tasks Tasks access stream elements through sliding wi wind ndows: created dynamically at runtime Stream accesses dictate the dependencies between tasks

c1 p1 p2 task task task … … a b c d stream

slide-11
SLIDE 11

OpenStream: a (very) short overview

5 / 15

Data-flow extension to OpenMP

  • Tasks: units of work spawned as concurrent coroutines
  • Str

tream ams: unbounded channels for communication between tasks Tasks access stream elements through sliding wi wind ndows: created dynamically at runtime Stream accesses dictate the dependencies between tasks

c1 p1 p2 task task task … … a b c d stream c2 task

slide-12
SLIDE 12

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-13
SLIDE 13

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-14
SLIDE 14

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-15
SLIDE 15

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-16
SLIDE 16

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-17
SLIDE 17

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-18
SLIDE 18

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-19
SLIDE 19

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-20
SLIDE 20

6 / 15

1D Gauss-Seidel: stencil code granularity tuning

for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) phi[j] = (phi[j - 1] + phi[j + 1]) / 2;

Sequential C [SeqC]

stream_array S[N]; for (i = 0; i < I; ++i) for (j = 1; j < N - 1; ++j) task { read once from S[j]; // phi[j] (discarded) peek once from S[j - 1]; // phi[j - 1] peek once from S[j + 1]; // phi[j + 1] write once into S[j]; // phi[j] // work function: // phi[j] = (phi[j - 1] + phi[j + 1]) / 2; }

OpenStream: Fine-grained tasks [OS-FG]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector j i

1/ 2

+

1/ 2

slide-21
SLIDE 21

7 / 15

1D Gauss-Seidel: stencil code granularity tuning

Loop tile/ Pluto-tiled task Flow dependence distance vector between tiles Loop iteration/ fine-grained task j i

1) Semantically equivalent C code (SA) 2) Pluto source-to-source compiler 3) OpenMP parallel code [OMP-PT] 4) OpenStream: Pluto-tiled tasks [OS-PT]

slide-22
SLIDE 22

7 / 15

1D Gauss-Seidel: stencil code granularity tuning

Loop tile/ Pluto-tiled task Flow dependence distance vector between tiles Loop iteration/ fine-grained task j i

OpenStream: Spatially tiled tasks [OS-ST]

Spatially tiled task Flow dependence distance vector between tiles Loop iteration/ fine-grained task j i

1) Semantically equivalent C code (SA) 2) Pluto source-to-source compiler 3) OpenMP parallel code [OMP-PT] 4) OpenStream: Pluto-tiled tasks [OS-PT]

slide-23
SLIDE 23

8 / 15

1D Gauss-Seidel: results

slide-24
SLIDE 24

9 / 15

2D Gauss-Seidel: a visual picture

j k i

OpenStream: Fine-grained tasks [OS-FG]

Previous iteration Current iteration Current grid point Not yet computed Flow dependence distance vector

slide-25
SLIDE 25

2D Gauss-Seidel: a visual picture

10 / 15

j k i

OpenStream: Pluto-tiled tasks [OS-PT]

slide-26
SLIDE 26

2D Gauss-Seidel: a visual picture

10 / 15

j k i

OpenStream: Spatially tiled tasks [OS-ST]

j k i

OpenStream: Pluto-tiled tasks [OS-PT]

slide-27
SLIDE 27

11 / 15

2D Gauss-Seidel: results

slide-28
SLIDE 28

The polynomial problem

12 / 15

  • Stream indexing is polynomial

▪ e.g. parametric tiling

slide-29
SLIDE 29

The polynomial problem

12 / 15

  • Stream indexing is polynomial

▪ e.g. parametric tiling

  • Deadlock undecidability

Albert Cohen, Alain Darte, and Paul Feautrier. 2016. Static Analysis of OpenStream Programs

slide-30
SLIDE 30

The polynomial problem

12 / 15

  • Stream indexing is polynomial

▪ e.g. parametric tiling

  • Deadlock undecidability

Albert Cohen, Alain Darte, and Paul Feautrier. 2016. Static Analysis of OpenStream Programs

  • Schedule found: no deadlock

Paul Feautrier and Albert Cohen. 2018. On Polynomial Code Generation

slide-31
SLIDE 31

13 / 15

Future work: bounding streams

c1 p1 p2 task task task … … a b c d stream p1 c1 p2

Dataflow task graph

slide-32
SLIDE 32

13 / 15

Future work: bounding streams

c1 p1 p2 task task task … … a b c d stream p1 c1 p2

3-element stream: deadlock

p1 c1 p2

Dataflow task graph Back-pressure dependencies Dataflow task graph: new edges (cycle)

  • Poly. model: “just” new schedule restrictions (no schedule)
slide-33
SLIDE 33

13 / 15

Future work: bounding streams

c1 p1 p2 task task task … … a b c d stream p1 c1 p2

3-element stream: deadlock

p1 c1 p2

Dataflow task graph Back-pressure dependencies Dataflow task graph: new edges (cycle)

  • Poly. model: “just” new schedule restrictions (no schedule)

If schedule found: OpenStream’s runtime can schedule the program

slide-34
SLIDE 34

Future work: coarsening task graphs

14 / 15

Dataflow task graph

t1 t3 t2 t1 t0 t2

slide-35
SLIDE 35

Future work: coarsening task graphs

14 / 15

Dataflow task graph

t1 t3 t2 t1 t0 t2 t1 t0 + t3 t2

Arbitrary coarsening: deadlock

slide-36
SLIDE 36

Future work: coarsening task graphs

14 / 15

Loop strip-mining, facilitated by stream mushing Dataflow task graph

t1 t3 t2 t1 t0 t2 t1 t0 + t3 t2

Arbitrary coarsening: deadlock e.g. coalescing instances of the same task

t3 t0 t1 + t2

slide-37
SLIDE 37

Future work: coarsening task graphs

14 / 15

Loop strip-mining, facilitated by stream mushing If schedule found: OpenStream’s runtime can schedule the program Dataflow task graph

t1 t3 t2 t1 t0 t2 t1 t0 + t3 t2

Arbitrary coarsening: deadlock e.g. coalescing instances of the same task

t3 t0 t1 + t2

slide-38
SLIDE 38

Summary

15 / 15

  • Task-parallel dataflow programs can benefit from polyhedral transformations
  • Analyses and transformations are hindered by polynomials
  • Bounding streams: adding back-pressure dependencies and finding a schedule
  • Granularity control: loop strip-mining? how do we align this w/ current techniques?