Bo Bounded St Stream Sc Scheduli ling in in Polyh lyhedral l - - PowerPoint PPT Presentation

bo bounded st stream sc scheduli ling in in polyh
SMART_READER_LITE
LIVE PREVIEW

Bo Bounded St Stream Sc Scheduli ling in in Polyh lyhedral l - - PowerPoint PPT Presentation

Bo Bounded St Stream Sc Scheduli ling in in Polyh lyhedral l OpenStream Nuno Mig iguel Nob obre | nunomiguel.nobre@manchester.ac.uk Andi Drebes | andi.drebes@inria.fr Graham Riley | graham.riley@manchester.ac.uk Antoniu Pop |


slide-1
SLIDE 1

Bo Bounded St Stream Sc Scheduli ling in in Polyh lyhedral l OpenStream

Nuno Mig iguel Nob

  • bre | nunomiguel.nobre@manchester.ac.uk

Andi Drebes | andi.drebes@inria.fr Graham Riley | graham.riley@manchester.ac.uk Antoniu Pop | antoniu.pop@manchester.ac.uk IMPACT 2020: January 22, 2020 | Bologna, Italy

slide-2
SLIDE 2

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

slide-3
SLIDE 3

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization

slide-4
SLIDE 4

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Point-to-point synchronization: Hide latency More opportunities for parallelism

slide-5
SLIDE 5

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Point-to-point synchronization: Hide latency More opportunities for parallelism Task

slide-6
SLIDE 6

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Point-to-point synchronization: Hide latency More opportunities for parallelism Task Data

slide-7
SLIDE 7

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Point-to-point synchronization: Hide latency More opportunities for parallelism Task Data Pipeline

slide-8
SLIDE 8

The case for streaming dataflow languages

2 / 11

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Scheduling is the runtime’s job Provide functional determinism No in-place writes: Fewer dependencies Point-to-point synchronization: Hide latency More opportunities for parallelism Task Data Pipeline

slide-9
SLIDE 9

The case for streaming dataflow languages

2 / 11

GPU

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

Instead of barrier synchronization Scheduling is the runtime’s job Provide functional determinism No in-place writes: Fewer dependencies Point-to-point synchronization: Hide latency More opportunities for parallelism Task Data Pipeline Memory footprint FPGA

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … …

slide-10
SLIDE 10

Outline

3 / 11

1) OpenStream

  • Overview & polyhedral subset
  • Computing dependencies and schedules

2) Stream bounding

  • Basic strategy & limitations
  • Usage guidelines
slide-11
SLIDE 11

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime

slide-12
SLIDE 12

… …

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime Task dependencies:

  • verlapping windows

stream s;

s

WsRs Control program Accesses on stream s

slide-13
SLIDE 13

… …

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime Task dependencies:

  • verlapping windows

stream s; task p1 { write three times to s; }

p1 p1 s

Rs Ws Control program Accesses on stream s

slide-14
SLIDE 14

… …

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime Task dependencies:

  • verlapping windows

stream s; task p1 { write three times to s; } task p2 { write two times to s; }

p1 p2 p1 p2 s

Rs Ws Control program Accesses on stream s

slide-15
SLIDE 15

… …

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime Task dependencies:

  • verlapping windows

stream s; task p1 { write three times to s; } task p2 { write two times to s; } task r { peek three times from s; }

p1 p2 r p1 p2 r s

Rs Ws Control program Accesses on stream s

slide-16
SLIDE 16

… …

OpenStream: a short overview

4 / 11

Data-flow extension to OpenMP

  • Tas

asks: units of work spawned as concurrent coroutines

  • Str

Streams: unbounded channels for communication between tasks Tasks access stream elements through win indows: created dynamically at runtime Task dependencies:

  • verlapping windows

stream s; task p1 { write three times to s; } task p2 { write two times to s; } task r { peek three times from s; } task c { read five times from s; }

p1 p2 r c p1 p2 r c s

Rs Ws Control program Accesses on stream s

slide-17
SLIDE 17

Polyhedral OpenStream: computing dependencies

5 / 11

Polyhedral control program:

  • No nested task creation
  • Affine control statements

stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

slide-18
SLIDE 18

Polyhedral OpenStream: computing dependencies

5 / 11

Polyhedral control program:

  • No nested task creation
  • Affine control statements

stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

Can statically count Ws and Rs and obtain access windows:

  • Ehrhart polynomials
  • Brion generating functions

Ws(tw,i) = 2i window: [2i, 2i + 1] Rs(tc,j) = 4j window: [4j, 4j + 3]

slide-19
SLIDE 19

Polyhedral OpenStream: computing dependencies

5 / 11

Polyhedral control program:

  • No nested task creation
  • Affine control statements

stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

Can statically count Ws and Rs and obtain access windows:

  • Ehrhart polynomials
  • Brion generating functions

Ws(tw,i) = 2i window: [2i, 2i + 1] Rs(tc,j) = 4j window: [4j, 4j + 3] Compute dependencies by intersecting windows

tc,0 tw,0 tw,1

… 2i ≤ 4j + 3 ∧ 4j ≤ 2i + 1 2j ≤ i ≤ 2j + 1

slide-20
SLIDE 20

Polyhedral OpenStream: scheduling

6 / 11

Dependencies: polynomial (in)equalities 𝑞𝑗 𝑦 , semi-algebraic sets: 𝑇 = 𝑦 ∈ ℝ𝑒 𝑞1 𝑦 ≥ 0, 𝑞2 𝑦 ≥ 0, … , 𝑞𝑜 𝑦 ≥ 0}

slide-21
SLIDE 21

Polyhedral OpenStream: scheduling

6 / 11

Dependencies: polynomial (in)equalities 𝑞𝑗 𝑦 , semi-algebraic sets: 𝑇 = 𝑦 ∈ ℝ𝑒 𝑞1 𝑦 ≥ 0, 𝑞2 𝑦 ≥ 0, … , 𝑞𝑜 𝑦 ≥ 0} A polynomial 𝑄(𝑦) is strictly positive in 𝑇 iff: 𝑄 𝑦 = ෍

𝑙∈ℕ𝑜

𝜇𝑙𝑞1

𝑙1 𝑦 𝑞2 𝑙2 𝑦 … 𝑞𝑜 𝑙𝑜(𝑦)

𝜇𝑙 ≥ 0 ∑𝜇𝑙 > 0

slide-22
SLIDE 22

Polyhedral OpenStream: scheduling

6 / 11

Dependencies: polynomial (in)equalities 𝑞𝑗 𝑦 , semi-algebraic sets: 𝑇 = 𝑦 ∈ ℝ𝑒 𝑞1 𝑦 ≥ 0, 𝑞2 𝑦 ≥ 0, … , 𝑞𝑜 𝑦 ≥ 0} A polynomial 𝑄(𝑦) is strictly positive in 𝑇 iff: 𝑄 𝑦 = ෍

𝑙∈ℕ𝑜

𝜇𝑙𝑞1

𝑙1 𝑦 𝑞2 𝑙2 𝑦 … 𝑞𝑜 𝑙𝑜(𝑦)

𝜇𝑙 ≥ 0 ∑𝜇𝑙 > 0 Cannot possibly exhaust all 𝑙 in finite time:

  • Semi-decidable (undecidable) problem
  • In practice, ∼ conservative ‘Farkas lemma’
slide-23
SLIDE 23

Stream bounding: back-pressure WaRs

7 / 11 stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

… …

tw,0 tw,1 tw,2 tw,3 tc,0 tw,0 tw,1 tc,1 tw,2 tw,3

tc,0 tc,1 s

slide-24
SLIDE 24

Stream bounding: back-pressure WaRs

7 / 11 stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

… …

tw,0 tw,1 tw,2 tw,3 tc,0 tw,0 tw,1 tc,1 tw,2 tw,3

tc,0 tc,1 tw,2

Stream bound: 4 elements

s

slide-25
SLIDE 25

Stream bounding: back-pressure WaRs

7 / 11 stream s; parameter N; for(i = 0; i < N; ++i) task tw { write two times to s; } for(j = 0; j < N/2; ++j) task tc { read four times from s; }

… …

tw,0 tw,1 tw,2 tw,3 tc,0 tw,0 tw,1 tc,1 tw,2 tw,3

tc,0 tc,1 tw,2

Stream bound: 4 elements ⚠︐ ≤ Ws(tw,2) + (# writes) – bound – 1 = 4 + 2 – 4 – 1 = 1 New back-pressure dependency: some parallelism is lost

s

slide-26
SLIDE 26

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3 s

slide-27
SLIDE 27

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock

tw,3 s

slide-28
SLIDE 28

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is 2-element bound

tw,3 s

slide-29
SLIDE 29

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s

slide-30
SLIDE 30

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s tsour

slide-31
SLIDE 31

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s tw,3

slide-32
SLIDE 32

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s tw,2

slide-33
SLIDE 33

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s tw,1

slide-34
SLIDE 34

Stream bounding: the implications of (partial) causality

8 / 11 stream s; parameter N; task tsink { read once from s; } for(k = 1; k < N; ++k) task tw { read once from s; write once to s; } task tsource { write once to s; }

tsour tsink tw,1 tw,2

… … …

tw,1 tw,2 tsour tsink

Indexing Filling

tw,3 tw,3

Stream bound: 2 elements, deadlock Caveat: the stream is is 2-element bound

s tsink

slide-35
SLIDE 35

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

… … … …

tb ta s1 s2

slide-36
SLIDE 36

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

… … … …

tb ta s1 s2

Minimum bounds: s1: 2 elements s2: 3 elements

slide-37
SLIDE 37

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

… … … …

tb ta s1 s2

Minimum bounds: s1: 2 elements s2: 3 elements

ta

slide-38
SLIDE 38

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

… … … …

tb ta s1 s2

Minimum bounds: s1: 2 elements s2: 3 elements

ta tb

slide-39
SLIDE 39

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

… … … …

tb ta s1 s2

Minimum bounds: s1: 2 elements s2: 3 elements

ta tb

We can have one, but not both: s1: 2 elements & s2: ≥5 elements s1: ≥3 elements & s2: 3 elements

slide-40
SLIDE 40

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements s1: 0 elements s2: 0 elements

slide-41
SLIDE 41

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

tw1

s1: 2 elements s2: 0 elements

slide-42
SLIDE 42

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

ta

s1: 0 elements s2: 2 elements

slide-43
SLIDE 43

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

tc2

s1: 0 elements s2: 0 elements

slide-44
SLIDE 44

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

tw2

s1: 0 elements s2: 3 elements

slide-45
SLIDE 45

s1: 1 element s2: 0 elements

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

tb

slide-46
SLIDE 46

s1: 0 elements s2: 0 elements

Stream bounding: global surface minimization

9 / 11 stream s1, s2; task tw1 { write two times to s1; } task tw2 { write three times to s2; } task ta { write two times to s2; read two times from s1; } task tb { write once to s1; read three times from s2; } task tc1 { read once from s1; } task tc2 { read two times from s2; }

tw1 tw2 ta tb tc1 tc2

Minimum bounds: s1: 2 elements s2: 3 elements The return of the causality caveat, assume these bounds: s1: 2 elements & s2: 3 elements

tc1

slide-47
SLIDE 47

Stream bounding: application guidelines

10 / 11

Can Can we e ru run a a gi given pr program on

  • n a

a de device wit ith mem emory M? 1) Select stream bounds combination s.t. Σs bounds = M 2) Add back-pressure dependencies for this combination 3) Look for schedule 4) If found: guaranteed execution If not found: if other combinations available, 1) if all exhausted, conservatively assume execution not possible

slide-48
SLIDE 48

Summary

11 / 11

Back-pressure dependencies: 1) Bound streams 2) Statically, but conservatively, decide execution in limited memory 3) Limitations:

  • Causality-induced ‘spurious’ deadlocks
  • Non-independent stream minimization
  • Overestimation of actual memory usage
  • Deadlock detection undecidability