Dynamic Expressivity with Static Optimization for Streaming - - PowerPoint PPT Presentation

▶

Jun 28, 2023 473 likes •698 views

Dynamic Expressivity with Static Optimization for Streaming Languages Robert Soul Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM DEBS 2013 1 Problem Stream (FIFO queue) Operator Rate

SLIDE 1

Dynamic Expressivity with Static Optimization for Streaming Languages

1 ¡

Robert Soulé Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM

DEBS 2013

SLIDE 2

2 ¡

Video Input Huffman IQuant IDCT

Problem

“Rate” = number of queue pushes/pops per operator firing Dynamic rate (varies at runtime)  Requires dynamic expressivity Static rate (known at compile time)  Enables static optimization How to get both? Observation: applications are “mostly static” (Thies, Amarasinghe [PACT 2010])

*

Stream (FIFO queue) Operator

SLIDE 3

StreamIt, a Streaming Language Designed for Static Optimization

3 ¡

float->float pipeline ABC { add float->float filter A() { work pop … push 2 { … } } add float->float filter B() { work pop 3 push 1 { … } } add float->float filter C() { work pop 2 push … { … } } }  Statically known push/pop rates (SDF = Synchronous Dataflow)

A B C

2 3 1 2 … … B pops 3 per firing B pushes 1 per firing

SLIDE 4

SDF Steady-State Schedule

4 ¡

A B C

… …

A B C A B C A B C A B C A B C

 Statically known firing order and FIFO queue sizes 2 3 1 2

SLIDE 5

Scalarization

5 ¡

A B C

… …

A B C A B C A B C A B C A B C

r1=… r2=… r3=… r4=… …=r1 …=r2 …=r3 r5=… r1=… r2=… …=r4 …=r1 …=r2 r6=… …=r5 …=r6  Implement FIFO queue via local variables, or even registers (more intricate with “peek”, not shown in this talk) 2 3 1 2

SLIDE 6

Fission (Data Parallelism)

6 ¡

X1 Roundrobin Split  Round-robin split and merge rely on static rates Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2

SLIDE 7

Dynamic Rates

7 ¡

float->float pipeline Decoder { add float->float filter VideoInput() { work pop 1 push 1 { … } } add float->float filter Huffman() { work pop * push 1 { … } } add float->float filter IQuant() { work pop 64 push 64 { … } } add float->float filter IDCT() { work pop 8 push 8 { … } } } Video Input Huffman IQuant IDCT

*

 No more static optimization?

SLIDE 8

Dynamic Scheduling Approaches

Scheduling approach Description Representative citation OS Thread Each operator has its own thread SPC, Amini et al. [DMSSP 2006] Demand Recruit from thread pool Aurora, Abadi et al. [VLDBJ 2003] No-op Static rate, send nonce when no data CQL, Arasu et al. [VLDBJ 2006]

SLIDE 9

Our Approach: Locally Static + Globally Dynamic

1. Partitioning into static subgraphs
2. Locally optimize static subgraphs
2a. Fusion
2b. Scalarization
2c. Fission
3. Placement
3a. Core placement
3b. Thread placement
4. Globally dynamic scheduling

SLIDE 10

Partition into Static Subgraphs

10 ¡

Video Input Huffman IQuant IDCT

*

Partitioning Video Input Huffman IQuant IDCT

*

Static subgraph: Weakly connected component after deleting dynamic edges.

SLIDE 11

Locally Optimize Static Subgraphs

11 ¡

Video Input Huffman IQuant IDCT

*

Partitioning Video Input Huffman IQuant IDCT

*

Fusion Video Input Huffman IQuant IDCT

*

Video Input Huffman IQuant IDCT

*

Scalarization Video Input Huffman IQuant IDCT

*

IQuant IDCT Fission

SLIDE 12

Core Placement

12 ¡

Video Input Huffman IQuant IDCT

*

IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman

Place fission replicas on all cores Static weight estimate and greedy bin-packing

SLIDE 13

Thread Placement

13 ¡

Video Input Huffman IQuant IDCT

*

IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman

One pinned thread per static subgraph per core

(must be able to suspend dynamic reader when no input)

SLIDE 14

Video Input IQuant IDCT IQuant IDCT Huffman

Dynamic Scheduling

14 ¡

Barrier ¡Control ¡ Legend:

Use condition variables for hand-off to successor

SLIDE 15

¡Data ¡ ¡Control ¡

Data Pipelining

15 ¡

Barrier Legend: Video Input IQuant IDCT IQuant IDCT Huffman

Use buffer for pipeline parallelism

SLIDE 16

Dynamic vs. Static Performance

16 ¡

 Close enough for heavy operators, but what about light operators? File Reader Work W/2 Work W/2 File Writer

*

Operator weight

SLIDE 17

Amortizing the Thread Switching Overhead

1. Partitioning into static subgraphs
2. Locally optimize static subgraphs
2a. Fusion
2b. Scalarization
2c. Fission
2d. Batching
3. Placement
3a. Core placement
3b. Thread placement
4. Globally dynamic scheduling

SLIDE 18

Benefit of Batching

18 ¡

 Amortize thread switching overhead without heavy operators File Reader Work 1 Work 1 File Writer

*

N dynamic queues

SLIDE 19

Our vs. Other Dynamic Schedulers Performance

Scheduling approach Experiment Result OS Thread 32 threads, 1 core, work 31 per operator Our scheduler is 10x faster Demand Huffman encoder and decoder Our scheduler is 1.2x faster No-op 2 programs: VWAP and predicate filter Our scheduler is 5.1x and 4.9x faster

 Our scheduler was faster in all cases (see paper for details)

SLIDE 20

Conclusions

Static streaming languages such as

StreamIt enable powerful optimizations

But many real-world applications require

dynamic rates

We extend the StreamIt optimizing

Dynamic Expressivity with Static Optimization for Streaming Languages

DEBS 2013

Problem

*

StreamIt, a Streaming Language Designed for Static Optimization

A B C

SDF Steady-State Schedule

A B C

A B C A B C A B C A B C A B C

Scalarization

A B C

A B C A B C A B C A B C A B C

Fission (Data Parallelism)

Dynamic Rates

*

Dynamic Scheduling Approaches

Scheduling approach Description Representative citation OS Thread Each operator has its own thread SPC, Amini et al. [DMSSP 2006] Demand Recruit from thread pool Aurora, Abadi et al. [VLDBJ 2003] No-op Static rate, send nonce when no data CQL, Arasu et al. [VLDBJ 2006]

Our Approach: Locally Static + Globally Dynamic

Partition into Static Subgraphs

*

*

Locally Optimize Static Subgraphs

*

*

*

*

*

Core Placement

*

Thread Placement

*

Dynamic Scheduling

Data Pipelining

Dynamic vs. Static Performance

*

Amortizing the Thread Switching Overhead

Benefit of Batching

*

Our vs. Other Dynamic Schedulers Performance

Scheduling approach Experiment Result OS Thread 32 threads, 1 core, work 31 per operator Our scheduler is 10x faster Demand Huffman encoder and decoder Our scheduler is 1.2x faster No-op 2 programs: VWAP and predicate filter Our scheduler is 5.1x and 4.9x faster

Conclusions

StreamIt enable powerful optimizations

dynamic rates

compiler to handle dynamic rates