Dynamic Expressivity with Static Optimization for Streaming - - PowerPoint PPT Presentation

dynamic expressivity with static optimization for
SMART_READER_LITE
LIVE PREVIEW

Dynamic Expressivity with Static Optimization for Streaming - - PowerPoint PPT Presentation

Dynamic Expressivity with Static Optimization for Streaming Languages Robert Soul Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM DEBS 2013 1 Problem Stream (FIFO queue) Operator Rate


slide-1
SLIDE 1

Dynamic Expressivity with Static Optimization for Streaming Languages

1 ¡

Robert Soulé Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM

DEBS 2013

slide-2
SLIDE 2

2 ¡

Video Input Huffman IQuant IDCT

Problem

“Rate” = number of queue pushes/pops per operator firing Dynamic rate (varies at runtime)  Requires dynamic expressivity Static rate (known at compile time)  Enables static optimization How to get both? Observation: applications are “mostly static” (Thies, Amarasinghe [PACT 2010])

*

Stream (FIFO queue) Operator

slide-3
SLIDE 3

StreamIt, a Streaming Language Designed for Static Optimization

3 ¡

float->float pipeline ABC { add float->float filter A() { work pop … push 2 { … } } add float->float filter B() { work pop 3 push 1 { … } } add float->float filter C() { work pop 2 push … { … } } }  Statically known push/pop rates (SDF = Synchronous Dataflow)

A B C

2 3 1 2 … … B pops 3 per firing B pushes 1 per firing

slide-4
SLIDE 4

SDF Steady-State Schedule

4 ¡

A B C

… …

A B C A B C A B C A B C A B C

 Statically known firing order and FIFO queue sizes 2 3 1 2

slide-5
SLIDE 5

Scalarization

5 ¡

A B C

… …

A B C A B C A B C A B C A B C

r1=… r2=… r3=… r4=… …=r1 …=r2 …=r3 r5=… r1=… r2=… …=r4 …=r1 …=r2 r6=… …=r5 …=r6  Implement FIFO queue via local variables, or even registers (more intricate with “peek”, not shown in this talk) 2 3 1 2

slide-6
SLIDE 6

Fission (Data Parallelism)

6 ¡

X1 Roundrobin Split  Round-robin split and merge rely on static rates Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2

slide-7
SLIDE 7

Dynamic Rates

7 ¡

float->float pipeline Decoder { add float->float filter VideoInput() { work pop 1 push 1 { … } } add float->float filter Huffman() { work pop * push 1 { … } } add float->float filter IQuant() { work pop 64 push 64 { … } } add float->float filter IDCT() { work pop 8 push 8 { … } } } Video Input Huffman IQuant IDCT

*

 No more static optimization?

slide-8
SLIDE 8

Dynamic Scheduling Approaches

8

Scheduling approach Description Representative citation OS Thread Each operator has its own thread SPC, Amini et al. [DMSSP 2006] Demand Recruit from thread pool Aurora, Abadi et al. [VLDBJ 2003] No-op Static rate, send nonce when no data CQL, Arasu et al. [VLDBJ 2006]

slide-9
SLIDE 9

Our Approach: Locally Static + Globally Dynamic

9

  • 1. Partitioning into static subgraphs
  • 2. Locally optimize static subgraphs
  • 2a. Fusion
  • 2b. Scalarization
  • 2c. Fission
  • 3. Placement
  • 3a. Core placement
  • 3b. Thread placement
  • 4. Globally dynamic scheduling
slide-10
SLIDE 10

Partition into Static Subgraphs

10 ¡

Video Input Huffman IQuant IDCT

*

Partitioning Video Input Huffman IQuant IDCT

*

Static subgraph: Weakly connected component after deleting dynamic edges.

slide-11
SLIDE 11

Locally Optimize Static Subgraphs

11 ¡

Video Input Huffman IQuant IDCT

*

Partitioning Video Input Huffman IQuant IDCT

*

Fusion Video Input Huffman IQuant IDCT

*

Video Input Huffman IQuant IDCT

*

Scalarization Video Input Huffman IQuant IDCT

*

IQuant IDCT Fission

slide-12
SLIDE 12

Core Placement

12 ¡

Video Input Huffman IQuant IDCT

*

IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman

Place fission replicas on all cores Static weight estimate and greedy bin-packing

slide-13
SLIDE 13

Thread Placement

13 ¡

Video Input Huffman IQuant IDCT

*

IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman

One pinned thread per static subgraph per core

(must be able to suspend dynamic reader when no input)

slide-14
SLIDE 14

Video Input IQuant IDCT IQuant IDCT Huffman

Dynamic Scheduling

14 ¡

Barrier ¡Control ¡ Legend:

Use condition variables for hand-off to successor

slide-15
SLIDE 15

¡Data ¡ ¡Control ¡

Data Pipelining

15 ¡

Barrier Legend: Video Input IQuant IDCT IQuant IDCT Huffman

Use buffer for pipeline parallelism

slide-16
SLIDE 16

Dynamic vs. Static Performance

16 ¡

 Close enough for heavy operators, but what about light operators? File Reader Work W/2 Work W/2 File Writer

*

Operator weight

slide-17
SLIDE 17

Amortizing the Thread Switching Overhead

17

  • 1. Partitioning into static subgraphs
  • 2. Locally optimize static subgraphs
  • 2a. Fusion
  • 2b. Scalarization
  • 2c. Fission
  • 2d. Batching
  • 3. Placement
  • 3a. Core placement
  • 3b. Thread placement
  • 4. Globally dynamic scheduling
slide-18
SLIDE 18

Benefit of Batching

18 ¡

 Amortize thread switching overhead without heavy operators File Reader Work 1 Work 1 File Writer

*

N dynamic queues

slide-19
SLIDE 19

Our vs. Other Dynamic Schedulers Performance

19

Scheduling approach Experiment Result OS Thread 32 threads, 1 core, work 31 per operator Our scheduler is 10x faster Demand Huffman encoder and decoder Our scheduler is 1.2x faster No-op 2 programs: VWAP and predicate filter Our scheduler is 5.1x and 4.9x faster

 Our scheduler was faster in all cases (see paper for details)

slide-20
SLIDE 20

Conclusions

  • Static streaming languages such as

StreamIt enable powerful optimizations

  • But many real-world applications require

dynamic rates

  • We extend the StreamIt optimizing

compiler to handle dynamic rates

20