Dynamic Expressivity with Static Optimization for Streaming Languages
1 ¡
Robert Soulé Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM
Dynamic Expressivity with Static Optimization for Streaming - - PowerPoint PPT Presentation
Dynamic Expressivity with Static Optimization for Streaming Languages Robert Soul Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM DEBS 2013 1 Problem Stream (FIFO queue) Operator Rate
1 ¡
Robert Soulé Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM
2 ¡
Video Input Huffman IQuant IDCT
“Rate” = number of queue pushes/pops per operator firing Dynamic rate (varies at runtime) Requires dynamic expressivity Static rate (known at compile time) Enables static optimization How to get both? Observation: applications are “mostly static” (Thies, Amarasinghe [PACT 2010])
Stream (FIFO queue) Operator
3 ¡
float->float pipeline ABC { add float->float filter A() { work pop … push 2 { … } } add float->float filter B() { work pop 3 push 1 { … } } add float->float filter C() { work pop 2 push … { … } } } Statically known push/pop rates (SDF = Synchronous Dataflow)
2 3 1 2 … … B pops 3 per firing B pushes 1 per firing
4 ¡
… …
Statically known firing order and FIFO queue sizes 2 3 1 2
5 ¡
… …
r1=… r2=… r3=… r4=… …=r1 …=r2 …=r3 r5=… r1=… r2=… …=r4 …=r1 …=r2 r6=… …=r5 …=r6 Implement FIFO queue via local variables, or even registers (more intricate with “peek”, not shown in this talk) 2 3 1 2
6 ¡
X1 Roundrobin Split Round-robin split and merge rely on static rates Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2 X1 Roundrobin Split Roundrobin Merge X2
7 ¡
float->float pipeline Decoder { add float->float filter VideoInput() { work pop 1 push 1 { … } } add float->float filter Huffman() { work pop * push 1 { … } } add float->float filter IQuant() { work pop 64 push 64 { … } } add float->float filter IDCT() { work pop 8 push 8 { … } } } Video Input Huffman IQuant IDCT
No more static optimization?
8
9
10 ¡
Video Input Huffman IQuant IDCT
Partitioning Video Input Huffman IQuant IDCT
Static subgraph: Weakly connected component after deleting dynamic edges.
11 ¡
Video Input Huffman IQuant IDCT
Partitioning Video Input Huffman IQuant IDCT
Fusion Video Input Huffman IQuant IDCT
Video Input Huffman IQuant IDCT
Scalarization Video Input Huffman IQuant IDCT
IQuant IDCT Fission
12 ¡
Video Input Huffman IQuant IDCT
IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman
Place fission replicas on all cores Static weight estimate and greedy bin-packing
13 ¡
Video Input Huffman IQuant IDCT
IQuant IDCT Video Input IQuant IDCT IQuant IDCT Huffman
One pinned thread per static subgraph per core
(must be able to suspend dynamic reader when no input)
Video Input IQuant IDCT IQuant IDCT Huffman
14 ¡
Barrier ¡Control ¡ Legend:
Use condition variables for hand-off to successor
¡Data ¡ ¡Control ¡
15 ¡
Barrier Legend: Video Input IQuant IDCT IQuant IDCT Huffman
Use buffer for pipeline parallelism
16 ¡
Close enough for heavy operators, but what about light operators? File Reader Work W/2 Work W/2 File Writer
Operator weight
17
18 ¡
Amortize thread switching overhead without heavy operators File Reader Work 1 Work 1 File Writer
N dynamic queues
19
Our scheduler was faster in all cases (see paper for details)
20