Satisfying Dataflow Programs Constraints on Multicore Architectures - PowerPoint PPT Presentation

Satisfying Dataflow Programs Constraints on Multicore Architectures Citi lab PhD day 2014 Manuel Selva Supervisor: Lionel Morel Director: Stéphane Frénot Bull: Frédéric Soinne 27th March 2014 1 / 14

More and more parallelism Nehalem multicore die (source: www.intel.com) Apple/ARM dual core (source: www.cultofmac.com) How to program ? • Threads (Java, C + Pthreads) • Annotations to sequential code (OpenMP) • Dataflow Multiprocessor motherboard (source: www.bit-tech.net) 2 / 14

Why and what is dataflow programming ? Text.Y Mot.Y Merger Display Parser Text.U Mot.U Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding 3 / 14

Why and what is dataflow programming ? Text.Y Mot.Y Merger Display Parser Text.U Mot.U Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding • Different kinds of parallelism • Actors exchanging data only through FIFO channels 3 / 14

Why and what is dataflow programming ? Text.Y Mot.Y Merger Display Parser Text.U Mot.U Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding • Different kinds of parallelism • Actors exchanging data only through FIFO channels • Task 3 / 14

Why and what is dataflow programming ? Text.Y Mot.Y Merger Display Parser Text.U Mot.U Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding • Different kinds of parallelism • Actors exchanging data only through FIFO channels • Task , pipeline 3 / 14

Why and what is dataflow programming ? Text. Y Text. Y Text.Y Text. Y Mot.Y Merger Display Parser Text.U Mot.U Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding • Different kinds of parallelism • Actors exchanging data only through FIFO channels • Task , pipeline , data 3 / 14

Why and what is dataflow programming ? Text.Y Mot.Y 256 768 256 Merger Display Parser Text.U Mot.U 256 Text.V Mot.V (a) H264 decoding (b) LTE-Adv decoding • Different kinds of parallelism • Actors exchanging data only through FIFO channels • Task , pipeline , data Many models SDF ... ... ... (DDF) Static analyses Expressiveness 3 / 14

How to execute dataflow programs ? • Compilation to synchronized tasks respecting data dependencies Text.Y Mot.Y Parser Text.U Mot.U Merger Display Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; 4 / 14

How to execute dataflow programs ? • Compilation to synchronized tasks respecting data dependencies • Mapping of tasks and channels to hardware C4 Text.Y Mot.Y C1 C7 C5 C10 Parser Text.U Mot.U Merger Display C2 C8 C6 C3 C9 Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 T9 RAM 1 RAM 2 C1 C2 C3 C7 C8 C9 C4 C5 C6 C10 4 / 14

Proposition Motivations • DF applications with throughput constraints • Mapping satisfying constraints requires: • Actors internal execution time • Concurrent applications • DF actors consumption/production rates Goals Extend DF language Compile time Runtime Monitor app/resources Adapt static choices 5 / 14

Contrib 1: Languages/compilers extensions • Languages extensions taken into account in compilation flow [9, 10] Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V H264 graph with throughput constraint 6 / 14

Contrib 1: Languages/compilers extensions • Languages extensions taken into account in compilation flow [9, 10] th 4 Text.Y Mot.Y th 1 th 7 25 f / s th 5 th 2 th 8 Merger Display Parser Text.U Mot.U th 6 th 3 Text.V Mot.V th 9 H264 graph with throughput constraint • Propagate this value in SDF languages • Determine actors acceptable exec time 6 / 14

Contrib II - Fine grain monitoring Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; 7 / 14

Contrib II - Fine grain monitoring Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V Merger instrumented to measure throughput DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Merger ; Display ; 7 / 14

Contrib II - Fine grain monitoring Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V DF Compiler Tasks instrumented Task 1 Task 1 Task 2 Task 2 Task 3 Task 3 Task 4 Task 4 Task 5 Task 5 Task 6 Task 6 Task 7 Task 7 Task 8 Task 8 Task 9 Task 9 to measure Parser ; Parser ; Text . Y ; Text . Y ; Text . U ; Text . U ; Text . V ; Text . V ; Mot . Y ; Mot . Y ; Mot . U ; Mot . U ; Mot . V ; Mot . V ; Merger ; Merger ; Display ; Display ; actors execution times 7 / 14

Contrib II - Fine grain monitoring Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 T9 RAM 1 RAM 1 RAM 2 RAM 2 Memory monitoring using PMU: C1 C2 C3 C7 C8 C9 RAM controllers load C4 C5 C6 C10 QPI traffic 7 / 14

Contrib II - Fine grain monitoring Text.Y Mot.Y 25 f / s Merger Display Parser Text.U Mot.U Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 T9 RAM 1 RAM 1 RAM 2 RAM 2 Memory monitoring using PMU: C1 C2 C3 C7 C8 C9 RAM controllers load C4 C5 C6 C10 QPI traffic Conclusions Are we facing cores load imbalance ? Are the actors too slow because of memory latencies ? 7 / 14

Contrib III - Dataflow adaptations Text.Y Mot.Y Parser Text.U Mot.U Merger Display Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 T9 RAM 1 RAM 2 C1 C2 C3 C7 C8 C9 C4 C5 C6 C10 8 / 14

Contrib III - Dataflow adaptations • Cpu load balancing Text.Y Mot.Y Parser Text.U Mot.U Merger Display Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T9 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 RAM 1 RAM 2 C1 C2 C3 C7 C8 C9 C4 C5 C6 C10 8 / 14

Contrib III - Dataflow adaptations • Cpu load balancing • Memory load balancing Text.Y Mot.Y Parser Text.U Mot.U Merger Display Text.V Mot.V DF Compiler Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Parser ; Text . Y ; Text . U ; Text . V ; Mot . Y ; Mot . U ; Mot . V ; Merger ; Display ; DF Mapper T1 T9 T2 T5 T6 Core1 Core2 Core5 Core6 Dual socket processor Core3 Core4 Core7 Core8 T3 T4 T7 T8 RAM 1 RAM 2 C1 C2 C3 C7 C8 C4 C5 C6 C9 C10 8 / 14

Conclusion Dynamic framework for DF programs State of the art • Applicative monitoring • DF compilation [7, 13, 4] • Hardware monitoring • DF theoretical throughput • Runtime adaptations analysis [3, 11] making profit of DF • DF adaptation [12, 8, 5, 1] information • Non-DF NUMA Current work adaptations [6, 2] • Finishing implementation in Streamit 9 / 14

C HOI , Y., L I , C.-H., S ILVA , D. D., B IVENS , A., AND S CHENFELD , E. Adaptive task duplication using on-line bottleneck detection for streaming applications. In Proceedings of the 9th Conference on Computing Frontiers (New York, NY, USA, 2012), CF ’12, ACM, pp. 163–172. D ASHTI , M., F EDOROVA , A., F UNSTON , J., G AUD , F., L ACHAIZE , R., L EPERS , B., Q UEMA , V., AND R OTH , M. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2013), ASPLOS ’13, ACM, pp. 381–394. 10 / 14

G HAMARIAN , A.-H., G EILEN , M. C. W., S TUIJK , S., B ASTEN , T., M OONEN , A. J. M., B EKOOIJ , M., T HEELEN , B., AND M OUSAVI , M. Throughput analysis of synchronous data flow graphs. In Application of Concurrency to System Design, 2006. ACSD 2006. Sixth International Conference on (2006), pp. 25–36. G ORDON , M. I. Compiler Techniques for Scalable Performance of Stream Programs PhD thesis, MIT, 2010. H ORMATI , A. H., C HOI , Y., K UDLUR , M., R ABBAH , R., M UDGE , T., AND M AHLKE , S. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. 11 / 14

Satisfying Dataflow Programs Constraints on Multicore Architectures - PowerPoint PPT Presentation

Satisfying Dataflow Programs Constraints on Multicore Architectures Citi lab PhD day 2014 Manuel Selva Supervisor: Lionel Morel Director: Stphane Frnot Bull: Frdric Soinne 27th March 2014 1 / 14 More and more parallelism Nehalem

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

Design of Digital Circuits(S2) Chapter 1, Part 2 Modelling and Simulation Section 1.3 Delay Time

Parallel programming using threads Extended and adapted by Eduardo R. B. Marques from original

Slides for [ICASSP 2020] BBAND Index: A No- Reference Banding Artifact Predictor Presentation

Weak Solutions for a Degenerate Elliptic Dirichlet Problem Aurelian Gheondea Bilkent University,

Science Gateway on GARUDA GRID for Open Source Drug Discovery community Presented by Santhosh J

CEE 680 Lecture #32 3/25/2020 Print version Updated: 25 March 2020 Lecture #32 Coordination

Electric Dipole Moment Experiments Birmingham Particle Physics Seminar, Feb.13, 2019 W. Clark

Transport in adatom-decorated graphene Erik Henriksen Jamie

Satisfying Dataflow Programs Constraints on Multicore Architectures - PowerPoint PPT Presentation

Satisfying Dataflow Programs Constraints on Multicore Architectures Citi lab PhD day 2014 Manuel Selva Supervisor: Lionel Morel Director: Stphane Frnot Bull: Frdric Soinne 27th March 2014 1 / 14 More and more parallelism Nehalem

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

Design of Digital Circuits(S2) Chapter 1, Part 2 Modelling and Simulation Section 1.3 Delay Time

Parallel programming using threads Extended and adapted by Eduardo R. B. Marques from original

Slides for [ICASSP 2020] BBAND Index: A No- Reference Banding Artifact Predictor Presentation

Weak Solutions for a Degenerate Elliptic Dirichlet Problem Aurelian Gheondea Bilkent University,

Science Gateway on GARUDA GRID for Open Source Drug Discovery community Presented by Santhosh J

CEE 680 Lecture #32 3/25/2020 Print version Updated: 25 March 2020 Lecture #32 Coordination

Electric Dipole Moment Experiments Birmingham Particle Physics Seminar, Feb.13, 2019 W. Clark

Transport in adatom-decorated graphene Erik Henriksen Jamie

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed