Do we need dataflow programming? Ant ntho hony Da y Dana nali - PowerPoint PPT Presentation

Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC’1 ’16, C , Cha hateau d des Cont ntes

Programming vs Execution ü Dataflow based execution ü Think ILP, Out of order execution ü Automatically derived by hardware/compiler/etc ü Dataflow programming ü Think Workflows ü Flow of data expli licitly ly specified by human

Task-based vs Dataflow-based Is task execution the same thing? OpenMP StarPU PaRSEC *SS HPX 3 3

Task-based vs Dataflow-based Is task execution the same thing? Developer specifies OpenMP Runtime derives dataflow dataflow StarPU PaRSEC *SS HPX 4 4

Limits of deriving the dataflow P: nodes N: number of kernel executions Tk: kernel execution time To: overhead of discovery To*N << Tk*N/P => To*N <= 0.1*Tk*N/P => P <= 0.1*Tk/To To = 100ns, Tk = 100us => P <= 100 P <= 100 5 5

Explicit Dataflow Programming Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage of without deriving information. 6 6

Explicit Dataflow Programming Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage of without deriving information. Benefits: Perfect Parallelism, Automatic Comm./Comp. overlap, Collective operation detection. 7 7

Perf. Case study: NWChem CCSD DO {x4} Global work stealing CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() Allocate and initialize C CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Allocate and fetch A CALL MA_PUSH_GET(…,k_a) (same for B, not shown) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Actual work CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Push C back CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Structure of PTG computation

CCSD Execution Time on 32 nodes 5000 Original 32 4500 PaRSEC 32 4000 3500 Execution Time (sec) 3000 2500 17 1703 2000 1500 818 81 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cores/Node

Performance bottlenecks DO {x4} 1. Global atomic CALL nxt_ctx_next(ctx, icounter, next) 2. Coarse grain parallelism IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN 3. No opportunity for CALL MA_PUSH_GET(…,k_a) comm/comp overlap CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Trace of Original code

Trace of Original code (zoom)

Trace of PaRSEC implementation

Whose fault is the bad performance? Audience participation. Choose who to blame: • MPI • Developers • Programming paradigm (Coarse Grain Parallelism) • Vetter (for not telling his users about dataflow) 15 15

Whose fault is it? Audience participation. Choose who to blame: • MPI • Developers • Programming paradigm (Coarse Grain Parallelism) • Vetter (for not telling his users about dataflow) MPI has a simple and an advanced API and many developers use only the simple one. - Rusty 16 16

Message so far • Using CGP does not scale • Using Dataflow execution does • BUT, developers have to understand their code 17 17

Sure, but can we make EDP easy? Can we make dataflow execution harness all the benefits without explicit dataflow programming? 18 18

Sure, but can we make EDP easy? Can we make dataflow execution harness all the benefits without explicit dataflow programming? Yes, w , we c can. In s n. In some me c cases. M . Mayb ybe? 19 19

Bridging Explicit & Implicit dataflow ü Reduce the cost of discovery Ø Code specialization • Developer expertise • Results of compiler analysis ü Harness benefits of parametric representation Ø Compress the Graph on the fly • Detect patterns in series that translate to expressions, or functions • Use compiler inserted hints 20 20

Reduce the unnecessary discovery DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 21 21

Reduce the unnecessary discovery DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN Handle Generation CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Handle Generation CALL MA_PUSH_GET(…,k_a) Data Fetching CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Data Flushing CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 22 22

Reduce the unnecessary discovery Pleasantly Parallel DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN Handle Generation CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Handle Generation CALL MA_PUSH_GET(…,k_a) Data Fetching CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Data Flushing CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 23 23

Dataflow between subroutines

Code grouping based on dataflow

Message so far • Discovering the whole DAG does not scale • Pruning the DAG requires human expertise • Compiler analysis can assist with pruning • BUT, developers have to understand their code 26 26

Compressing the DAG to a PTG? for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); 27 27

What does a DAG look like? POTRF T=>T T=>T T=>T TRSM T=>T C=>A TRSM C=>A TRSM C=>A SYRK C=>A TRSM C=>A C=>B C=>B C=>A C=>A C=>B C=>B T=>T C=>A C=>A C=>B C=>B C=>A GEMM GEMM GEMM POTRF GEMM GEMM GEMM SYRK C=>C SYRK C=>C T=>T T=>T T=>T C=>C SYRK T=>T C=>C TRSM T=>T TRSM C=>C C=>C TRSM T=>T C=>A C=>B C=>A C=>B C=>A C=>A C=>A C=>B C=>A SYRK GEMM SYRK GEMM GEMM SYRK T=>T C=>C POTRF C=>C T=>T T=>T T=>T C=>C T=>T TRSM TRSM C=>A C=>B C=>A C=>A SYRK GEMM SYRK T=>T C=>C POTRF T=>T T=>T TRSM C=>A SYRK T=>T POTRF 28 28

Fully compressed DAG (PTG) {[k,m,n]->[k,m+1,n]:m<mt-1} {[k,m,n]->[k+1,m,n]:n>k+1 && m>k+1} GEQRT(k) TSMQR(k,m,n) {[k,m,n]->[n]:k+1==n && k+1==m} k = 0 .. mt-1 k = 0 .. mt-1 m = k+1 .. mt-1 n = k+1 .. mt-1 {[k,m,n]->[k+1,n]:k+1==m && n>m} {[k]->[k,n]:k<n<nt && k<nt-1} { [ k {[k,m]->[k,m,n]:k<nt-1 && k<n<nt} {[k,m,n]->[n,m]:k+1==n && m>n} ] - > [ k , k + 1 ] : k < = m t - 2 } {[k,n]->[k,k+1,n]:k<mt-1} {[k,m]->[k,m+1]:m<mt-1} UNMQR(k,n) TSQRT(k,m) k = 0 .. mt-1 k = 0 .. mt-1 m = k+1 .. mt-1 n = k+1 .. mt-1

Compressing the DAG to a PTG? Task_A for (k = 0; k < MT; k++) { for Task_B Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { for Task_B Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, Task_B A[m][k], INOUT | LOCALITY, Task_C T[m][k], OUTPUT); Task_D } Task_D for for (n = k+1; n < NT; n++) { Task_D Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, … A[k][n], INOUT); for (m = k+1; m < MT; m++) { for Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); 30 30

Do we need dataflow programming? Ant ntho hony Da y Dana nali - PowerPoint PPT Presentation

Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC1 16, C , Cha hateau d des Cont ntes Programming vs Execution Dataflow based execution

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael

Loophole Timing Attacks on Shared Event Loops in Chrome Pepe Vila November 22, 2016 Pepe Vila

Energy-Efficiency in GPUs By: Ehsan Sharifi Esfahani Outlines Upward trend of using accelerator

On the behaviors of affine equivalent Sboxes regarding differential and linear cryptanalysis

In egalit es spectrales pour le contr ole des EDP lin eaires : groupe de Schr

THE MIGHTY MINI DOCUMENT CAMERA! COMPARING SMART PENS AND DOCUMENT CAMERAS Suzanne Sumner,

SAHARA Project overview and update TELLES NOBREGA IRC: TELLESNOBREGA EMAIL:

4/17/13 Background & Selection Process Fall of 2009: Testing & comparing of web

CS 126 Lecture A1: TOY Machine Outline Introduction Toy machine Machine language

Do we need dataflow programming? Ant ntho hony Da y Dana nali - PowerPoint PPT Presentation

Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC1 16, C , Cha hateau d des Cont ntes Programming vs Execution Dataflow based execution

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael

Loophole Timing Attacks on Shared Event Loops in Chrome Pepe Vila November 22, 2016 Pepe Vila

Energy-Efficiency in GPUs By: Ehsan Sharifi Esfahani Outlines Upward trend of using accelerator

On the behaviors of affine equivalent Sboxes regarding differential and linear cryptanalysis

In egalit es spectrales pour le contr ole des EDP lin eaires : groupe de Schr

THE MIGHTY MINI DOCUMENT CAMERA! COMPARING SMART PENS AND DOCUMENT CAMERAS Suzanne Sumner,

SAHARA Project overview and update TELLES NOBREGA IRC: TELLESNOBREGA EMAIL:

4/17/13 Background &amp; Selection Process Fall of 2009: Testing &amp; comparing of web

CS 126 Lecture A1: TOY Machine Outline Introduction Toy machine Machine language

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

4/17/13 Background & Selection Process Fall of 2009: Testing & comparing of web