Do we need dataflow programming? Ant ntho hony Da y Dana nali - - PowerPoint PPT Presentation

do we need dataflow programming
SMART_READER_LITE
LIVE PREVIEW

Do we need dataflow programming? Ant ntho hony Da y Dana nali - - PowerPoint PPT Presentation

Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC1 16, C , Cha hateau d des Cont ntes Programming vs Execution Dataflow based execution


slide-1
SLIDE 1

Do we need dataflow programming?

Ant ntho hony Da y Dana nali lis

Innovative Computing Laboratory University of Tennessee

CCDS DSC’1 ’16, C , Cha hateau d des Cont ntes

slide-2
SLIDE 2

Programming vs Execution

ü Dataflow based execution

ü Think ILP, Out of order execution ü Automatically derived by hardware/compiler/etc

ü Dataflow programming

ü Think Workflows ü Flow of data expli

licitly ly specified by human

slide-3
SLIDE 3

Task-based vs Dataflow-based

Is task execution the same thing?

3 3

OpenMP StarPU *SS PaRSEC HPX

slide-4
SLIDE 4

Task-based vs Dataflow-based

Is task execution the same thing?

4 4

OpenMP StarPU *SS PaRSEC HPX

Runtime derives dataflow Developer specifies dataflow

slide-5
SLIDE 5

Limits of deriving the dataflow

P: nodes N: number of kernel executions Tk: kernel execution time To: overhead of discovery To*N << Tk*N/P => To*N <= 0.1*Tk*N/P => P <= 0.1*Tk/To To = 100ns, Tk = 100us => P <= 100 P <= 100

5 5

slide-6
SLIDE 6

Explicit Dataflow Programming

Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage

  • f without deriving information.

6 6

slide-7
SLIDE 7

Explicit Dataflow Programming

Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage

  • f without deriving information.

Benefits:

Perfect Parallelism, Automatic Comm./Comp. overlap, Collective operation detection.

7 7

slide-8
SLIDE 8
  • Perf. Case study: NWChem CCSD

DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Allocate and initialize C Allocate and fetch A

(same for B, not shown)

Global work stealing Actual work Push C back

slide-9
SLIDE 9

Structure of PTG computation

slide-10
SLIDE 10

CCSD Execution Time on 32 nodes

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Execution Time (sec) Cores/Node

Original 32 PaRSEC 32

81 818 17 1703

slide-11
SLIDE 11

Performance bottlenecks

DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

  • 3. No opportunity for

comm/comp overlap

  • 1. Global atomic
  • 2. Coarse grain parallelism
slide-12
SLIDE 12

Trace of Original code

slide-13
SLIDE 13

Trace of Original code (zoom)

slide-14
SLIDE 14

Trace of PaRSEC implementation

slide-15
SLIDE 15

Whose fault is the bad performance?

Audience participation. Choose who to blame:

  • MPI
  • Developers
  • Programming paradigm (Coarse Grain Parallelism)
  • Vetter (for not telling his users about dataflow)

15 15

slide-16
SLIDE 16

Whose fault is it?

Audience participation. Choose who to blame:

  • MPI
  • Developers
  • Programming paradigm (Coarse Grain Parallelism)
  • Vetter (for not telling his users about dataflow)

16 16

MPI has a simple and an advanced API and many developers use only the simple one.

  • Rusty
slide-17
SLIDE 17

Message so far

  • Using CGP does not scale
  • Using Dataflow execution does
  • BUT, developers have to understand their code

17 17

slide-18
SLIDE 18

Sure, but can we make EDP easy?

Can we make dataflow execution harness all the benefits without explicit dataflow programming?

18 18

slide-19
SLIDE 19

Sure, but can we make EDP easy?

Can we make dataflow execution harness all the benefits without explicit dataflow programming? Yes, w , we c

  • can. In s
  • n. In some

me c

  • cases. M

. Mayb ybe?

19 19

slide-20
SLIDE 20

Bridging Explicit & Implicit dataflow

ü Reduce the cost of discovery

Ø Code specialization

  • Developer expertise
  • Results of compiler analysis

ü Harness benefits of parametric representation

Ø Compress the Graph on the fly

  • Detect patterns in series that translate to expressions, or functions
  • Use compiler inserted hints

20 20

slide-21
SLIDE 21

Reduce the unnecessary discovery

21 21

DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Insert_Task

slide-22
SLIDE 22

Reduce the unnecessary discovery

22 22

DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Insert_Task Handle Generation Handle Generation Data Flushing Data Fetching

slide-23
SLIDE 23

Reduce the unnecessary discovery

23 23

DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

Insert_Task Handle Generation Handle Generation Data Flushing Pleasantly Parallel Data Fetching

slide-24
SLIDE 24

Dataflow between subroutines

slide-25
SLIDE 25

Code grouping based on dataflow

slide-26
SLIDE 26

Message so far

  • Discovering the whole DAG does not scale
  • Pruning the DAG requires human expertise
  • Compiler analysis can assist with pruning
  • BUT, developers have to understand their code

26 26

slide-27
SLIDE 27

Compressing the DAG to a PTG?

27 27

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

slide-28
SLIDE 28

What does a DAG look like?

28 28

POTRF TRSM T=>T TRSM T=>T TRSM T=>T TRSM T=>T GEMM C=>B GEMM C=>B GEMM C=>B SYRK C=>A C=>A SYRK C=>A GEMM C=>B GEMM C=>B GEMM C=>C SYRK T=>T C=>A C=>A SYRK C=>A GEMM C=>B GEMM C=>C GEMM C=>C SYRK T=>T C=>A C=>A C=>A SYRK C=>A TRSM C=>C TRSM C=>C TRSM C=>C POTRF T=>T T=>T T=>T T=>T C=>B SYRK C=>A C=>B C=>A C=>A C=>B T=>T GEMM C=>C SYRK T=>T SYRK T=>T C=>A C=>A C=>A TRSM C=>C POTRF T=>T T=>T TRSM T=>T C=>C C=>A C=>B SYRK T=>T C=>A C=>A POTRF T=>T TRSM C=>C T=>T C=>A POTRF T=>T

slide-29
SLIDE 29

Fully compressed DAG (PTG)

{ [ k ]

  • >

[ k , k + 1 ] : k < = m t

  • 2

}

GEQRT(k)

k = 0 .. mt-1

TSMQR(k,m,n)

k = 0 .. mt-1 m = k+1 .. mt-1 n = k+1 .. mt-1

TSQRT(k,m)

k = 0 .. mt-1 m = k+1 .. mt-1

UNMQR(k,n)

k = 0 .. mt-1 n = k+1 .. mt-1 {[k,m,n]->[n]:k+1==n && k+1==m} {[k,m]->[k,m,n]:k<nt-1 && k<n<nt} {[k,m,n]->[n,m]:k+1==n && m>n} {[k,m,n]->[k+1,n]:k+1==m && n>m} {[k]->[k,n]:k<n<nt && k<nt-1} {[k,n]->[k,k+1,n]:k<mt-1} {[k,m,n]->[k+1,m,n]:n>k+1 && m>k+1} {[k,m,n]->[k,m+1,n]:m<mt-1} {[k,m]->[k,m+1]:m<mt-1}

slide-30
SLIDE 30

Compressing the DAG to a PTG?

30 30

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Task_A Task_B Task_B Task_B Task_C Task_D Task_D Task_D …

slide-31
SLIDE 31

Compressing the DAG to a PTG?

31 31

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Task_A Task_B Task_B Task_B Task_C Task_D Task_D Task_D …

slide-32
SLIDE 32

Compressing the DAG to a PTG?

32 32

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Task_A Task_B Task_B Task_B Task_C Task_D(1-3) …

slide-33
SLIDE 33

Compressing the DAG to a PTG?

33 33

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Iteration_vector Iteration_vector(k,n,m k,n,m) Indices(A[k][n], Indices(A[k][n],k,n k,n)

slide-34
SLIDE 34

Compressing the DAG to a PTG?

34 34

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Task_A Task_B Task_B Task_B Task_C Task_D Task_D Task_D Task_C Task_D Task_D Task_D Task_A …

slide-35
SLIDE 35

Compressing the DAG to a PTG?

35 35

for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Task_A Task_B Task_B Task_B Task_C(k,n) Task_D(k,n,m) …

k<m<MT

slide-36
SLIDE 36

Conclusion

☺ Dataflow execution is more scalable than CGP ☺ Dataflow programming can maximize benefits ☹ Compilers cannot do it by themselves

☹ Not even Torsten’s compiler!

☹ Runtimes can, but at a cost

Ø Dataflow for the masses means sharing the load

between developer, compiler and runtime.

36 36

slide-37
SLIDE 37

Quotes

Developers know about their program much more than a compiler can ever figure out.

  • Doug Miles

Let the human do what humans do best.

  • Jeff Hollingsworth

37 37