stateful dataflow multigraphs a data centric model for
play

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This


  1. spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".

  2. spcl.inf.ethz.ch @spcl_eth Motivation Slide courtesy of NVIDIA 4

  3. spcl.inf.ethz.ch @spcl_eth Source: US DoE Computational Scientist 6

  4. spcl.inf.ethz.ch @spcl_eth Domain Scientist Performance Engineer 7

  5. spcl.inf.ethz.ch @spcl_eth Optimization Techniques ▪ Multi-core CPU ▪ Tiling for complex cache hierarchies ▪ Register optimizations ▪ Vectorization ▪ Many-core GPU ▪ Coalesced memory access ▪ Warp divergence minimization, register tiling ▪ Task fusion ▪ FPGA ▪ Maximize resource utilization (logic units, DSPs) ▪ Streaming optimizations, pipelining ▪ Explicit buffering (FIFO) and wiring 8

  6. spcl.inf.ethz.ch @spcl_eth aCe Overview System Domain Scientist Performance Engineer Problem Formulation Hardware 𝜖𝑣 Information 𝜖𝑢 − 𝛽𝛼 2 𝑣 = 0 Compiler Transformed Dataflow Python DSLs Data-Centric Intermediate Representation (SDFG) TensorFlow MATLAB … CPU Binary 𝑴 𝑺 Performance Runtime * Results * GPU Binary * * Scientific Frontend * * FPGA Modules Graph Transformations 9

  7. spcl.inf.ethz.ch @spcl_eth Dataflow Programming in DaCe x 𝑧 = 𝑦 2 + sin 𝑦 Memlets Tasklet 𝜌 y 10

  8. spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A[0] A[1] A[N-1] … Tasklet Tasklet Tasklet B[0] B[1] B[N-1] B 11

  9. spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A A[0:N] A[0] A[1] A[N-1] [i=0:N] A[i] … Tasklet Tasklet Tasklet Scope Tasklet B[i] B[0] B[1] B[N-1] [i=0:N] B B[0:N] B 12

  10. spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 13

  11. spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming State s0 State s1 A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 14

  12. spcl.inf.ethz.ch @spcl_eth Example: 2D Stencil State s1 A A[0:H,0:W] [y=0:H, x=0:W] A[y,x-1] A[y,x+1] A[y-1,x] A[y+1,x] Jacobi State s0 B[y,x] [y=0:H,x=0:W] [y=0:H,x=0:W] ∅ B[0:H,0:W] t=0 Initialize t ≥ T B B[y,x] B[0:H,0:W] [y=0:H, x=0:W] [y=0:H,x=0:W] B[y,x-1] B[y,x+1] B[y-1,x] B[y+1,x] B[0:H,0:W] Jacobi A[y,x] B [y=0:H,x=0:W] t < T A[0:H,0:W] A t < T; t++ 15

  13. spcl.inf.ethz.ch @spcl_eth Meet the Nodes State State machine element Tasklet Fine-grained computational block N-dimensional data container Array Parametric graph abstraction for parallelism Exit Map Streaming data container Stream Dynamic mapping of computations on streams Exit Consume Defines behavior during conflicting writes Conflict Resolution 16

  14. spcl.inf.ethz.ch @spcl_eth Meet the Nodes State s0 A State State machine element A[0:N] [i=0:N] Tasklet Fine-grained computational block A[i] Filter N-dimensional data container Array Bsize(+) S [i=0:N] Parametric graph abstraction for parallelism Exit Map S S Bsize(+) Streaming data container Stream B[0:N] Dynamic mapping of computations on streams Exit Consume Bsize B Defines behavior during conflicting writes Conflict Resolution 17

  15. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A A[0:N] [i=0:N:TN] CPU A[i:i+TN] [ti=0:TN] Core A[i+ti] out = in_A * in_A … 18

  16. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A // ... A[0:N] #pragma omp parallel for [i=0:N:TN] for (int i = 0; i < N; i += TN) { CPU vec<double, 4> tA[TN]; A[i:i+TN] Global2Stack_1D <double, 4, 1> ( &A[i], min(N – i, TN), tA); tA for (int ti = 0; ti < TN; ti += 1) { tA[0:TN] [ti=0:TN] vec<double, 4> in_A = tA[ti]; Core auto out = (in_A * in_A); tA[ti] tC[ti] = out; } out = in_A * in_A … 19

  17. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] [i=0:N:TN] CPU A[i:i+TN] tA tA[0:TN] [ti=0:TN] Core tA[ti] out = in_A * in_A … 20

  18. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] gA __global__ void multiplication_1(...) { gA[0:N] int i = blockIdx.x * TN; GPU [i=0:N:TN] int ti = threadIdx.y + 0; Device if (i+ti >= N) return; gA[i:i+TN] __shared__ vec<double, 2> tA[TN]; tA GlobalToShared1D <double, 2, TN, 1, 1, false>(gA, tA); tA[0:TN] vec<double, 2> in_A = tA[ti]; GPU [ti=0:TN] auto out = (in_A * in_A); Block tC[ti] = out; tA[ti] } out = in_A * in_A … 21

  19. spcl.inf.ethz.ch @spcl_eth Mapping to Reconfigurable Hardware ▪ Module generation with HDL and HLS Xilinx SDAccel Intel FPGA (experimental) ▪ Parallelism Exploiting temporal locality: pipelines Exploiting spatial locality: vectorization, replication ▪ Replication Enables parametric systolic array generation 23

  20. spcl.inf.ethz.ch @spcl_eth Data-centric Parallel Programming for Python ▪ Programs are integrated within existing codes In Python, integrated functions in existing code @dace.program In MATLAB, separate .m files def program_numpy(A, B): B[:] = np.transpose(A) In TensorFlow, takes existing graph ▪ In Python: Implicit and Explicit Dataflow @dace.program Implicit: numpy syntax def program_explicit(A, B): Explicit: Enforce memory access decoupling from computation @dace.map def transpose(i: _[0:N], j: _[0:M]): ▪ Output compatible with existing programs a << A[i,j] b >> B[j,i] C-compatible SO/DLL file with autogenerated include file b = a 24

  21. spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable tmp[i,j,k] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map tmp[0:M,0:N,0:K] def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): in_A << A[i,k] tmp in_B << B[k,j] tmp[0:M,0:N,0:K] out >> tmp[i,j,k] Reduce out = in_A * in_B [axis: 2, sum] dace.reduce(lambda a, b: a + b, tmp, C, axis=2) C[0:M,0:N] C 25

  22. spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable C (+) [i,j] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): C[0:M,0:N] in_A << A[i,k] C in_B << B[k,j] out >> tmp[i,j,k] out = in_A * in_B dace.reduce(lambda a, b: a + b, tmp, C, axis=2) 26

  23. spcl.inf.ethz.ch @spcl_eth MapReduceFusion Transformation my_tasklet my_tasklet $A[$ar] $B[$ar] * * $A[:] $B[$br] X $A arr arr $A[:] $B $REDUCE $B[$br] $B 27

  24. spcl.inf.ethz.ch @spcl_eth Programming Model Challenges col[:] x[:] val[:] prow[0:2] [y=0:H,x=0:W] ∅ [j=prow[0]:prow[1]] x(1)[:] col[j] x y init indirection update i = 0 x y val[j] x y x_in 𝒚 𝟑 + 𝒛 𝟑 < 𝟓; i = i + 1 i multiply image[y,x] b[i] (Sum) [y=0:H,x=0:W] [j=prow[0]:prow[1]] image[0:H,0:W] b[:] (Sum) image Indirect memory access Nested state machines 29

  25. spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) 32

  26. spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) Transformations SDFG Source Code (malleable) SDFG Properties Generated Code Transformation History 33

  27. spcl.inf.ethz.ch @spcl_eth Performance Naïve SDFG 34

  28. spcl.inf.ethz.ch @spcl_eth Performance MapReduceFusion Naïve SDFG 35

  29. spcl.inf.ethz.ch @spcl_eth Performance LoopReorder MapReduceFusion Naïve SDFG 36

  30. spcl.inf.ethz.ch @spcl_eth Performance BlockTiling LoopReorder MapReduceFusion Naïve SDFG 37

  31. spcl.inf.ethz.ch @spcl_eth Performance RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 38

  32. spcl.inf.ethz.ch @spcl_eth Performance LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 39

  33. spcl.inf.ethz.ch @spcl_eth Performance PromoteTransient LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 40

  34. spcl.inf.ethz.ch @spcl_eth Performance Intel MKL 25% difference DaCe With tuning: 98.6% of MKL OpenBLAS 41

  35. spcl.inf.ethz.ch @spcl_eth Intel Xeon E5-2650 v4 NVIDIA Tesla P100 Xilinx VU9P General Compilers SDFG GCC 8, Clang 6, icc 18, NVCC 9.2, SDAccel Polyhedral Optimizers Frameworks & Libraries HPX, Halide, Intel MKL, CUBLAS, Polly 6, Pluto 0.11.4, PPCG 0.8 CUSPARSE, CUTLASS, CUB 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend