Stateful Dataflow Multigraphs: A Data-Centric Model for Performance - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".

spcl.inf.ethz.ch @spcl_eth Motivation Slide courtesy of NVIDIA 4

spcl.inf.ethz.ch @spcl_eth Source: US DoE Computational Scientist 6

spcl.inf.ethz.ch @spcl_eth Domain Scientist Performance Engineer 7

spcl.inf.ethz.ch @spcl_eth Optimization Techniques ▪ Multi-core CPU ▪ Tiling for complex cache hierarchies ▪ Register optimizations ▪ Vectorization ▪ Many-core GPU ▪ Coalesced memory access ▪ Warp divergence minimization, register tiling ▪ Task fusion ▪ FPGA ▪ Maximize resource utilization (logic units, DSPs) ▪ Streaming optimizations, pipelining ▪ Explicit buffering (FIFO) and wiring 8

spcl.inf.ethz.ch @spcl_eth aCe Overview System Domain Scientist Performance Engineer Problem Formulation Hardware 𝜖𝑣 Information 𝜖𝑢 − 𝛽𝛼 2 𝑣 = 0 Compiler Transformed Dataflow Python DSLs Data-Centric Intermediate Representation (SDFG) TensorFlow MATLAB … CPU Binary 𝑴 𝑺 Performance Runtime * Results * GPU Binary * * Scientific Frontend * * FPGA Modules Graph Transformations 9

spcl.inf.ethz.ch @spcl_eth Dataflow Programming in DaCe x 𝑧 = 𝑦 2 + sin 𝑦 Memlets Tasklet 𝜌 y 10

spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A[0] A[1] A[N-1] … Tasklet Tasklet Tasklet B[0] B[1] B[N-1] B 11

spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A A[0:N] A[0] A[1] A[N-1] [i=0:N] A[i] … Tasklet Tasklet Tasklet Scope Tasklet B[i] B[0] B[1] B[N-1] [i=0:N] B B[0:N] B 12

spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 13

spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming State s0 State s1 A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 14

spcl.inf.ethz.ch @spcl_eth Example: 2D Stencil State s1 A A[0:H,0:W] [y=0:H, x=0:W] A[y,x-1] A[y,x+1] A[y-1,x] A[y+1,x] Jacobi State s0 B[y,x] [y=0:H,x=0:W] [y=0:H,x=0:W] ∅ B[0:H,0:W] t=0 Initialize t ≥ T B B[y,x] B[0:H,0:W] [y=0:H, x=0:W] [y=0:H,x=0:W] B[y,x-1] B[y,x+1] B[y-1,x] B[y+1,x] B[0:H,0:W] Jacobi A[y,x] B [y=0:H,x=0:W] t < T A[0:H,0:W] A t < T; t++ 15

spcl.inf.ethz.ch @spcl_eth Meet the Nodes State State machine element Tasklet Fine-grained computational block N-dimensional data container Array Parametric graph abstraction for parallelism Exit Map Streaming data container Stream Dynamic mapping of computations on streams Exit Consume Defines behavior during conflicting writes Conflict Resolution 16

spcl.inf.ethz.ch @spcl_eth Meet the Nodes State s0 A State State machine element A[0:N] [i=0:N] Tasklet Fine-grained computational block A[i] Filter N-dimensional data container Array Bsize(+) S [i=0:N] Parametric graph abstraction for parallelism Exit Map S S Bsize(+) Streaming data container Stream B[0:N] Dynamic mapping of computations on streams Exit Consume Bsize B Defines behavior during conflicting writes Conflict Resolution 17

spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A A[0:N] [i=0:N:TN] CPU A[i:i+TN] [ti=0:TN] Core A[i+ti] out = in_A * in_A … 18

spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A // ... A[0:N] #pragma omp parallel for [i=0:N:TN] for (int i = 0; i < N; i += TN) { CPU vec<double, 4> tA[TN]; A[i:i+TN] Global2Stack_1D <double, 4, 1> ( &A[i], min(N – i, TN), tA); tA for (int ti = 0; ti < TN; ti += 1) { tA[0:TN] [ti=0:TN] vec<double, 4> in_A = tA[ti]; Core auto out = (in_A * in_A); tA[ti] tC[ti] = out; } out = in_A * in_A … 19

spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] [i=0:N:TN] CPU A[i:i+TN] tA tA[0:TN] [ti=0:TN] Core tA[ti] out = in_A * in_A … 20

spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] gA __global__ void multiplication_1(...) { gA[0:N] int i = blockIdx.x * TN; GPU [i=0:N:TN] int ti = threadIdx.y + 0; Device if (i+ti >= N) return; gA[i:i+TN] __shared__ vec<double, 2> tA[TN]; tA GlobalToShared1D <double, 2, TN, 1, 1, false>(gA, tA); tA[0:TN] vec<double, 2> in_A = tA[ti]; GPU [ti=0:TN] auto out = (in_A * in_A); Block tC[ti] = out; tA[ti] } out = in_A * in_A … 21

spcl.inf.ethz.ch @spcl_eth Mapping to Reconfigurable Hardware ▪ Module generation with HDL and HLS Xilinx SDAccel Intel FPGA (experimental) ▪ Parallelism Exploiting temporal locality: pipelines Exploiting spatial locality: vectorization, replication ▪ Replication Enables parametric systolic array generation 23

spcl.inf.ethz.ch @spcl_eth Data-centric Parallel Programming for Python ▪ Programs are integrated within existing codes In Python, integrated functions in existing code @dace.program In MATLAB, separate .m files def program_numpy(A, B): B[:] = np.transpose(A) In TensorFlow, takes existing graph ▪ In Python: Implicit and Explicit Dataflow @dace.program Implicit: numpy syntax def program_explicit(A, B): Explicit: Enforce memory access decoupling from computation @dace.map def transpose(i: _[0:N], j: _[0:M]): ▪ Output compatible with existing programs a << A[i,j] b >> B[j,i] C-compatible SO/DLL file with autogenerated include file b = a 24

spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable tmp[i,j,k] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map tmp[0:M,0:N,0:K] def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): in_A << A[i,k] tmp in_B << B[k,j] tmp[0:M,0:N,0:K] out >> tmp[i,j,k] Reduce out = in_A * in_B [axis: 2, sum] dace.reduce(lambda a, b: a + b, tmp, C, axis=2) C[0:M,0:N] C 25

spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable C (+) [i,j] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): C[0:M,0:N] in_A << A[i,k] C in_B << B[k,j] out >> tmp[i,j,k] out = in_A * in_B dace.reduce(lambda a, b: a + b, tmp, C, axis=2) 26

spcl.inf.ethz.ch @spcl_eth MapReduceFusion Transformation my_tasklet my_tasklet $A[$ar] $B[$ar] * * $A[:] $B[$br] X $A arr arr $A[:] $B $REDUCE $B[$br] $B 27

spcl.inf.ethz.ch @spcl_eth Programming Model Challenges col[:] x[:] val[:] prow[0:2] [y=0:H,x=0:W] ∅ [j=prow[0]:prow[1]] x(1)[:] col[j] x y init indirection update i = 0 x y val[j] x y x_in 𝒚 𝟑 + 𝒛 𝟑 < 𝟓; i = i + 1 i multiply image[y,x] b[i] (Sum) [y=0:H,x=0:W] [j=prow[0]:prow[1]] image[0:H,0:W] b[:] (Sum) image Indirect memory access Nested state machines 29

spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) 32

spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) Transformations SDFG Source Code (malleable) SDFG Properties Generated Code Transformation History 33

spcl.inf.ethz.ch @spcl_eth Performance Naïve SDFG 34

spcl.inf.ethz.ch @spcl_eth Performance MapReduceFusion Naïve SDFG 35

spcl.inf.ethz.ch @spcl_eth Performance LoopReorder MapReduceFusion Naïve SDFG 36

spcl.inf.ethz.ch @spcl_eth Performance BlockTiling LoopReorder MapReduceFusion Naïve SDFG 37

spcl.inf.ethz.ch @spcl_eth Performance RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 38

spcl.inf.ethz.ch @spcl_eth Performance LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 39

spcl.inf.ethz.ch @spcl_eth Performance PromoteTransient LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 40

spcl.inf.ethz.ch @spcl_eth Performance Intel MKL 25% difference DaCe With tuning: 98.6% of MKL OpenBLAS 41

spcl.inf.ethz.ch @spcl_eth Intel Xeon E5-2650 v4 NVIDIA Tesla P100 Xilinx VU9P General Compilers SDFG GCC 8, Clang 6, icc 18, NVCC 9.2, SDAccel Polyhedral Optimizers Frameworks & Libraries HPX, Halide, Intel MKL, CUBLAS, Polly 6, Pluto 0.11.4, PPCG 0.8 CUSPARSE, CUTLASS, CUB 42

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip & Amit -

Edge colouring multigraphs Penny Haxell University of Waterloo Hal Kierstead Arizona State

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Stateful access control using LSM CS547 Thomas Uphill Stateful access cont rol using LSM 11

Scalable Verification of Stateful Networks Aurojit Panda, Ori Lahav, Katerina Argyraki, Mooly

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Basic Concepts: Magnetism of electrons J. M. D. Coey School of Physics and CRANN, Trinity College

Collisionless Nonrelativistic Shocks Overview Manfred Scholer Max-Planck-Institut fr

PSI Muon Experiment at the PSI , KEK RCNP

R05 - Multiple Regression STAT 587 (Engineering) Iowa State University October 30, 2020

Design and Analysis of Computer Experiments for Bulk Acoustic Wave filters:

Recursive identification of smoothing spline ANOVA models Marco Ratto, Andrea Pagano European

TA2 Test Case Praveen. C 1 R. Duvigneau 2 1 Tata Institute of Fundamental Research Center for

Tutorials on the Gaussian Random Process and its OR Applications By Juta Pichitlamken

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip &amp; Amit -

Edge colouring multigraphs Penny Haxell University of Waterloo Hal Kierstead Arizona State

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Stateful access control using LSM CS547 Thomas Uphill Stateful access cont rol using LSM 11

Scalable Verification of Stateful Networks Aurojit Panda, Ori Lahav, Katerina Argyraki, Mooly

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Basic Concepts: Magnetism of electrons J. M. D. Coey School of Physics and CRANN, Trinity College

Collisionless Nonrelativistic Shocks Overview Manfred Scholer Max-Planck-Institut fr

PSI Muon Experiment at the PSI , KEK RCNP

R05 - Multiple Regression STAT 587 (Engineering) Iowa State University October 30, 2020

Design and Analysis of Computer Experiments for Bulk Acoustic Wave filters:

Recursive identification of smoothing spline ANOVA models Marco Ratto, Andrea Pagano European

TA2 Test Case Praveen. C 1 R. Duvigneau 2 1 Tata Institute of Fundamental Research Center for

Tutorials on the Gaussian Random Process and its OR Applications By Juta Pichitlamken

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip & Amit -