Data-Centric Parallel Programming Torsten Hoefler, invited talk at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Data-Centric Parallel Programming Torsten Hoefler, invited talk at ROSS’19 at HPDC’19 in conjunction with ACM FCRC Alexandros Ziogas, Tal Ben-Nun, Guillermo Indalecio, Timo Schneider, Mathieu Luisier, and Johannes de Fine Licht and the whole DAPP team @ SPCL https://eurompi19.inf.ethz.ch

spcl.inf.ethz.ch @spcl_eth Changing hardware constraints and the physics of computing 0.9 V [1] 32-bit FP ADD: 0.9 pJ 32-bit FP MUL: 3.2 pJ How to address locality challenges on standard architectures and programming? 2x32 bit from L1 (8 kiB): 10 pJ D. Unat et al.: “ Trends in Data Locality Abstractions for HPC Systems” 2x32 bit from L2 (1 MiB): 100 pJ 130nm 2x32 bit from DRAM: 1.3 nJ IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 28, Nr. 10, IEEE, Oct. 2017 90nm Three Ls of modern computing: 65nm 45nm 32nm 22nm 14nm 10nm … [1]: Marc Horowitz, Computing’s Energy Problem (and what we can do about it), ISSC 2014, plenary 2 [2]: Moore: Landauer Limit Demonstrated, IEEE Spectrum 2012

spcl.inf.ethz.ch @spcl_eth Data movement will dominate everything! Source: Kogge, Shalf Source: NVIDIA Source: Fatollahi-Fard et al. ▪ “In future microprocessors, the energy expended for data movement will have a critical effect on achievable performance.” ▪ “… movement consumes almost 58 watts with hardly any energy budget left for computation.” ▪ “…the cost of data movement starts to dominate.” ▪ “…data movement over these networks must be limited to conserve energy…” ▪ the phrase “data movement” appears 18 times on 11 pages (usually in concerning contexts)! ▪ “Efficient data orchestration will increasingly be critical, evolving to more efficient memory hierarchies and new types of interconnect tailored for locality and that depend on sophisticated software to place computation and data so as to minimize data movement .” 3

spcl.inf.ethz.ch @spcl_eth “Sophisticated software”: How do we program today? ▪ Well, to a good approximation how we programmed yesterday ▪ Or last year? ▪ Or four decades ago? ▪ Control-centric programming Backus ‘77: “The assignment statement is the von Neumann bottleneck of programming ▪ Worry about operation counts (flop/s is the metric , isn’t it?) languages and keeps us thinking in word-at-a-time ▪ Data movement is at best implicit (or invisible/ignored) terms in much the same way the computer’s bottleneck does.” ▪ Legion [1] is taking a good direction towards data-centric ▪ Tasking relies on data placement but not really dependencies (not visible to tool-chain) ▪ But it is still control-centric in the tasks – not (performance) portable between devices! ▪ Let’s go a step further towards an explicitly data -centric viewpoint ▪ For performance engineers at least! 4 [1]: Bauer et al.: “Legion: expressing locality and independence with logical regions”, SC12, 2012

spcl.inf.ethz.ch @spcl_eth Performance Portability with DataCentric (DaCe) Parallel Programming System Domain Scientist Performance Engineer Problem Formulation Hardware 𝜖𝑣 Information 𝜖𝑢 − 𝛽𝛼 2 𝑣 = 0 Transformed SDFG Compiler Dataflow Python / DSLs NumPy Data-Centric Intermediate Representation (SDFG, § 3) TensorFlow MATLAB CPU Binary Runtime 𝑴 𝑺 Performance GPU Binary * Results * SDFG Builder API * FPGA Modules * * * Graph Transformations High-Level Program Thin Runtime (API, Interactive, § 4) Infrastructure 5 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth A first example in DaCe Python 7 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth DIODE User Interface SDFG Source Code Transformations (malleable) Generated Code Performance SDFG 8 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 Naïve SDFG 9 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 MapReduceFusion Naïve SDFG 10 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 LoopReorder MapReduceFusion Naïve SDFG 11 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 BlockTiling LoopReorder MapReduceFusion Naïve SDFG 12 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve 13 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve 14 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 PromoteTransient LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve 15 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance for matrix multiplication on x86 Intel MKL 25% difference DAPP But do we really care about MMM on x86 CPUs? With more tuning: 98.6% of MKL OpenBLAS 16 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Hardware Mapping: Load/Store Architectures ▪ Recursive code generation (C++, CUDA) ▪ Control flow: Construct detection and gotos ▪ Parallelism ▪ Multi-core CPU : OpenMP, atomics, and threads ▪ GPU : CUDA kernels and streams ▪ Connected components run concurrently ▪ Memory and interaction with accelerators ▪ Array-array edges create intra-/inter-device copies ▪ Memory access validation on compilation ▪ Automatic CPU SDFG to GPU transformation ▪ Tasklet code immutable 17 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Hardware Mapping: Pipelined Architectures ▪ Module generation with HDL and HLS ▪ Integration with Xilinx SDAccel ▪ Nested SDFGs become FPGA state machines ▪ Parallelism ▪ Exploiting temporal locality: Pipelines ▪ Exploiting spatial locality: Vectorization, replication ▪ Replication ▪ Enables parametric systolic array generation ▪ Memory access ▪ Burst memory access, vectorization ▪ Streams for inter-PE communication 18 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance (Portability) Evaluation ▪ Three platforms: ▪ Intel Xeon E5-2650 v4 CPU (2.20 GHz, no HT) ▪ Tesla P100 GPU ▪ Xilinx VCU1525 hosting an XCVU9P FPGA ▪ Compilers and frameworks: ▪ GPU and FPGA compilers: ▪ Compilers: CUDA nvcc 9.2 GCC 8.2.0 Clang 6.0 Xilinx SDAccel 2018.2 ▪ Frameworks and optimized libraries: icc 18.0.3 ▪ Polyhedral optimizing compilers: HPX Halide Polly 6.0 Pluto 0.11.4 Intel MKL NVIDIA CUBLAS, CUSPARSE, CUTLASS PPCG 0.8 NVIDIA CUB 19 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

spcl.inf.ethz.ch @spcl_eth Performance Evaluation: Fundamental Kernels (CPU) ▪ Database Query : roughly 50% of a 67,108,864 column ▪ Matrix Multiplication (MM) : 2048x2048x2048 ▪ Histogram : 8192x8192 ▪ Jacobi stencil : 2048x2048 for T=1024 ▪ Sparse Matrix-Vector Multiplication (SpMV) : 8192x8192 CSR matrix (nnz=33,554,432) 8.12x faster 98.6% of MKL 2.5x faster 82.7% of Halide 99.9% of MKL 20 Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, to appear at SC19

Data-Centric Parallel Programming Torsten Hoefler, invited talk at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Data-Centric Parallel Programming Torsten Hoefler, invited talk at ROSS19 at HPDC19 in conjunction with ACM FCRC Alexandros Ziogas, Tal Ben-Nun, Guillermo Indalecio, Timo Schneider, Mathieu Luisier, and Johannes

TransMR: Data Centric Programming Beyond Data Parallelism Naresh Rapolu Karthik Kambatla Prof.

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Data-centric Profiling Working Group Outbrief Basic Concept Associating performance data with

Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods Chad Voegele,

Various Faces of Data Centric Networking and Systems Eiko Yoneki University of Cambridge

The Worlds First LED Human Centric Fluorescent Tube by Human Centric Optics Inc. 333,

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Various Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Six Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

The Standardized Boot flow for RISC-V Platforms - Jagan Teki, Amarula Solutions(India) CRVA

Long Term Sustainability of Outer Space Activities Space Governance in Europe: Regulation of

LVM in a nutshell Moreno Baricevic What are we talking about? ??? ??? [baro@login-tmp ~]$ df

Building Up the Temple LESSON 9 Your Response to the Lesson What was most interesting in the

Software Engineering I (02161) Week 4 Assoc. Prof. Hubert Baumeister DTU Compute Technical

Ruth Benvegnen Teacher Trainer for English as a foreign language Haute Ecole Pdagogique,

Alaska Seafood Industry Update Presented to RDC, Nov. 18, 2015 by Glenn Reed Pacific Seafood

Adapting to COVID: Ways to Reach Local Seafood Consumers E V E R Y W E D N E S D AY Wednesday,