Making Dataflow Programming Ubiquitous for Scientific Computing - PowerPoint PPT Presentation

Motivations MORSE T1 T2 T3 Conclusion Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale Simulations January 9-13, 2012 Providence, RI USA H. Ltaief ICERM Workshop 2012 1 / 45

Motivations MORSE T1 T2 T3 Conclusion Outline 1 Motivations 2 Background: the MORSE project 3 Cholesky-based Matrix Inversion and Generalized Symmetric Eigenvalue Problem 4 Turbulence Simulations 5 Seismic Applications 6 Conclusion H. Ltaief ICERM Workshop 2012 2 / 45

Motivations MORSE T1 T2 T3 Conclusion Plan 1 Motivations 2 Background: the MORSE project 3 Cholesky-based Matrix Inversion and Generalized Symmetric Eigenvalue Problem 4 Turbulence Simulations 5 Seismic Applications 6 Conclusion H. Ltaief ICERM Workshop 2012 3 / 45

Motivations MORSE T1 T2 T3 Conclusion DataFlow Programming Five decades OLD concept Programming paradigm that models a program as a directed graph of the data flowing between operations (cf. Wikipedia) Think ”how things connect” rather than ”how things happen” Assembly line Inherently parallel H. Ltaief ICERM Workshop 2012 4 / 45

Motivations MORSE T1 T2 T3 Conclusion What did they say? Katherine Yelick : – High level abstraction optimizations e.g, in the context of linear algebra, leverage BLAS optimizations to the whole numerical algorithm. – Load balancing is paramount, especially in sparse linear algebra computations. – Locality is critical when computational intensity low and memory hierarchies are deep. Victor Eijkhout : – Integrative Model for Parallelism design: describe parallel algorithms based on explicit partitioning of input and output data. – MPI instruction commands are encapsulated into derivable objects. No need for direct MPI user coding = ⇒ productivity! Jonathan Cohen : – Expose as much as possible fine-grain parallelism to exploit the underlying hardware components. H. Ltaief ICERM Workshop 2012 5 / 45

Motivations MORSE T1 T2 T3 Conclusion Matrices Over Runtime Systems at Exascale Mission statement: ”Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems” . Runtime challenges due to the ever growing hardware complexity. Algorithmic challenges to exploit the hardware capabilities at most. H. Ltaief ICERM Workshop 2012 7 / 45

Motivations MORSE T1 T2 T3 Conclusion QUARK From Sequential Nested-Loop Code to Parallel Execution Task-based parallelism Out-of-order dynamic scheduling Scheduling a Window of Tasks Data Locality and Cache Reuse High user-productivity Shipped within PLASMA but standalone project H. Ltaief ICERM Workshop 2012 8 / 45

Motivations MORSE T1 T2 T3 Conclusion QUARK H. Ltaief ICERM Workshop 2012 9 / 45

Motivations MORSE T1 T2 T3 Conclusion QUARK int QUARK_core_dpotrf( Quark *quark, char uplo, int n, double *A, int lda, int *info ) { QUARK_Insert_Task( quark, TASK_core_dpotrf, 0x00, sizeof(char), &uplo, VALUE, sizeof(int), &n, VALUE, sizeof(double)*n*n, A, INOUT | LOCALITY, sizeof(int), &lda, VALUE, sizeof(int), info, OUTPUT, 0); } void TASK_core_dpotrf(Quark *quark) { char uplo; int n; double *A; int lda; int *info; quark_unpack_args_5( quark, uplo, n, A, lda, info ); dpotrf_( &uplo, &n, A, &lda, info ); } H. Ltaief ICERM Workshop 2012 10 / 45

Motivations MORSE T1 T2 T3 Conclusion StarPU RunTime which provides: = ⇒ Task scheduling = ⇒ Memory management Supports: = ⇒ SMP/Multicore Processors (x86, PPC, . . . ) = ⇒ NVIDIA GPUs (e.g. heterogeneous multi-GPU) = ⇒ OpenCL devices = ⇒ Cell Processors (experimental) H. Ltaief ICERM Workshop 2012 11 / 45

Motivations MORSE T1 T2 T3 Conclusion StarPU starpu_Insert_Task(&cl_dpotrf, VALUE, &uplo, sizeof(char), VALUE, &n, sizeof(int), INOUT, Ahandle(k, k), VALUE, &lda, sizeof(int), OUTPUT, &info, sizeof(int) CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL, 0); H. Ltaief ICERM Workshop 2012 12 / 45

Motivations MORSE T1 T2 T3 Conclusion SMPSs Compiler technology. Task parameters and directionality defined by the user through pragmas Translates C codes with pragma annotations to standard C99 code Embedded Locality optimizations Data renaming feature to reduce dependencies, leaving only the true dependencies. H. Ltaief ICERM Workshop 2012 13 / 45

Motivations MORSE T1 T2 T3 Conclusion SMPSs #pragma css task input(A[NB][NB]) inout(T[NB][NB]) void dsyrk(double *A, double *T); #pragma css task inout(T[NB][NB]) void dpotrf(double *T); #pragma css task input(A[NB][NB], B[NB][NB]) inout(C[NB][NB]) void dgemm(double *A, double *B, double *C); #pragma css task input(T[NB][NB]) inout(B[NB][NB]) void dtrsm(double *T, double *C); #pragma css start for (k = 0; k < TILES; k++) { for (n = 0; n < k; n++) dsyrk(A[k][n], A[k][k]); dpotrf(A[k][k]); for (m = k+1; m < TILES; m++) { for (n = 0; n < k; n++) dgemm(A[k][n], A[m][n], A[m][k]); dtrsm(A[k][k], A[m][k]); } } #pragma css finish H. Ltaief ICERM Workshop 2012 14 / 45

Motivations MORSE T1 T2 T3 Conclusion Standardization??? Efforts to define an API standard for these runtime systems. Difficult task... But worth the time and sacrifice when it comes to making end users life easier. H. Ltaief ICERM Workshop 2012 15 / 45

Motivations MORSE T1 T2 T3 Conclusion DAGuE Compiler technology. Converting Sequential Code to a DAG representation. Parametrized DAG scheduler for distributed memory systems. Engine of DPLASMA library H. Ltaief ICERM Workshop 2012 16 / 45

Motivations MORSE T1 T2 T3 Conclusion DAGuE H. Ltaief ICERM Workshop 2012 17 / 45

Motivations MORSE T1 T2 T3 Conclusion DAGuE H. Ltaief ICERM Workshop 2012 18 / 45

Motivations MORSE T1 T2 T3 Conclusion Blocked Algorithms L A N I F UPDATE L PANEL A N I F UPDATE PANEL PANEL (a) First step. (b) Second step. (c) Third step. Figure: Panel-update sequences for the LAPACK factorizations. H. Ltaief ICERM Workshop 2012 20 / 45

Motivations MORSE T1 T2 T3 Conclusion Blocked Algorithms Principles: Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model H. Ltaief ICERM Workshop 2012 21 / 45

Motivations MORSE T1 T2 T3 Conclusion Tile Data Layout Format LAPACK: column-major format PLASMA: tile format H. Ltaief ICERM Workshop 2012 22 / 45

Motivations MORSE T1 T2 T3 Conclusion Tile Algorithms Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Feed the dynamic runtime system H. Ltaief ICERM Workshop 2012 23 / 45

Motivations MORSE T1 T2 T3 Conclusion A − 1 , Seriously??? YES! Critical component of the variance-covariance matrix computation in statistics (cf. Higham, Accuracy and Stability of Numerical Algorithms, Second Edition, SIAM, 2002) A is a dense symmetric matrix Three steps: Cholesky factorization (DPOTRF) 1 Inverting the Cholesky factor (DTRTRI) 2 Calculating the product of the inverted Cholesky factor with its 3 transpose (DLAUUM) StarPU runtime used here H. Ltaief ICERM Workshop 2012 24 / 45

Motivations MORSE T1 T2 T3 Conclusion A − 1 , Hybrid Architecture Targeted = ⇒ PCI Interconnect 16X 64Gb/s, very thin pipe! = ⇒ Fermi C2050 448 cuda cores 515 Gflop/s H. Ltaief ICERM Workshop 2012 25 / 45

Motivations MORSE T1 T2 T3 Conclusion A − 1 , Preliminary Results 500 Tile Hybrid CPU−GPU MAGMA 450 PLASMA LAPACK 400 350 300 Gflop/s 250 200 150 100 50 0.5 1 1.5 2 2.5 Matrix size 4 x 10 H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief Student Minisymposium, HIPC’11, India H. Ltaief ICERM Workshop 2012 26 / 45

Motivations MORSE T1 T2 T3 Conclusion GSEVP: What we solve? Ax = λ Bx A , B ∈ R n × n , x ∈ R n , λ ∈ R or A , B ∈ C n × n , x ∈ C n , λ ∈ R A = A T or A = A H A is symmetric or Hermitian xBx H > 0 B is symmetric positive definite H. Ltaief ICERM Workshop 2012 27 / 45

Motivations MORSE T1 T2 T3 Conclusion GSEVP: Why we solve it? To obtain energy eigenstates in: Chemical cluster theory Electronic structure of semiconductors Ab-initio energy calculations of solids H. Ltaief ICERM Workshop 2012 28 / 45

Making Dataflow Programming Ubiquitous for Scientific Computing - PowerPoint PPT Presentation

Motivations MORSE T1 T2 T3 Conclusion Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Ten lessons learned about Ten lessons learned about Ubiquitous Computing Ubiquitous Computing

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

Ubiquitous Data Gavin Bierman, Cambridge Peter Buneman, Edinburgh Philippa Gardner, Imperial 1

B-trees (Ubiquitous and otherwise) Williams College :: CSCI 333 Spring 2019 Logistics

Today Course Information Ubiquitous Computing A few words about Interaction Design

saf ety. net Ubiquit ous Comput ing f or Disast er Mit igat ion, Response and Recovery Friday,

Ubiquitous Computing Visions Boris Smus Tuesday, March 16, 2010 Who cares about vision?

Fast And Robust Interface Generation for Ubiquitous Applications The S UPPLE Project University

Ubiquitous Agriculture (u-Agri) A Collaborative Research Project in WSN between C-DAC and CRIDA

Challenges in Ubiquitous Data Mining Jo ao Gama LIAAD-INESC Porto, University of Porto,

Making Dataflow Programming Ubiquitous for Scientific Computing - PowerPoint PPT Presentation

Motivations MORSE T1 T2 T3 Conclusion Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Ten lessons learned about Ten lessons learned about Ubiquitous Computing Ubiquitous Computing

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

Ubiquitous Data Gavin Bierman, Cambridge Peter Buneman, Edinburgh Philippa Gardner, Imperial 1

B-trees (Ubiquitous and otherwise) Williams College :: CSCI 333 Spring 2019 Logistics

Today Course Information Ubiquitous Computing A few words about Interaction Design

saf ety. net Ubiquit ous Comput ing f or Disast er Mit igat ion, Response and Recovery Friday,

Ubiquitous Computing Visions Boris Smus Tuesday, March 16, 2010 Who cares about vision?

Fast And Robust Interface Generation for Ubiquitous Applications The S UPPLE Project University

Ubiquitous Agriculture (u-Agri) A Collaborative Research Project in WSN between C-DAC and CRIDA

Challenges in Ubiquitous Data Mining Jo ao Gama LIAAD-INESC Porto, University of Porto,

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed