DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu - PowerPoint PPT Presentation

2012 Scheduling Workshop, Pittsburgh, PA DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra

DAGuE • DAGuE [dag] (like in Prague [prag]) • Not DAGuE like ragout [rágo ͞ o] (the Prague Astronomical Clock was first installed in 1410, making it the third-oldest • Not DAGuE like vague [väg] astronomical clock in the world and the oldest • Innovative Computing Laboratory, one still working. -- Wikipedia Notice) University of Tennessee, Knoxville • Task / Data Flow Computation Framework • Dynamic Scheduling • Symbolic DAG representation • Distributed Memory • Many-core / Accelerators

Motivation • Today software developers face systems with • ~1 TFLOP of compute power per node • 32+ of cores, 100+ hardware threads • Highly heterogeneous architectures (cores + specialized cores + accelerators/coprocessors) • Deep memory hierarchies • Today, we deal with thousands of them (plan to deal with millions) • ! systemic load imbalance / decreasing use of the resources • How to harness these devices productively? • SPMD produces choke points, wasted wait times • We need to improve efficiency, power and reliability

How to Program • Threads & synchronization | Processes & Messages • Hand written Pthreads, compiler-based OpenMP, Chapel, UPC, MPI, hybrid • Very challenging to find parallelism, to debug, to maintain and to get good performance • Portably • With reasonable development efforts When is it time to redesign a software? • Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

Goals Decouple “System issues” from Algorithm • Keep the algorithm as simple as possible Language • Depict only the flow of data between tasks • Distributed Dataflow Environment based on Dynamic Scheduling of (Micro) Tasks • Programmability: layered approach • Algorithm / Data Distribution • Parallel applications without parallel programming • Portability / Efficiency System • Use all available hardware; overlap data movements / computation • Find something to do when imbalance arise

Dataflow with Runtime scheduling • Algorithms expect help to abstract • Hardware specificities : a runtime can provide portability, performance, scheduling heuristics, heterogeneity management, data movement, … • Scalability : maximize parallelism extraction, but avoid centralized scheduling or entire DAG representation: dynamic and independent discovery of the relevant portions during the execution • Jitter resilience : Do not support explicit communications, instead make them implicit and schedule to maximize overlap and load balance • ! express the algorithms differently

DPOTRF performance problem scaling DGEQRF performance problem scaling DGETRF performance problem scaling 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 7 7 7 Theoretical peak Theoretical peak Theoretical peak Practical peak (GEMM) Practical peak (GEMM) Practical peak (GEMM) 6 6 6 DAGuE DAGuE DAGuE DSBP ScaLAPACK HPL ScaLAPACK ScaLAPACK 5 5 5 TFlop/s TFlop/s 4 4 TFlop/s 4 3 3 3 2 2 2 1 1 1 0 0 0 107k 120k 130k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k Matrix size (N) Matrix size (N) Matrix size (N) 81 dual Intel Xeon L5420@2.5GHz [22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨ om. Distributed (2x4 cores/node) ! 648 cores DSBP SBP cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. , 36(2):1–25, 2009. ISSN 0098-3500. MX 10Gbs, Intel MKL, Scalapack DOI: 10.1145/1499096.1499100.

DPOTRF performance problem scaling DGEQRF performance problem scaling DGETRF performance problem scaling 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 7 7 7 Hardware Theoretical peak Theoretical peak Theoretical peak Practical peak (GEMM) Practical peak (GEMM) aware Practical peak (GEMM) 6 6 6 DAGuE DAGuE DAGuE scheduling DSBP ScaLAPACK HPL ScaLAPACK ScaLAPACK 5 5 5 Competes with TFlop/s TFlop/s 4 4 TFlop/s 4 Hand tuned 3 3 3 Extracts more 2 2 2 parallelism Change of the 1 1 data layout 1 (static task 0 0 0 scheduling) 107k 120k 130k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k Matrix size (N) Matrix size (N) Matrix size (N) 81 dual Intel Xeon L5420@2.5GHz [22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨ om. Distributed (2x4 cores/node) ! 648 cores DSBP SBP cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. , 36(2):1–25, 2009. ISSN 0098-3500. MX 10Gbs, Intel MKL, Scalapack DOI: 10.1145/1499096.1499100.

The DAGuE framework Extensions Domain Specific Dense LA … Sparse LA Tools Runtime Parallel Data Symbolic Scheduling Movement Representation Hardware Memory Data Cores Coherence Accelerators Hierarchies Movement

Domain Specific Extensions • DSEs � higher productivity for developers • High-level data types & ops tailored to domain • E.g., relations, matrices, triangles, … • Prototyping / Meta-Programming • Portable and scalable specification of parallelism • Automatically adjust data structures, mapping, and scheduling as systems scale up • Toolkit of classical data distributions, etc

DAGuE toolchain Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain A P L A S M M A G M A

Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE DAGuE Compiler MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain P L A S M A M A G M A Serial Code to Dataflow Representation

Example: QR Factorization FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 GEQRT A[k][k]|Up, A[m][k], T[m][k] <- TSQRT TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 UNMQR A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) TSMQR FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

Input Format – Quark (PLASMA) for (k = 0; k < A.mt; k++) { • Sequential C code Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { • Annotated through Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, QUARK-specific syntax T[m][k], OUTPUT); } • Insert_Task f or (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, • INOUT, OUTPUT, INPUT T[k][k], INPUT, • REGION_L, REGION_U, A[k][m], INOUT); REGION_D, … for (m = k+1; m < A.mt; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, • LOCALITY A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }

Data Application code & distribution Codelets Dataflow Analysis Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel MEM compiler compiler tasks stubs Incoming Data Additional k = SIZE-1 Outgoing Data libraries Serial DAGuE MPI Code compiler pthreads Runtime CUDA k = 0 DAGuE Toolchain PLASMA MAGMA FOR k = 0 .. SIZE - 1 • data flow analysis A[k][k], T[k][k] <- GEQRT( A[k][k] ) • Example on task DGEQRT of FOR m = k+1 .. SIZE - 1 QR UPPER • Polyhedral Analysis through A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) Omega Test • Compute algebraic FOR n = k+1 .. SIZE - 1 expressions for: LOWER A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) • Source and destination FOR m = k+1 .. SIZE - 1 tasks n = k+1 • Necessary conditions for m = k+1 A[k][n], A[m][n] <- that data flow to exist TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ GEQRT : A(k, k) RW A <- (k == 0) ? A(k, k) TSQRT : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] UNMQR -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) TSMQR -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY Control flow is eliminated, therefore maximum parallelism is possible zgeqrt( A, T ) END

Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE JDF MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain P L A S M A M A G M A Dataflow Representation

Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)

DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu - PowerPoint PPT Presentation

2012 Scheduling Workshop, Pittsburgh, PA DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra DAGuE DAGuE [dag] (like in Prague [prag]) Not DAGuE like ragout [rgo o] (the Prague

Minimal OpenStack Starting Your OpenStack Journey Sean Dague / Aug 19 th , 2015 Twitter: @sdague

with privacy Vincent Armant Laurent Simon Philippe Dague Outline Introduction

Analysis of metabolic networks in presence of biological constraints: geometrical, combinatorial

What is The Gate? Colloquialism for OpenStacks pre-merge continuous integration

Advanced Message Passing ASD Distributed Memory HPC Workshop Computer Systems Group Research

MacNeille completion and Buchholz Omega rule Kazushige Terui RIMS, Kyoto University 27/03/18,

Possible Hadron Physics with High- Momentum Beam Lines Shinya Sawada (KEK) Workshop on 'Future

Monotone Dynamical Systems: A Quick Tour Hal Smith A R I Z O N A S T A T E U N I V E R S I T Y

Mass modifica+on of hadrons associated with par+al chiral symmetry restora+on Masayasu Harada

E aStencils ExaSlang and the ExaStencils code generator Christian Schmitt 1 , Stefan Kronawitter 2

Yvain Bruned (U Edinburgh) Resonance based schemes for dispersive equations via decorated trees

Lessons between Computer Algebra and Verification/Satisfiability Checking James Davenport 1

Mining Knowledge Graphs from Text WSDM 2018 J AY P UJARA , S AMEER S INGH Tutorial Overview

Performance evaluation of SVX4 telescope 25/12/2016 Yoko YAMAUCHI 1 ATLAS Silicon detectors LHC

Communication by Qawi Harvard Thesis Defense October 8 th , 2009 Introduction Bandwidth and

Fundamental Propertjes of the GraphQL Language Olaf Hartjg @olafiartjg Joint work with Jorge

Robust Topology Control for Indoor Wireless Sensor Networks

Argument-Level Interactions for Persuasion Comments Evaluation using Co-attention Lu Ji , Zhongyu

The Art, Science and Algorithms from each point of an object ! of Photography ! no distortion.

Fast, Scalable, and Programmable Packet Scheduler in Hardware Vishal Shrivastav Cornell

Numerical Error Analysis for Statistical N i l E A l i f St ti ti l www.simtec Software

Development of a Polarized 3 He Ion Source for RHIC/EIC

Honey, I Shrunk the Cube Matteo Golfarelli Stefano Rizzi University of Bologna - Italy Summary

objective caml Daniel Jackson MIT Lab for Computer Science 6898: Advanced Topics in Software