S7728 - MAGMA Tensors and Batched Computing for Accelerating - PowerPoint PPT Presentation

S7728 - MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs Stan Tomov - Research Director, UTK Azzam Haidar - Research Scientist, UTK Abstract : Learn how to accelerate your machine learning, data mining, and other algorithms through fast matrix and tensor operations on GPUs. There's an increasing demand for accelerated independent computations on tensors and many small matrices. Although common, these workloads cannot be efficiently executed using standard linear algebra libraries. To fill the gap, we developed the MAGMA Batched library that achieves dramatically better performance by repetitively executing the small operations in "batches." We'll describe a methodology on how to develop high-performance BLAS, SVD, factorizations, and solvers for both large- and small-batched matrices. We'll also present the current state-of-the-art implementations and community efforts to standardize an API that extends BLAS for Batched computations. GT GTC 20 2017 17 San an Jos Jose, e, CA May ay 8—11, 8—11, 2017 2017

MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs ¡ Stan ¡Tomov ¡ and ¡Azzam ¡Haidar ¡ Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville In collaboration with : LLNL, Livermore, CA, USA University of Manchester, Manchester, UK University of Paris-Sud, France GTC 2017 San Jose, CA May 8—11, 2017 ¡

Outline • Introduction • MAGMA library – Numerical Linear Algebra (NLA) for large problems – NLA for applications that need small problems • MAGMA Tensor contraction computations • MAGMA Batched Computing • MAGMA-DNN NLA backend for DNN • Algorithms and optimization techniques • Conclusions

Wide range of Applications depend on Numerical Linear Algebra (NLA) Libraries • Airplane wing design, • Quantum chemistry, • Geophysical flows, • Stealth aircraft, • Diffusion of solid bodies in a liquid, • Adaptive mesh refinement, • Computational materials research, • Deep learning in neural networks, • Stochastic simulation, • Massively parallel data mining, • …

Numerical Linear Algebra (NLA) in Applications NLA is NLA is the he bac backend end that accelerates a wide variety of science and engineering applications: • Linear system Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Convex o ptimization, Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more • Singular Value Decomposition (SVD): A = U Σ V* • Information retrieval, web search, signal processing, big data analytics, low rank matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. • LA is crucial to the development of sparse solvers

Numerical Linear Algebra (NLA) in Applications NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications: Large matrices In contemporary libraries: • For big NLA problems BLAS (BLAS, convolutions, SVD, linear system solvers, etc.) LAPACK ScaLAPACK MAGMA (for GPUs)

Numerical Linear Algebra (NLA) in Applications NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications: Large matrices In contemporary libraries: • For big NLA problems BLAS (BLAS, convolutions, SVD, linear system solvers, etc.) LAPACK ScaLAPACK MAGMA (for GPUs) • Numerous important applications need NLA for small problems • Machine learning / DNNs Where data can be multidimensional / relational • Data mining / analytics • High-order FEM, • Graph analysis, • Neuroscience, • Astrophysics, • Quantum chemistry, • Signal processing, and more

Numerical Linear Algebra (NLA) in Applications NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications: Large matrices In contemporary libraries: • For big NLA problems BLAS (BLAS, convolutions, SVD, linear system solvers, etc.) LAPACK ScaLAPACK MAGMA (for GPUs) • Adding in MAGMA application backends for small problems • Machine learning / DNNs Small matrices / tensors • Data mining / analytics Fixed-size • High-order FEM, batches • Graph analysis, Variable-size • Neuroscience, batches • Astrophysics, Dynamic batches • Quantum chemistry, • Signal processing, and more Tensors

Key Features of MAGMA 2.2 hybrid scheduling TASK-BASED ALGORITHMS BLAS tasking + MAGMA uses task-based algorithms where the computation is split into tasks of varying granularity and their execution scheduled over the hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the GPUs. PERFORMANCE & ENERGY EFFICIENCY MAGMA LU factorization in double precision arithmetic CPU Intel Xeon E5-2650 v3 (Haswell) 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU NVIDIA K40 GPU K40 2x10 cores @ 2.30 GHz 56 MP x 64 @ 1.19 GHz 4000 3500 Performance GFLOP/s 14 3000 GFLOPs / Watt 12 P100 2500 10 2000 8 2 K40 6 1500 1 K40 4 1000 CPU 2 500 0 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 32k 34k 36k CPU K40 P100 CPU K40 P100 Matrix size N x N

MAGMA – designed to use Level 3 BLAS as much as possible Nvidia P100 , 1.19 GHz, Peak DP = 4700 Gflop/s C = C + A*B 4800 4503 Gflop/s 4400 4000 3600 3200 31x 2800 Gflop/s 2400 2000 y = y + A*x 1600 145 Gflop/s 1200 dgemm BLAS Level 3 y = � *x + y dgemv BLAS Level 2 800 daxpy BLAS Level 1 52 Gflop/s 400 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k Matrix size (N), vector size (NxN) Nvidia P100 The theoretical peak double precision is 4700 Gflop/s CUDA version 8.0

MAGMA Algorithms (influenced by hardware trend) Hybrid (using CPU + GPUs) and/vs. GPU-only MAGMA LU factorization in double precision arithmetic CPU Intel Xeon E5-2650 v3 (Haswell) 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU NVIDIA K40 GPU K40 2x10 cores @ 2.30 GHz 56 MP x 64 @ 1.19 GHz �� magma native (opt) �� magma native �� magma hybrid ��

MAGMA Algorithms (influenced by hardware trend) Mixed-precision iterative refinement Solving general dense linear systems using mixed precision iterative refinement 5000 CPOSV 4500 ZCPOSV 4000 ZPOSV 26 x 3500 3000 2500 2000 1500 GPU TITAN X (3,072 CUDA cores @ 1.076 GHz) 1000 Z/C GEMM peak ~ 190 / 5,600 GFlop/s; Maxwell CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 500 0 2500 5000 7500 10000 12500 15000 17500 20000 Matrix size

Backend for DNN and Data Analytics Support for various Batched and/or Tensor contraction routines e.g., Convolutional Neural Networks (CNNs) used in computer vision Key computation is convolution of Filter Fi (feature detector) and input image D (data): Convolution Pooling Convolution Pooling Fully Output connected predictions Data D Output O n chicken 0.4 person 0.1 . boat 0.3 O dog 0.01 n , k D k Convolution of Filters F i (feature detection) and input image D: Filters F For every filter F n and every channel, the computation for • every pixel value O n,k is a tensor contraction : F n ∑ O D F = n , k k , i n , i i Plenty of parallelism; small operations that must be batched • With data “reshape” the computation can be transformed into • a batched GEMM (for efficiency; among other approaches)

S7728 - MAGMA Tensors and Batched Computing for Accelerating - PowerPoint PPT Presentation

S7728 - MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs Stan Tomov - Research Director, UTK Azzam Haidar - Research Scientist, UTK Abstract : Learn how to accelerate your machine learning, data mining, and other

Introduction to Magma What is Magma? Magma is a computer algebra system for computations in

Outline Outline 4 Basic Rules 4 Basic Rules 4 Vectors and Tensors 4 Vectors and Tensors 4

Computing With Tensors: Modern Algorithm for . . . Modern Algorithm for . . . Potential

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro

Sha um Za n Compan ecretary ACS No. 13918 Encl: as above corn/MagrnaFincorpLtd

MAGMA HIGH-TEMPERATURE RECYCLING OF MUNICIPAL WASTE BY WASTE-FREE TECHNOLOGY GERMANY |

The origins of magmas What is the primary magma? Basalt Origin of Basaltic Magma Basalt in

Tensors Lek-Heng Lim Statistics Department Retreat October 27, 2012 Thanks: NSF DMS 1209136 and

09 - Introduction to Tensors Data Mining and Matrices Universitt des Saarlandes, Saarbrcken

A CLT for Wishart Tensors Dan Mikulincer Weizmann Institute of Science 1 Wishart Tensors Let {

Spectral Methods from Tensor Networks Alex Wein Courant Institute, NYU Joint work with Ankur

Introduction to Magma Wieb Bosma and John Cannon Radboud University Nijmegen and University of

Magma 2010 Conference on p -adic L -functions p -adic L -functions, (Stark-) Heegner points, and

Appendix: The Magma Language Geoff Bailey School of Mathematics and Statistics The University of

How does magma reach the surface? 2004-2008, effusive Michael Manga 1980, explosive Why do

For personal use only Proposed Acquisition of Magma Metals Limited Investor Presentation 3

Volcanoes D Y N A M I C E A R T H H T T P : / / W W W . Y O U T U B E . C O M / W A T C H ? V =

Post- -accelerators accelerators for EURISOL for EURISOL Post Marie- -H H l l ne

LESSONS LEARNED FROM FMMP-C5 PHASE II IN CAMBODIA 1 CONTENT Products of FMMP-C5 1. NSTEs

MAGMA UNESCO GLOBAL GEOPARK The Moon Geopark! Pl Thjme, Director Within a few years,

Entebbe, Entebbe, November 24- -28, 2008 28, 2008 November 24 Geologic Conditions of

The AGM- X 0 ( N ) Algorithm Heegner point lifting with application to elliptic curve point

COMPANY PRESENTATION DECEMBER - 2009 Caution Regarding Forward-Looking Statements These slides