S7728 - MAGMA Tensors and Batched Computing for Accelerating - - PowerPoint PPT Presentation

s7728 magma tensors and batched computing for
SMART_READER_LITE
LIVE PREVIEW

S7728 - MAGMA Tensors and Batched Computing for Accelerating - - PowerPoint PPT Presentation

S7728 - MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs Stan Tomov - Research Director, UTK Azzam Haidar - Research Scientist, UTK Abstract : Learn how to accelerate your machine learning, data mining, and other


slide-1
SLIDE 1

S7728 - MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs

Abstract: Learn how to accelerate your machine learning, data mining, and other

algorithms through fast matrix and tensor operations on GPUs. There's an increasing demand for accelerated independent computations on tensors and many small matrices. Although common, these workloads cannot be efficiently executed using standard linear algebra libraries. To fill the gap, we developed the MAGMA Batched library that achieves dramatically better performance by repetitively executing the small operations in "batches." We'll describe a methodology on how to develop high-performance BLAS, SVD, factorizations, and solvers for both large- and small-batched matrices. We'll also present the current state-of-the-art implementations and community efforts to standardize an API that extends BLAS for Batched computations.

GT GTC 20 2017 17 San an Jos Jose, e, CA May ay 8—11, 8—11, 2017 2017

Stan Tomov - Research Director, UTK Azzam Haidar - Research Scientist, UTK

slide-2
SLIDE 2

MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs ¡

Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville In collaboration with: LLNL, Livermore, CA, USA University of Manchester, Manchester, UK University of Paris-Sud, France GTC 2017 San Jose, CA May 8—11, 2017

¡

Stan ¡Tomov ¡and ¡Azzam ¡Haidar ¡

slide-3
SLIDE 3

Outline

  • Introduction
  • MAGMA library

– Numerical Linear Algebra (NLA) for large problems – NLA for applications that need small problems

  • MAGMA Tensor contraction computations
  • MAGMA Batched Computing
  • MAGMA-DNN NLA backend for DNN
  • Algorithms and optimization techniques
  • Conclusions
slide-4
SLIDE 4

Wide range of Applications depend on Numerical Linear Algebra (NLA) Libraries

  • Airplane wing design,
  • Quantum chemistry,
  • Geophysical flows,
  • Stealth aircraft,
  • Diffusion of solid bodies in a liquid,
  • Adaptive mesh refinement,
  • Computational materials research,
  • Deep learning in neural networks,
  • Stochastic simulation,
  • Massively parallel data mining,
slide-5
SLIDE 5

NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications:

  • Linear system Solve Ax = b
  • Computational electromagnetics, material science, applications using boundary integral equations,

airflow past wings, fluid flow around ship and other offshore constructions, and many more

  • Least squares: Find x to minimize || Ax – b ||
  • Convex optimization, Computational statistics (e.g., linear least squares or ordinary least squares),

econometrics, control theory, signal processing, curve fitting, and many more

  • Eigenproblems: Solve Ax = λ x
  • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining,

marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more

  • Singular Value Decomposition (SVD): A = U Σ V*
  • Information retrieval, web search, signal processing, big data analytics, low rank matrix approximation, total

least squares minimization, pseudo-inverse, and many more

  • Many variations depending on structure of A
  • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc.
  • LA is crucial to the development of sparse solvers

Numerical Linear Algebra (NLA) in Applications

slide-6
SLIDE 6

NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications:

Numerical Linear Algebra (NLA) in Applications

  • For big NLA problems

(BLAS, convolutions, SVD, linear system solvers, etc.)

In contemporary libraries:

BLAS LAPACK ScaLAPACK MAGMA (for GPUs) Large matrices

slide-7
SLIDE 7

NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications:

Numerical Linear Algebra (NLA) in Applications

  • For big NLA problems

(BLAS, convolutions, SVD, linear system solvers, etc.)

  • Numerous important applications need NLA for small problems

In contemporary libraries:

BLAS LAPACK ScaLAPACK MAGMA (for GPUs) Large matrices

  • Machine learning / DNNs
  • Data mining / analytics
  • High-order FEM,
  • Graph analysis,
  • Neuroscience,
  • Astrophysics,
  • Quantum chemistry,
  • Signal processing, and more

Where data can be multidimensional / relational

slide-8
SLIDE 8

NLA NLA is is the he bac backend end that accelerates a wide variety of science and engineering applications:

Numerical Linear Algebra (NLA) in Applications

  • For big NLA problems

(BLAS, convolutions, SVD, linear system solvers, etc.)

  • Adding in MAGMA application backends for small problems

In contemporary libraries:

BLAS LAPACK ScaLAPACK MAGMA (for GPUs) Large matrices

  • Machine learning / DNNs
  • Data mining / analytics
  • High-order FEM,
  • Graph analysis,
  • Neuroscience,
  • Astrophysics,
  • Quantum chemistry,
  • Signal processing, and more

Small matrices / tensors

Fixed-size batches Variable-size batches Dynamic batches Tensors

slide-9
SLIDE 9

Key Features of MAGMA 2.2

TASK-BASED ALGORITHMS

MAGMA uses task-based algorithms where the computation is split into tasks of varying granularity and their execution scheduled over the hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks,

  • ften on the critical path, are scheduled on the CPU, and larger more parallelizable
  • nes, often Level 3 BLAS, are scheduled on the GPUs.

PERFORMANCE & ENERGY EFFICIENCY

GFLOPs / Watt

BLAS tasking + hybrid scheduling Matrix size N x N Performance GFLOP/s

500 1000 1500 2000 2500 3000 3500 4000 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 32k 34k 36k P100 2 K40 1 K40 CPU MAGMA LU factorization in double precision arithmetic K40 CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU 56 MP x 64 @ 1.19 GHz

2 4 6 8 10 12 14 CPU K40 P100 CPU K40 P100

slide-10
SLIDE 10

Nvidia P100 The theoretical peak double precision is 4700 Gflop/s CUDA version 8.0

MAGMA – designed to use Level 3 BLAS as much as possible

Nvidia P100, 1.19 GHz, Peak DP = 4700 Gflop/s

Matrix size (N), vector size (NxN)

2k 4k 6k 8k 10k 12k 14k 16k 18k 20k

Gflop/s

400 800 1200 1600 2000 2400 2800 3200 3600 4000 4400 4800 dgemm BLAS Level 3 dgemv BLAS Level 2 daxpy BLAS Level 1

145 Gflop/s 52 Gflop/s 4503 Gflop/s

31x

C = C + A*B y = y + A*x y = *x + y

slide-11
SLIDE 11

MAGMA Algorithms (influenced by hardware trend)

Hybrid (using CPU + GPUs) and/vs. GPU-only

  • MAGMA LU factorization in double precision arithmetic

K40 CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU 56 MP x 64 @ 1.19 GHz

magma native (opt) magma native magma hybrid

slide-12
SLIDE 12

Matrix size

Solving general dense linear systems using mixed precision iterative refinement

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 2500 5000 7500 10000 12500 15000 17500 20000 CPOSV ZCPOSV ZPOSV

GPU TITAN X (3,072 CUDA cores @ 1.076 GHz)

Z/C GEMM peak ~ 190 / 5,600 GFlop/s; Maxwell

CPU Intel Xeon X5660@2.80GHz (2 x 6 cores)

26 x

MAGMA Algorithms (influenced by hardware trend) Mixed-precision iterative refinement

slide-13
SLIDE 13

Backend for DNN and Data Analytics

Support for various Batched and/or Tensor contraction routines

e.g., Convolutional Neural Networks (CNNs) used in computer vision Key computation is convolution of Filter Fi (feature detector) and input image D (data):

Filters F Fn Output On

n,k

O

n,k

O

=

k,i

D

i

n,i

F

Dk

.

Convolution Pooling Convolution Pooling Fully Output connected predictions

Data D

Convolution of Filters Fi (feature detection) and input image D:

  • For every filter Fn and every channel, the computation for

every pixel value On,k is a tensor contraction:

  • Plenty of parallelism; small operations that must be batched
  • With data “reshape” the computation can be transformed into

a batched GEMM (for efficiency; among other approaches)

chicken 0.4 boat 0.3 person 0.1 dog 0.01

slide-14
SLIDE 14

Tensor contractions for high-order FEM

Reference:

  • A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, S. Tomov,

High-Performance Tensor Contractions for GPUs, ICCS 2016, San Diego, CA, June 6—8, 2016.

Code Generation

C++11 features will be used as much as possible. Additional needs will be handled by defining a domain specific embedded language (DSEL). This technique is used in C++ to take advantage

  • f DSL features while using the optimizations provided by a

standard compiler. It will handle the generation of versions (index reordering, next) to be empirically evaluated and be part of the autotuning framework.

Autotuning

We are developing fixed-size gemm kernels for GPUs, Xeon Phi, and multicore (see Figure on Right for a single core intel Xeon E5- 2620 and K40) through an autotuning framework. A number of generic versions are developed and parametrized for

  • performance. The parameters are autotuned (empirically) to find

“best” kernels for specific size.

Tensor operations in high-order FEM

Consider the FE mass matrix ME for an element/zone E with weight ρ, as a 2-dimensional tensor: i, j = 1,..., nd , where Take the nq x nd matrix and Then, , or omitting the E subscript . Using FE of order p, we have nd = O(pd) and nq = O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decompose dofs and quadrature point indices in logical coordinate axes i = (i1, …, id), j = (j1, …, jd), k = (k1, …, kd) so Mij can be viewed as 2d-dimensional tensor Mi1, …, id, j1, …, jd.

Summary of kernels needed:

  • Assembly of M, referred as equations (1) & (2) below
  • Evaluations of M times V, referred as equations (3) & (4) below

Towards a High-Performance Tensor Algebra Package for Accelerators

  • M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov

Abstract

Numerous important applications, e.g., high-order FEM simulations, can be expressed through tensors. Examples are computation of FE matrices and SpMV products expressed as generalized tensor contractions. Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present

  • ngoing work on the design of a high-performance package in

MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to batched execution of large number of small tensor contractions. We apply auto-tuning and code generation techniques to provide an architecture-aware, user-friendly interface.

Motivation

Numerous important applications can be expressed through tensors:

  • High-order FEM simulations
  • Signal Processing
  • Numerical Linear Algebra
  • Numerical Analysis

The goal is to design a:

  • High-performance package for Tensor algebra
  • Built-in architecture-awareness (GPU, Xeon Phi, multicore)
  • User-friendly interface

Example cases

Numerical linear algebra:

  • A 4-dimensional tensor contraction
  • rank-k update on matrices in tile format (k can be small, e.g.,

sub-vector/warp size)

  • Must determine (in software) if possible to do it through

batched GEMM kernels

[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangian

  • hydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)

APPROACH AND RESULTS User-friendly interface

To provide various interfaces, including one using C++11. Top level design to provide features similar to the mshadow library. https://github.com/dmlc/mshadow

Index reordering/reshape

If we store tensors as column-wise 1D arrays, , i.e., M can be interpreted as a 4th order tensor, a nd x nd matrix, or a vector of size nd2, without changing the storage. We can define as long as n1...nr = m1…mq and for every i1..r , j1..qi1 + n1i2 + … + n1n2...nr-1ir = j1 + m1j2 + … + m1m2…mq-1jq. Contractions can be implemented as a sequence of pairwise

  • contractions. There is enough complexity here to search for

something better: code generation, index reordering, and autotuning will be used, e.g., contractions (3a) - (4f) can be implemented as tensor index-reordering plus gemm A, B -> ATB.

// Our current interface : // create a 2 x 5 x 2 float tensor , default locality is cpu using std::vector as default backend for data Tensor<2,5,2> ts; // create a 2 x 5 x 2 tensor on the gpu using thrust as the default backend for data Tensor<2,5,5,gpu_> d_ts; // Call a thrust function to set values to 9 thrust::fill(d_ts.begin() , d_ts.end() , 9); // Send back values to the cpu tensor ts = d_ts ; // Reorder the 2 x 5 x 2 tensor to a matrix 2 x 10 using views view<2,10> mat = ts ;

  • Data Mining
  • Deep Learning
  • Graph Analysis
  • Neuroscience and more

Batched LA

Tensor contractions are transformed through reshapes to batched LA operations, many of which available in MAGMA[2] http://icl.cs.utk. edu/magma/ (including LU, QR, Cholesky, GEMM, GEMV, TRSM, SYRK).

[2] A.Haidar, T.Dong, S.Tomov, P.Luszczek, and J.Dongarra. A framework for batched and GPU-resident factorization algorithms applied to block Householder transformations. ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015.

Conclusions and Future directions

  • High-performance package on Tensor Algebra has the potential for high-impact on a number of important applications
  • Multidisciplinary effort
  • Current results show promising performance, where various components will be leveraged from autotuning MAGMA Batched linear

algebra kernels, and BLAST from LLNL

  • This is an ongoing work

Figure: Batched dgemms on K40 GPU. Batch count is 2,000. MAGMA exceeds in performance CUBLAS for “small” sizes, currently tuned for above 32. Current work is concentrated on kernels for fixed smaller (sub-warp) sizes. Gatlinburg, Tennessee, Aug 31- Sept 2, 2015 http://computing.ornl.gov/workshops/SMC15/

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

LLNL release number LLNL-POST-676632

ICL's work on this material was supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA.

Code Generation

C++11 features will be used as much as possible. Additional needs will be handled by defining a domain specific embedded language (DSEL). This technique is used in C++ to take advantage

  • f DSL features while using the optimizations provided by a

standard compiler. It will handle the generation of versions (index reordering, next) to be empirically evaluated and be part of the autotuning framework.

Autotuning

We are developing fixed-size gemm kernels for GPUs, Xeon Phi, and multicore (see Figure on Right for a single core intel Xeon E5- 2620 and K40) through an autotuning framework. A number of generic versions are developed and parametrized for

  • performance. The parameters are autotuned (empirically) to find

“best” kernels for specific size.

Tensor operations in high-order FEM

Consider the FE mass matrix ME for an element/zone E with weight ρ, as a 2-dimensional tensor: i, j = 1,..., nd , where Take the nq x nd matrix and Then, , or omitting the E subscript . Using FE of order p, we have nd = O(pd) and nq = O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decompose dofs and quadrature point indices in logical coordinate axes i = (i1, …, id), j = (j1, …, jd), k = (k1, …, kd) so Mij can be viewed as 2d-dimensional tensor Mi1, …, id, j1, …, jd.

Summary of kernels needed:

  • Assembly of M, referred as equations (1) & (2) below
  • Evaluations of M times V, referred as equations (3) & (4) below

Towards a High-Performance Tensor Algebra Package for Accelerators

  • M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov

Abstract

Numerous important applications, e.g., high-order FEM simulations, can be expressed through tensors. Examples are computation of FE matrices and SpMV products expressed as generalized tensor contractions. Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present

  • ngoing work on the design of a high-performance package in

MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to batched execution of large number of small tensor contractions. We apply auto-tuning and code generation techniques to provide an architecture-aware, user-friendly interface.

Motivation

Numerous important applications can be expressed through tensors:

  • High-order FEM simulations
  • Signal Processing
  • Numerical Linear Algebra
  • Numerical Analysis

The goal is to design a:

  • High-performance package for Tensor algebra
  • Built-in architecture-awareness (GPU, Xeon Phi, multicore)
  • User-friendly interface

Example cases

Numerical linear algebra:

  • A 4-dimensional tensor contraction
  • rank-k update on matrices in tile format (k can be small, e.g.,

sub-vector/warp size)

  • Must determine (in software) if possible to do it through

batched GEMM kernels

[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangian

  • hydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)

APPROACH AND RESULTS User-friendly interface

To provide various interfaces, including one using C++11. Top level design to provide features similar to the mshadow library. https://github.com/dmlc/mshadow

Index reordering/reshape

If we store tensors as column-wise 1D arrays, , i.e., M can be interpreted as a 4th order tensor, a nd x nd matrix, or a vector of size nd2, without changing the storage. We can define as long as n1...nr = m1…mq and for every i1..r , j1..qi1 + n1i2 + … + n1n2...nr-1ir = j1 + m1j2 + … + m1m2…mq-1jq. Contractions can be implemented as a sequence of pairwise

  • contractions. There is enough complexity here to search for

something better: code generation, index reordering, and autotuning will be used, e.g., contractions (3a) - (4f) can be implemented as tensor index-reordering plus gemm A, B -> ATB.

// Our current interface : // create a 2 x 5 x 2 float tensor , default locality is cpu using std::vector as default backend for data Tensor<2,5,2> ts; // create a 2 x 5 x 2 tensor on the gpu using thrust as the default backend for data Tensor<2,5,5,gpu_> d_ts; // Call a thrust function to set values to 9 thrust::fill(d_ts.begin() , d_ts.end() , 9); // Send back values to the cpu tensor ts = d_ts ; // Reorder the 2 x 5 x 2 tensor to a matrix 2 x 10 using views view<2,10> mat = ts ;

  • Data Mining
  • Deep Learning
  • Graph Analysis
  • Neuroscience and more

Batched LA

Tensor contractions are transformed through reshapes to batched LA operations, many of which available in MAGMA[2] http://icl.cs.utk. edu/magma/ (including LU, QR, Cholesky, GEMM, GEMV, TRSM, SYRK).

[2] A.Haidar, T.Dong, S.Tomov, P.Luszczek, and J.Dongarra. A framework for batched and GPU-resident factorization algorithms applied to block Householder transformations. ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015.

Conclusions and Future directions

  • High-performance package on Tensor Algebra has the potential for high-impact on a number of important applications
  • Multidisciplinary effort
  • Current results show promising performance, where various components will be leveraged from autotuning MAGMA Batched linear

algebra kernels, and BLAST from LLNL

  • This is an ongoing work

Figure: Batched dgemms on K40 GPU. Batch count is 2,000. MAGMA exceeds in performance CUBLAS for “small” sizes, currently tuned for above 32. Current work is concentrated on kernels for fixed smaller (sub-warp) sizes. Gatlinburg, Tennessee, Aug 31- Sept 2, 2015 http://computing.ornl.gov/workshops/SMC15/

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

LLNL release number LLNL-POST-676632

ICL's work on this material was supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA.

  • Contractions can often be implemented as index reordering

plus batched GEMM (and hence, be highly efficient)

Reference:

  • V. Dobrev, Tz. Kolev, R. Rieben, High-order curvilinear finite element methods

for Lagrangian hydrodynamics, SIAM J. Sci. Comp., B606-B641. (36 pages)

slide-15
SLIDE 15

Batched routines released in MAGMA

slide-16
SLIDE 16

REGISTERS MAIN MEMORY BANDWIDTH PCI EXPRESS GEN3 X16 INTERCONNECT

CRAY GEMINI

L3 CACHE L2 CACHE L1 CACHE & GPU SHARED MEMORY

MAIN MEMORY

Haswell

E5-2650 v3

KNL 7250 DDR5|MCDRAM ARM K40c P100 10 cores 68 cores 4 cores 15 SM x 192 cores 56 SM x 64 cores 16/core AVX2 32/core AVX-512 32/core 256 KB/SM 256 KB/SM 32 KB/core 32 KB/core 32 KB/core 64 KB/SM 64 KB/SM 256 KB/core 1024 KB/2cores 2 MB 1.5 MB 4 MB 25 MB 0...16 GB N/A N/A N/A 64 GB 384 |16 GB 4 GB 12 GB 16 GB 68 GB/s 115 | 421 GB/s 26 GB/s 288 GB/s 720 GB/s 16 GB/s 16 GB/s 16 GB/s 16 GB/s 16 GB/s 6 GB/s 6 GB/s 6 GB/s 6 GB/s 6 GB/s

Memory hierarchies for different type of architectures Memory hierarchies

Implementation on current hardware is becoming challenging

Draft Reports Batched BLAS Draft Reports: https://www.dropbox.com/s/olocmipyxfvcaui/batched_api_03_30_2016.pdf?dl=0 Batched BLAS Poster: https://www.dropbox.com/s/ddkym76fapddf5c/Batched%20BLAS%20Poster%2012.pdf?dl=0 Batched BLAS Slides: https://www.dropbox.com/s/kz4fhcipz3e56ju/BatchedBLAS-1.pptx?dl=0 Webpage on ReproBLAS: http://bebop.cs.berkeley.edu/reproblas/ Efficient Reproducible Floating Point Summation and BLAS: http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-229.pdf

Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech Computational Science and Engineering Atlanta, GA February 23—25, 2017

http://bit.ly/Batch-BLAS-2017

slide-17
SLIDE 17

Tensor contractions – performance

Performance comparison of tensor contraction versions using batched C = αAB + βC on 100,000 square matrices of size n on a K40c GPU and 16 cores of Intel Xeon E5-2670, 2.60 GHz CPUs.

Matrix Size

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Gflop/s

50 100 150 200 250 300 350 400Nvidia K40 / Intel Xeon E5-2650 v3 (Haswell) 10 cores

Our design MAGMA K40 Cublas K40 Rocache design MKL+openMP on CPU Roofline bound

Reference:

  • I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra, High-performance

matrix-matrix multiplications of very small matrices, Euro-Par’16, Grenoble, France, August 22-26, 2016.

Matrix Size

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Gflop/s

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300

Nvidia P100

Magma tensor dgemm predefined size at compile Magma batched dgemm generic small cuBLAS v8.0 Roofline bound

slide-18
SLIDE 18

MAGMA-DNN

  • Deep learning architectures show promising results in abstract tasks like image

classifications

  • They inherently consist of same old neural networks as before except
  • The size of the networks has increased drastically, more compute power,
  • Training dataset is huge, but fast
  • Any improvement in the core modules of such networks will greatly influences the

training and increases performance, also helps in understanding the network well

  • Spatial convolution in convnets takes up to 70% of the total execution.
  • We optimize spatial convolution module specifically for convnets.

Motivation

slide-19
SLIDE 19

Introduction to Spatial Convolution:

  • Input and weights are 3D

tensors

  • As the filter traverses across

the input volume horizontally and vertically it generates a 2D activation map

  • Multiple filters generate

multiple 2D output frames

  • These output frames are

stacked to form a 3D output tensor

MAGMA-DNN

slide-20
SLIDE 20

Background or existing technique: Unfold and GEMM

VGG-16 D conv modules

MAGMA-DNN

slide-21
SLIDE 21

Background or existing technique: Unfold and GEMM

Advantages:

  • Unfold involves streaming memcpy and can be made parallel by

having many threads working on many sections of the input

  • Many BLAS libraries contain fine tuned GEMM routines that can

be used

  • The output format is consistent with the actual convolution output

Disadvantages:

  • Unfold operation requires extra memory
  • Matrix shapes can be greatly skewed

MAGMA-DNN

slide-22
SLIDE 22

Convolution using transformation techniques

FFT method:

  • Convolution becomes elementwise product in frequency domain
  • Complexity in 2D is O(RS log(HW)) , better than O(RSHW) in

direct convolution

  • Caution !! filter dimension should be similar to image dimension

but in convnets that are used widely RS << HW Winograd Minimal filtering:

  • Best suited for small filters, RS << HW
  • Reduces the arithmetic operations by constant factor thus

improves the asymptotic timing

MAGMA-DNN

slide-23
SLIDE 23

Winograd algorithm

Steps:

  • Transform a 4x4 image tile
  • Transform a 3x3 +lter
  • Perform element wise product

between the transformed tiles

  • Inverse transform on the

product tile

1. 2. 3. 4.

Winograd algorithm

MAGMA-DNN

slide-24
SLIDE 24

Winograd algorithm: reduction to GEMM’s

  • Each Image tensor has 16 matrices of size

TxC

  • The K filters are reduced to 16 CxK matrices
  • For a batch size of N there are 16N GEMMs,

for example N=64 gives 1024 GEMMs.

  • 16 filter matrices are common for all the

GEMMs, so better last level cache efficiency

  • Once GEMM are performed, “Gather” the

elements from the 16 output matrices to form a 4x4 output tile

  • Apply inverse transform on the output tile to
  • btain 2x2 convolution output

MAGMA-DNN

slide-25
SLIDE 25

Winograd algorithm: Advantage of GEMM’s

  • Each transformed image tile is reused with K filters. Similarly, each filter

tile is reused with all the input tiles across the batch of N

  • If N and K are large enough, the transformation cost is amortized

because of max re- usage of transformed tiles

  • Instead of applying inverse across C and then accumulating, the natural

form of GEMM accumulates the result across C and then inverse can be applied once to the GEMM output tile.

  • This is possible because of the linearity property for Winograd

convolution Fine tuned GEMM APIs are available

  • Good cache efficiency Good arithmetic intensity

MAGMA-DNN

slide-26
SLIDE 26

Winograd algorithm

MAGMA-DNN

slide-27
SLIDE 27

Optimizing GEMM’s: Kernel design

A21 ¡ A22 ¡ A23 ¡ A24 ¡ A31 ¡ A32 ¡ A33 ¡ A34 ¡ A41 ¡ A42 ¡ A43 ¡ A44 ¡ B11 ¡ B12 ¡ B13 ¡ B14 ¡ B21 ¡ B22 ¡ B23 ¡ B24 ¡ B31 ¡ B32 ¡ B33 ¡ B34 ¡ B41 ¡ B42 ¡ B43 ¡ B44 ¡ C11 ¡ C12 ¡ C13 ¡ C14 ¡ C21 ¡ C22 ¡ C23 ¡ C24 ¡ C31 ¡ C32 ¡ C33 ¡ C34 ¡ C41 ¡ C42 ¡ C43 ¡ C44 ¡

T1 = αA11 B11 T2 = αA12 B21 T3 = αA13 B31 T4 = αA14 B41 C11 = βC11 + ΣTk

K N M K

A11 ¡ A12 ¡ A13 ¡ A14 ¡ A11 ¡ A12 ¡ A13 ¡ A14 ¡ B11 ¡ B21 ¡ B31 ¡ B41 ¡ C11 ¡

C = βC + αAB

MAGMA-DNN

slide-28
SLIDE 28

Optimizing GEMM’s: Kernel design

C = βC + αAB

C11 ¡ C12 ¡ C13 ¡ C14 ¡ C21 ¡ C22 ¡ C23 ¡ C24 ¡ C31 ¡ C32 ¡ C33 ¡ C34 ¡ C41 ¡ C42 ¡ C43 ¡ C44 ¡

M K K N BLK

K

BLK

M

BLK

K

BLK

N

B A

  • Assign every block of Cii to a TB
  • Hold the block Cii in register/sm
  • Slide the green tile of A and B

and compute C = βC + αAB

  • This design guarantee

reproducibility of results

  • The kernel is parameterized to

allow tuning and optimization

MAGMA-DNN

slide-29
SLIDE 29

T ¡

T = A1 B1

M K K N BLK

K

BLK

M

BLK

K

BLK

N

B A

T += A2 B2 T += A3 B3 T += A4 B4 C22 = βC22 + α T

C22 ¡

C = βC + αAB

Optimizing GEMM’s: Kernel design

MAGMA-DNN

slide-30
SLIDE 30

C11 ¡ C12 ¡ C13 ¡ C14 ¡ C21 ¡ C22 ¡ C23 ¡ C24 ¡ C31 ¡ C32 ¡ C33 ¡ C34 ¡ C41 ¡ C42 ¡ C43 ¡ C44 ¡

M K K N BLK

K

BLK

M

BLK

K

BLK

N

B A

thy thx ¡ ¡ ¡ ¡ ¡

  • Reading/writing the elements is

based on the TB size (# threads) and so is an extra parameter.

  • Also it could be different for A, B

and C

Optimizing GEMM’s: Kernel design

MAGMA-DNN

slide-31
SLIDE 31

Optimizing GEMM’s: Kernel design

C11 ¡ C12 ¡ C13 ¡ C14 ¡ C21 ¡ C22 ¡ C23 ¡ C24 ¡ C31 ¡ C32 ¡ C33 ¡ C34 ¡ C41 ¡ C42 ¡ C43 ¡ C44 ¡

M K K N BLK

K

BLK

M

BLK

K

BLK

N

B A

C22 ¡ C22 ¡ C22 ¡

  • Prefetching

MAGMA-DNN

slide-32
SLIDE 32

Optimizing GEMM’s: Kernel design

Are we done, we have our best kernel ?

  • for most of the case LA algorithms, Deep Learning, etc., the matrices

A,B,C are not squares which requires autotunning

MAGMA-DNN

slide-33
SLIDE 33

Optimizing GEMM’s: Kernel tuning

7 14 21 28 35 42 49 56 63 70 77 200 400 600 800 1000 1200 1400 1600 1800 2000 Gflops matrix size sgemm NN square Performance bound gemm 739 (6) gemm 711 (6) gemm 742 (5) gemm 741 (4) gemm 738 (4) Performance min

CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz

MAGMA-DNN

slide-34
SLIDE 34

Optimizing GEMM’s: Kernel tuning

6 12 18 24 30 36 42 48 54 60 66 72 1 2 3 4 5 6 7 8 9 Gflops matrix size sgemm NN custom1 Performance bound gemm 636 (4) gemm 777 (3) gemm 765 (3) gemm 745 (3) gemm 732 (3) Performance min

CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz

MAGMA-DNN

slide-35
SLIDE 35

112 224 336 448 560 672 784 896 1008 1120 1232 600 1200 1800 2400 3000 3600 4200 4800 5400 6000 Gflops matrix size sgemm NN square Performance bound gemm 845 (7) gemm 843 (7) gemm 847 (6) gemm 844 (6) gemm 789 (5) Performance min

Optimizing GEMM’s: Kernel tuning

CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz

MAGMA-DNN

slide-36
SLIDE 36

Optimizing GEMM’s: Kernel tuning

83 166 249 332 415 498 581 664 747 830 913 1 2 3 4 5 6 7 8 9 Gflops matrix size sgemm NN custom1 Performance bound gemm 447 (4) gemm 557 (3) gemm 496 (3) gemm 440 (3) gemm 837 (2) Performance min

CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz

MAGMA-DNN

slide-37
SLIDE 37
  • m=

m=n, n, k= k=16 16 m= m=n, n, k= k=64 64

  • m=

m=n= n=k k

K40

NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz

  • m=

m=n, n, k= k=8 8

slide-38
SLIDE 38

Conclusions and future work

  • Developed a number of NLA in MAGMA targeting applications
  • High-order FEM, DNN, and data analytics;
  • Tensor abstractions and high-performance tensor contractions (for high-order FEM)
  • Multidisciplinary effort
  • Achieve 90+% of theoretical maximum on GPUs and multicore CPUs
  • Use on-the-fly tensor reshaping to cast tensor contractions as

small but many GEMMs, executed using batched approaches

  • Custom designed GEMM kernels for small matrices and autotuning

In conclusion: Future directions:

  • To release a tensor contractions package through the MAGMA library
  • To release NLA backend for DLA and data analytics
  • Integrate developments in applications
  • Complete autotuning and develop all kernels
slide-39
SLIDE 39

Collaborators and Support

MAGMA team

http://icl.cs.utk.edu/magma

Collaborating partners

University of Tennessee, Knoxville Lawrence Livermore National Laboratory, Livermore, CA LLNL led ECP CEED: Center for Efficient Exascale Discretizations University of Manchester, Manchester, UK University of Paris-Sud, France INRIA, France