Accelerating Linear Algebra on Small Matrices from - - PowerPoint PPT Presentation

accelerating linear algebra on small matrices from
SMART_READER_LITE
LIVE PREVIEW

Accelerating Linear Algebra on Small Matrices from - - PowerPoint PPT Presentation

Accelerating Linear Algebra on Small Matrices from Batched BLAS to Large Scale Solvers Stan Tomov and Ichitaro Yamazaki Azzam Haidar, Ahmad


slide-1
SLIDE 1

Accelerating ¡Linear ¡Algebra ¡on ¡Small ¡Matrices ¡– ¡ ¡ from ¡Batched ¡BLAS ¡to ¡Large ¡Scale ¡Solvers ¡

Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville ¡

In collaboration with:

LLNL, Livermore, CA, USA University of Manchester, Manchester, UK University of Paris-Sud, France GTC 2018 San Jose, CA March 26 – 29, 2018

¡

Stan ¡Tomov ¡and ¡Ichitaro ¡Yamazaki ¡

Azzam ¡Haidar, ¡Ahmad ¡Abdelfattah, ¡Mark ¡Gates, ¡and ¡Jack ¡Dongarra ¡

¡

slide-2
SLIDE 2

Outline

  • Introduction
  • Batched BLAS
  • MAGMA Batched functionalities and techniques
  • Accelerating applications with Batched LA

– MAGMA DNN, Templates, and Tensors – Fused Batched BLAS – Applications in exascale discretizations (CEED project)

  • PART II: Batched computations for large scale solvers

with low-rank approximations and precond.

  • Conclusions

2 / 37

slide-3
SLIDE 3

Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of

  • f science

cience and and engineer engineering ing applica pplications ions: :

  • Linear systems: Solve Ax = b
  • Computational electromagnetics, material science, applications using

boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more

  • Least squares: Find x to minimize || Ax – b ||
  • Computational statistics (e.g., linear least squares or ordinary least squares),

econometrics, control theory, signal processing, curve fitting, and many more

  • Eigenproblems: Solve Ax = λ x
  • Computational chemistry, quantum mechanics, material science, face recognition,

PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more

  • SVD: A = U Σ V* (Au = σv and A*v = σu)
  • Information retrieval, web search, signal processing, big data analytics, low rank

matrix approximation, total least squares minimization, pseudo-inverse, and many more

  • Many variations depending on structure of A
  • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded,

sparse with dense blocks, etc.

  • DLA is crucial to the development of sparse solvers

Dense Linear Algebra in Applications

3 / 37

slide-4
SLIDE 4

Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of

  • f science

cience and and engineer engineering ing applica pplications ions: :

  • Linear systems: Solve Ax = b
  • Computational electromagnetics, material science, applications using

boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more

  • Least squares: Find x to minimize || Ax – b ||
  • Computational statistics (e.g., linear least squares or ordinary least squares),

econometrics, control theory, signal processing, curve fitting, and many more

  • Eigenproblems: Solve Ax = λ x
  • Computational chemistry, quantum mechanics, material science, face recognition,

PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more

  • SVD: A = U Σ V* (Au = σv and A*v = σu)
  • Information retrieval, web search, signal processing, big data analytics, low rank

matrix approximation, total least squares minimization, pseudo-inverse, and many more

  • Many variations depending on structure of A
  • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded,

sparse with dense blocks, etc.

  • DLA is crucial to the development of sparse solvers

Dense Linear Algebra in Applications

Provided in MAGMA 2.3

http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magma

slide-5
SLIDE 5

0 ¡ 1000 ¡ 2000 ¡ 3000 ¡ 4000 ¡ 5000 ¡ 6000 ¡ 2k ¡ 4k ¡ 6k ¡ 8k ¡ 10k ¡12k ¡14k ¡16k ¡18k ¡20k ¡22k ¡24k ¡26k ¡28k ¡30k ¡32k ¡34k ¡36k ¡ V100 ¡ P100 ¡ K40 ¡ CPU ¡

Why use GPUs in HPC?

PERFORMANCE & ENERGY EFFICIENCY

GFLOPs / Watt

Matrix size N x N Performance GFLOP/s

MAGMA 2.3 LU factorization in double precision arithmetic

K40 CPU Intel Xeon E5-2650 v3 (Haswell)

2x10 cores @ 2.30 GHz NVIDIA Kepler GPU 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU 56 MP x 64 @ 1.19 GHz

0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ CPU ¡ K40 ¡ P100 ¡ V100 ¡ V100 NVIDIA Volta GPU

80 MP x 64 @ 1.38 GHz

10x 10x

Energy efficiency (under ~ the same power draw)

5 / 37

slide-6
SLIDE 6

Many applications need LA on many small matrices

Data Analytics and associated with it Linear Algebra on small LA problems are needed in many applications:

  • Machine learning,
  • Data mining,
  • High-order FEM,
  • Numerical LA,
  • Graph analysis,
  • Neuroscience,
  • Astrophysics,
  • Quantum chemistry,
  • Multi-physics problems,
  • Signal processing, etc.

Filters F Fn Output On

n,k

O

n,k

O

=

k,i

D

i

n,i

F

Dk

.

Convolution Pooling Convolution Pooling Fully Output connected predictions

Data D

Convolution of Filters Fi (feature detection) and input image D:

  • For every filter Fn and every channel, the computation for

every pixel value On,k is a tensor contraction:

  • Plenty of parallelism; small operations that must be batched
  • With data “reshape” the computation can be transformed into

a batched GEMM (for efficiency; among other approaches) chicken 0.4 boat 0.3 person 0.1 dog 0.01

Batched LAPACK Sparse / Dense Matrix System Single calls to Batched BLAS DAG-based factorization

  • Matrix-free basis evaluation needs efficient tensor contractions,
  • Within ECP CEED Project, designed MAGMA batched methods

to split the computation in many small high-intensity GEMMs, grouped together (batched) for efficient execution: Batch_{ Ci3 = AT Bi3, for range of i3 }

i1,i2,i3

C

=

k,i1

A

k,i2,i3

B

k

Machine learning Applications using high-order FEM Sparse/Dense solvers & preconditioners

6 / 37

slide-7
SLIDE 7
  • 1. Non-batched computation
  • loop over the matrices one by one and compute using multithread (note that, since matrices are of small

sizes there is not enough work for all the cores). So we expect low performance as well as threads contention might also affect the performance

for (i=0; i<batchcount; i++) dgemm(…)

There ¡is ¡not ¡enough ¡work ¡ to ¡fulfill ¡all ¡the ¡cores. ¡ ¡ Low ¡percentage ¡of ¡the ¡ resources ¡is ¡used ¡

MAGMA Batched Computations

7 / 37

slide-8
SLIDE 8

Batched_dgemm(…)

  • 1. Batched computation

Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to

  • perate on it independently
  • For very small matrices, assign a matrix/core (CPU) or per TB for GPU
  • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)
  • For large size switch to multithreads classical 1 matrix per round.

Based on the kernel design that decide the n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler

Tasks manager dispatcher

High ¡percentage ¡of ¡the ¡ resources ¡is ¡used ¡

MAGMA Batched Computations

8 / 37

slide-9
SLIDE 9

Batched_dgemm(…)

  • 1. Batched computation

Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to

  • perate on it independently
  • For very small matrices, assign a matrix/core (CPU) or per TB for GPU
  • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)
  • For large size switch to multithreads classical 1 matrix per round.

Based on the kernel design that decide the n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler

Tasks manager dispatcher

High ¡percentage ¡of ¡the ¡ resources ¡is ¡used ¡

MAGMA Batched Computations

9 / 37

slide-10
SLIDE 10

Batched_dgemm(…)

  • 1. Batched computation

Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to

  • perate on it independently
  • For very small matrices, assign a matrix/core (CPU) or per TB for GPU
  • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)
  • For large size switch to multithreads classical 1 matrix per round.

Based on the kernel design that decide the n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler

Tasks manager dispatcher

High ¡percentage ¡of ¡the ¡ resources ¡is ¡used ¡

MAGMA Batched Computations

10 / 37

slide-11
SLIDE 11

Implementa)on ¡on ¡ ¡ current ¡hardware ¡ is ¡becoming ¡challenging ¡ ¡ ¡

Draft Reports Batched BLAS Draft Reports: https://www.dropbox.com/s/olocmipyxfvcaui/batched_api_03_30_2016.pdf?dl=0 Batched BLAS Poster: https://www.dropbox.com/s/ddkym76fapddf5c/Batched%20BLAS%20Poster%2012.pdf?dl=0 Batched BLAS Slides: https://www.dropbox.com/s/kz4fhcipz3e56ju/BatchedBLAS-1.pptx?dl=0 Webpage on ReproBLAS: http://bebop.cs.berkeley.edu/reproblas/ Efficient Reproducible Floating Point Summation and BLAS: http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-229.pdf

Workshops on Batched, Reproducible, and Reduced Precision BLAS

University of Tennessee, Knoxville, May 18-19, 2016

http://bit.ly/Batch-BLAS-2016

Georgia Tech, Computational Science and Engineering, Atlanta, GA, February 23—25, 2017

http://bit.ly/Batch-BLAS-2017

REGISTERS MAIN MEMORY BW PCI EXPRESS GEN3X16 NVLINK INTERCONNECT

INFINIBAND EDR

L3 CACHE L2 CACHE L1 CACHE & GPU SHARED MEMORY

MAIN MEMORY

Intel Haswell

E5-2650 v3

Intel KNL 7250 DDR5|MCDRAM ARM Cortex A57 Nvidia P100 Nvidia V100 10 cores 368 Gflop/s 105 WaUs 68 cores 2662 Gflop/s 215 WaUs 4 cores 32 Gflop/s 7 WaUs 56 SM 64 cores 4700 Gflop/s 250 WaUs 80 SM 64 cores 7500 Gflop/s 300 WaUs 16/core AVX2 32/core AVX-512 32/core 256 KB/SM 256 KB/SM 32 KB/core 32 KB/core 32 KB/core 64 KB/SM 96 KB/SM 256 KB/core 1024 KB/2cores 2 MB 4 MB 6 MB 25 MB 0...16 GB N/A N/A N/A 64 GB 384 |16 GB 4 GB 16 GB 16 GB 68 GB/s 5.4 flops/byte 115 | 421 GB/s 23 | 6 Flops/byte 26 GB/s 1.2 flops/byte 720 GB/s 6.5 flops/byte 900 GB/s 8.3 flops/byte 16 GB/s 23 flops/byte 16 GB/s 166 flops/byte 16 GB/s 2 flops/byte 16 GB/s 294 flops/byte 300 GB/s (NVL) 25 flops/byte 12 GB/s 30 flops/byte 12 GB/s 221 flops/byte 12 GB/s 2.6 flops/byte 12 GB/s 392 flops/byte 12 GB/s 625 flops/byte

Memory hierarchies for different type of architectures Flops per byte transfer (all flop rates for 64 bit operands) Memory hierarchies

11 / 37

slide-12
SLIDE 12

Toward a standard Batched BLAS API

Status and goal Ø Batched BLAS functionalities becomes a major factor in our community

  • Batched routines gradually make their steps into vendor

libraries (Nvidia, Intel, etc.) as well as into research software (MAGMA, Kokkos, etc) Ø Today’s API differ significantly which can lead to poor portability Ø Thus the community needs to make an effort to standardize the Batched BLAS API

12 / 37

Reference: Dongarra, Jack and Duff, Iain and Gates, Mark and Haidar, Azzam and Hammarling, Sven and Higham, Nicholas J. and Hogg, Jonathon and Valero-Lara, Pedro and Relton, Samuel D. and Tomov, Stanimire and Zounon, Mawussi (2016) A Proposed API for Batched Basic Linear Algebra Subprograms, MIMS Eprint 2016.25, 2016. ( available at http://eprints.ma.man.ac.uk/2464/ )

slide-13
SLIDE 13

Toward a standard Batched BLAS API

Matrices are stored BLAS-like “the usual storage that we know”

  • array of pointers: that consists of a pointer to each matrix
  • Data could belong to one memory allocation
  • Data could be anywhere, different allocations
  • Matrices could be equidistant or not from

each other

  • Is suitable for CPU, GPU, Phi
  • Accommodate most of the cases
  • User has to fill-up the array of pointers

13 / 37

slide-14
SLIDE 14

Toward a standard Batched BLAS API

Matrices are stored BLAS like “the usual storage that we know”

  • array of pointers: that consists of a pointer to each matrix
  • strided: as one pointer to a memory and matrices are strided inside
  • Fixed stride
  • Variable stride
  • Suitable for CPU, GPU, Phi
  • For variable stride, user has to fill-up the array
  • Cannot accommodate data that was not been allocated within the same

chunk of memory. Think about adding matrices to the batch.

14 / 37

slide-15
SLIDE 15

Toward a standard Batched BLAS API

Matrices are stored BLAS like “the usual storage that we know”

  • array of pointers: that consists of a pointer to each matrix
  • strided: as one pointer to a memory and matrices are strided inside

Matrices are stored in interleaved fashion or compact

  • data can be interleaved by batchcount or by chunk (SIMD, AVX, Warp)
  • Is only good for sizes less than 20 and only for some routines such as

GEMM, TRSM, while it has performance and implementation issues for routines like LU or QR factorization

  • Requires user or implementation to convert/reshuffle the memory storage since

most of the storage are BLAS-like

15 / 37

slide-16
SLIDE 16

Toward a standard Batched BLAS API

API discussion Ø Same or separate API for fixed and variable size batches?

  • Have two separate API’s?
  • Have a flag that switch between fixed and variable?

§ To simplify user life and avoid a combinatorial combination of parameter, we propose to distinguish between fixed and variable size APIs

void batchedblas_dgemm_batched ( batched_trans_t transA , batched_trans_t transB , batched_int_t m, batched_int_t n, batched_int_t k, double alpha , double const * const * dA_array , batched_int_t ldda , double const * const * dB_array , batched_int_t lddb , double beta , double ** dC_array , batched_int_t lddc , batched_int_t batchCount , batched_queue_t queue ); void batchedblas_dgemm_vbatched ( batched_trans_t transA , batched_trans_t transB , batched_int_t *m, batched_int_t *n, batched_int_t *k, double alpha , double const * const *dA_array , batched_int_t *ldda , double const * const *dB_array , batched_int_t *lddb , double beta , double **dC_array , batched_int_t *lddc , batched_int_t batchCount , batched_queue_t queue );

16 / 37

slide-17
SLIDE 17

Type Name

  • ption args

(transA, transB)

scaling args

(alpha, beta)

sizes, ld’s, inc’s

(m, n, k) (lda, ldb,ldc)

data pointers

(A, B, C)

Flag-based Flat Standard BATCH_FIXED F F F V BATCH_VAR V V V V Group MKL per group F F F V across groups V V V V

Flat

MAGMA

Fixed ¡API ¡

F F F V

  • Var. ¡API ¡

F F V V cuBLAS F F F V

  • Consider GEMM as an example
  • ‘F’ = Fixed, unified across a batch/group of problem
  • ‘V’ = Variable, each problem has its assigned value

Toward a standard Batched BLAS API

17 / 37

slide-18
SLIDE 18
  • This API supports over 1000 possibilities for GEMM
  • The size of C must be equal to batchCount
  • The size of other arguments can be either 1 or batchCount
  • Error checking through std::vector<blas_int> info
  • The group API can be supported by promoting batchCount to be a std::vector

template<typename T> void gemm_batch( std::vector<Op> const &transA, std::vector<Op> const &transB, std::vector<blas_int> const &m, std::vector<blas_int> const &n, std::vector<blas_int> const &k, std::vector<T> const &alpha, std::vector<T*> const &A, std::vector<blas_int> const &lda, std::vector<T*> const &B, std::vector<blas_int> const &ldb, std::vector<T> const &beta, std::vector<T*> const &C, std::vector<blas_int> const &ldc, const blas_int batchCount);

Batched BLAS API for C++

18 / 37

slide-19
SLIDE 19

Batched routines released in MAGMA

19 / 37

slide-20
SLIDE 20

How to implement fast batched DLA?

1000 2000 3000 4000 5000 6000 7000 8000 500 1000 1500 2000 2500 3000 3500 4000

50~1000 matrices of size

Nvidia V100 GPU

Batch dgemm BLAS 3 Standard dgemm BLAS 3

small sizes ! ! ! ! ! ! ! ! ! ! ! ! ! ! medium sizes ! ! ! ! ! ! ! ! ! ! ! ! ! ! Large sizes ! ! ! ! ! ! Switch to non-batch ! ! ! ! ! !

Batch dgemm BLAS 3 Standard dgemm BLAS 3

19X 1.4X Gflop/s

Problem

  • blem siz

izes es inf influence luence algor algorit ithms hms & opt

  • ptimiz

imization ion tec echniques hniques

Matrix sizes (fixed) in the batch Batch size 1,000 Batch size 300 Batch size 50

C11# C12# C13# C14# C21# C22# C23# C24# C31# C32# C33# C34# C41# C42# C43# C44#

M K K N BLK

K

BLK

M

BLK

K

BLK

N

B A

thy thx !!!! !

  • Reading/writing the elements is

based on the TB size (# threads) and so is an extra parameter.

  • Also it could be different for A, B

and C

Optimizing GEMM’s: Kernel design

Kernels are designed various scenarios and parameterized for autotuning framework to find “best” performing kernels

20 / 37

slide-21
SLIDE 21

GPU Optimization Summary

  • Hardware concepts

– CUDA core – Warp – Half-warp – Register file – Shared memory – Atomics – Shuffles – SMX

  • Software concepts

– Stream – Thread block – Kernel – Inlining – Intrinsics

  • Algorithmic concepts

– Blocking – Recursive blocking – Kernel replacement – Out-of-place operations

How to implement fast batched DLA?

21 / 37

slide-22
SLIDE 22

Performance of MAGMA Batched (GEMM)

22 / 37

slide-23
SLIDE 23

Performance of MAGMA Batched (LU)

23 / 37

slide-24
SLIDE 24

Performance of MAGMA Batched (QR)

24 / 37

slide-25
SLIDE 25

Performance of MAGMA Batched (Cholesky)

25 / 37

slide-26
SLIDE 26

Performance of MAGMA Batched (LU)

26 / 37

slide-27
SLIDE 27

Performance of MAGMA Batched (QR)

27 / 37

slide-28
SLIDE 28

Batched computations in applications

Data Analytics and associated with it Linear Algebra on small LA problems are needed in many applications:

  • Machine learning,
  • Data mining,
  • High-order FEM,
  • Numerical LA,
  • Graph analysis,
  • Neuroscience,
  • Astrophysics,
  • Quantum chemistry,
  • Multi-physics problems,
  • Signal processing, etc.

Filters F Fn Output On

n,k

O

n,k

O

=

k,i

D

i

n,i

F

Dk

.

Convolution Pooling Convolution Pooling Fully Output connected predictions

Data D

Convolution of Filters Fi (feature detection) and input image D:

  • For every filter Fn and every channel, the computation for

every pixel value On,k is a tensor contraction:

  • Plenty of parallelism; small operations that must be batched
  • With data “reshape” the computation can be transformed into

a batched GEMM (for efficiency; among other approaches) chicken 0.4 boat 0.3 person 0.1 dog 0.01

Batched LAPACK Sparse / Dense Matrix System Single calls to Batched BLAS DAG-based factorization

  • Matrix-free basis evaluation needs efficient tensor contractions,
  • Within ECP CEED Project, designed MAGMA batched methods

to split the computation in many small high-intensity GEMMs, grouped together (batched) for efficient execution: Batch_{ Ci3 = AT Bi3, for range of i3 }

i1,i2,i3

C

=

k,i1

A

k,i2,i3

B

k

Machine learning Applications using high-order FEM Sparse/Dense solvers & preconditioners

28 / 37

slide-29
SLIDE 29

MagmaDNN – Data Analytics Tool

Ø MagmaDNN 0.1-Alpha – HP Data analytics and ML

GPU-accelerated numerical software using MAGMA as computational backend to accelerate its LA computations

Ø Open source; looking for feedback and contributions

Started with students from REU/RECSEM program https://bitbucket.org/icl/magmadnn

Ø Implemented/proposed so far

Ø Tensors and tensor operations Ø Deep learning primitives: Fully-connected layers, convolutional layers, pooling layers, activation layers, and output layers. All of them support SGD back-propagation training Ø Established adapters for calling CuDNN Ø Applied MagmaDNN to the MNIST benchmark using multilayer perceptron or a convolutional neural network.

Provided in MAGMA 2.3

http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magmadnn

w/ Lucien Ng, The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational

Sciences (JICS), UTK

29 / 37

slide-30
SLIDE 30

Fully connected layers

Fully-connected 3-layer Neural Network example

Ø Data (input, output, NN weights, etc.) is handled through tensor abstractions

// 2d tensor for n_images and n_features in the corresponding dimensions Tensor<float> Images = Tensor<float>({n_images, n_features});

Ø Support for various layers:

Fully connected (FCLayer), convolution, activation, flatten, pooling, input, output, etc. layers

// Create layers for the network FCLayer<float> *FC1 = new FCLayer<float>(&inputLayer, 128); ActivationLayer<float> *actv1 = new ActivationLayer<float>(FC1, SIGMOID); FCLayer<float> *FC2 = new FCLayer<float>(actv1, n_output_classes);

Ø Support networks – composed of layers

std::vector<Layer<float>*> vec_layer; vec_layer.push_back(&inputLayer); vec_layer.push_back(FC1); vec_layer.push_back(actv1); vec_layer.push_back(FC2); …

30 / 37

slide-31
SLIDE 31

Convolutional network layers

Convolution Network (ConvNet) example

Ø Layers are typically 3D volumes Ø Handled through tensors Ø Each layer transforms 3D tensor to 3D tensor Ø Layers support the forward and backward pass algorithms for the training Ø Support for optimization solvers (GD and derivatives) Ø Gradient Descent (GD) Ø Stochastic Gradient Descent (SGD) Ø Mini-Batch Gradient Descent (MB-GD)

31 / 37

slide-32
SLIDE 32

How to accelerate on manycore GPU and CPUs?

Convolution Network (ConvNet) example

Ø Convolutions can be accelerated in various ways: Ø Unfold and GEMM Ø FFT Ø Winograd minimal filtering – reduction to batched GEMMs Ø Use autotuning to handle complexity of tuning

Require matrix-matrix products of various sizes, including batched GEMMs

32 / 37

slide-33
SLIDE 33

Examples …

slide-34
SLIDE 34

Problem

  • blem siz

izes es inf influence luence algor algorit ithms hms & opt

  • ptimiz

imization ion tec echniques hniques

Used in High-order (HO) Finite Element Methods (FEM)

Code Generation

C++11 features will be used as much as possible. Additional needs will be handled by defining a domain specific embedded language (DSEL). This technique is used in C++ to take advantage

  • f DSL features while using the optimizations provided by a

standard compiler. It will handle the generation of versions (index reordering, next) to be empirically evaluated and be part of the autotuning framework.

Autotuning

We are developing fixed-size gemm kernels for GPUs, Xeon Phi, and multicore (see Figure on Right for a single core intel Xeon E5- 2620 and K40) through an autotuning framework. A number of generic versions are developed and parametrized for

  • performance. The parameters are autotuned (empirically) to find

“best” kernels for specific size.

Tensor operations in high-order FEM

Consider the FE mass matrix ME for an element/zone E with weight ρ, as a 2-dimensional tensor: i, j = 1,..., nd , where Take the nq x nd matrix and Then, , or omitting the E subscript . Using FE of order p, we have nd = O(pd) and nq = O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decompose dofs and quadrature point indices in logical coordinate axes i = (i1, …, id), j = (j1, …, jd), k = (k1, …, kd) so Mij can be viewed as 2d-dimensional tensor Mi1, …, id, j1, …, jd.

Summary of kernels needed:

  • Assembly of M, referred as equations (1) & (2) below
  • Evaluations of M times V, referred as equations (3) & (4) below

Towards a High-Performance Tensor Algebra Package for Accelerators

  • M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov

Abstract

Numerous important applications, e.g., high-order FEM simulations, can be expressed through tensors. Examples are computation of FE matrices and SpMV products expressed as generalized tensor contractions. Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present

  • ngoing work on the design of a high-performance package in

MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to batched execution of large number of small tensor contractions. We apply auto-tuning and code generation techniques to provide an architecture-aware, user-friendly interface.

Motivation

Numerous important applications can be expressed through tensors:

  • High-order FEM simulations
  • Signal Processing
  • Numerical Linear Algebra
  • Numerical Analysis

The goal is to design a:

  • High-performance package for Tensor algebra
  • Built-in architecture-awareness (GPU, Xeon Phi, multicore)
  • User-friendly interface

Example cases

Numerical linear algebra:

  • A 4-dimensional tensor contraction
  • rank-k update on matrices in tile format (k can be small, e.g.,

sub-vector/warp size)

  • Must determine (in software) if possible to do it through

batched GEMM kernels

[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangian

  • hydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)

APPROACH AND RESULTS User-friendly interface

To provide various interfaces, including one using C++11. Top level design to provide features similar to the mshadow library. https://github.com/dmlc/mshadow

Index reordering/reshape

If we store tensors as column-wise 1D arrays, , i.e., M can be interpreted as a 4th order tensor, a nd x nd matrix, or a vector of size nd2, without changing the storage. We can define as long as n1...nr = m1…mq and for every i1..r , j1..qi1 + n1i2 + … + n1n2...nr-1ir = j1 + m1j2 + … + m1m2…mq-1jq. Contractions can be implemented as a sequence of pairwise

  • contractions. There is enough complexity here to search for

something better: code generation, index reordering, and autotuning will be used, e.g., contractions (3a) - (4f) can be implemented as tensor index-reordering plus gemm A, B -> ATB.

// Our current interface : // create a 2 x 5 x 2 float tensor , default locality is cpu using std::vector as default backend for data Tensor<2,5,2> ts; // create a 2 x 5 x 2 tensor on the gpu using thrust as the default backend for data Tensor<2,5,5,gpu_> d_ts; // Call a thrust function to set values to 9 thrust::fill(d_ts.begin() , d_ts.end() , 9); // Send back values to the cpu tensor ts = d_ts ; // Reorder the 2 x 5 x 2 tensor to a matrix 2 x 10 using views view<2,10> mat = ts ;

  • Data Mining
  • Deep Learning
  • Graph Analysis
  • Neuroscience and more

Batched LA

Tensor contractions are transformed through reshapes to batched LA operations, many of which available in MAGMA[2] http://icl.cs.utk. edu/magma/ (including LU, QR, Cholesky, GEMM, GEMV, TRSM, SYRK).

[2] A.Haidar, T.Dong, S.Tomov, P.Luszczek, and J.Dongarra. A framework for batched and GPU-resident factorization algorithms applied to block Householder transformations. ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015.

Conclusions and Future directions

  • High-performance package on Tensor Algebra has the potential for high-impact on a number of important applications
  • Multidisciplinary effort
  • Current results show promising performance, where various components will be leveraged from autotuning MAGMA Batched linear

algebra kernels, and BLAST from LLNL

  • This is an ongoing work

Figure: Batched dgemms on K40 GPU. Batch count is 2,000. MAGMA exceeds in performance CUBLAS for “small” sizes, currently tuned for above 32. Current work is concentrated on kernels for fixed smaller (sub-warp) sizes. Gatlinburg, Tennessee, Aug 31- Sept 2, 2015 http://computing.ornl.gov/workshops/SMC15/

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

LLNL release number LLNL-POST-676632

ICL's work on this material was supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA.

Matrix-free basis evaluation needs efficient tensor contractions, Within ECP CEED Project, designed batched methods to split the computation in many small high-intensity GEMMs, grouped together (batched) for efficient execution:

Batch_{ Ci3 = AT Bi3, for range of i3 }

i1,i2,i3

C

=

k,i1

A

k,i2,i3

B

k

http://ceed.exascaleproject.org/

Tensor contractions for high-order FEM

Code Generation

C++11 features will be used as much as possible. Additional needs will be handled by defining a domain specific embedded language (DSEL). This technique is used in C++ to take advantage

  • f DSL features while using the optimizations provided by a

standard compiler. It will handle the generation of versions (index reordering, next) to be empirically evaluated and be part of the autotuning framework.

Autotuning

We are developing fixed-size gemm kernels for GPUs, Xeon Phi, and multicore (see Figure on Right for a single core intel Xeon E5- 2620 and K40) through an autotuning framework. A number of generic versions are developed and parametrized for

  • performance. The parameters are autotuned (empirically) to find

“best” kernels for specific size.

Tensor operations in high-order FEM

Consider the FE mass matrix ME for an element/zone E with weight ρ, as a 2-dimensional tensor: i, j = 1,..., nd , where Take the nq x nd matrix and Then, , or omitting the E subscript . Using FE of order p, we have nd = O(pd) and nq = O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decompose dofs and quadrature point indices in logical coordinate axes i = (i1, …, id), j = (j1, …, jd), k = (k1, …, kd) so Mij can be viewed as 2d-dimensional tensor Mi1, …, id, j1, …, jd.

Summary of kernels needed:

  • Assembly of M, referred as equations (1) & (2) below
  • Evaluations of M times V, referred as equations (3) & (4) below

Towards a High-Performance Tensor Algebra Package for Accelerators

  • M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov

Abstract

Numerous important applications, e.g., high-order FEM simulations, can be expressed through tensors. Examples are computation of FE matrices and SpMV products expressed as generalized tensor contractions. Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present

  • ngoing work on the design of a high-performance package in

MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to batched execution of large number of small tensor contractions. We apply auto-tuning and code generation techniques to provide an architecture-aware, user-friendly interface.

Motivation

Numerous important applications can be expressed through tensors:

  • High-order FEM simulations
  • Signal Processing
  • Numerical Linear Algebra
  • Numerical Analysis

The goal is to design a:

  • High-performance package for Tensor algebra
  • Built-in architecture-awareness (GPU, Xeon Phi, multicore)
  • User-friendly interface

Example cases

Numerical linear algebra:

  • A 4-dimensional tensor contraction
  • rank-k update on matrices in tile format (k can be small, e.g.,

sub-vector/warp size)

  • Must determine (in software) if possible to do it through

batched GEMM kernels

[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangian

  • hydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)

APPROACH AND RESULTS User-friendly interface

To provide various interfaces, including one using C++11. Top level design to provide features similar to the mshadow library. https://github.com/dmlc/mshadow

Index reordering/reshape

If we store tensors as column-wise 1D arrays, , i.e., M can be interpreted as a 4th order tensor, a nd x nd matrix, or a vector of size nd2, without changing the storage. We can define as long as n1...nr = m1…mq and for every i1..r , j1..qi1 + n1i2 + … + n1n2...nr-1ir = j1 + m1j2 + … + m1m2…mq-1jq. Contractions can be implemented as a sequence of pairwise

  • contractions. There is enough complexity here to search for

something better: code generation, index reordering, and autotuning will be used, e.g., contractions (3a) - (4f) can be implemented as tensor index-reordering plus gemm A, B -> ATB.

// Our current interface : // create a 2 x 5 x 2 float tensor , default locality is cpu using std::vector as default backend for data Tensor<2,5,2> ts; // create a 2 x 5 x 2 tensor on the gpu using thrust as the default backend for data Tensor<2,5,5,gpu_> d_ts; // Call a thrust function to set values to 9 thrust::fill(d_ts.begin() , d_ts.end() , 9); // Send back values to the cpu tensor ts = d_ts ; // Reorder the 2 x 5 x 2 tensor to a matrix 2 x 10 using views view<2,10> mat = ts ;

  • Data Mining
  • Deep Learning
  • Graph Analysis
  • Neuroscience and more

Batched LA

Tensor contractions are transformed through reshapes to batched LA operations, many of which available in MAGMA[2] http://icl.cs.utk. edu/magma/ (including LU, QR, Cholesky, GEMM, GEMV, TRSM, SYRK).

[2] A.Haidar, T.Dong, S.Tomov, P.Luszczek, and J.Dongarra. A framework for batched and GPU-resident factorization algorithms applied to block Householder transformations. ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015.

Conclusions and Future directions

  • High-performance package on Tensor Algebra has the potential for high-impact on a number of important applications
  • Multidisciplinary effort
  • Current results show promising performance, where various components will be leveraged from autotuning MAGMA Batched linear

algebra kernels, and BLAST from LLNL

  • This is an ongoing work

Figure: Batched dgemms on K40 GPU. Batch count is 2,000. MAGMA exceeds in performance CUBLAS for “small” sizes, currently tuned for above 32. Current work is concentrated on kernels for fixed smaller (sub-warp) sizes. Gatlinburg, Tennessee, Aug 31- Sept 2, 2015 http://computing.ornl.gov/workshops/SMC15/

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

LLNL release number LLNL-POST-676632

ICL's work on this material was supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA.

34 / 37

slide-35
SLIDE 35

http://ceed.exascaleproject.org/

Tensor contractions for high-order FEM

  • Reference:
  • A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, S. Tomov, High-Performance Tensor Contractions

for GPUs, ICCS 2016, San Diego, CA, June 6—8, 2016.

35 / 37

slide-36
SLIDE 36

http://ceed.exascaleproject.org/

Tensor contractions for high-order FEM

  • Fus

Fused ed ker ernels nels: : vs. .

36 / 37

slide-37
SLIDE 37

http://ceed.exascaleproject.org/

Tensor contractions for high-order FEM

Fus Fused ed ker ernels nels: :

90.86x 56.34x 33.93x 22.4x 14.15x 11.69x 100.28x 89.39x 67.54x 46.52x

200 400 600 800 1000 1200 (2x2) 1048576 (3x2) 1048576 (3x3) 262144 (4x3) 262144 (5x4) 131072 (6x5) 65536 (7x6) 32768 (8x7) 32768 (9x8) 16384 (10x9) 8192

Gflop/s Batch BT D (BABT) B, double precision, Tesla V100 GPU

cuBLAS MAGMA MAGMA-Fused

Performance on realistic (application) sizes Device BLAS interfaces for fused Batched BLAS

37 / 37

slide-38
SLIDE 38

Collaborators and Support

MAGMA team

http://icl.cs.utk.edu/magma

Collaborating partners

University of Tennessee, Knoxville Lawrence Livermore National Laboratory, Livermore, CA LLNL led ECP CEED: Center for Efficient Exascale Discretizations University of Manchester, Manchester, UK University of Paris-Sud, France INRIA, France

PLASMA team

http://icl.cs.utk.edu/plasma