CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - - PowerPoint PPT Presentation

cutensor
SMART_READER_LITE
LIVE PREVIEW

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com ACKNOWLEDGMENTS Colleagues at NVIDIA Collaborators outside of NVIDIA Albert Di Dmitry Liakh


slide-1
SLIDE 1

High-Performance CUDA Tensor Primitives

CUTENSOR

Paul Springer, Chen-Han Yu, March 20th 2019

pspringer@nvidia.com and chenhany@nvidia.com

slide-2
SLIDE 2

2

ACKNOWLEDGMENTS

  • Collaborators outside of NVIDIA
  • Dmitry Liakh (TAL-SH)
  • Jutho Haegeman (Julia)
  • Tim Besard (Julia)
  • Colleagues at NVIDIA
  • Albert Di
  • Alex Fit-Florea
  • Evghenii Gaburov
  • Harun Bayraktar
  • Sharan Chetlur
  • Timothy Costa
  • Zachary Zimmerman

*alphabetic order

slide-3
SLIDE 3

3

WHAT IS A TENSOR?

  • mode-0: scalar 𝛽
  • mode-1: vector 𝐵#
  • mode-2: matrix 𝐵#,%
  • mode-n: general tensor 𝐵#,%,&
slide-4
SLIDE 4

4

WHAT IS A TENSOR?

  • mode-0: scalar 𝛽
  • mode-1: vector 𝐵#
  • mode-2: matrix 𝐵#,%
  • mode-n: general tensor 𝐵#,%,&,'
slide-5
SLIDE 5

5

WHAT IS A TENSOR?

  • mode-0: scalar 𝛽
  • mode-1: vector 𝐵#
  • mode-2: matrix 𝐵#,%
  • mode-n: general tensor 𝐵#,%,&,',(
slide-6
SLIDE 6

6

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector

A Success Story

= 𝛽 +

slide-7
SLIDE 7

7

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector

A Success Story

= 𝛽 + = ∗

slide-8
SLIDE 8

8

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector

= 𝛽 + = ∗

A Success Story

slide-9
SLIDE 9

9

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector
  • 1980 – BLAS Level 3: Matrix-Matrix

= 𝛽 + = ∗ = ∗

A Success Story

slide-10
SLIDE 10

10

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector
  • 1980 – BLAS Level 3: Matrix-Matrix

= 𝛽 + = ∗ = ∗

A Success Story

slide-11
SLIDE 11

11

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector
  • 1980 – BLAS Level 3: Matrix-Matrix
  • Now? – BLAS Level 4: Tensor-Tensor

= 𝛽 + = ∗ = ∗ = ∗

A Success Story

slide-12
SLIDE 12

12

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector
  • 1980 – BLAS Level 3: Matrix-Matrix
  • Now? – BLAS Level 4: Tensor-Tensor

A Success Story

= 𝛽 + = ∗ = ∗ = ∗

slide-13
SLIDE 13

13

BASIC LINEAR SUBPROGRAMS

  • 1969 – BLAS Level 1: Vector-Vector
  • 1972 – BLAS Level 2: Matrix-Vector
  • 1980 – BLAS Level 3: Matrix-Matrix
  • Now? – BLAS Level 4: Tensor-Tensor

A Success Story

= 𝛽 + = ∗ = ∗ = ∗

slide-14
SLIDE 14

14

TENSORS ARE UBIQUITOUS

Potential Use Cases Deep Learning Quantum Chemistry Condensed Matter Physics PYRO LS-DALTON TAL-SH TensorLy

TAL-SH: https://github.com/DmitryLyakh/TAL_SH TensorLy: http://tensorly.org Itensor: http://itensor.org Julia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU/CUDAnative.jl

  • Multi-GPU
  • Out-of-Core
slide-15
SLIDE 15

15

CUTENSOR

A High-Performance CUDA Library for Tensor Primitives

  • Tensor Contractions (generalization of matrix-matrix multiplication)
  • Element-wise operations (e.g., permutations, additions)
  • Mixed precision support
  • Generic and flexible interface

=

A B D

+ +

C

= ∑ ( )

A B D

* +

C

slide-16
SLIDE 16

16

Tensor Contractions

slide-17
SLIDE 17

17

TENSOR CONTRACTIONS

Examples

  • Einstein notation (einsum)
  • Modes that appear in A and B are contracted
  • Examples
  • 𝐸(,, = α ∑ 𝐵(,&
  • &

∗ 𝐶&,, // GEMM

= ∑ ( )

A B D

* +

C

slide-18
SLIDE 18

18

TENSOR CONTRACTIONS

Examples

  • Einstein notation (einsum)
  • Modes that appear in A and B are contracted
  • Examples
  • 𝐸(,, = α 𝐵(,& ∗ 𝐶&,,

// GEMM

  • 𝐸(2,,,(3 = α 𝐵(2,&,(3 ∗ 𝐶&,,

// Tensor Contraction

  • 𝐸(2,,2,,3,(3 = α 𝐵(2,&,(3 ∗ 𝐶&,,3,,2

// Tensor Contraction

  • 𝐸(2,,2,,3,(3 = α 𝐵(2,&2,(3,&3 ∗ 𝐶&3,&2,,3,,2

// Multi-mode Tensor Contraction

= ∑ ( )

A B D

* +

C

slide-19
SLIDE 19

19

TENSOR CONTRACTIONS

Examples (cont.)

  • Examples
  • 𝐸(,, = α 𝐵( ∗ 𝐶,

// outer product

  • 𝐸(2,,,(3 = α 𝐵(2,(3 ∗ 𝐶,

// outer product

  • 𝐸(2,,2,'2 = α 𝐵(2,&,'2 ∗ 𝐶&,,2,'2

// batched GEMM

  • 𝐸(2,,2,'2,,3,(3 = α 𝐵(2,&,'2,(3 ∗ 𝐶&,,3,,2,'2

// single-mode batched tensor contraction

  • 𝐸(2,,2,'2,,3,(3,'3 = α 𝐵(2,&,'3, '2,(3 ∗ 𝐶&,,3,,2,'2,'3 // multi-mode batched tensor contraction

= ∑ ( )

A B D

* +

C

slide-20
SLIDE 20

20

TENSOR CONTRACTIONS

Key Features

  • Ψ are unary operators
  • E.g., Identity, RELU, CONJ, …
  • Mixed-precision
  • No additional work-space required
  • Auto-tuning capability (similar to cublasGemmEx)
  • High performance

= ∑ ( )

A B D

* +

C

slide-21
SLIDE 21

21

TENSOR CONTRACTIONS

Key Challenges

  • Keep the fast FPUs busy
  • Reuse data in shared memory & registers as much as possible
  • Coalesced accesses to/from global memory
slide-22
SLIDE 22

22

TENSOR CONTRACTIONS

  • Loading a scalar 𝛽
  • Loading a vector 𝐵#
  • Loading a matrix 𝐵#,%
  • Loading a general tensor 𝐵#,%,&

Key Challenges

✅ ✅ (✅) ((✅))

slide-23
SLIDE 23

23

TENSOR CONTRACTIONS

Technical insight

= *

D

[1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016)

A B

slide-24
SLIDE 24

24

TENSOR CONTRACTIONS

Technical insight

= *

D

[1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016)

A B 𝒝 ℬ 𝒠 To SHMEM 𝒝 ℬ 𝒠

= GEMM-like ( , )

slide-25
SLIDE 25

25

TENSOR CONTRACTIONS

Technical insight

= *

D

[1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016)

A B 𝒠 𝒝 ℬ 𝒠

= GEMM-like ( , )

ℬ 𝒝

slide-26
SLIDE 26

26

TENSOR CONTRACTIONS

Technical insight

= *

D

[1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016)

A B 𝒠 𝒝 ℬ 𝒠

= GEMM-like ( , )

ℬ 𝒝 To Global

slide-27
SLIDE 27

27

PERFORMANCE

Tensor Contractions

=

A B C

*

TBILS (https://github.com/devinamatthews/tblis)

~8x over two-socket CPU Arithmetic Intensity

Random tensor contractions:

  • 3D to 6D tensors
  • FP64
slide-28
SLIDE 28

28

PERFORMANCE

Tensor Contractions

=

A B C

*

TBILS (https://github.com/devinamatthews/tblis)

Arithmetic Intensity

Random tensor contractions:

  • 3D to 6D tensors
  • FP64 (data) & FP32 (compute)
slide-29
SLIDE 29

29

Element-wise Operations

slide-30
SLIDE 30

30

ELEMENT-WISE TENSOR OPERATIONS

Examples

  • 𝐸8,9,:,, = α 𝐵:,8,9,,
  • 𝐸8,9,:,, = α 𝐵:,8,9,, + β𝐶:,8,9,,
  • 𝐸8,9,:,, = min

( α 𝐵:,8,9,, , β 𝐶:,8,9,,)

  • 𝐸8,9,:,, = α 𝐵:,8,9,, + β 𝐶8,9,:,, + 𝛿 𝐷8,9,:,,
  • 𝐸8,9,:,, = α 𝑆𝐹𝑀𝑉(𝐵:,8,9,,) + β 𝐶8,9,:,, + 𝛿 𝐷8,9,:,,
  • 𝐸8,9,:,, = 𝐺𝑄32( α 𝑆𝐹𝑀𝑉(𝐵:,8,9,,) + β 𝐶8,9,:,, + 𝛿 𝐷8,9,:,,)

= α + β + 𝛿

A B D C Enables users to fuse multiple element-wise calls.

slide-31
SLIDE 31

31

ELEMENT-WISE TENSOR OPERATIONS

Key Features

  • Ψ are unary operators
  • E.g., Identity, RELU, CONJ, …
  • Φ are binary operators
  • E.g., MAX, MIN, ADD, MUL, …
  • Mixed-precision
  • High performance

= α + β + 𝛿

A B D C

slide-32
SLIDE 32

32

PERFORMANCE

Element-wise Operation

* FP32 tensor permutation (e.g., reformatting)

= α + β

A B C ~5x over two-socket CPU

HPTT (https://github.com/springer13/hptt)

slide-33
SLIDE 33

33

CUTENSOR’s API

slide-34
SLIDE 34

34

TENSOR CONTRACTIONS

API

cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream );

=

A B D

* +

C

Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md

slide-35
SLIDE 35

35

TENSOR CONTRACTIONS

API

cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream );

=

A B D

* +

C

Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md

  • 𝐸M,N,(,,,: = α ∑

(𝐵M,O,P,N,:

O,P

∗ 𝐶O,(,P,,) + 𝛾𝐷M,N,(,,,:

auto status = cutensorContraction (handle, alpha, A, descA, { ‘a’, ‘o’, ‘p’, ‘b’, ‘c’ }, B, descB, { ‘o’, ‘m’, ‘p’, ‘n’ }, beta, C, descC, { ‘a’, ‘b’, ‘m’, ‘n’, ‘c’ }, D, descC, { ‘a’, ‘b’, ‘m’, ‘n’, ‘c’ }, CUTENSOR_OP_IDENTITY, CUDA_R_32F, CUTENSOR_ALGO_DEFAULT, nullptr, 0, stream );

slide-36
SLIDE 36

36

ELEMENT-WISE OPERATION

API

cutensorStatus_t cutensorElementwiseTrinary ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute, cudaStream_t stream );

= α + β + 𝛿

A B D C

slide-37
SLIDE 37

37

ELEMENT-WISE OPERATION

API

cutensorStatus_t cutensorElementwiseTrinary ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute, cudaStream_t stream );

= α + β + 𝛿

A B D C

  • 𝐸8,9,:,, = min

(α𝐵:,8,9,,, 𝛾𝐶:,8,9) + 𝛿𝐷8,9,:,,

auto status = cutensorElementwiseTrinary ( handle, alpha, A, descA, { ‘c’, ‘w’, ‘h’, ‘n’ }, beta, B, descB, { ‘c’, ‘w’, ‘h’ }, gamma, C, descC, { ‘w’, ‘h’, ‘c’, ‘n’ }, D, descD, { ‘w’, ‘h’, ‘c’, ‘n’ }, CUTENSOR_OP_MIN, CUTENSOR_OP_ADD, CUDA_R_16F, stream );

slide-38
SLIDE 38

38

REFERENCES

[1] Devin A. Matthews “High-performance tensor contraction without Transposition” (2016) [2] Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) [3] Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” (2016) [4] Antti-Pekka Hynninen et al. “cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs” (2017) [5] Jinsung Kim et al. "Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs." (2018). [6] Jinsung Kim et al. “A code generator for high-performance tensor contractions on GPUs” (2019)

TensorFlow (logo): The TensorFlow logo and any related marks are trademarks of Google Inc. PyTorch (logo): https://github.com/pytorch/pytorch/blob/master/docs/source/_static/img/pytorch-logo-dark.png TensorLy (logo): http://tensorly.org Julia (logo): https://github.com/JuliaGraphics/julia-logo-graphics NWChem (logo): https://pbs.twimg.com/media/Da8JYfgV4AAKGsv.png Pyro (logo): http://pyro.ai/img/pyro_logo.png

slide-39
SLIDE 39

39

CUTENSOR

  • CUDA library for high-performance CUDA tensor primitives

Pre-release available at: https://developer.nvidia.com/cuTensor Your feedback is highly appreciated.

= ∑ ( )

A B D

* +

C

~5x Speedup ~8x Speedup

= α + β + 𝛿

A B D C

slide-40
SLIDE 40
slide-41
SLIDE 41

41

CUTENSOR

API

cutensorStatus_t cutensorCreateTensorDescriptor ( cutensorTensorDescriptor_t *desc, unsigned int numModes, const int64_t extent[], const int64_t stride[], // Stride is optional and may be null cudaDataType_t dataType, cutensorOperator_t unaryOp, const int vectorIndex, const int32_t vectorWidth); cutensorStatus_t cutensorContraction (cuTensorHandle_t handle, const void* alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void* beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void* workspace, size_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream );

=

A B D

* +

C

Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md