cutensor
play

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com ACKNOWLEDGMENTS Colleagues at NVIDIA Collaborators outside of NVIDIA Albert Di Dmitry Liakh


  1. CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com

  2. ACKNOWLEDGMENTS • Colleagues at NVIDIA • Collaborators outside of NVIDIA Albert Di Dmitry Liakh (TAL-SH) • • Alex Fit-Florea Jutho Haegeman (Julia) • • Evghenii Gaburov Tim Besard (Julia) • • • Harun Bayraktar Sharan Chetlur • Timothy Costa • • Zachary Zimmerman *alphabetic order 2

  3. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,& • 3

  4. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,' • 4

  5. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,',( • 5

  6. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 6

  7. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 7

  8. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 8

  9. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 9

  10. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 10

  11. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 11

  12. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 12

  13. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 13

  14. TENSORS ARE UBIQUITOUS Potential Use Cases Deep Learning Quantum Chemistry Condensed Matter Physics PYRO LS-DALTON TAL-SH Multi-GPU • Out-of-Core • TensorLy TAL-SH: https://github.com/DmitryLyakh/TAL_SH Itensor: http://itensor.org TensorLy: http://tensorly.org Julia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU/CUDAnative.jl 14

  15. CUTENSOR A High-Performance CUDA Library for Tensor Primitives • Tensor Contractions (generalization of matrix-matrix multiplication) = ∑ ( ) * + D A B C • Element-wise operations (e.g., permutations, additions) = + + D A B C • Mixed precision support • Generic and flexible interface 15

  16. Tensor Contractions 16

  17. � TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples 𝐸 (,, = α ∑ 𝐵 (,& • ∗ 𝐶 &,, // GEMM & 17

  18. TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples • 𝐸 (,, = α 𝐵 (,& ∗ 𝐶 &,, // GEMM 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, 3 ,, 2 // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,& 2 ,( 3 ,& 3 ∗ 𝐶 & 3 ,& 2 ,, 3 ,, 2 // Multi-mode Tensor Contraction • 18

  19. TENSOR CONTRACTIONS Examples (cont.) = ∑ ( ) * + D A B C • Examples 𝐸 (,, = α 𝐵 ( ∗ 𝐶 , // outer product • 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,( 3 ∗ 𝐶 , // outer product • 𝐸 ( 2 ,, 2 ,' 2 = α 𝐵 ( 2 ,&,' 2 ∗ 𝐶 &,, 2 ,' 2 // batched GEMM • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 // single-mode batched tensor contraction • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 ,' 3 = α 𝐵 ( 2 ,&,' 3 , ' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 ,' 3 // multi-mode batched tensor contraction • 19

  20. TENSOR CONTRACTIONS Key Features = ∑ ( ) * + D A B C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Mixed-precision • No additional work-space required • Auto-tuning capability (similar to cublasGemmEx) • High performance 20

  21. TENSOR CONTRACTIONS Key Challenges • Keep the fast FPUs busy Reuse data in shared memory & registers as much as possible • Coalesced accesses to/from global memory • 21

  22. TENSOR CONTRACTIONS Key Challenges • Loading a scalar 𝛽 ✅ ✅ Loading a vector 𝐵 # • ( ✅ ) Loading a matrix 𝐵 #,% • Loading a general tensor 𝐵 #,%,& • (( ✅ )) 22

  23. TENSOR CONTRACTIONS Technical insight = * B D A [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 23

  24. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To SHMEM 𝒠 𝒝 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 24

  25. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 25

  26. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To Global 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 26

  27. PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 ~8x over two-socket CPU Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 27

  28. PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 (data) & FP32 (compute) Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 28

  29. Element-wise Operations 29

  30. ELEMENT-WISE TENSOR OPERATIONS Examples = α + β + 𝛿 B D A C 𝐸 8,9,:,, = α 𝐵 :,8,9,, • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β𝐶 :,8,9,, • 𝐸 8,9,:,, = min ( α 𝐵 :,8,9,, , β 𝐶 :,8,9,, ) • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = 𝐺𝑄32( α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, ) • Enables users to fuse multiple element-wise calls. 30

  31. ELEMENT-WISE TENSOR OPERATIONS Key Features = α + β + 𝛿 B D A C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Φ are binary operators • E.g., MAX, MIN, ADD, MUL, … • Mixed-precision • High performance 31

  32. PERFORMANCE = α + β C A B Element-wise Operation ~5x over two-socket CPU HPTT (https://github.com/springer13/hptt) 32 * FP32 tensor permutation (e.g., reformatting)

  33. CUTENSOR’s API 33

  34. TENSOR CONTRACTIONS API = * + D A B C cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream ); 34 Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend