optimizing tensor contractions in ccsd t for efficient
play

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - PowerPoint PPT Presentation

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State


  1. Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State University, 2 Pacific Northwest National Laboratory 1

  2. Contents • Introduction • Overview of Mapping Strategy • Optimized Execution of Tensor Contractions • Fusion for Symmetrized Tensor Contractions • Experimental Results • Conclusion 2

  3. Introduction 3

  4. Tensor Contraction • Tensor • A Multidimensional Array • For example, a vector is 1D Tensor and a Matrix is 2D Tensor • Tensor Contractions • Higher Dimensional Analogs of Matrix Multiplications • High Order Models in Quantum Chemistry, Deep Learning, Finite Element Methods, etc 4

  5. Tensor Contraction • Example: ! ", $, % = '( ", ), % × '+ ), $ 5

  6. CCSD(T) • Coupled Cluster Singles and Doubles with perturbative Triples correction • One of the most accurate methods applicable to reasonably large molecules • A widely used application in computational chemistry suites • A set of symmetrized tensor contractions • Bandwidth-bound tensor contractions • The computational bottleneck in the CCSD(T) coupled-cluster method 6

  7. CCSD(T) • Symmetrized Tensor Contractions in CCSD(T) 7

  8. Background of GPU Memory Types • Global Memory (Grid-Level) • The slowest form of I/O on GPU • Shared Memory (Thread Block-Level) • Very fast memory located in the SM • Registers (Thread-Level) • The fastest memory CUDA Memory Model 8 (source: NVIDIA documentation)

  9. Global Memory Coalescing • To Maximize Global Memory Bandwidth • All threads in a warp (32 continuous threads) executes the same instruction (SIMD) • For a load/store instruction, need to minimize the number of transactions • For a tensor contraction, • Load input tensors • Store the output tensor 9

  10. Challenges • Mapping of the High-Dimensional Iteration Space to Threads • Choice of Data Buffering in Shared-Memory and Registers • Choice of Tile Sizes for Multi-Level Tiling • Fusing the Symmetrized Tensor Contractions in CCSD(T) 10

  11. Our Implementations • An efficient GPU implementation of the tensor contractions in CCSD(T) • Shared-memory buffering • Register tiling • Loop fusion • Register transposition 11

  12. Overview of Mapping Strategy 12

  13. Overview of Mapping Strategy • The Overall Strategy • Each thread: a set of elements of the output in its registers • Shared-memory to buffer slices of input tensors • Each Thread Block: a Hyper-Rectangular Slice of the Output • Partitioning the total work of a tensor contraction among thread blocks • Mapping an iteration space to threads • Choice of Tile-sizes 13

  14. Overview of Mapping Strategy • Two-Level Tiling • 2D Thread Block: !" # × !" % • 2D Register Tiles: &'( # × &'( % ( Register Tiling ) • Mappings • External Indices → 2D Thread Block and 2D Register Tiles • Internal Indices → All • Tile-Sizes • The size of each index handled by a thread block • E.g., ! * , ! , … 14

  15. Overview of Mapping Strategy • Example: ! ", $, %, & = ( ", $, ) × + ), %, & , → ./ 0 , 1 → ./ 2 , 3 → 456 0 , 7 → 456 2 , 8 → ∗ • • . : = 16, . = = 4, . ? = 16, . @ = 4 { b } { a } { d } REG X = 4 TB X = 16 { c } REG Y = 4 Register Tile TB Y = 16 Thread Thread Block 15

  16. Optimized Execution of Tensor Contractions 16

  17. Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A thread block compute a slice of the output C (1) • Two slices of the input A and B ⌈ N k /T k ⌉ • Loading portions from the slices of A and B (1) N k T j T k SMEM (1) ⌈ N k /T k ⌉ T j N i N i T i T i A C T k N j N k SMEM 17 GMEM GMEM

  18. Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A row-vector & A column-vector (2) (1) • An outer-product contribution (3) ⌈ N k /T k ⌉ • Threads store the output tensor slice to GMEM (4) N k T j T k SMEM (1) (3) Outer-Product ⌈ N k /T k ⌉ T j (2) N i N i T i T j (2) T i ⊗ A T i C T k N j N k (4) SMEM 18 GMEM GMEM REG

  19. Fusion for Symmetrized Tensor Contractions 19

  20. Fusion for Symmetrized Tensor Contractions • Symmetrized Tensor Contractions • The Accumulated Output Tensor • The identical left-hand side (LHS) • Possible to fuse tensor contractions with different parts of input tensors 20

  21. Fusion for Symmetrized Tensor Contractions • For Example: the 9 sd2 functions in CCSD(T) • Without storing the results from registers to global memory after finishing each tensor contraction 21

  22. Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (1/2) 1. The Size of Shared Memory • Depends on chosen Tile-Sizes • Problem : Different amounts of shared memory among tensor contractions • Issue: Lower Occupancy • Constraint : An identical amount of shared memory 22

  23. Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (2/2) 2. Arithmetic Intensity • Register Tiles (REG X * REG Y ) • # of Loaded Elements: REG X + REG Y • # of Result-Elements: REG X * REG Y • Problem : Indices mapped on Register Tiles might come from one of inputs • Issue: Low Arithmetic Intensity • Constraint : Indices mapped to register tile should come from different input tensors 23

  24. Fusion for Symmetrized Tensor Contractions • Example (1/2) • Tile-Sizes: T k = T j = T i = T c = T b = T a = 4 and T d ( T l ) = 16 • Mapping: !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ 24

  25. Fusion for Symmetrized Tensor Contractions • Example (2/2) • Two different kernels can fuse all of them • Mapping #1 : !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ • Mapping #2 : !, # → %& ' , +, ) → %& * , ( → ,-. ' , / → ,-. * , 0 1 → ∗ • Partially-Fused Kernel Version 25

  26. Fusion for Symmetrized Tensor Contractions • Register Transposition • Within a thread block, a hyper-rectangular slice of the output can be transposed via shared memory • Example: 4D Output Tensor--- ! ", $, %, & • Let a mapping be ' → )* + , , → )* - , . → /01 + , 2 → /01 - • Let tile-sizes be ) 3 = ) 5 = ) 6 = ) 7 = 2 k, j 0 0 1 1 j i, c 0 1 0 1 k 0 0 0 1 2 3 0 1 4 5 6 7 1 0 8 9 10 11 1 1 12 13 14 15 c i Registers 26

  27. Fusion for Symmetrized Tensor Contractions • Example: 4D Output Tensor--- ! ", $, %, & TB X TB Y REG X REG Y #1 k i j c #2 k j i c k, i 0 1 2 3 k, j 0 0 1 1 i 0 0 1 1 j j , c 4 5 6 7 0 1 0 1 k i , c 0 1 0 1 k 8 9 10 11 0 1 4 5 0 0 12 13 14 15 0 0 0 1 2 3 2 3 6 7 0 1 0 1 4 5 6 7 8 9 12 13 Shared Memory 1 0 1 0 8 9 10 11 10 11 14 15 1 1 1 1 12 13 14 15 c j c i Registers Registers 27

  28. Fusion for Symmetrized Tensor Contractions • Register Transposition • Finally , a kernel fuses all 9 sd1 functions or all sd2 functions. • Fully-Fused Kernel Version 28

  29. Experimental Results 29

  30. Experimental Results (1/3) • Experimental Setup • Pascal P100 and Volta V100 GPUs (16GB, PCI-Express) • CUDA 9.0 and GCC 6.2 • Two Fused Variants • Fully-Fused and Partially-Fused Kernel Versions. • The performance of the two fused variants with the NWChem kernels, TAL-SH and OpenACC implementations. Problem Sizes Parameters used in Fully-Fused and Partially-Fused Kernels 30

  31. Experimental Results (2/3) • On P100 (Pascal) • NWChem kernels: Max. 500 GFLOPS • TAL-SH and openACC: less than 300 GFLOPS • Fully-Fused and Partially-Fused Kernel versions: Max. 2.8 and 2.1 TFLOPS, respectively 3000 3000 2500 2500 2000 2000 GFLOPS GFLOPS 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC 31 sd1 on P100 (Pascal) sd2 on P100 (Pascal)

  32. Experimental Results (3/3) • On V100 (Volta) • The NWChem kernels: 818–1004 GFLOPS • TAL-SH and openACC kernels: Max. 400 GFLOPS • Two Fused Variants: 2.5 and 4.5 TFLOPS, with a peak of 4.5 TFLOPS. 5000 5000 4500 4500 4000 4000 3500 3500 3000 3000 GFLOPS 2500 2500 GFLOPS 2000 2000 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC sd1 on V100 (Volta) sd2 on V100 (Volta) 32

  33. Conclusion 33

  34. Conclusion • A novel strategy for executing tensor contractions in CCSD(T) on GPUs. • Kernel-level optimizations • Fusion across the symmetrization kernels • A novel register-level transpose operation • Experimental evaluation • Significant performance improvements as compared to existing alternatives • Over 60% of peak floating point performance on both Pascal P100 and Volta V100 GPUs. 34

  35. Thank you 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend