Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - - PowerPoint PPT Presentation

optimizing tensor contractions in ccsd t for efficient
SMART_READER_LITE
LIVE PREVIEW

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - - PowerPoint PPT Presentation

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State


slide-1
SLIDE 1

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Jinsung Kim1, Aravind Sukumaran Rajam1, Changwan Hong1, Ajay Panyala2, Rohit Kumar Srivastava1, Sriram Krishnamoorthy2, P. Sadayappan1

1The Ohio State University, 2Pacific Northwest National Laboratory

1

slide-2
SLIDE 2

Contents

  • Introduction
  • Overview of Mapping Strategy
  • Optimized Execution of Tensor Contractions
  • Fusion for Symmetrized Tensor Contractions
  • Experimental Results
  • Conclusion

2

slide-3
SLIDE 3

3

Introduction

slide-4
SLIDE 4

Tensor Contraction

  • Tensor
  • A Multidimensional Array
  • For example, a vector is 1D Tensor and a Matrix is 2D Tensor
  • Tensor Contractions
  • Higher Dimensional Analogs of Matrix Multiplications
  • High Order Models in Quantum Chemistry, Deep Learning, Finite Element Methods, etc

4

slide-5
SLIDE 5

Tensor Contraction

  • Example: ! ", $, % = '( ", ), % × '+ ), $

5

slide-6
SLIDE 6

CCSD(T)

  • Coupled Cluster Singles and Doubles with perturbative Triples correction
  • One of the most accurate methods applicable to reasonably large molecules
  • A widely used application in computational chemistry suites
  • A set of symmetrized tensor contractions
  • Bandwidth-bound tensor contractions
  • The computational bottleneck in the CCSD(T) coupled-cluster method

6

slide-7
SLIDE 7

CCSD(T)

  • Symmetrized Tensor Contractions in CCSD(T)

7

slide-8
SLIDE 8

Background of GPU Memory Types

  • Global Memory (Grid-Level)
  • The slowest form of I/O on GPU
  • Shared Memory (Thread Block-Level)
  • Very fast memory located in the SM
  • Registers (Thread-Level)
  • The fastest memory

8

CUDA Memory Model

(source: NVIDIA documentation)

slide-9
SLIDE 9

Global Memory Coalescing

  • To Maximize Global Memory Bandwidth
  • All threads in a warp (32 continuous threads) executes the same instruction (SIMD)
  • For a load/store instruction, need to minimize the number of transactions
  • For a tensor contraction,
  • Load input tensors
  • Store the output tensor

9

slide-10
SLIDE 10

Challenges

  • Mapping of the High-Dimensional Iteration Space to Threads
  • Choice of Data Buffering in Shared-Memory and Registers
  • Choice of Tile Sizes for Multi-Level Tiling
  • Fusing the Symmetrized Tensor Contractions in CCSD(T)

10

slide-11
SLIDE 11

Our Implementations

  • An efficient GPU implementation of the tensor contractions in CCSD(T)
  • Shared-memory buffering
  • Register tiling
  • Loop fusion
  • Register transposition

11

slide-12
SLIDE 12

12

Overview of Mapping Strategy

slide-13
SLIDE 13

Overview of Mapping Strategy

  • The Overall Strategy
  • Each thread: a set of elements of the output in its registers
  • Shared-memory to buffer slices of input tensors
  • Each Thread Block: a Hyper-Rectangular Slice of the Output
  • Partitioning the total work of a tensor contraction among thread blocks
  • Mapping an iteration space to threads
  • Choice of Tile-sizes

13

slide-14
SLIDE 14

Overview of Mapping Strategy

  • Two-Level Tiling
  • 2D Thread Block: !"# × !"%
  • 2D Register Tiles: &'(# × &'(% (Register Tiling)
  • Mappings
  • External Indices → 2D Thread Block and 2D Register Tiles
  • Internal Indices → All
  • Tile-Sizes
  • The size of each index handled by a thread block
  • E.g., !*, !

, …

14

slide-15
SLIDE 15

Overview of Mapping Strategy

  • Example: ! ", $, %, & = ( ", $, ) × + ), %, &
  • , → ./0, 1 → ./2, 3 → 4560, 7 → 4562, 8 → ∗
  • .

: = 16, .= = 4, . ? = 16, .@ = 4

15

TBX = 16 TBY = 16

{a} {c} {d} {b}

REGX= 4 REGY = 4

Thread Block Register Tile

Thread

slide-16
SLIDE 16

16

Optimized Execution of Tensor Contractions

slide-17
SLIDE 17

Optimized Execution of Tensor Contractions

  • Example: ! ", $ = & ", ' × )[', $]
  • A thread block compute a slice of the output C
  • Two slices of the input A and B
  • Loading portions from the slices of A and B (1)

17

A

Ni Nk GMEM

B

Nk Nj GMEM

C

Nj Ni GMEM Ti Tj Ti Tk SMEM Tj Tk SMEM (1) (1) ⌈Nk/Tk⌉ ⌈Nk/Tk⌉

slide-18
SLIDE 18

A

Ni Nk GMEM

Optimized Execution of Tensor Contractions

  • Example: ! ", $ = & ", ' × )[', $]
  • A row-vector & A column-vector (2)
  • An outer-product contribution (3)
  • Threads store the output tensor slice to GMEM (4)

18

B

Nk Nj GMEM

C

Nj Ni GMEM Ti Tk SMEM Tj Tk SMEM (1) (1) ⌈Nk/Tk⌉ ⌈Nk/Tk⌉ Tj Ti REG (2) (2)

(3) Outer-Product Ti Tj (4)

slide-19
SLIDE 19

19

Fusion for Symmetrized Tensor Contractions

slide-20
SLIDE 20

Fusion for Symmetrized Tensor Contractions

  • Symmetrized Tensor Contractions
  • The Accumulated Output Tensor
  • The identical left-hand side (LHS)
  • Possible to fuse tensor contractions with different parts of input tensors

20

slide-21
SLIDE 21

Fusion for Symmetrized Tensor Contractions

  • For Example: the 9 sd2 functions in CCSD(T)
  • Without storing the results from registers to global memory after finishing each tensor

contraction

21

slide-22
SLIDE 22

Fusion for Symmetrized Tensor Contractions

  • Two Issues of Fusion (1/2)

1. The Size of Shared Memory

  • Depends on chosen Tile-Sizes
  • Problem: Different amounts of shared memory among tensor contractions
  • Issue: Lower Occupancy
  • Constraint: An identical amount of shared memory

22

slide-23
SLIDE 23

Fusion for Symmetrized Tensor Contractions

  • Two Issues of Fusion (2/2)

2. Arithmetic Intensity

  • Register Tiles (REGX * REGY)
  • # of Loaded Elements: REGX + REGY
  • # of Result-Elements: REGX * REGY
  • Problem: Indices mapped on Register Tiles might come from one of inputs
  • Issue: Low Arithmetic Intensity
  • Constraint: Indices mapped to register tile should come from different input tensors

23

slide-24
SLIDE 24

Fusion for Symmetrized Tensor Contractions

  • Example (1/2)
  • Tile-Sizes: Tk = Tj = Ti = Tc = Tb = Ta = 4 and Td (Tl) = 16
  • Mapping: !, # → %&', (, ) → %&*, + → ,-.', / → ,-.*, 0 1 → ∗

24

slide-25
SLIDE 25

Fusion for Symmetrized Tensor Contractions

  • Example (2/2)
  • Two different kernels can fuse all of them
  • Mapping #1: !, # → %&', (, ) → %&*, + → ,-.', / → ,-.*, 0 1 → ∗
  • Mapping #2: !, # → %&', +, ) → %&*, ( → ,-.', / → ,-.*, 0 1 → ∗
  • Partially-Fused Kernel Version

25

slide-26
SLIDE 26

Fusion for Symmetrized Tensor Contractions

  • Register Transposition
  • Within a thread block, a hyper-rectangular slice of the output can be transposed via

shared memory

  • Example: 4D Output Tensor--- ! ", $, %, &
  • Let a mapping be ' → )*+, , → )*-, . → /01+, 2 → /01-
  • Let tile-sizes be )3 = )

5 = )6 = ) 7 = 2

26

Registers 1 1 1 1 j k c i 1 1 1 1 k, j i, c 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-27
SLIDE 27

Fusion for Symmetrized Tensor Contractions

  • Example: 4D Output Tensor--- ! ", $, %, &

27

TBX TBY REGX REGY #1

k i j c

#2

k j i c

Registers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 j k c i 1 1 1 1 k, j i, c Registers 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 1 1 1 1 c j 1 1 1 1 i k k, i j, c Shared Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-28
SLIDE 28

Fusion for Symmetrized Tensor Contractions

  • Register Transposition
  • Finally, a kernel fuses all 9 sd1 functions or all sd2 functions.
  • Fully-Fused Kernel Version

28

slide-29
SLIDE 29

29

Experimental Results

slide-30
SLIDE 30

Experimental Results (1/3)

  • Experimental Setup
  • Pascal P100 and Volta V100 GPUs (16GB, PCI-Express)
  • CUDA 9.0 and GCC 6.2
  • Two Fused Variants
  • Fully-Fused and Partially-Fused Kernel Versions.
  • The performance of the two fused variants with the NWChem kernels, TAL-SH and

OpenACC implementations.

30

Parameters used in Fully-Fused and Partially-Fused Kernels Problem Sizes

slide-31
SLIDE 31

Experimental Results (2/3)

500 1000 1500 2000 2500 3000 Size-A Size-B Size-C Size-D Size-E GFLOPS Fully-Fused Partially-Fused NWChem TAL-SH

  • penACC

sd1 on P100 (Pascal)

500 1000 1500 2000 2500 3000 Size-A Size-B Size-C Size-D Size-E GFLOPS Fully-Fused Partially-Fused NWChem TAL-SH

  • penACC

sd2 on P100 (Pascal)

31

  • On P100 (Pascal)
  • NWChem kernels: Max. 500 GFLOPS
  • TAL-SH and openACC: less than 300 GFLOPS
  • Fully-Fused and Partially-Fused Kernel versions: Max. 2.8 and 2.1 TFLOPS, respectively
slide-32
SLIDE 32

Experimental Results (3/3)

  • On V100 (Volta)
  • The NWChem kernels: 818–1004 GFLOPS
  • TAL-SH and openACC kernels: Max. 400 GFLOPS
  • Two Fused Variants: 2.5 and 4.5 TFLOPS, with a peak of 4.5 TFLOPS.

32

sd1 on V100 (Volta)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size-A Size-B Size-C Size-D Size-E GFLOPS Fully-Fused Partially-Fused NWChem TAL-SH

  • penACC

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size-A Size-B Size-C Size-D Size-E GFLOPS Fully-Fused Partially-Fused NWChem TAL-SH

  • penACC

sd2 on V100 (Volta)

slide-33
SLIDE 33

33

Conclusion

slide-34
SLIDE 34

Conclusion

  • A novel strategy for executing tensor contractions in CCSD(T) on GPUs.
  • Kernel-level optimizations
  • Fusion across the symmetrization kernels
  • A novel register-level transpose operation
  • Experimental evaluation
  • Significant performance improvements as compared to existing alternatives
  • Over 60% of peak floating point performance on both Pascal P100 and Volta V100 GPUs.

34

slide-35
SLIDE 35

35

Thank you