PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - - PowerPoint PPT Presentation

β–Ά
performance tensor
SMART_READER_LITE
LIVE PREVIEW

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - - PowerPoint PPT Presentation

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body


slide-1
SLIDE 1

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS

slide-2
SLIDE 2

2

MOTIVATION

Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 Sum over repeated indices 𝑑 and 𝑙 Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply

Tensor contractions

slide-3
SLIDE 3

3

TENSOR CONTRACTIONS

Reduction over a pair of indices shared by two tensors, e.g. 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 This can be evaluated as 𝑀 𝑏, 𝑑, 𝑗, 𝑙 β†’ 𝑀 𝑏, 𝑗, 𝑙, 𝑑 # tensor transpose 𝑆 𝑙, 𝑐, 𝑑 β†’ 𝑆 𝑙, 𝑑, 𝑐 # tensor transpose 𝐸 𝑏, 𝑗, 𝑐 += 𝑀 𝑏, 𝑗, 𝑙, 𝑑 𝑆 𝑙, 𝑑, 𝑐 # matrix multiply 𝐸 𝑏, 𝑗, 𝑐 β†’ 𝐸 𝑏, 𝑐, 𝑗 # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS

Indirect approach

slide-4
SLIDE 4

4

PREVIOUS WORK

Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime

[1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla

  • GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013

[2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures

No runtime high-performance tensor transpose library exists for GPUs

slide-5
SLIDE 5

5

TENSOR TRANSPOSE ALGORITHMS

slide-6
SLIDE 6

6

MATRIX TRANSPOSE: TILED ALGORITHM

Step 1: Read 32x32 tile from global memory to shared memory Step 2: Read shared memory in transposed

  • rder

and write to global memory __syncthreads()

Mark Harris β€œAn Efficient Matrix Transpose in CUDA C/C+”, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc

slide-7
SLIDE 7

7

TILED ALGORITHM

1 2 3 4 5 6 5 4 1 6 3 2 3 6 shared memory volume looped over using TB 4 2

Constant shared memory usage (~32x32) Performs well when d1 and d5 are fairly large (~32) Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into shared memory?

1 5

slide-8
SLIDE 8

8

PACKED ALGORITHM

1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 3 6 shared memory TB loop volume 4

No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) As much shared memory is allocated as it takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large?

slide-9
SLIDE 9

9

PACKED-SPLIT ALGORITHM

1 2 3 4 5 6 4 1 6 3 2 1 2 3 6 shared memory TB loop volume 4

Split largest dimension Number of splits is determined by the shared memory size Must choose which dimensions to pack, and number of splits

5 5

slide-10
SLIDE 10

10

MEMORY POSITION CALCULATION

slide-11
SLIDE 11

11

GLOBAL MEMORY POSITION CALCULATION

1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 3 6 shRead 4

H = Number of elements in shared memory M = Number of elements in loop volume Need to convert scalar positions s and p to global memory positions:

s= 0, ..., H-1 p= 0, ..., M-1

glRead = Global memory read glWrite = Global memory write Global memory position is split into: glRead = glMinorRead(s) + glMajorRead(p) glWrite = glMinorWrite(s) + glMajorWrite(p)

glRead glWrite

slide-12
SLIDE 12

12

MAJOR POSITION CALCULATION

glMajorRead π‘ž = ෍

𝑗=1 π‘œ

𝑛𝑝𝑒 π‘ž 𝑑𝑗 , 𝑒𝑗 𝑒𝑗

// int p =0,...,M-1 // int c[n] = {1, d3, d3*d4} // int d[n] = {d3, d4, d6} // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} int glMajorRead = 0; for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; }

1 2 3 4 5 6 3 6 4 O(n) Observation: p is constant within thread block (and therefore warp) glMajorRead(p) p= 0, ..., M-1

slide-13
SLIDE 13

13

WARP-PARALLEL POSITION CALCULATION

// int p = 0,...,M-1 // int c = {1, d3, d3*d4, 1, ..., 1} // int d = {d3, d4, d6 , 1, ..., 1} // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; for (int i=16;i >= 1;i/=2) { glMajorRead += __shfl_xor(glMajorRead, i); }

෍

𝑗=1 π‘œ

𝑛𝑝𝑒 π‘ž 𝑑𝑗 , 𝑒𝑗 𝑒𝑗

Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32

1 2 3 4 5 6 3 6 4 p= 0, ..., M-1 glMajorRead(p)

slide-14
SLIDE 14

14

MINOR POSITION CALCULATION

1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 shared memory

For Tiled algorithm this is trivial For Packed and Packed-Split, pre-compute positions and store into registers Number of registers per thread: numReg = (H - 1)/blockDim.x + 1 int glMinorRead[numReg] int shRead[numReg] int glMinorWrite[numReg] Template kernel with numReg

glMinorRead(s) glMinorWrite(s) shRead(s) s= 0, ..., H-1

slide-15
SLIDE 15

15

ALGORITHM & PARAMETER CHOICE

slide-16
SLIDE 16

16

CHOOSING THE BEST ALGORITHM

Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters Packed: input and output ranks Packed-Split: input and output ranks, number of splits Large performance differences between different algorithm and parameter choices

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 5 4 1 6 3 2 5 4 1 6 3 2 5 4 1 6 3 2

slide-17
SLIDE 17

17

CUTT PLANS

Measure –plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic –plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?)

cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata);

slide-18
SLIDE 18

18

BENCHMARKS

slide-19
SLIDE 19

19

BENCHMARK 1

Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total

slide-20
SLIDE 20

20

TESLA K20X

*

* maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth

  • f many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt,

Germany

slide-21
SLIDE 21

21

TESLA M40

slide-22
SLIDE 22

22

TESLA P100

slide-23
SLIDE 23

23

BENCHMARK 2

Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations

slide-24
SLIDE 24

24

TESLA K20X

slide-25
SLIDE 25

25

TESLA M40

slide-26
SLIDE 26

26

TESLA P100

slide-27
SLIDE 27

27

PERFORMANCE DISTRIBUTION

slide-28
SLIDE 28

28

BENCHMARK 3

Set of 57 tensor transposes from (TTC): P . Springer, J. R. Hammond, and P . Bientinesi. TTC: A high performance compiler for tensor

  • transpositions. CoRR, 2016.

http://arxiv.org/abs/1603.02297 Somewhat β€œeasy” benchmark due to small number of permutations

slide-29
SLIDE 29

29

TESLA K40M

TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249

TTC average 140 GiB/s cuTT average 144 GiB/s

slide-30
SLIDE 30

30

BENCHMARK 4

Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS

slide-31
SLIDE 31

31

TESLA K20X

10 20 30 40 50 60 70 80 90 100 200 400 600 800 1000 1200 1 10 100 1000 10000 Percentage of max. performance

GFlop/s Arithmetic Intensity

Best Average Worst

(b)

10 20 30 40 50 60 70 80 90 100 500 1000 1500 2000 2500 3000 3500 1 10 100 1000 10000 Percentage of max. performance

GFlop/s Arithmetic Intensity

Best Average Worst

(a)

Single precision Double precision GPU 𝐸 = 𝐸 + 𝑀 βˆ™ 𝑆 π΅π‘ π‘—π‘’β„Žπ‘›π‘“π‘’π‘—π‘‘ π½π‘œπ‘’π‘“π‘œπ‘‘π‘—π‘’π‘§ = 2 vol 𝐸 vol 𝑀 vol 𝑆 vol 𝐸 + vol 𝑀 + vol 𝑆

slide-32
SLIDE 32

32

TESLA M40

Single precision

slide-33
SLIDE 33

33

TESLA P100

Single precision Double precision

10 20 30 40 50 60 70 80 90 100 500 1000 1500 2000 2500 3000 3500 4000 4500 1 10 100 1000 10000 Percentage of max. performance

GFlop/s Arithmetic Intensity

Best Average Worst

(b)

10 20 30 40 50 60 70 80 90 100 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 10 100 1000 10000 Percentage of max. performance

GFlop/s Arithmetic Intensity

Best Average Worst

(a)

slide-34
SLIDE 34

34

CONCLUSIONS & ACKNOWLEDGEMENTS

slide-35
SLIDE 35

35

CONCLUSIONS

Fully runtime library for high-performance tensor transposing on NVIDIA GPUs Extensive benchmarking Achieves median of 70-80% of the maximum achievable memory bandwidth Performance equals or exceeds the performance of compiler-based approach (TTC) Enables close to peak FLOP tensor contractions on P100 Integrated as part of TAL-SH (https://github.com/DmitryLyakh/TAL_SH) Work underway to be used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Source code available at: https://github.com/ap-hynninen/cutt Manuscript available at: https://arxiv.org/abs/1705.01598

slide-36
SLIDE 36

36

ACKNOWLEDGEMENTS

Dmitry I. Lyakh at Oak Ridge Leadership Computing Facility at ORNL ORNL where 80% of the work was done

slide-37
SLIDE 37