Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - - PowerPoint PPT Presentation
PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - - PowerPoint PPT Presentation
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body
2
MOTIVATION
Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV πΈ π, π, π += π π, π, π, π π π, π, π Sum over repeated indices π and π Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply
Tensor contractions
3
TENSOR CONTRACTIONS
Reduction over a pair of indices shared by two tensors, e.g. πΈ π, π, π += π π, π, π, π π π, π, π This can be evaluated as π π, π, π, π β π π, π, π, π # tensor transpose π π, π, π β π π, π, π # tensor transpose πΈ π, π, π += π π, π, π, π π π, π, π # matrix multiply πΈ π, π, π β πΈ π, π, π # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS
Indirect approach
4
PREVIOUS WORK
Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime
[1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla
- GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013
[2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures
No runtime high-performance tensor transpose library exists for GPUs
5
TENSOR TRANSPOSE ALGORITHMS
6
MATRIX TRANSPOSE: TILED ALGORITHM
Step 1: Read 32x32 tile from global memory to shared memory Step 2: Read shared memory in transposed
- rder
and write to global memory __syncthreads()
Mark Harris βAn Efficient Matrix Transpose in CUDA C/C+β, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc
7
TILED ALGORITHM
1 2 3 4 5 6 5 4 1 6 3 2 3 6 shared memory volume looped over using TB 4 2
Constant shared memory usage (~32x32) Performs well when d1 and d5 are fairly large (~32) Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into shared memory?
1 5
8
PACKED ALGORITHM
1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 3 6 shared memory TB loop volume 4
No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) As much shared memory is allocated as it takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large?
9
PACKED-SPLIT ALGORITHM
1 2 3 4 5 6 4 1 6 3 2 1 2 3 6 shared memory TB loop volume 4
Split largest dimension Number of splits is determined by the shared memory size Must choose which dimensions to pack, and number of splits
5 5
10
MEMORY POSITION CALCULATION
11
GLOBAL MEMORY POSITION CALCULATION
1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 3 6 shRead 4
H = Number of elements in shared memory M = Number of elements in loop volume Need to convert scalar positions s and p to global memory positions:
s= 0, ..., H-1 p= 0, ..., M-1
glRead = Global memory read glWrite = Global memory write Global memory position is split into: glRead = glMinorRead(s) + glMajorRead(p) glWrite = glMinorWrite(s) + glMajorWrite(p)
glRead glWrite
12
MAJOR POSITION CALCULATION
glMajorRead π = ΰ·
π=1 π
πππ π ππ , ππ π’π
// int p =0,...,M-1 // int c[n] = {1, d3, d3*d4} // int d[n] = {d3, d4, d6} // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} int glMajorRead = 0; for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; }
1 2 3 4 5 6 3 6 4 O(n) Observation: p is constant within thread block (and therefore warp) glMajorRead(p) p= 0, ..., M-1
13
WARP-PARALLEL POSITION CALCULATION
// int p = 0,...,M-1 // int c = {1, d3, d3*d4, 1, ..., 1} // int d = {d3, d4, d6 , 1, ..., 1} // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; for (int i=16;i >= 1;i/=2) { glMajorRead += __shfl_xor(glMajorRead, i); }
ΰ·
π=1 π
πππ π ππ , ππ π’π
Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32
1 2 3 4 5 6 3 6 4 p= 0, ..., M-1 glMajorRead(p)
14
MINOR POSITION CALCULATION
1 2 3 4 5 6 5 4 1 6 3 2 1 2 5 shared memory
For Tiled algorithm this is trivial For Packed and Packed-Split, pre-compute positions and store into registers Number of registers per thread: numReg = (H - 1)/blockDim.x + 1 int glMinorRead[numReg] int shRead[numReg] int glMinorWrite[numReg] Template kernel with numReg
glMinorRead(s) glMinorWrite(s) shRead(s) s= 0, ..., H-1
15
ALGORITHM & PARAMETER CHOICE
16
CHOOSING THE BEST ALGORITHM
Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters Packed: input and output ranks Packed-Split: input and output ranks, number of splits Large performance differences between different algorithm and parameter choices
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 5 4 1 6 3 2 5 4 1 6 3 2 5 4 1 6 3 2
17
CUTT PLANS
Measure βplans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic βplans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?)
cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata);
18
BENCHMARKS
19
BENCHMARK 1
Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total
20
TESLA K20X
*
* maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth
- f many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt,
Germany
21
TESLA M40
22
TESLA P100
23
BENCHMARK 2
Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations
24
TESLA K20X
25
TESLA M40
26
TESLA P100
27
PERFORMANCE DISTRIBUTION
28
BENCHMARK 3
Set of 57 tensor transposes from (TTC): P . Springer, J. R. Hammond, and P . Bientinesi. TTC: A high performance compiler for tensor
- transpositions. CoRR, 2016.
http://arxiv.org/abs/1603.02297 Somewhat βeasyβ benchmark due to small number of permutations
29
TESLA K40M
TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249
TTC average 140 GiB/s cuTT average 144 GiB/s
30
BENCHMARK 4
Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS
31
TESLA K20X
10 20 30 40 50 60 70 80 90 100 200 400 600 800 1000 1200 1 10 100 1000 10000 Percentage of max. performance
GFlop/s Arithmetic Intensity
Best Average Worst
(b)
10 20 30 40 50 60 70 80 90 100 500 1000 1500 2000 2500 3000 3500 1 10 100 1000 10000 Percentage of max. performance
GFlop/s Arithmetic Intensity
Best Average Worst
(a)
Single precision Double precision GPU πΈ = πΈ + π β π π΅π ππ’βπππ’ππ π½ππ’πππ‘ππ’π§ = 2 vol πΈ vol π vol π vol πΈ + vol π + vol π
32
TESLA M40
Single precision
33
TESLA P100
Single precision Double precision
10 20 30 40 50 60 70 80 90 100 500 1000 1500 2000 2500 3000 3500 4000 4500 1 10 100 1000 10000 Percentage of max. performance
GFlop/s Arithmetic Intensity
Best Average Worst
(b)
10 20 30 40 50 60 70 80 90 100 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 10 100 1000 10000 Percentage of max. performance
GFlop/s Arithmetic Intensity
Best Average Worst
(a)
34
CONCLUSIONS & ACKNOWLEDGEMENTS
35
CONCLUSIONS
Fully runtime library for high-performance tensor transposing on NVIDIA GPUs Extensive benchmarking Achieves median of 70-80% of the maximum achievable memory bandwidth Performance equals or exceeds the performance of compiler-based approach (TTC) Enables close to peak FLOP tensor contractions on P100 Integrated as part of TAL-SH (https://github.com/DmitryLyakh/TAL_SH) Work underway to be used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Source code available at: https://github.com/ap-hynninen/cutt Manuscript available at: https://arxiv.org/abs/1705.01598
36
ACKNOWLEDGEMENTS
Dmitry I. Lyakh at Oak Ridge Leadership Computing Facility at ORNL ORNL where 80% of the work was done