Super GPU & Super Kernels: Make programming of multi-GPU systems - - PowerPoint PPT Presentation
Super GPU & Super Kernels: Make programming of multi-GPU systems - - PowerPoint PPT Presentation
Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix
2
AGENDA
Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix Vector Multiplication Implications for SW and HW Example: FFT Example: Caffe AlexNet Take-outs parse Matrix Vector Multiplication Implications for SW and HW Example: FFT Example:
3
MULTI-GPU SYSTEMS
Multi-GPU nodes are here and more coming
DGX1 - 8 GPUs Coming: Summit and Aurora: about 15 K GPUs GPU enabled clusters
Need orchestrate GPUs computations GPUs are connected by a network based on NVlinks Supported by MPI-like library NCCL
4
WORK RASTERIZATION
// Kernel invocation with 3 x 2 grid of 4 x 3 threads dim3 grid(3, 2, 1); dim3 threads(4, 3, 1); MatrixAdd<<<grid, threads>>>(A, B, C)
Allows to distribute computations CUDA success in programming massive number of threads can be extended to multi- GPU systems Tiling GPUs into a super-GPU seems like logical step in scaling
5
WORK SUPER RASTERIZATION
dim3 sblock(8, 1, 1); __host__ void SpMVKernelS( dim3 sblock, float** d_matr, int** d_idx, โฆ) { #pragma omp parallel for num_threads(sblock.x) for (int i = 0; i < sblock.x; ++i) { cudaSetDevice(i); SpMVKernel<<<grid, threads>>>(d_matr[i], d_idx[i], โฆ); } }
GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100
dim3 sblock(8) dim3 sblock(4, 3) dim3 sblock(2, 2, 2)
GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100
6
SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV)
No cross-GPU communications Super-linear speedup 8.4 on DGX1 4.3 on PLX connected K40m
0.00 20.00 40.00 60.00 80.00 100.00 120.00 1 2 3 4 5 6 7 8 K40m PLX DGX-1 node
Single-precision SpMV, 22 M nnz, 1 M rows
Number of GPUs
GFLOPs
13.3 112.3 3.48 14.96
7
IMPLICATIONS FOR SW AND HW
Driver has an option to recognize super-kernels and optimize launch
- One option: CudaLaunchKernelCooperative
- Move some load needed for kernel launch to GPUs
Vectorize kernel launches
- PCIe supports broadcast
Allocation of page tables can be directed by superblock
8
FFT SUPERKERNEL
dim3 sblock(4, 2, 1); CopyDataToGPUs(sblock, r * s, h_src, d_dst); ButterflyRightWing(sblock, r, s, d_dst, d_res); GlobalTranspose<T2>(sblock, r, s, d_res, d_tsr, d_tmp, handle); ButterflyLeftWing(sblock, r, s, d_tmp, d_dst); CopyDataFromGPU(sblock, r * s, d_dst, h_res);
Frs = (Fr Is)Ds
r(Ir
Fs)
0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1 3 5 7
32 M points FFT Double Complex DGX1
FFT GP100
Number GPUs GFLOPS
9
TRAINING NEURAL NETWORKS (CAFFE)
๐๐๐๐๐ = เท
๐=1 ๐ท
เท
๐ =1 ๐
เท
๐ก=1 ๐
๐๐, ๐, ๐ + ๐ , ๐ + ๐ก ๐๐๐๐ ๐ก Many layers Big data volumes have to pass through Most computationally expensive are convolutional layers Main ops: Gemm, Winograd, FFT Data parallel distribution requires AllReduce to update weights
10
CAFFE SUPERKERNEL
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1 2 4 8
AlexNet Scalability Number of GPUs Iterations per second
In train(), create dim3 sblock = get_gpus(); Refer to sblock.Volume() instead of gpus.size(); Pass sblock to P2Psync constructor:
P2Psync::P2Psync(solver, root_id, sblock, solver->params);
Use sblock in P2Psync::Run(); Alternative: caffe_gpu_gemm<float>(sblock, gemm_params, โฆ)
11
TAKE-AWAYS
Multi-GPU programming using superblocks is easy Rasterization of the superkernel is as intuitive as rasterization of CTAs
- Results in good scalability assuming good load balance and small
communications (SpMV) Transparently distributes work, allows to concentrate on optimization
- f the communications
- Pipeline communications and computations
Data distributions can be described by the superblock
QUESTIONS?
Michael Frumkin mfrumkin@nvidia.com
13
PHOTO CAPTION