Super GPU & Super Kernels: Make programming of multi-GPU systems - - PowerPoint PPT Presentation

โ–ถ
super gpu super kernels make
SMART_READER_LITE
LIVE PREVIEW

Super GPU & Super Kernels: Make programming of multi-GPU systems - - PowerPoint PPT Presentation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix


slide-1
SLIDE 1

Michael Frumkin, May 8, 2017

Super GPU & Super Kernels: Make programming of multi-GPU systems easy

slide-2
SLIDE 2

2

AGENDA

Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix Vector Multiplication Implications for SW and HW Example: FFT Example: Caffe AlexNet Take-outs parse Matrix Vector Multiplication Implications for SW and HW Example: FFT Example:

slide-3
SLIDE 3

3

MULTI-GPU SYSTEMS

Multi-GPU nodes are here and more coming

DGX1 - 8 GPUs Coming: Summit and Aurora: about 15 K GPUs GPU enabled clusters

Need orchestrate GPUs computations GPUs are connected by a network based on NVlinks Supported by MPI-like library NCCL

slide-4
SLIDE 4

4

WORK RASTERIZATION

// Kernel invocation with 3 x 2 grid of 4 x 3 threads dim3 grid(3, 2, 1); dim3 threads(4, 3, 1); MatrixAdd<<<grid, threads>>>(A, B, C)

Allows to distribute computations CUDA success in programming massive number of threads can be extended to multi- GPU systems Tiling GPUs into a super-GPU seems like logical step in scaling

slide-5
SLIDE 5

5

WORK SUPER RASTERIZATION

dim3 sblock(8, 1, 1); __host__ void SpMVKernelS( dim3 sblock, float** d_matr, int** d_idx, โ€ฆ) { #pragma omp parallel for num_threads(sblock.x) for (int i = 0; i < sblock.x; ++i) { cudaSetDevice(i); SpMVKernel<<<grid, threads>>>(d_matr[i], d_idx[i], โ€ฆ); } }

GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100

dim3 sblock(8) dim3 sblock(4, 3) dim3 sblock(2, 2, 2)

GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100

slide-6
SLIDE 6

6

SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV)

No cross-GPU communications Super-linear speedup 8.4 on DGX1 4.3 on PLX connected K40m

0.00 20.00 40.00 60.00 80.00 100.00 120.00 1 2 3 4 5 6 7 8 K40m PLX DGX-1 node

Single-precision SpMV, 22 M nnz, 1 M rows

Number of GPUs

GFLOPs

13.3 112.3 3.48 14.96

slide-7
SLIDE 7

7

IMPLICATIONS FOR SW AND HW

Driver has an option to recognize super-kernels and optimize launch

  • One option: CudaLaunchKernelCooperative
  • Move some load needed for kernel launch to GPUs

Vectorize kernel launches

  • PCIe supports broadcast

Allocation of page tables can be directed by superblock

slide-8
SLIDE 8

8

FFT SUPERKERNEL

dim3 sblock(4, 2, 1); CopyDataToGPUs(sblock, r * s, h_src, d_dst); ButterflyRightWing(sblock, r, s, d_dst, d_res); GlobalTranspose<T2>(sblock, r, s, d_res, d_tsr, d_tmp, handle); ButterflyLeftWing(sblock, r, s, d_tmp, d_dst); CopyDataFromGPU(sblock, r * s, d_dst, h_res);

Frs = (Fr Is)Ds

r(Ir

Fs)

0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1 3 5 7

32 M points FFT Double Complex DGX1

FFT GP100

Number GPUs GFLOPS

slide-9
SLIDE 9

9

TRAINING NEURAL NETWORKS (CAFFE)

๐‘๐‘œ๐‘™๐‘ž๐‘Ÿ = เท

๐‘‘=1 ๐ท

เท

๐‘ =1 ๐‘†

เท

๐‘ก=1 ๐‘‡

๐‘Œ๐‘œ, ๐‘‘, ๐‘ž + ๐‘ , ๐‘Ÿ + ๐‘ก ๐‘‹๐‘™๐‘‘๐‘ ๐‘ก Many layers Big data volumes have to pass through Most computationally expensive are convolutional layers Main ops: Gemm, Winograd, FFT Data parallel distribution requires AllReduce to update weights

slide-10
SLIDE 10

10

CAFFE SUPERKERNEL

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1 2 4 8

AlexNet Scalability Number of GPUs Iterations per second

In train(), create dim3 sblock = get_gpus(); Refer to sblock.Volume() instead of gpus.size(); Pass sblock to P2Psync constructor:

P2Psync::P2Psync(solver, root_id, sblock, solver->params);

Use sblock in P2Psync::Run(); Alternative: caffe_gpu_gemm<float>(sblock, gemm_params, โ€ฆ)

slide-11
SLIDE 11

11

TAKE-AWAYS

Multi-GPU programming using superblocks is easy Rasterization of the superkernel is as intuitive as rasterization of CTAs

  • Results in good scalability assuming good load balance and small

communications (SpMV) Transparently distributes work, allows to concentrate on optimization

  • f the communications
  • Pipeline communications and computations

Data distributions can be described by the superblock

slide-12
SLIDE 12

QUESTIONS?

Michael Frumkin mfrumkin@nvidia.com

slide-13
SLIDE 13

13

PHOTO CAPTION