Super GPU & Super Kernels: Make programming of multi-GPU systems - PowerPoint PPT Presentation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017

Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix Vector Multiplication AGENDA Implications for SW and HW Example: FFT Example: Caffe AlexNet Take-outs parse Matrix Vector Multiplication Implications for SW and HW Example: FFT 2 Example:

MULTI-GPU SYSTEMS Multi-GPU nodes are here and more coming DGX1 - 8 GPUs Coming: Summit and Aurora: about 15 K GPUs GPU enabled clusters Need orchestrate GPUs computations GPUs are connected by a network based on NVlinks Supported by MPI-like library NCCL 3

WORK RASTERIZATION // Kernel invocation with 3 x 2 grid of 4 x 3 threads dim3 grid(3, 2, 1); dim3 threads(4, 3, 1); MatrixAdd<<<grid, threads>>>(A, B, C) Allows to distribute computations CUDA success in programming massive number of threads can be extended to multi- GPU systems Tiling GPUs into a super-GPU seems like logical step in scaling 4

WORK SUPER RASTERIZATION dim3 sblock(4, 3) dim3 sblock(8, 1, 1); GP100 GP100 GP100 GP100 __host__ void SpMVKernelS( dim3 sblock, float** d_matr, int** d_idx , …) { GP100 GP100 GP100 GP100 #pragma omp parallel for num_threads(sblock.x) GP100 GP100 GP100 GP100 for (int i = 0; i < sblock.x; ++i) { cudaSetDevice(i); SpMVKernel<<<grid, threads>>>(d_matr[i], d_idx[i ], …); } } dim3 sblock(8) dim3 sblock(2, 2, 2 ) GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 5

SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) No cross-GPU communications Single-precision SpMV, 22 M nnz, 1 M rows 120.00 112.3 Super-linear speedup K40m PLX 100.00 DGX-1 node 8.4 on DGX1 80.00 GFLOPs 4.3 on PLX connected K40m 60.00 13.3 40.00 14.96 3.48 20.00 0.00 1 2 3 4 5 6 7 8 Number of GPUs 6

IMPLICATIONS FOR SW AND HW Driver has an option to recognize super-kernels and optimize launch - One option: CudaLaunchKernelCooperative - Move some load needed for kernel launch to GPUs Vectorize kernel launches - PCIe supports broadcast Allocation of page tables can be directed by superblock 7

FFT SUPERKERNEL F rs = (F r I s )D s r (I r F s ) 32 M points FFT Double Complex DGX1 1200.00 FFT GP100 1000.00 800.00 GFLOPS 600.00 400.00 200.00 dim3 sblock(4, 2, 1); 0.00 1 3 5 7 Number GPUs CopyDataToGPUs(sblock, r * s, h_src, d_dst); ButterflyRightWing(sblock, r, s, d_dst, d_res); GlobalTranspose<T2>(sblock, r, s, d_res, d_tsr, d_tmp, handle); ButterflyLeftWing(sblock, r, s, d_tmp, d_dst); CopyDataFromGPU(sblock, r * s, d_dst, h_res); 8

TRAINING NEURAL NETWORKS (CAFFE) Many layers Big data volumes have to pass through Most computationally expensive are convolutional layers Main ops: Gemm, Winograd, FFT 𝐷 𝑆 𝑇 𝑍 𝑜𝑙𝑞𝑟 = ෍ ෍ ෍ 𝑌 𝑜, 𝑑, 𝑞 + 𝑠, 𝑟 + 𝑡 𝑋 𝑙𝑑𝑠𝑡 𝑑=1 𝑠=1 𝑡=1 Data parallel distribution requires AllReduce to update weights 9

CAFFE SUPERKERNEL AlexNet Scalability 9.00 8.00 Iterations per second 7.00 In train(), create dim3 sblock = get_gpus(); 6.00 5.00 4.00 Refer to sblock.Volume() instead of gpus.size(); 3.00 2.00 Pass sblock to P2Psync constructor: 1.00 0.00 1 2 4 8 P2Psync::P2Psync(solver, root_id, sblock, solver->params); Number of GPUs Use sblock in P2Psync::Run(); Alternative: caffe_gpu_gemm<float>(sblock, gemm_params , …) 10

TAKE-AWAYS Multi-GPU programming using superblocks is easy Rasterization of the superkernel is as intuitive as rasterization of CTAs • Results in good scalability assuming good load balance and small communications (SpMV) Transparently distributes work, allows to concentrate on optimization of the communications • Pipeline communications and computations Data distributions can be described by the superblock 11

QUESTIONS? Michael Frumkin mfrumkin@nvidia.com

PHOTO CAPTION 13

Super GPU & Super Kernels: Make programming of multi-GPU systems - PowerPoint PPT Presentation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Accountable Care Organizations: The P Practical Reality ti l R lit BNA Webinar June 2, 2010

IRC Sect. 704(b): Allocations to Partners Navigating Complex Rules on Determining Validity of

Paediatric uveitis Clive Edelsten bio MA,MB Bchir, Cambridge, UK 1983 MRCP 1986, FRCS

Jumping Into The Deep: Imposter Syndrome, Defining Success, and the New Librarian Melanie

III. Procedure for EP under Bayanihan Act IV. GPPB Online Portal V. Direct Purchase from

Glass Packaging Institute Overview and Activity Update Bryan Vickers Glass Packaging Institute

Barclays CEO Energy-Power Conference September 9, 2020 Forward-Looking and Cautionary Statements

Green Building Finance and Investments: Practice, Policy and Research Tony Cope Head of

Sambuz

Useful Links

Newsletter

Mail Us

Super GPU & Super Kernels: Make programming of multi-GPU systems - PowerPoint PPT Presentation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Accountable Care Organizations: The P Practical Reality ti l R lit BNA Webinar June 2, 2010

IRC Sect. 704(b): Allocations to Partners Navigating Complex Rules on Determining Validity of

Paediatric uveitis Clive Edelsten bio MA,MB Bchir, Cambridge, UK 1983 MRCP 1986, FRCS

Jumping Into The Deep: Imposter Syndrome, Defining Success, and the New Librarian Melanie

III. Procedure for EP under Bayanihan Act IV. GPPB Online Portal V. Direct Purchase from

Glass Packaging Institute Overview and Activity Update Bryan Vickers Glass Packaging Institute

Barclays CEO Energy-Power Conference September 9, 2020 Forward-Looking and Cautionary Statements

Green Building Finance and Investments: Practice, Policy and Research Tony Cope Head of

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team