MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - PowerPoint PPT Presentation

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey

MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2

MULTI-GPU DL TRAINING Single-GPU parameters Database : GBs of Forward/ Update input data : Backward images, sound, … batch (e.g. 256 images) gradients 3

MULTI-GPU DL TRAINING Data parallel parameters parameters parameters parameters batch batch batch batch local gradients local gradients local gradients local gradients NCCL Allreduce : Sum gradients across GPUs gradients gradients gradients gradients 4

NCCL A multi-GPU communication library Between systems Within a system NVLink Sockets (Ethernet) PCIe Infiniband, with GPU Direct RDMA GPU Direct P2P 5

NCCL Architecture Tensorflow PyTorch MXNet CNTK Caffe2 Caffe (+Horovod) Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GPUs 6

TIMELINE NCCL history & roadmap Intra-node Inter-node Improved communication communication latency 1.x 2.0 2.1 Point-to-point Aggregated Large scale primitives operations algorithms (Send/Recv) 2.2 2.3 2.4 7

132 NCCL 2.0 Provide best performance to DL apps Allreduce Bandwidth (OMB, size=128MB) 70 62 60 50 GB/s 40 30 20 12 8 10 5 0 QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta) 8

NCCL 2.1 ResNet50 buffer size Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer. ResNet50 / MXNet 34 32 30 28 26 24 22 20 Occurences 18 16 FP32 14 FP16 12 10 8 6 4 2 0 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 9 Bytes

NCCL 2.1 Latency improvement 265 NCCL Latency (in us) 160 2.0.5 2.1.0 140 120 7x 100 75 μ s 80 60 5x 40 36 40 22 20 14 10 8 0 2 4 8 16 n GPUs 2 nodes, 1 node, NVLink NVLink+Infiniband 10

NCCL 2.2 Aggregated operations : principle Principle : Merge multiple operations on the same CUDA device Pay the launch overhead only once (more operations per second) Use multiple NVLinks simultaneously (more bandwidth) DL framework ncclAllReduce NCCL cudaLaunchKernel CUDA 11

NCCL 2.2 Aggregated operations : overhead Per-operation time, 8 GPUs, 8 Bytes reduction 15 14 13 12 11 10 9 8 μ s 7 6 5 4 3 2 1 0 1 2 4 8 16 32 64 128 256 # of Aggregated ops 12

NCCL 2.2 Aggregated operations : usage Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate : ncclGroupStart(); for (int op=0; op<nops; op++) { ncclAllReduce( layers[op].localGradients, layers[op].globalGradients, layers[op].gradientSize, ncclFloat, ncclSum, ncclComm, ncclStream); } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after ncclGroupEnd cudaStreamSynchronize(ncclStream); 13

NCCL 2.2 Aggregated operations : usage Can be combined/nested with multi-GPU grouping : ncclGroupStart(); for (int op=0; op<nops; op++) { for (int gpu=0; gpu<ngpus; gpu++) { ncclGroupStart(); ncclAllReduce( layers[op].localGradients[gpu], layers[op].globalGradients[gpu], layers[op].gradientSize, ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]); ncclGroupEnd(); } } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd for (int gpu=0; gpu<ngpus; gpu++) cudaStreamSynchronize(ncclStreams[gpu]); 14

NCCL 2.2 Aggregated operations : other uses ReduceScatterV = Aggregation of multiple reductions operations ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, redOp, rank, comm, stream); } ncclGroupEnd(); AllGatherV = Aggregation of multiple broadcasts operations ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, rank, comm, stream); } ncclGroupEnd(); 15

NCCL 2.3 Large scale algorithms Allreduce Latency 300 2.2 2.3 (projected) 250 200 150 100 50 0 2 4 8 16 32 64 128 16

NCCL 2.4 Point-to-point primitives Send / Receive , Scatter[v],Gather[v],Alltoall[v,w ], neighbor collectives, … gather Neighbor alltoall scatter 17

NCCL Summary Optimized inter-GPU communication for DL and HPC Optimized for all NVIDIA platforms, most OEMs and Cloud Scales to 100s of GPUs, targeting 10,000s in the near future. Aims at covering all communication needs for multi-GPU computing. Only relies on CUDA. No dependency on MPI or any parallel environment. More questions ? Connect with the Experts : NCCL Wed 28, 3pm 18

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - PowerPoint PPT Presentation

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Consulate General of India Munich Commercial & Economic Wing (June 2016) (Consular

ISK BEARINGS INDUSTRIES OFFICE ICE & WORKS Plot t No. G 1513, Lodhika ika GIDC, C,

EGEA General Assembly 23 rd October 2014, Brussels Providing more influence, better information

Automotive Systems Presse- und Finanzanalystenreise 24./25.09.03 Erwin Stoller CEO, Rieter

MURI: Training Knowledge and Skills for the Networked Battlefield ARO Award No. W911NF-05-1-0153

Chester Bridge Environmental Assessment (EA) CAG Meeting #2 October 12, 2017 Agenda

Speed Databasing A Matchmaking Activity for Students and Library Databases Jill Chisnell Teresa

Youth Insight Developing our understanding 1 Agenda and outcomes Agenda Reminder of youth

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - PowerPoint PPT Presentation

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Consulate General of India Munich Commercial &amp; Economic Wing (June 2016) (Consular

ISK BEARINGS INDUSTRIES OFFICE ICE &amp; WORKS Plot t No. G 1513, Lodhika ika GIDC, C,

EGEA General Assembly 23 rd October 2014, Brussels Providing more influence, better information

Automotive Systems Presse- und Finanzanalystenreise 24./25.09.03 Erwin Stoller CEO, Rieter

MURI: Training Knowledge and Skills for the Networked Battlefield ARO Award No. W911NF-05-1-0153

Chester Bridge Environmental Assessment (EA) CAG Meeting #2 October 12, 2017 Agenda

Speed Databasing A Matchmaking Activity for Students and Library Databases Jill Chisnell Teresa

Youth Insight Developing our understanding 1 Agenda and outcomes Agenda Reminder of youth

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Consulate General of India Munich Commercial & Economic Wing (June 2016) (Consular

ISK BEARINGS INDUSTRIES OFFICE ICE & WORKS Plot t No. G 1513, Lodhika ika GIDC, C,