Sylvain Jeaugey
MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - - PowerPoint PPT Presentation
MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - - PowerPoint PPT Presentation
MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL
2
MULTI-GPU COMPUTING
Harvesting the power of multiple GPUs
1 GPU Multiple GPUs per system Multiple systems connected NCCL : NVIDIA Collective Communication Library NCCL
3
MULTI-GPU DL TRAINING
Single-GPU
parameters gradients batch (e.g. 256 images) Forward/ Backward Update Database : GBs of input data : images, sound, …
4
MULTI-GPU DL TRAINING
Data parallel
parameters batch gradients
local gradients
parameters batch gradients
local gradients
parameters batch gradients
local gradients NCCL Allreduce : Sum gradients across GPUs
parameters batch gradients
local gradients
5
NCCL
A multi-GPU communication library
Sockets (Ethernet) Infiniband, with GPU Direct RDMA Between systems Within a system PCIe NVLink GPU Direct P2P
6
NCCL
Architecture
NCCL CUDA CUBLAS
Tensorflow (+Horovod)
PyTorch MXNet Caffe2 Caffe
Deep Learning Frameworks
NVIDIA GPUs CUDNN CNTK
7
TIMELINE
NCCL history & roadmap
Inter-node communication Improved latency Aggregated
- perations
Large scale algorithms Point-to-point primitives (Send/Recv)
2.0 2.1 2.2 2.3 2.4
Intra-node communication
1.x
8
NCCL 2.0
Provide best performance to DL apps
10 20 30 40 50 60 70 QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta)
Allreduce Bandwidth (OMB, size=128MB) 5 8 12 132 62
GB/s
9
NCCL 2.1
ResNet50 buffer size
Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer.
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Bytes Occurences
ResNet50 / MXNet
FP32 FP16
10
NCCL 2.1
Latency improvement
20 40 60 80 100 120 140 160 2 4 8 16
NCCL Latency (in us)
2.0.5 2.1.0 n GPUs 265 75 40 22 8 10 14 36 1 node, NVLink 2 nodes, NVLink+Infiniband 7x 5x μs
11
NCCL
NCCL 2.2
Aggregated operations : principle
CUDA DL framework
ncclAllReduce cudaLaunchKernel
Principle : Merge multiple operations on the same CUDA device
Pay the launch overhead only once (more operations per second) Use multiple NVLinks simultaneously (more bandwidth)
12
NCCL 2.2
Aggregated operations : overhead
# of Aggregated ops μs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 4 8 16 32 64 128 256
Per-operation time, 8 GPUs, 8 Bytes reduction
13
NCCL 2.2
Aggregated operations : usage
Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate :
ncclGroupStart(); for (int op=0; op<nops; op++) { ncclAllReduce( layers[op].localGradients, layers[op].globalGradients, layers[op].gradientSize, ncclFloat, ncclSum, ncclComm, ncclStream); } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after ncclGroupEnd cudaStreamSynchronize(ncclStream);
14
NCCL 2.2
Aggregated operations : usage
Can be combined/nested with multi-GPU grouping :
ncclGroupStart(); for (int op=0; op<nops; op++) { for (int gpu=0; gpu<ngpus; gpu++) { ncclGroupStart(); ncclAllReduce( layers[op].localGradients[gpu], layers[op].globalGradients[gpu], layers[op].gradientSize, ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]); ncclGroupEnd(); } } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd for (int gpu=0; gpu<ngpus; gpu++) cudaStreamSynchronize(ncclStreams[gpu]);
15
NCCL 2.2
Aggregated operations : other uses
ReduceScatterV = Aggregation of multiple reductions operations
ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, redOp, rank, comm, stream); } ncclGroupEnd();
AllGatherV = Aggregation of multiple broadcasts operations
ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, rank, comm, stream); } ncclGroupEnd();
16
NCCL 2.3
Large scale algorithms
50 100 150 200 250 300 2 4 8 16 32 64 128
Allreduce Latency
2.2 2.3 (projected)
17
NCCL 2.4
Point-to-point primitives
Send / Receive , Scatter[v],Gather[v],Alltoall[v,w], neighbor collectives, … scatter gather alltoall Neighbor
18
NCCL
Summary
Optimized inter-GPU communication for DL and HPC Optimized for all NVIDIA platforms, most OEMs and Cloud Scales to 100s of GPUs, targeting 10,000s in the near future. Aims at covering all communication needs for multi-GPU computing. Only relies on CUDA. No dependency on MPI or any parallel environment. More questions ? Connect with the Experts : NCCL Wed 28, 3pm