Sylvain Jeaugey
NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - - PowerPoint PPT Presentation
NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - - PowerPoint PPT Presentation
NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2
2
DEEP LEARNING ON GPUS
Making DL training times shorter
Multi-core CPU GPU CUDA Multi-GPU NCCL 1 Multi-GPU Multi-node NCCL 2 Deeper neural networks, larger data sets … training is a very, very long operation !
3
NCCL
A multi-GPU communication library
PCIe NVLink Sockets (Ethernet) Infiniband, with GPU Direct RDMA To other systems Within a system GPU Direct P2P
4
NCCL
Architecture
NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK
Deep Learning Frameworks
NVIDIA GPUs CUDNN
5
AGENDA
NCCL History Design NCCL 2.0 New features API Changes Performance Future
6
HISTORY
Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved API.
7
DESIGN
Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HPC apps using MPI. Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously.
What is NCCL ?
8
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
9
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
PCIe / QPI : 1 unidirectional ring
10
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring
11
Reduction
DESIGN
Kernels
sendbuff recvbuff FIFO Next GPU in the ring Previous GPU in the ring
12
NCCL 2.0
13
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
14
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
15
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node 1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node N-1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
4 GPUs per socket 2 sockets per node n nodes
16
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node 1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node n-1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
1 process per GPU P P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P
10
P
11
P
12
P
13
P
14
P
15
P
8n
- 8
P
8n
- 7
P
8n
- 6
P
8n
- 5
P
8n
- 4
P
8n
- 3
P
8n
- 2
P
8n
- 1
17
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node 1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node n-1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Process 0
t0 t1 t2 t3
Process 1
t0 t1 t2 t3
Process 2
t0 t1 t2 t3
Process 3
t0 t1 t2 t3 Process 2n-2 t0 t1 t2 t3 Process 2n-1 t0 t1 t2 t3
1 thread per GPU 1 process per socket
18
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node 1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Node n-1
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
1 process per node Process 0 Process 1 Process n-1 8 GPUs per process
19
NCCL 2.0 API
NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread NCCL 1.x :
for (int i=0; i<ngpus; i++) { cudaSetDevice(devices[i]); ncclAllReduce(…, comms[i], streams[i]); }
NCCL 2.0 :
ncclGroupStart(); for (int i=0; i<ngpus; i++) { ncclAllReduce(…, comms[i], streams[i]); } ncclGroupEnd();
Group calls
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Process 0
20
NCCL 2.0 API
Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank
if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclCommInitRank(&comm, nranks, id, rank);
Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd
if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclGroupStart(); for (int i=0; i<ndev; i++) { cudaSetDevice(devices[i]); ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i); } ncclGroupEnd();
Integration with parallel environments
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
P P 1 P 2 P 3 P 4 P 5 P 6 P 7
GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7
Process 0
21
NCCL 2.0 API
Other small API adjustments over the NCCL 1.x API : Counts are now of type size_t instead of int allGather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations
Others
22
PERFORMANCE
23
PERFORMANCE
Intra-node performance
10 20 30 40 50 60
4 QPI 4 CPU 4 PCI DGX-1
AllReduce bandwidth (OMB, size=128MB, in GB/s)
24
PERFORMANCE
Inter-node performance
5 10 15 20 25 30 35 40 45
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
AllReduce bandwidth (OMB, size=128MB, in GB/s)
MPI Baidu Allreduce NCCL
25
PERFORMANCE
Deep Learning - CNTK
1000 2000 3000 4000 5000 6000 7000 8000 8 16 24 32
CNTK scaling ResNet50, images/s
Ideal MPI NCCL
217 1684 3281 6569 1645 1744 3360
26
FUTURE
27
FUTURE
Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm