 
              NCCL 2.0 Sylvain Jeaugey
DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets … training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2
NCCL A multi-GPU communication library To other systems Sockets (Ethernet) Infiniband, with GPU Direct RDMA Within a system NVLink PCIe GPU Direct P2P 3
NCCL Architecture Caffe Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GPUs 4
NCCL History Design NCCL 2.0 AGENDA New features API Changes Performance Future 5
HISTORY Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved API. 6
DESIGN What is NCCL ? Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HPC apps using MPI. Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously. 7
DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. 8
DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring 9
DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings 10
DESIGN Kernels sendbuff recvbuff FIFO Reduction Previous GPU Next GPU in the ring in the ring 11
NCCL 2.0 12
NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 13
NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 14
NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. n nodes Node 0 Node 1 Node N-1 2 sockets CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node 4 GPUs per GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 socket 15
NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 P P P P P P P P 1 process P P P CPU0 P P P P P CPU1 P P CPU0 CPU1 CPU0 CPU1 P P P P P P 8n 8n 8n 8n 8n 8n 8n 8n per GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -6 -5 -4 -3 -2 -1 -8 -7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 16
NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 Process 0 Process 1 Process 2 Process 3 Process 2n-2 Process 2n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per socket t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 1 thread GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 per GPU 17
NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node Process 0 Process 1 Process n-1 8 GPUs per process GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 18
NCCL 2.0 API Group calls NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread NCCL 1.x : for (int i=0; i<ngpus; i++) { cudaSetDevice(devices[i]); CPU0 CPU1 ncclAllReduce (…, comms[i], streams[i]); Process 0 } NCCL 2.0 : GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclGroupStart (); for (int i=0; i<ngpus; i++) { ncclAllReduce (…, comms[i], streams[i]); } ncclGroupEnd (); 19
NCCL 2.0 API Integration with parallel environments Inter-node communicator creation still uses the NCCL 1.x verbs : P P P P P P P P CPU0 CPU1 ncclGetUniqueId/ncclCommInitRank 0 1 2 3 4 5 6 7 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclCommInitRank(&comm, nranks, id, rank); Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd CPU0 CPU1 Process 0 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclGroupStart (); for (int i=0; i<ndev; i++) { GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 cudaSetDevice(devices[i]); ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i); } ncclGroupEnd (); 20
NCCL 2.0 API Others Other small API adjustments over the NCCL 1.x API : Counts are now of type size_t instead of int allGather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8 , int32 = int, uint32 , int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations 21
PERFORMANCE 22
PERFORMANCE Intra-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 60 50 40 30 20 10 0 4 QPI 4 CPU 4 PCI DGX-1 23
PERFORMANCE Inter-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 45 40 35 30 25 MPI Baidu Allreduce 20 NCCL 15 10 5 0 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink) 24
PERFORMANCE Deep Learning - CNTK CNTK scaling ResNet50, images/s 8000 7000 6569 6000 5000 4000 3000 3360 3281 1684 2000 1744 1645 1000 217 0 0 8 16 24 32 Ideal MPI NCCL 25
FUTURE 26
FUTURE Top asked features Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm 27
Recommend
More recommend