NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - PowerPoint PPT Presentation

NCCL 2.0 Sylvain Jeaugey

DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets … training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2

NCCL A multi-GPU communication library To other systems Sockets (Ethernet) Infiniband, with GPU Direct RDMA Within a system NVLink PCIe GPU Direct P2P 3

NCCL Architecture Caffe Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GPUs 4

NCCL History Design NCCL 2.0 AGENDA New features API Changes Performance Future 5

HISTORY Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved API. 6

DESIGN What is NCCL ? Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HPC apps using MPI. Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously. 7

DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. 8

DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring 9

DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings 10

DESIGN Kernels sendbuff recvbuff FIFO Reduction Previous GPU Next GPU in the ring in the ring 11

NCCL 2.0 12

NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 13

NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 14

NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. n nodes Node 0 Node 1 Node N-1 2 sockets CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node 4 GPUs per GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 socket 15

NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 P P P P P P P P 1 process P P P CPU0 P P P P P CPU1 P P CPU0 CPU1 CPU0 CPU1 P P P P P P 8n 8n 8n 8n 8n 8n 8n 8n per GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -6 -5 -4 -3 -2 -1 -8 -7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 16

NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 Process 0 Process 1 Process 2 Process 3 Process 2n-2 Process 2n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per socket t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 1 thread GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 per GPU 17

NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node Process 0 Process 1 Process n-1 8 GPUs per process GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 18

NCCL 2.0 API Group calls NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread NCCL 1.x : for (int i=0; i<ngpus; i++) { cudaSetDevice(devices[i]); CPU0 CPU1 ncclAllReduce (…, comms[i], streams[i]); Process 0 } NCCL 2.0 : GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclGroupStart (); for (int i=0; i<ngpus; i++) { ncclAllReduce (…, comms[i], streams[i]); } ncclGroupEnd (); 19

NCCL 2.0 API Integration with parallel environments Inter-node communicator creation still uses the NCCL 1.x verbs : P P P P P P P P CPU0 CPU1 ncclGetUniqueId/ncclCommInitRank 0 1 2 3 4 5 6 7 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclCommInitRank(&comm, nranks, id, rank); Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd CPU0 CPU1 Process 0 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclGroupStart (); for (int i=0; i<ndev; i++) { GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 cudaSetDevice(devices[i]); ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i); } ncclGroupEnd (); 20

NCCL 2.0 API Others Other small API adjustments over the NCCL 1.x API : Counts are now of type size_t instead of int allGather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8 , int32 = int, uint32 , int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations 21

PERFORMANCE 22

PERFORMANCE Intra-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 60 50 40 30 20 10 0 4 QPI 4 CPU 4 PCI DGX-1 23

PERFORMANCE Inter-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 45 40 35 30 25 MPI Baidu Allreduce 20 NCCL 15 10 5 0 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink) 24

PERFORMANCE Deep Learning - CNTK CNTK scaling ResNet50, images/s 8000 7000 6569 6000 5000 4000 3000 3360 3281 1684 2000 1744 1645 1000 217 0 0 8 16 24 32 Ideal MPI NCCL 25

FUTURE 26

FUTURE Top asked features Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm 27

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - PowerPoint PPT Presentation

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Mobility Increases the Capacity of Ad-hoc Wireless Networks Matthias Grossglauser and David Tse

ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya

On Evolution of On Evolution of C2 Netw ork Topology C2 Netw ork Topology IC ICCR CRTS 2010

T he distinction between ments could just as readily lead Working Party data controller

Enabling Federated Cloud Networking Giovanni Merlino - University of Messina (Italy) Sbastien

Insights from bibliometric network properties into technology evolution - wind energy example -

PERFORMANCEANALYSIS OFPPFNN PROBABILISTIC

Peixin Qiao Illinois Institute of Technology Mo#va#on Simulator & Applica#ons

BEST PATHS Project: Real-Time Demonstrator for the Integration of Offshore Wind Farms using Multi-

OpenSoC Fabric An On-Chip Network Generator Farzad Fatollahi-Fard , Dave Donofrio, George

Remote Procedure Call (RPC) and Transparency Brad Karp UCL Computer Science CS GZ03 / M030 11

The importance of clarity in relationships and transparency: Lessons learnt from ENCePP PML

NETWORK OPERATORS Who we are Represents more than 600 independent electricity and gas

Tips and Tricks to enhancing transparency in personal information management Victorian Privacy

Congressional Budget Office February 4, 2019 Transparency at CBO 11th Annual Meeting of the OECD

Embracing new roles: how librarians can encourage and support clinicians and health-related

Presentation of the NTW BEPPER project on transparency in RWM Dr. Johan Swahn, Director The

UNT Digitizing your Supply Chain Bernard Goor, One Network 1 (c) One Network Enterprises. This

Voter Assistance & Outreach Maricopa County Elections Department A presentation for AAKN

OGD for Regulation of Energy Resources in India: Case of Coal and Oil & Gas TERI

Agenda Irish Regional Workshop, Crowne Plaza, Dundalk Network Code Development 9.00 - 9.30

MSC - Gaining a Complete Picture Understanding the Costs AND Benefits of products and services

Annual Meeting December 2013 Prof. Edward Miguel University of California, Berkeley

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - PowerPoint PPT Presentation

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Mobility Increases the Capacity of Ad-hoc Wireless Networks Matthias Grossglauser and David Tse

ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya

On Evolution of On Evolution of C2 Netw ork Topology C2 Netw ork Topology IC ICCR CRTS 2010

T he distinction between ments could just as readily lead Working Party data controller

Enabling Federated Cloud Networking Giovanni Merlino - University of Messina (Italy) Sbastien

Insights from bibliometric network properties into technology evolution - wind energy example -

PERFORMANCEANALYSIS OFPPFNN PROBABILISTIC

Peixin Qiao Illinois Institute of Technology Mo#va#on Simulator &amp; Applica#ons

BEST PATHS Project: Real-Time Demonstrator for the Integration of Offshore Wind Farms using Multi-

OpenSoC Fabric An On-Chip Network Generator Farzad Fatollahi-Fard , Dave Donofrio, George

Remote Procedure Call (RPC) and Transparency Brad Karp UCL Computer Science CS GZ03 / M030 11

The importance of clarity in relationships and transparency: Lessons learnt from ENCePP PML

NETWORK OPERATORS Who we are Represents more than 600 independent electricity and gas

Tips and Tricks to enhancing transparency in personal information management Victorian Privacy

Congressional Budget Office February 4, 2019 Transparency at CBO 11th Annual Meeting of the OECD

Embracing new roles: how librarians can encourage and support clinicians and health-related

Presentation of the NTW BEPPER project on transparency in RWM Dr. Johan Swahn, Director The

UNT Digitizing your Supply Chain Bernard Goor, One Network 1 (c) One Network Enterprises. This

Voter Assistance &amp; Outreach Maricopa County Elections Department A presentation for AAKN

OGD for Regulation of Energy Resources in India: Case of Coal and Oil &amp; Gas TERI

Agenda Irish Regional Workshop, Crowne Plaza, Dundalk Network Code Development 9.00 - 9.30

MSC - Gaining a Complete Picture Understanding the Costs AND Benefits of products and services

Annual Meeting December 2013 Prof. Edward Miguel University of California, Berkeley

Peixin Qiao Illinois Institute of Technology Mo#va#on Simulator & Applica#ons

Voter Assistance & Outreach Maricopa County Elections Department A presentation for AAKN

OGD for Regulation of Energy Resources in India: Case of Coal and Oil & Gas TERI