NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - - PowerPoint PPT Presentation

nccl 2 0
SMART_READER_LITE
LIVE PREVIEW

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - - PowerPoint PPT Presentation

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2


slide-1
SLIDE 1

Sylvain Jeaugey

NCCL 2.0

slide-2
SLIDE 2

2

DEEP LEARNING ON GPUS

Making DL training times shorter

Multi-core CPU GPU CUDA Multi-GPU NCCL 1 Multi-GPU Multi-node NCCL 2 Deeper neural networks, larger data sets … training is a very, very long operation !

slide-3
SLIDE 3

3

NCCL

A multi-GPU communication library

PCIe NVLink Sockets (Ethernet) Infiniband, with GPU Direct RDMA To other systems Within a system GPU Direct P2P

slide-4
SLIDE 4

4

NCCL

Architecture

NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks

NVIDIA GPUs CUDNN

slide-5
SLIDE 5

5

AGENDA

NCCL History Design NCCL 2.0 New features API Changes Performance Future

slide-6
SLIDE 6

6

HISTORY

Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved API.

slide-7
SLIDE 7

7

DESIGN

Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HPC apps using MPI. Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously.

What is NCCL ?

slide-8
SLIDE 8

8

DESIGN

Rings

NCCL uses rings to move data across all GPUs and perform reductions.

slide-9
SLIDE 9

9

DESIGN

Rings

NCCL uses rings to move data across all GPUs and perform reductions.

PCIe / QPI : 1 unidirectional ring

slide-10
SLIDE 10

10

DESIGN

Rings

NCCL uses rings to move data across all GPUs and perform reductions.

DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring

slide-11
SLIDE 11

11

Reduction

DESIGN

Kernels

sendbuff recvbuff FIFO Next GPU in the ring Previous GPU in the ring

slide-12
SLIDE 12

12

NCCL 2.0

slide-13
SLIDE 13

13

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

slide-14
SLIDE 14

14

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

slide-15
SLIDE 15

15

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node 1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node N-1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

4 GPUs per socket 2 sockets per node n nodes

slide-16
SLIDE 16

16

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node 1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node n-1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

1 process per GPU P P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P

10

P

11

P

12

P

13

P

14

P

15

P

8n

  • 8

P

8n

  • 7

P

8n

  • 6

P

8n

  • 5

P

8n

  • 4

P

8n

  • 3

P

8n

  • 2

P

8n

  • 1
slide-17
SLIDE 17

17

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node 1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node n-1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Process 0

t0 t1 t2 t3

Process 1

t0 t1 t2 t3

Process 2

t0 t1 t2 t3

Process 3

t0 t1 t2 t3 Process 2n-2 t0 t1 t2 t3 Process 2n-1 t0 t1 t2 t3

1 thread per GPU 1 process per socket

slide-18
SLIDE 18

18

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node 1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Node n-1

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

1 process per node Process 0 Process 1 Process n-1 8 GPUs per process

slide-19
SLIDE 19

19

NCCL 2.0 API

NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread NCCL 1.x :

for (int i=0; i<ngpus; i++) { cudaSetDevice(devices[i]); ncclAllReduce(…, comms[i], streams[i]); }

NCCL 2.0 :

ncclGroupStart(); for (int i=0; i<ngpus; i++) { ncclAllReduce(…, comms[i], streams[i]); } ncclGroupEnd();

Group calls

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Process 0

slide-20
SLIDE 20

20

NCCL 2.0 API

Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank

if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclCommInitRank(&comm, nranks, id, rank);

Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd

if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclGroupStart(); for (int i=0; i<ndev; i++) { cudaSetDevice(devices[i]); ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i); } ncclGroupEnd();

Integration with parallel environments

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

P P 1 P 2 P 3 P 4 P 5 P 6 P 7

GPU0 CPU0 GPU1 GPU2 GPU3 GPU4 CPU1 GPU5 GPU6 GPU7

Process 0

slide-21
SLIDE 21

21

NCCL 2.0 API

Other small API adjustments over the NCCL 1.x API : Counts are now of type size_t instead of int allGather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations

Others

slide-22
SLIDE 22

22

PERFORMANCE

slide-23
SLIDE 23

23

PERFORMANCE

Intra-node performance

10 20 30 40 50 60

4 QPI 4 CPU 4 PCI DGX-1

AllReduce bandwidth (OMB, size=128MB, in GB/s)

slide-24
SLIDE 24

24

PERFORMANCE

Inter-node performance

5 10 15 20 25 30 35 40 45

2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

AllReduce bandwidth (OMB, size=128MB, in GB/s)

MPI Baidu Allreduce NCCL

slide-25
SLIDE 25

25

PERFORMANCE

Deep Learning - CNTK

1000 2000 3000 4000 5000 6000 7000 8000 8 16 24 32

CNTK scaling ResNet50, images/s

Ideal MPI NCCL

217 1684 3281 6569 1645 1744 3360

slide-26
SLIDE 26

26

FUTURE

slide-27
SLIDE 27

27

FUTURE

Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm

Top asked features

slide-28
SLIDE 28