MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - - PowerPoint PPT Presentation

multi gpu training with nccl
SMART_READER_LITE
LIVE PREVIEW

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - - PowerPoint PPT Presentation

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL


slide-1
SLIDE 1

Sylvain Jeaugey

MULTI-GPU TRAINING WITH NCCL

slide-2
SLIDE 2

2

MULTI-GPU COMPUTING

Harvesting the power of multiple GPUs

1 GPU Multiple GPUs per system Multiple systems connected NCCL : NVIDIA Collective Communication Library NCCL

slide-3
SLIDE 3

3

MULTI-GPU DL TRAINING

Single-GPU

parameters gradients batch (e.g. 256 images) Forward/ Backward Update Database : GBs of input data : images, sound, …

slide-4
SLIDE 4

4

MULTI-GPU DL TRAINING

Data parallel

parameters batch gradients

local gradients

parameters batch gradients

local gradients

parameters batch gradients

local gradients NCCL Allreduce : Sum gradients across GPUs

parameters batch gradients

local gradients

slide-5
SLIDE 5

5

NCCL

A multi-GPU communication library

Sockets (Ethernet) Infiniband, with GPU Direct RDMA Between systems Within a system PCIe NVLink GPU Direct P2P

slide-6
SLIDE 6

6

NCCL

Architecture

NCCL CUDA CUBLAS

Tensorflow (+Horovod)

PyTorch MXNet Caffe2 Caffe

Deep Learning Frameworks

NVIDIA GPUs CUDNN CNTK

slide-7
SLIDE 7

7

TIMELINE

NCCL history & roadmap

Inter-node communication Improved latency Aggregated

  • perations

Large scale algorithms Point-to-point primitives (Send/Recv)

2.0 2.1 2.2 2.3 2.4

Intra-node communication

1.x

slide-8
SLIDE 8

8

NCCL 2.0

Provide best performance to DL apps

10 20 30 40 50 60 70 QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta)

Allreduce Bandwidth (OMB, size=128MB) 5 8 12 132 62

GB/s

slide-9
SLIDE 9

9

NCCL 2.1

ResNet50 buffer size

Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer.

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Bytes Occurences

ResNet50 / MXNet

FP32 FP16

slide-10
SLIDE 10

10

NCCL 2.1

Latency improvement

20 40 60 80 100 120 140 160 2 4 8 16

NCCL Latency (in us)

2.0.5 2.1.0 n GPUs 265 75 40 22 8 10 14 36 1 node, NVLink 2 nodes, NVLink+Infiniband 7x 5x μs

slide-11
SLIDE 11

11

NCCL

NCCL 2.2

Aggregated operations : principle

CUDA DL framework

ncclAllReduce cudaLaunchKernel

Principle : Merge multiple operations on the same CUDA device

Pay the launch overhead only once (more operations per second) Use multiple NVLinks simultaneously (more bandwidth)

slide-12
SLIDE 12

12

NCCL 2.2

Aggregated operations : overhead

# of Aggregated ops μs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 4 8 16 32 64 128 256

Per-operation time, 8 GPUs, 8 Bytes reduction

slide-13
SLIDE 13

13

NCCL 2.2

Aggregated operations : usage

Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate :

ncclGroupStart(); for (int op=0; op<nops; op++) { ncclAllReduce( layers[op].localGradients, layers[op].globalGradients, layers[op].gradientSize, ncclFloat, ncclSum, ncclComm, ncclStream); } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after ncclGroupEnd cudaStreamSynchronize(ncclStream);

slide-14
SLIDE 14

14

NCCL 2.2

Aggregated operations : usage

Can be combined/nested with multi-GPU grouping :

ncclGroupStart(); for (int op=0; op<nops; op++) { for (int gpu=0; gpu<ngpus; gpu++) { ncclGroupStart(); ncclAllReduce( layers[op].localGradients[gpu], layers[op].globalGradients[gpu], layers[op].gradientSize, ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]); ncclGroupEnd(); } } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd for (int gpu=0; gpu<ngpus; gpu++) cudaStreamSynchronize(ncclStreams[gpu]);

slide-15
SLIDE 15

15

NCCL 2.2

Aggregated operations : other uses

ReduceScatterV = Aggregation of multiple reductions operations

ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, redOp, rank, comm, stream); } ncclGroupEnd();

AllGatherV = Aggregation of multiple broadcasts operations

ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, rank, comm, stream); } ncclGroupEnd();

slide-16
SLIDE 16

16

NCCL 2.3

Large scale algorithms

50 100 150 200 250 300 2 4 8 16 32 64 128

Allreduce Latency

2.2 2.3 (projected)

slide-17
SLIDE 17

17

NCCL 2.4

Point-to-point primitives

Send / Receive , Scatter[v],Gather[v],Alltoall[v,w], neighbor collectives, … scatter gather alltoall Neighbor

slide-18
SLIDE 18

18

NCCL

Summary

Optimized inter-GPU communication for DL and HPC Optimized for all NVIDIA platforms, most OEMs and Cloud Scales to 100s of GPUs, targeting 10,000s in the near future. Aims at covering all communication needs for multi-GPU computing. Only relies on CUDA. No dependency on MPI or any parallel environment. More questions ? Connect with the Experts : NCCL Wed 28, 3pm

slide-19
SLIDE 19