S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 - - PowerPoint PPT Presentation

s8688 inside dgx 2
SMART_READER_LITE
LIVE PREVIEW

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 - - PowerPoint PPT Presentation

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created DGX-2 internal architecture Agenda Software programming model Simple application Results 2 DEEP LEARNING TRENDS Application properties Explosive DL


slide-1
SLIDE 1

Glenn Dearth, Vyas Venkataraman Mar 28, 2018

S8688 : INSIDE DGX-2

slide-2
SLIDE 2

2

Agenda

Why was DGX-2 created DGX-2 internal architecture Software programming model Simple application Results

slide-3
SLIDE 3

3

DEEP LEARNING TRENDS

Application properties

Explosive DL growth:

Increasing data, computation & complexity demands Exceeds memory capacity of single GPU Exceeds compute performance of a single GPU

Driving scale-out across GPUs

2011 2012 2013 2014 2015 2016 2017

Image

(GOP * Bandwidth) ResNet-50 Inception-v2 Inception-v4 AlexNet GoogleNet

350X

slide-4
SLIDE 4

4

DGX-1

8 V100 GPUs 6 NVLinks per GPU Each link is 50GB/s (bidirectional) 300GB/s bidirectional BW from GPU DGX-1 uses Hybrid Cube Mesh topology Internal bisection bandwidth 300GB/s Optimized data parallel training with NCCL

slide-5
SLIDE 5

5

DESIRED SCALE-OUT BEHAVIOR

Scale up to 16 GPUs Direct peer GPU memory access Full non-blocking bandwidth Utilize all GPU links when accessing memory Simplify multi-GPU programming

slide-6
SLIDE 6

6

SCALE UP TO 16 GPUS

GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15 GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

slide-7
SLIDE 7

7

DIRECT PEER MEMORY ACCESS

GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15 GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

NVSwitch NVSwitch

slide-8
SLIDE 8

8

FULL NON-BLOCKING BANDWIDTH

GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15 GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch

slide-9
SLIDE 9

9

DGX-2 AT A GLANCE

NVSWITCHES

6x NVLink

DGX-2 GPU Density 16 1GPU to 1GPU Always 6 NVLink Connectivity Fully Connected Topology Symmetric Bisection Bandwidth 2.4 TB/s

slide-10
SLIDE 10

10

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE

1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory

10

30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 Dual 10/25 GigE 9

Introducing NVIDIA DGX-2

slide-11
SLIDE 11

11

NVSWITCH

Features: 18 NVLink ports @ 50GB/s per port 900 GBs total Fully connected crossbar x4 PCIe Gen2 Management Port GPIO I2C Transistor count: 2 billion Package: 47.5 x 47.5mm 1937 Ball @ 1mm pitch

slide-12
SLIDE 12

12

SWITCH FUNCTIONS

NVLink Performs physical, datalink & transaction layer functions Forwarding Determines packet routing Crossbar (non-blocking) Schedules traffic flows to outputs Management Configuration, errors, monitors

NVLINK NVLINK MANAGEMENT CROSSBAR FORWARDING FORWARDING

slide-13
SLIDE 13

13

NVSWITCH RELIABILITY FEATURES

Link CRC and retry ECC on routing structures and data path Secondary checks: Routing checks Data path overflow/underflow checks Access control checks

slide-14
SLIDE 14

14

Programming Model

slide-15
SLIDE 15

15

MULTI GPU PROGRAMMING IN CUDA

GPU GPU 1

slide-16
SLIDE 16

16

EXECUTION CONTROL

Asynchronous CUDA calls execute in a CUDA stream Default to null stream Can specify stream explicitly CUDA runtime API calls have implicit current device selected Current device can be changed using cudaSetDevice() call Cooperative groups have a multi device launch cudaLaunchCooperativeKernelMultiDevice()

slide-17
SLIDE 17

17

CUDA ON DGX-2

DGX-2 enables up to 16 peer GPUs DGX-2 enables full NVLink bandwidth to peer GPUs GPU memory model extended to all GPUs Unified Memory and CUDA aware MPI use NVLink for transfers

slide-18
SLIDE 18

18

MEMORY MANAGMENT

NVLINK PROVIDES

All-to-all high-bandwidth peer mapping between GPUs Full inter-GPU memory interconnect (incl. Atomics)

GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

16x 32GB Independent Memory Regions

GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15

slide-19
SLIDE 19

19

PINNED MEMORY ALLOCATION

Enable peer memory access

// Enable Peer accesses between all pairs of GPUs for (int i = 0; i < numDevices; ++i) for (int j = 0; j < numDevices; ++j) if (i != j) { cudaEnablePeerAccess(i, j); }

slide-20
SLIDE 20

20

PINNED MEMORY ALLOCATION

cudaMalloc with CUDA P2P

int* ptr[MAX_DEVICES]; for (int i = 0; i<numDevices; ++i) { // Set a device cudaSetDevice(i); // Allocate memory on the device cudaMalloc((void**)&ptr[i], size); }

slide-21
SLIDE 21

21

UNIFIED MEMORY PROVIDES

Single memory view shared by all GPUs Automatic migration of data between GPUs User control of data locality

UNIFIED MEMORY + DGX-2

GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15

512 GB Unified Memory

slide-22
SLIDE 22

22

UNIFIED MEMORY

Allocating across multiple GPUs

int* ptr; // Allocate memory cudaMallocManaged((void**)&ptr, size * numDevices);

slide-23
SLIDE 23

23

UNIFIED MEMORY

Allocating across multiple GPUs

int* ptr; // Allocate memory cudaMallocManaged((void**)&ptr, size * numDevices); for (int i = 0; i < numDevices; ++i) { // Mark the memory as preferring a specific GPU cudaMemAdvise(ptr + i*size, size, cudaMemAdviseSetPreferredHome, i); // Mark this memory accessed by all devices for (int j = 0; j < numDevices; ++j) { cudaMemAdvise(ptr + i*size, size, cudaMemAdviseSetAccessedBy, j); } }

slide-24
SLIDE 24

24

BROADCAST ON DGX-1

Ring Scatter

Time = 0

slide-25
SLIDE 25

25

BROADCAST ON DGX-1

Ring Scatter

Time = 0 Time = 1

slide-26
SLIDE 26

26

BROADCAST ON DGX-1

Ring Scatter

Time = 0 Time = 2 Time = 1

slide-27
SLIDE 27

27

BROADCAST ON DGX-1

Ring Scatter

Time = 0 Time = 7

slide-28
SLIDE 28

28

BROADCAST ON DGX-2

Direct Broadcast (DGX-2)

Time = 0

slide-29
SLIDE 29

29

IMPLEMENTATION COMPARISON

__global__ void broadcast_ring(int *src, int *dst) { int index = blockIdx.x*gridDim.x + threadIdx.x; dst[index] = src[index]; } // CPU Code cudaEvent_t ev[MAX_DEVICES]; For (int i = 0; i < numDevices - 1; i++) { cudaEventCreate(&ev[i]); cudaSetDevice(i); if (i > 0) cudaStreamEventWait(NULL, ev[i-1], 0); broadcast_ring<<<blocks, threads >>>(ptr[i], ptr[i+1]); cudaEventRecord(ev[i]) } cudaSetDevice(0); cudaStreamWaitEvent(NULL, ev[numDevices – 2], 0); cudaDeviceSynchronize(); __global__ void broadcast_direct(int *src, int **pDst, int numDevices) { int index = blockIdx.x*gridDim.x + threadIdx.x; for (int i = 0; i < numDevices; ++i) { int *dst = pDst[i]; dst[index] = src[index]; } } // CPU code cudaSetDevice(0); broadcast_direct<<<blocks, threads>>>(ptr[0], dPtr); cudaDeviceSynchronize();

Ring Scatter (DGX-1) Direct broadcast (DGX-2)

slide-30
SLIDE 30

30

ALL REDUCE BENCHMARK

20000 40000 60000 80000 100000 120000 2 8 32 128 512 2048 8192 32768 131072 524288

Achieved Bandwidth (MB/sec) Message Size (kB)

2x DGX-1v w/ 100Gb Infiniband 2x DGX-1v w/ 400Gb Infiniband DGX-2 (ring) DGX-2 (direct)

Direct All reduce ~50 lines of code Better performance for small messages

Source: Performance measured on pre production NVSwitch hardware

slide-31
SLIDE 31

31

NVSWITCH FFT BENCHMARK

0x 1x 2x 3x 4x 4 GPU 8 GPU 16 GPU DGX-1 (Volta) DGX-2

3484 3598 1374 6965 13K

3D FFT 1280 x 1280 x 1280 in GFLOPS (FP32 Complex)

Performance is measured. NVSwitch uses early bring-up software FFT is measured with cufftbench

N/A

slide-32
SLIDE 32

32

2X HIGHER PERFORMANCE WITH NVSWITCH

2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs

Physics (MILC benchmark) 4D Grid Weather (ECMWF benchmark) All-to-all Recommender (Sparse Embedding) Reduce & Broadcast Language Model (Transformer with MoE) All-to-all

DGX-2 with NVSwitch 2x DGX-1 (Volta) 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER

slide-33
SLIDE 33

33

SUMMARY

Faster Solutions

DGX-2 Advantages

Faster Development Gigantic Problems

Physics (MILC benchmark) 4D Grid

2X FASTER

GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15

512 GB Unified Memory

slide-34
SLIDE 34

34

OTHER NVIDIA SESSIONS TO ATTEND

ADDITIONAL NVIDIA LED SESSIONS S8670 Wed 3/28 Time? Multi-GPU Programming Techniques in CUDA

  • Stephen Jones (Software Architect)

S8474 Thur 3/29 10:00 am GPUDirect: Life in the Fast Lane

  • Davide Rosetti (Software Architect)
slide-35
SLIDE 35