S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Why was DGX-2 created DGX-2 internal architecture Agenda Software programming model Simple application Results 2

DEEP LEARNING TRENDS Application properties Explosive DL growth: Increasing data, computation & complexity demands 350X Exceeds memory capacity of single GPU Inception-v4 Exceeds compute performance of a single GPU Driving scale-out across GPUs ResNet-50 AlexNet GoogleNet Inception-v2 2011 2012 2013 2014 2015 2016 2017 Image (GOP * Bandwidth) 3

DGX-1 8 V100 GPUs 6 NVLinks per GPU Each link is 50GB/s (bidirectional) 300GB/s bidirectional BW from GPU DGX-1 uses Hybrid Cube Mesh topology Internal bisection bandwidth 300GB/s Optimized data parallel training with NCCL 4

DESIRED SCALE-OUT BEHAVIOR Scale up to 16 GPUs Direct peer GPU memory access Full non-blocking bandwidth Utilize all GPU links when accessing memory Simplify multi-GPU programming 5

SCALE UP TO 16 GPUS GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 6

DIRECT PEER MEMORY ACCESS GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 NVSwitch NVSwitch GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 7

FULL NON-BLOCKING BANDWIDTH GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 8

DGX-2 AT A GLANCE 6x NVLink NVSWITCHES DGX-2 GPU Density 16 1GPU to 1GPU Always 6 NVLink Connectivity Fully Connected Topology Symmetric Bisection Bandwidth 2.4 TB/s 9

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE Introducing NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches Eight EDR Infiniband/100 GigE 3 4 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth PCIe Switch Complex 5 6 Two Intel Xeon Platinum CPUs 30 TB NVME SSDs 8 Internal Storage 7 1.5 TB System Memory Dual 10/25 GigE 9 10 10

NVSWITCH Features: Transistor count: 18 NVLink ports 2 billion @ 50GB/s per port 900 GBs total Package: Fully connected crossbar 47.5 x 47.5mm x4 PCIe Gen2 Management Port 1937 Ball @ 1mm pitch GPIO I2C 11

SWITCH FUNCTIONS FORWARDING FORWARDING NVLINK CROSSBAR NVLINK MANAGEMENT NVLink Performs physical, datalink & transaction layer functions Forwarding Determines packet routing Crossbar (non-blocking) Schedules traffic flows to outputs Management Configuration, errors, monitors 12

NVSWITCH RELIABILITY FEATURES Link CRC and retry ECC on routing structures and data path Secondary checks: Routing checks Data path overflow/underflow checks Access control checks 13

Programming Model 14

MULTI GPU PROGRAMMING IN CUDA GPU GPU 0 1 15

EXECUTION CONTROL Asynchronous CUDA calls execute in a CUDA stream Default to null stream Can specify stream explicitly CUDA runtime API calls have implicit current device selected Current device can be changed using cudaSetDevice () call Cooperative groups have a multi device launch cudaLaunchCooperativeKernelMultiDevice() 16

CUDA ON DGX-2 DGX-2 enables up to 16 peer GPUs DGX-2 enables full NVLink bandwidth to peer GPUs GPU memory model extended to all GPUs Unified Memory and CUDA aware MPI use NVLink for transfers 17

MEMORY MANAGMENT NVLINK PROVIDES GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 All-to-all high-bandwidth peer mapping between GPUs 16x 32GB Independent Memory Regions Full inter-GPU memory interconnect (incl. Atomics) GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 18

PINNED MEMORY ALLOCATION Enable peer memory access // Enable Peer accesses between all pairs of GPUs for (int i = 0; i < numDevices; ++i) for (int j = 0; j < numDevices; ++j) if (i != j) { cudaEnablePeerAccess(i, j); } 19

PINNED MEMORY ALLOCATION cudaMalloc with CUDA P2P int* ptr[MAX_DEVICES]; for (int i = 0; i<numDevices; ++i) { // Set a device cudaSetDevice(i); // Allocate memory on the device cudaMalloc((void**)&ptr[i], size); } 20

UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 Single memory view shared by all GPUs 512 GB Unified Memory Automatic migration of data between GPUs GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 User control of data locality 21

UNIFIED MEMORY Allocating across multiple GPUs int* ptr; // Allocate memory cudaMallocManaged((void**)&ptr, size * numDevices); 22

UNIFIED MEMORY Allocating across multiple GPUs int* ptr; // Allocate memory cudaMallocManaged((void**)&ptr, size * numDevices); for (int i = 0; i < numDevices; ++i) { // Mark the memory as preferring a specific GPU cudaMemAdvise(ptr + i*size, size, cudaMemAdviseSetPreferredHome, i); // Mark this memory accessed by all devices for (int j = 0; j < numDevices; ++j) { cudaMemAdvise(ptr + i*size, size, cudaMemAdviseSetAccessedBy, j); } } 23

BROADCAST ON DGX-1 Ring Scatter Time = 0 24

BROADCAST ON DGX-1 Ring Scatter Time = 0 Time = 1 25

BROADCAST ON DGX-1 Ring Scatter Time = 0 Time = 1 Time = 2 26

BROADCAST ON DGX-1 Ring Scatter Time = 0 Time = 7 27

BROADCAST ON DGX-2 Direct Broadcast (DGX-2) Time = 0 28

IMPLEMENTATION COMPARISON Ring Scatter (DGX-1) Direct broadcast (DGX-2) __global__ void broadcast_ring(int *src, int *dst) __global__ void broadcast_direct(int *src, int **pDst, int numDevices) { { int index = blockIdx.x*gridDim.x + threadIdx.x; int index = blockIdx.x*gridDim.x + threadIdx.x; dst[index] = src[index]; for (int i = 0; i < numDevices; ++i) { } int *dst = pDst[i]; // CPU Code dst[index] = src[index]; cudaEvent_t ev[MAX_DEVICES]; } For (int i = 0; i < numDevices - 1; i++) { } cudaEventCreate(&ev[i]); cudaSetDevice(i); if (i > 0) cudaStreamEventWait(NULL, ev[i-1], 0); broadcast_ring<<<blocks, threads >>>(ptr[i], ptr[i+1]); // CPU code cudaEventRecord(ev[i]) cudaSetDevice(0); } broadcast_direct<<<blocks, threads>>>(ptr[0], dPtr); cudaSetDevice(0); cudaDeviceSynchronize(); cudaStreamWaitEvent(NULL, ev[numDevices – 2], 0); cudaDeviceSynchronize(); 29

ALL REDUCE BENCHMARK 120000 100000 Achieved Bandwidth (MB/sec) 80000 2x DGX-1v w/ 100Gb Infiniband 60000 2x DGX-1v w/ 400Gb Infiniband DGX-2 (ring) DGX-2 (direct) 40000 Direct All reduce ~50 lines of code 20000 Better performance for small messages 0 2 8 32 128 512 2048 8192 32768 131072 524288 Message Size (kB) Source: Performance measured on pre production NVSwitch hardware 30

NVSWITCH FFT BENCHMARK 3D FFT 1280 x 1280 x 1280 in GFLOPS (FP32 Complex) 4x DGX-1 (Volta) DGX-2 13K 3x 2x 6965 1x 3598 3484 1374 N/A 0x 4 GPU 8 GPU 16 GPU 31 Performance is measured. NVSwitch uses early bring-up software FFT is measured with cufftbench

2X HIGHER PERFORMANCE WITH NVSWITCH 2.4X FASTER 2X FASTER 2.7X FASTER 2X FASTER Weather Recommender Language Model Physics (MILC benchmark) (ECMWF benchmark) (Sparse Embedding) (Transformer with MoE) 4D Grid All-to-all Reduce & Broadcast All-to-all 2x DGX-1 (Volta) DGX-2 with NVSwitch 32 2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs

SUMMARY DGX-2 Advantages 2X FASTER GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 … 512 GB Unified Memory GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 Physics (MILC benchmark) 4D Grid Faster Solutions Faster Development Gigantic Problems 33

OTHER NVIDIA SESSIONS TO ATTEND ADDITIONAL NVIDIA LED SESSIONS S8670 Multi-GPU Programming Techniques in CUDA Wed 3/28 - Stephen Jones (Software Architect) Time? S8474 GPUDirect: Life in the Fast Lane Thur 3/29 - Davide Rosetti (Software Architect) 10:00 am 34

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 - PowerPoint PPT Presentation

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created DGX-2 internal architecture Agenda Software programming model Simple application Results 2 DEEP LEARNING TRENDS Application properties Explosive DL

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

S9164 S9164 Adv Advanced nced We Weather In Inform rmatio ion Re Recall wi with th DGX DGX

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps,

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM

Efficient Observations Forecast for the Worlds Biggest Eye Using DGX-1 Damien Gratadour 1 and

BREAKING THE BARRIERS TO AI- SCALE IN THE ENTERPRISE Charlie Boyle Senior Director, DGX Systems

Speed of Thought Analytics at Scale S9373 - TPC-H Benchmark on DGX-2 A New Paradigm for OLAP and

Visualization of Petascale Particle Data in Nvidia DGX-1 Benjamin Hernandez, PhD

IMC Presentation Recommendation Adopt Inside Out and Back Again by Thanhha Lai Adopt One

Long-term Research Issues in SSD NVRAMOS 2011 Research Issues:

Inside Vaucanson The Vaucanson group LRDE / EPITA - LIAFA / Paris 7 - LTCI / ENST June 27, 2005

JDBC JDBC Perf erfor ormance mance fr from the Inside om the Inside Ju July 2017 1

REMIT Inside Information Platform Key Principles: Market Participants are responsible

THE QUEST FOR STUDENT SUCCESS AT COMMUNITY COLLEGES An Inside Higher Ed Webinar March 25 at 2

SFIO progress on Swiss-Tx SCS meeting on Frangipani: a scalable distrib- uted file system to

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division

Further Results Further Results of Soft-Inplane Tiltrotor of Soft-Inplane Tiltrotor

9 East Fangshui Road, Chemical Industry Park Liuhe, Nanjing, China 210047 Phone: 86 25 58392668;

Porting the LHCb Stack from x86 (Intel) to aarch64 (ARM) CHEP 2018, Sofia Laura Promberger 1 2

Performance Optimization on a Performance Optimization on a Supercomputer with cTuning and

Cross-Tool Semantics for Protocol Security Goals SSR December 5, 2016 Gaithersburg, MD Joshua

Activities and Plans in the US Paul M. Grant W2AGZ Technologies San Jose, CA USA A Sober