S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI - PowerPoint PPT Presentation

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com

A TALE OF ENLIGHTENMENT Basic OK 1 FPS List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2 3 3 OK 2

DEEP LEARNING – A NEW COMPUTING PLATFORM 30 FPS Assembly LDX #$00 dec: INX JSR printx CPX #$03 BNE dec BRK DL --> 3

SATURNV PURPOSE 124 Node Supercomputing Cluster Innovation is fueled by the right engine! Deep Learning scalability; move outside the box Drive research and Deep Learning application Partner with university research, government and industry collaborations Enable data science in HPC 4

NVIDIA DGX SATURNV ARCHITECTURE 124 node Cluster nvidia.com/dgx1 124 NVIDIA DGX-1 Nodes – 992 P100 GPUs 8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD – 7 TB scratch + 0.5 TB OS Mellanox 36 port EDR L1 and L2 switches 4 ports per system Fat tree topology Ubuntu 14.04, CUDA 8, OpenMPI 1.10.5a1, Docker, DL Frameworks NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL) Deep Learning applied research Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC 5

SATURNV STACK 6

DGX-1 MULTI-SYSTEM 7

NVIDIA DGX SATURNV Greenest Supercomputer 8

NVIDIA DGX-1 SATURNV HPL RUN 124 node Supercomputing cluster HPL Setup Problem contained mainly in GPU memory (~16GB / GPU) 124 nodes * 8 GPU/node * 16 GB mem/GPU = 15,872 GB mem --- N = 1419552 Measurement PDU input power time-stamped during full run All cluster hardware – nodes, switches, storage Performance HPL Rpeak – 4,896 TF HPL Rmax 3,307 TF ~15KW Pwr Full run avg – 321.2 KW sustained Pwr Core avg – 349.5 KW per rack SATURNV produced groundbreaking 9.4 GF/W at full scale 9.4 GF / Watt --> Sets the stage for future Exascale class computing 40% better than nearest competing technology 9

NOV2016 TOP GREEN500 SYSTEM Green500.org Top500.org SATURNV produced groundbreaking 9.4 GF/W at full scale --> Sets the stage for future Exascale class computing 10

WHAT IS HPL, TOP500, GREEN500? HPL – High Performance Linpack Multi-system benchmark - measures optimized double-precision floating performance Solves system of dense linear equations One system or many connected in a cluster - usually Ethernet or InfiniBand Single problem split across many systems –single final performance number Well designed to scale across large clusters and push limits Top500 (top500.org) List of the fastest HPL clusters in the world Updated twice a year – June and Sept – Published at ISC and SC conferences Green500 (green500.org) Same HPL clusters, but rank by power used during the HPL run Published at same time as Top500 11

DGX-1 SUPERCOMPUTER CHALLENGES Giant Leap Towards Exascale AI Compute • Significant math performance – FP32, FP16, INT8 • Highly optimized frameworks • Training, Inference Interconnect Multiple compute units inside node • Multiple systems • Storage Low latency, high bandwidth • Equal perf to all systems • Local caching for DL workloads • Facilities Sufficient for bursts • Maintain inlet air temp always • High power density • 12

NVIDIA DGX-1 COMPUTE NCCL Collective Library 13

DGX-1 COMPUTE AND MULTI-SYSTEM DGX-1 single system considerations • Higher performance per system 27x to 58x faster • Ingest data faster, provides faster results • Also more power and heat • High data ingest for DL workloads • • More storage and I/O into single system • Cache data locally • NFS cache on local SSD for training data Higher power/thermal density • Example: 32 Racks @ 750 KW vs 200 @ 1,000 KW • Ambient temperatures very important • Silicon uses more power @ higher temps • Clocks will gate at thermal and power limits • • Variability lowers overall performance of multi- GPU and multi-system runs 14

DGX-1 COMPUTE CONSIDERATIONS #1 Recommendation - Using containers improves performance - Access to latest NVIDIA tuned codes - Latest NCCL libraries Clocking Effects of Clocking - CPUs set to performance mode to improve memory/I/O bandwidth 1500 1480 1460 - Leave GPU clocks at default – if you do set 1440 them, use base or slightly higher 1420 - Running set at max can cause extreme 1400 1380 variation and reduced performance 1360 depending on workload 1340 - Monitor with nvidia-smi dmon 1320 Time 15

DGX-1 COMPUTE CONSIDERATIONS Affinity - Best performance when CPU/GPU/mem/IB affinity are aligned - E.g. cpu socket 0<->gpu0/1<->mlx5_0 Interrupt traffic can be high - Keep core 0 and core 20 free for interrupts 16

DGX-1 MULTI-SYSTEM CONSIDERATIONS IB Leaf Switch MLX1 GPU5 GPU6 MLX1 GPU7 GPU8 MLX1 GPU0 GPU1 MLX0 GPU3 GPU4 PCIe PCIe PCIe PCIe CPU0 CPU1 MEM0 MEM1 Example affinity with numactl: mpirun \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=1-4 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6-9 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=10-13 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=15-18 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=4 numactl --physcpubind=21-24 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=5 numactl --physcpubind=25-28 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=6 numactl --physcpubind=30-33 ./mycode : \ -np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=35-38 ./mycode 17

DGX-1 MULTI-NODE INTERCONNECT DESIGN Design topologies that reduce latency and 6,012 GB/s improve total bandwidth • Fat-tree topologies for instance • Equal bandwidth from a system all the way up to top level switch Ensure GPUDirect RDMA enablement • DL and many computation workloads rely on • fast synchronization • Collectives • Consistent iteration times System hierarchy - CPU0 <-> GPU0/1 <-> mlx5_0 - CPU0 <-> GPU2/3 <-> mlx5_1 - CPU1 <-> GPU4/5 <-> mlx5_2 - CPU2 <-> GPU6/7 <-> mlx5_3 If designing with only two IB ports, hook up mlx5_0, mlx5_2 18

DGX-1 MULTI-SYSTEM INTERCONNECT DGX-1 multi-system considerations High node to node communications • DL and HPC workloads • 4 IB ports à 2 ports DL: up to 5% loss • Compute: up to 18% loss • 1 IB port per system low performance Significant contention for many workloads • Can’t GPU Direct RDMA across full system • Switch hierarchy critical • Low bandwidth on second level Same issues as lowering ports per system • Contention, lower bandwidth, variability • 19

DGX-1 STORAGE CONSIDERATIONS Storage needs • HPC needs well known • Parallel FS like Lustre and Spectrum Scale well suited DL workloads just being understood • • Read dominated • Input data rarely changes • Can be raw or formatted in a DB (like LMDB) • Large group of random read, then reread same data later • Approaches Local caching helps significantly • Can be many GB (>16GB for instance) • Another approach is keep full datasets local (>100GB for ImageNet) • • Local SSD RAID • Alternately, copy all data to nodes at beginning of job Reference designs • 10Gb attached Central NFS with local caching Spectrum Scale IB attached (still evaluating) • Lustre IB attached (still evaluating) • 20

AI GRAND CHALLENGES CANDLE - Accelerate cancer research Energy / Fusion – Future of low cost energy Weather and Climate – Disaster Preparedness Astrophysics – Our future? Autonomous Cars 21

DGX-1 DL SCALABILITY SUMMARY Summary DGX-1 crafted for AI and Computational workloads • High compute density, but also high power and thermal density • Watch ambient – can cause large variability Single system has large demands in data ingest and GPU to GPU communication Multi DGX-1 systems have large demands on inter-node communication for most workloads • Need at least two IB rails per system (1 EDR IB for every 2 GPU) DL Storage needs are very high • But read dominated (vs writes with HPC) Many codes benefit significantly when watching affinity Align CPU/memory with GPUs and IB cards • Avoid cores handling interrupts • NVIDIA pre made containers significantly reduce user work • Affinity is already handled • Provides technologies like NCCL and the latest, tuned code and frameworks 22

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI - PowerPoint PPT Presentation

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK 1 FPS List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created DGX-2

S9164 S9164 Adv Advanced nced We Weather In Inform rmatio ion Re Recall wi with th DGX DGX

BREAKING THE BARRIERS TO AI- SCALE IN THE ENTERPRISE Charlie Boyle Senior Director, DGX Systems

Best Practices From Across The Country Morning Session 2014 NCWBA Summit Breaking Barriers

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Breaking Down Barrie iers Asya Choudry Genetic Counsellor Community Engagement Manager for

WIOA Populations with Barriers and Proposed Solutions WIOA BARRIER POTENTIAL BARRIERS TO ACCESS

SUSY breaking and the MSSM Spontaneous SUSY breaking at tree-level ORaifeartaigh, Fayet,

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM

Efficient Observations Forecast for the Worlds Biggest Eye Using DGX-1 Damien Gratadour 1 and

Speed of Thought Analytics at Scale S9373 - TPC-H Benchmark on DGX-2 A New Paradigm for OLAP and

Visualization of Petascale Particle Data in Nvidia DGX-1 Benjamin Hernandez, PhD

Space-Time Areal Mixture Model: Relabeling Algorithm and Model Selection Issues Md Monir

EUREKA CLUSTERS Driving industry-led innovation and collaboration 2019 EUREKA Association Full

Johan Schijf UMCES Chesapeake Biological Laboratory 1794 1794 Mozart Beethoven 1794

The OaSiS Trial: The Optimizing Lung Screening Study Kristie Long Foley, PhD Caroline Chiles, MD

VHL and clear cell Renal Cell Carcinoma Gene expression profiles in renal cell VHL syndrome

The Data Science Process Polong Lin Big Data University Leader & Data Scientist IBM

The Challenge HPC IT departments required to host Data Science and Machine Learning a

SIBR: Interprofessional Rounding A VCU Health Priority Initiative Sarah Hartigan, MD Associate

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI - PowerPoint PPT Presentation

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK 1 FPS List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created DGX-2

S9164 S9164 Adv Advanced nced We Weather In Inform rmatio ion Re Recall wi with th DGX DGX

BREAKING THE BARRIERS TO AI- SCALE IN THE ENTERPRISE Charlie Boyle Senior Director, DGX Systems

Best Practices From Across The Country Morning Session 2014 NCWBA Summit Breaking Barriers

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Breaking Down Barrie iers Asya Choudry Genetic Counsellor Community Engagement Manager for

WIOA Populations with Barriers and Proposed Solutions WIOA BARRIER POTENTIAL BARRIERS TO ACCESS

SUSY breaking and the MSSM Spontaneous SUSY breaking at tree-level ORaifeartaigh, Fayet,

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM

Efficient Observations Forecast for the Worlds Biggest Eye Using DGX-1 Damien Gratadour 1 and

Speed of Thought Analytics at Scale S9373 - TPC-H Benchmark on DGX-2 A New Paradigm for OLAP and

Visualization of Petascale Particle Data in Nvidia DGX-1 Benjamin Hernandez, PhD

Space-Time Areal Mixture Model: Relabeling Algorithm and Model Selection Issues Md Monir

EUREKA CLUSTERS Driving industry-led innovation and collaboration 2019 EUREKA Association Full

Johan Schijf UMCES Chesapeake Biological Laboratory 1794 1794 Mozart Beethoven 1794

The OaSiS Trial: The Optimizing Lung Screening Study Kristie Long Foley, PhD Caroline Chiles, MD

VHL and clear cell Renal Cell Carcinoma Gene expression profiles in renal cell VHL syndrome

The Data Science Process Polong Lin Big Data University Leader &amp; Data Scientist IBM

The Challenge HPC IT departments required to host Data Science and Machine Learning a

SIBR: Interprofessional Rounding A VCU Health Priority Initiative Sarah Hartigan, MD Associate

The Data Science Process Polong Lin Big Data University Leader & Data Scientist IBM