S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI - - PowerPoint PPT Presentation
S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI - - PowerPoint PPT Presentation
S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK 1 FPS List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2
2
Basic OK List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2 3 3 OK
A TALE OF ENLIGHTENMENT
1 FPS
3
Assembly LDX #$00 dec: INX JSR printx CPX #$03 BNE dec BRK
DEEP LEARNING – A NEW COMPUTING PLATFORM
30 FPS DL -->
4
Innovation is fueled by the right engine! Deep Learning scalability; move outside the box Drive research and Deep Learning application Partner with university research, government and industry collaborations Enable data science in HPC
SATURNV PURPOSE
124 Node Supercomputing Cluster
5
124 NVIDIA DGX-1 Nodes – 992 P100 GPUs
8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD – 7 TB scratch + 0.5 TB OS
Mellanox 36 port EDR L1 and L2 switches
4 ports per system Fat tree topology
Ubuntu 14.04, CUDA 8, OpenMPI 1.10.5a1, Docker, DL Frameworks
NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)
Deep Learning applied research
Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC
NVIDIA DGX SATURNV ARCHITECTURE
124 node Cluster
nvidia.com/dgx1
6
SATURNV STACK
7
DGX-1 MULTI-SYSTEM
8
NVIDIA DGX SATURNV
Greenest Supercomputer
9
HPL Setup
Problem contained mainly in GPU memory (~16GB / GPU) 124 nodes * 8 GPU/node * 16 GB mem/GPU = 15,872 GB mem
- -- N = 1419552
Measurement
PDU input power time-stamped during full run All cluster hardware – nodes, switches, storage
Performance
HPL Rpeak – 4,896 TF HPL Rmax 3,307 TF Pwr Full run avg – 321.2 KW Pwr Core avg – 349.5 KW
9.4 GF / Watt 40% better than nearest competing technology
NVIDIA DGX-1 SATURNV HPL RUN
124 node Supercomputing cluster
SATURNV produced groundbreaking 9.4 GF/W at full scale
- -> Sets the stage for future Exascale class computing
~15KW sustained per rack
10
NOV2016 TOP GREEN500 SYSTEM
Green500.org Top500.org
SATURNV produced groundbreaking 9.4 GF/W at full scale
- -> Sets the stage for future Exascale class computing
11
HPL – High Performance Linpack
Multi-system benchmark - measures optimized double-precision floating performance Solves system of dense linear equations One system or many connected in a cluster - usually Ethernet or InfiniBand Single problem split across many systems –single final performance number Well designed to scale across large clusters and push limits
Top500 (top500.org)
List of the fastest HPL clusters in the world Updated twice a year – June and Sept – Published at ISC and SC conferences
Green500 (green500.org)
Same HPL clusters, but rank by power used during the HPL run Published at same time as Top500
WHAT IS HPL, TOP500, GREEN500?
12
Compute
- Significant math performance – FP32, FP16, INT8
- Highly optimized frameworks
- Training, Inference
Interconnect
- Multiple compute units inside node
- Multiple systems
Storage
- Low latency, high bandwidth
- Equal perf to all systems
- Local caching for DL workloads
Facilities
- Sufficient for bursts
- Maintain inlet air temp always
- High power density
DGX-1 SUPERCOMPUTER CHALLENGES
Giant Leap Towards Exascale AI
13
NVIDIA DGX-1 COMPUTE
NCCL Collective Library
14
DGX-1 single system considerations
- Higher performance per system
- 27x to 58x faster
- Ingest data faster, provides faster results
- Also more power and heat
- High data ingest for DL workloads
- More storage and I/O into single system
- Cache data locally
- NFS cache on local SSD for training data
- Higher power/thermal density
- Example: 32 Racks @ 750 KW vs 200 @ 1,000 KW
- Ambient temperatures very important
- Silicon uses more power @ higher temps
- Clocks will gate at thermal and power limits
- Variability lowers overall performance of multi-
GPU and multi-system runs
DGX-1 COMPUTE AND MULTI-SYSTEM
15
DGX-1 COMPUTE CONSIDERATIONS
#1 Recommendation - Using containers improves performance
- Access to latest NVIDIA tuned codes
- Latest NCCL libraries
Clocking
- CPUs set to performance mode to improve
memory/I/O bandwidth
- Leave GPU clocks at default – if you do set
them, use base or slightly higher
- Running set at max can cause extreme
variation and reduced performance depending on workload
- Monitor with nvidia-smi dmon
1320 1340 1360 1380 1400 1420 1440 1460 1480 1500 Time
Effects of Clocking
16
DGX-1 COMPUTE CONSIDERATIONS
Affinity
- Best performance when CPU/GPU/mem/IB affinity are aligned
- E.g. cpu socket 0<->gpu0/1<->mlx5_0
Interrupt traffic can be high
- Keep core 0 and core 20 free for interrupts
17
Example affinity with numactl:
mpirun \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=1-4 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6-9 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=10-13 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=15-18 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=4 numactl --physcpubind=21-24 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=5 numactl --physcpubind=25-28 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=6 numactl --physcpubind=30-33 ./mycode : \
- np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=35-38 ./mycode
DGX-1 MULTI-SYSTEM CONSIDERATIONS
GPU0 GPU1 GPU3 GPU4 CPU0 MLX0 MLX1 PCIe PCIe GPU5 GPU6 MLX1 GPU7 GPU8 MLX1 MEM0 IB Leaf Switch CPU1 PCIe PCIe MEM1
18
Design topologies that reduce latency and improve total bandwidth
- Fat-tree topologies for instance
- Equal bandwidth from a system all the way
up to top level switch
- Ensure GPUDirect RDMA enablement
- DL and many computation workloads rely on
fast synchronization
- Collectives
- Consistent iteration times
System hierarchy
- CPU0 <-> GPU0/1 <-> mlx5_0
- CPU0 <-> GPU2/3 <-> mlx5_1
- CPU1 <-> GPU4/5 <-> mlx5_2
- CPU2 <-> GPU6/7 <-> mlx5_3
If designing with only two IB ports, hook up mlx5_0, mlx5_2
DGX-1 MULTI-NODE INTERCONNECT DESIGN
6,012 GB/s
19
DGX-1 multi-system considerations
- High node to node communications
- DL and HPC workloads
4 IB ports à 2 ports
- DL: up to 5% loss
- Compute: up to 18% loss
1 IB port per system low performance
- Significant contention for many workloads
- Can’t GPU Direct RDMA across full system
Switch hierarchy critical
- Low bandwidth on second level
- Same issues as lowering ports per system
- Contention, lower bandwidth, variability
DGX-1 MULTI-SYSTEM INTERCONNECT
20
Storage needs
- HPC needs well known
- Parallel FS like Lustre and Spectrum Scale well suited
- DL workloads just being understood
- Read dominated
- Input data rarely changes
- Can be raw or formatted in a DB (like LMDB)
- Large group of random read, then reread same data later
- Approaches
- Local caching helps significantly
- Can be many GB (>16GB for instance)
- Another approach is keep full datasets local (>100GB for ImageNet)
- Local SSD RAID
- Alternately, copy all data to nodes at beginning of job
Reference designs
- 10Gb attached Central NFS with local caching
- Spectrum Scale IB attached (still evaluating)
- Lustre IB attached (still evaluating)
DGX-1 STORAGE CONSIDERATIONS
21
CANDLE - Accelerate cancer research Energy / Fusion – Future of low cost energy Weather and Climate – Disaster Preparedness Astrophysics – Our future? Autonomous Cars
AI GRAND CHALLENGES
22
Summary
DGX-1 crafted for AI and Computational workloads
- High compute density, but also high power and thermal density
- Watch ambient – can cause large variability
Single system has large demands in data ingest and GPU to GPU communication Multi DGX-1 systems have large demands on inter-node communication for most workloads
- Need at least two IB rails per system (1 EDR IB for every 2 GPU)
DL Storage needs are very high
- But read dominated (vs writes with HPC)
Many codes benefit significantly when watching affinity
- Align CPU/memory with GPUs and IB cards
- Avoid cores handling interrupts
NVIDIA pre made containers significantly reduce user work
- Affinity is already handled
- Provides technologies like NCCL and the latest, tuned code and frameworks
DGX-1 DL SCALABILITY SUMMARY
23
Thanks!!! More info at NVIDIA DGX-1 System Architecture:
- http://www.nvidia.com/object/dgx-1-system-architecture-whitepaper.html
CANDLE sessions (http://www.gputechconf.com/agenda/schedule)
- S7788 - CANDLE: PREDICTING TUMOR CELL RESPONSE TO DRUG TREATMENTS
- S7782 - THE DOE AND NCI PARTNERSHIP ON PRECISION ONCOLOGY AND THE CANCER MOONSHOT
- S7792 - BUILDLING EXASCALE DEEP LEARNING TOOLS TO HELP UNDERSTAND CANCER BIOLOGY AT THE MOLECULAR SCALE
- S7780 - BUILDING EXASCALE DEEP TEXT COMPREHENSION TOOLS FOR EFFECTIVE CANCER SURVEILLANCE
DGX-1 DL SCALABILITY SUMMARY
S7754 - WHAT'S NEXT IN DGX SERVER SOLUTIONS FOR DEEP LEARNING
- Thursday, May 11, 10:00 AM - 10:50 AM – Room 210B