Maggie Zhang (张雪萌) maggiez@nvidia.com
Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - - PowerPoint PPT Presentation
Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - - PowerPoint PPT Presentation
Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com Introduction Why do we need to scale training AGENDA How to achieve scaling DL Training: from single GPU to multi-node 1.33 Minutes on
AGENDA
- Introduction
- Why do we need to scale training
- How to achieve scaling
3
2015
36000 Mins (25 Days) 1xK80 | 2015 CUDA
2016
1200 Mins (20 Hours) DGX-1P | 2016 NVLink
2017
480 Mins (8 Hours) DGX-1V | 2017 Tensor Core 6.3 Minutes on MLPerf At Scale | 2018 DGX Cluster
2018
70 Minutes on MLPerf DGX-2H | 2018 NVSwitch
ResNet50 v1.5 training
2019
52.7 Minutes on MLPerf DGX-2H | 2019 NVSwitch 1.33 Minutes on MLPerf At Scale | 2019 DGX SuperPOD
DL Training: from single GPU to multi-node
4
The whole stack must be considered
- Compute
- Network
- Storage
- Frameworks & Libraries
- Numerical methods
- Training recipes
5
MLPerf: NVIDIA advancing AI training
Time to Train From 8 Hours to 80 Seconds
2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23
6
Largest TensorFlow model at scale
Oak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs
Source: https://arxiv.org/pdf/1810.01993.pdf
2018 Gordon Bell Prize Winner
AGENDA
- Introduction
- Why do we need to scale training
- How to achieve scaling
8
- Unlabeled data:
○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M documents, 40 GB), C4 (Common Crawl, 745 GB) ○ GAN: unlabeled images and videos ○ Reinforcement learning: unsupervised self-play generates unlimited data
- Labeled data:
○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000 categories
○
Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving
Datasets getting larger
9
DL models increasing in complexity
Image Recognition NLP NLP – Generative Tasks Chatbots E-mail auto-completion Document Summarization Autonomous Vehicles Social Tagging Visual Search Q&A Sentiment Translation
1.5Bn 26M 340M
Next-level use-cases require gigantic models
https://github.com/NVIDIA/Megatron-LM
Project Megatron
8.3B parameters 8-way Model Parallel 64-way Data Parallel 24x larger than BERT
Speech Recognition Translation Object Detection
AGENDA
- Introduction
- Why do we need to scale training
- How to achieve scaling
11
Scaling == whack-a-mole ?
Solving one bottleneck and another one pops up
12
Multi-node infrastructure requirements
System Design Data Center Management SW Stack Multi-Node Success
13
- Hardware GPU cluster design:
○ Compute: significant CPU to GPU ratio, interconnect with GPU ○ Storage: high speed NFS, multi-tier caching ○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA
- GPU cluster management:
○ Scheduler: Slurm vs. Kubernetes ○ Container technologies: Docker, Enroot, Singularity, etc.
- Integrated software stack:
○ NVIDIA libraries: CUDA, cuDNN, NCCL ○ DL Framework scale-out optimization ○ Model scale-out implementation & optimization
Challenges of multi-node DL training
14
A basic recipe for deep learning scaling
Step 1: Optimize your single GPU model Step 2: Scale to multiple GPUs on one node Step 3: Scale to multiple nodes
15
Case study
- BERT model scripts:
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/ LanguageModeling/BERT/ https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Lan guageModeling/BERT Configurations for convergence, from 8 to 1500 GPUs, multi-node ready
- Clone and train your own BERT model on multi-node
Or download a pre-trained BERT model from NGC and fine-tune for your NLP task
Bidirectional Encoder Representations from Transformers
Super Human Question & Answering
NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance
16
- Pre-training on non-labelled data opens up opportunities to using massive amounts of data:
- BooksCorpus (800 million words)
- English Wikipedia (2.5 billion words), multi-language Wikipedia
- WebText (OpenAI, 8M documents, 40 GB of text)
- More data tends to lead to better accuracy
- BERT pre-training is computationally intensive and takes days to train even on the most
powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.
Why multi-node BERT training
17
BERT multi-node pre-training performance
DGX-1 (16 GB) GPUs Time to train (Hrs) 1 8 153.6 (6.3 days) 4 32 39.3 16 128 10.4 DGX-2H (32 GB) GPUs Time to train (Hrs) 1 16 58.4 (2.4 days) 4 64 15.4 16 256 3.9 64 1024 1.2
Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results * Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer ** Gradient accumulation is applied to DGX-2H 1,4,16 node
Metric: Time to train
18
- Create efficient data pipeline
- Enable mixed precision training
- Enable XLA
- Ensure latest GPU libraries
- Develop model in container to facilitate scaling out
Step 1: Optimize model
19
Step 1: Optimize model
- Use tf.data to create performant input pipelines
- Test I/O bottlenecks with a trivial model
- NVIDIA DALI accelerates image-based input pipelines
Data pipeline
20
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) d = d.apply( tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True if is_training else False))
BERT
TFRecord - fast binary format Parallel read, map, & batch Fused map & batch op
Data pipeline
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py
21
Step 1: Optimize model
- 1-line optimizer wrapper:
- pt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
- Up to 3x speed up in training on Tensor Cores with
- Same accuracy
- No change in hyperparameters
- ½ memory bandwidth & footprint
- Optimal on Volta and Turing GPUs
Automatic Mixed Precision (AMP)
22
Step 1: Optimize model
Automatic Mixed Precision (AMP)
- Robust speedup across
different TensorFlow workloads
- https://arxiv.org/abs/1710.0
3740
23
Step 1: Optimize model
XLA (Accelerated Linear Algebra)
- TensorFlow XLA can accelerate
models with minimal code changes
- XLA optimizes graph, mostly by
fusing compatible kernels
- Set XLA optimization level:
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo deling/BERT/run_pretraining.py#L531 System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests run using NVIDIA 18.11 TensorFlow container.
config.graph_options.optimizer_options .global_jit_level = tf.OptimizerOptions.ON_1
24
Step 1: Optimize model
- Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries
(cuDNN, cuBLAS, NCCL)
Latest GPU optimizations
25
Step 1: Optimize model
- NGC containers: fully featured DL
containers
- DL frameworks compiled with latest
GPU libraries
- Portability of application libraries
facilitates multi-node scale-out
Latest GPU optimizations
26
27
- Understand Data Parallel training concepts
- Ensure optimal inter-GPU communication
- Apply high level API for multi-GPU training
Step 2: Scale to multiple GPUs
28
Step 2: Scale to multiple GPUs
- Single GPU
Under the hood
29
Step 2: Scale to multiple GPUs
- Multiple GPU
- Data parallel training
Under the hood
- Allreduce algorithm
- NCCL: NVIDIA Collective
Communication Library
30
- Inter-GPU communication:
Step 2: Scale to multiple GPUs
Under the hood
Effective bandwidth in GB/s
31
- Full non-blocking bandwidth
Step 2: Scale to multiple GPUs
Under the hood
32
Step 2: Scale to multiple GPUs
- Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras
- Strong NCCL integration
- Sample commands:
- Single-node (4 GPUs):
horovodrun -np 4 -H localhost:4 python train.py
- Multi-node (4 nodes with 4 GPUs each):
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Approach 1: Horovod
33
Step 2: Scale to multiple GPUs
import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ...
- pt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
- pt = hvd.DistributedOptimizer(opt)
Approach 1: Horovod
# Add hook to synchronize initial state hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Session with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
34
- Recently released native API that also support Allreduce with NCCL
- Multi-GPU:
tf.distribute.MirrorStrategy
- Multi-node:
tf.distribute.experimental.MultiWorkerMirroredStrategy
Step 2: Scale to multiple GPUs
Approach 2: tf.distribute.Strategy
Source: https://www.tensorflow.org/guide/distributed_training
35
- Adopt optimizer designed for large batch size
- Ensure effective inter-node communication
- Move data close to compute
- Consider full application & system software stack
Step 3: Scale to multiple nodes
36
- Optimizer inspired by LARS
- Layerwise Adaptive learning rate (You et al.)
- Allows training at huge global batch size
- Originally, BERT+Adam (Devlin et al.) – global batch 256
- BERT+LAMB (You et al.) – global batch 64k
- Massive data parallelism
- Lower interconnect pressure with gradient accumulation
Step 3: Scale to multiple nodes
LAMB optimizer
37
BERT+LAMB
Robustly scale to large batch size
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py
class LAMBOptimizer(tf.train.Optimizer): """A LAMB optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="LAMBOptimizer"): """Constructs a LAMBOptimizer.""" super(LAMBOptimizer, self).__init__(False, name) . . .
Step 3: Scale to multiple nodes
LAMB optimizer
38
- Inter-GPU communication (bigger picture):
Step 3: Scale to multiple nodes
Under the hood
Effective bandwidth in GB/s
42
- Tensor Fusion
- Batch tensors together during allreduce
- HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...
- Gradient Compression (FP16 Allreduce):
- hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
- Reduces network utilization
Step 3: Scale to multiple nodes
Further Horovod optimizations
43
- DNN datasets are large
- Read-dominated at beginning of each epoch
- Keep data close to compute as much as possible:
- RAM disk, SSDs in RAID 0, Fast network attached storage
Step 3: Scale to multiple nodes
Storage
44
- Integrated software and hardware system for multi-node scaling
- State-of-the-art compute, GPU interconnect, node interconnect, and storage
Step 3: Scale to multiple nodes
Reference architecture: DGX SuperPOD
45
NVIDIA DGX SuperPOD
Mellanox EDR 100G InfiniBand Network Mellanox Smart Director Switches In-Network Computing Acceleration Engines Fast and Efficient Storage Access with RDMA Up to 130Tb/s Switching Capacity per Switch Ultra-Low Latency of 300ns Integrated Network Manager
Terabit-Speed InfiniBand Networking per Node
…
Rack 1 Rack 16 Compute Backplane Switch Storage Backplane Switch
64 DGX-2
GPFS 200 Gb/s per node 800 Gb/s per node
White paper: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-superpod-reference-architecture/
46
- Deep Learning Model:
- Hyperparameters tuned for multi-node
scaling
- Multi-node launcher scripts
- Deep Learning Container:
- Optimized DL frameworks, GPU libraries,
and multi-node software
- Host:
- Host OS, GPU driver, IB driver, container
runtime engine (docker, enroot)
Step 3: Scale to multiple nodes
Software stack - Application
47
- Slurm: User job scheduling & management
- Enroot: NVIDIA open-source tool to convert traditional container/OS images into
unprivileged sandboxes
- Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
- DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
Step 3: Scale to multiple nodes
Software stack - System
Login nodes DGX Pod: DGX Servers w. DGX base OS Slurm controller Enroot | Docker Pyxis NGC model containers (Pytorch, Tensorflow from 19.09) DCGM
48
DeepOps leverages Ansible for automated large scale cluster deployment. Deployment doc
Deployment with DeepOps
Bootstrap all nodes Prepare provisioning node Provision all node(s) Deploy Slurm on Slurm nodes Deploy DL/ML development tools Deploy Production AI applications Deploy management services
DeepOps
- Build your own GPU cluster following the DGX Pod and DGX
SuperPOD reference architectures.
- Clone the DeepOps repo and follow the cluster setup guide.
Open a GitHub issue if any problem.
Step 3: Scale to multiple nodes
49
- Scaling requires careful consideration of algorithms and infrastructure at each step
- Optimized single-GPU model
- Efficient & scalable Allreduce library
- GPU interconnect, networking, storage
...
- NVIDIA platform makes scaling DL training easier and more efficient
- Deep Learning Examples with SOTA accuracy and performance
- NVIDIA NGC Container with optimized multi-GPU/multi-node software stack
- Accelerated compute platform designed for performance and scaling