Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - - PowerPoint PPT Presentation

accelerate deep learning
SMART_READER_LITE
LIVE PREVIEW

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - - PowerPoint PPT Presentation

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com Introduction Why do we need to scale training AGENDA How to achieve scaling DL Training: from single GPU to multi-node 1.33 Minutes on


slide-1
SLIDE 1

Maggie Zhang (张雪萌) maggiez@nvidia.com

Accelerate Deep Learning Training at Scale on GPUs

slide-2
SLIDE 2

AGENDA

  • Introduction
  • Why do we need to scale training
  • How to achieve scaling
slide-3
SLIDE 3

3

2015

36000 Mins (25 Days) 1xK80 | 2015 CUDA

2016

1200 Mins (20 Hours) DGX-1P | 2016 NVLink

2017

480 Mins (8 Hours) DGX-1V | 2017 Tensor Core 6.3 Minutes on MLPerf At Scale | 2018 DGX Cluster

2018

70 Minutes on MLPerf DGX-2H | 2018 NVSwitch

ResNet50 v1.5 training

2019

52.7 Minutes on MLPerf DGX-2H | 2019 NVSwitch 1.33 Minutes on MLPerf At Scale | 2019 DGX SuperPOD

DL Training: from single GPU to multi-node

slide-4
SLIDE 4

4

The whole stack must be considered

  • Compute
  • Network
  • Storage
  • Frameworks & Libraries
  • Numerical methods
  • Training recipes
slide-5
SLIDE 5

5

MLPerf: NVIDIA advancing AI training

Time to Train From 8 Hours to 80 Seconds

2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23

slide-6
SLIDE 6

6

Largest TensorFlow model at scale

Oak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs

Source: https://arxiv.org/pdf/1810.01993.pdf

2018 Gordon Bell Prize Winner

slide-7
SLIDE 7

AGENDA

  • Introduction
  • Why do we need to scale training
  • How to achieve scaling
slide-8
SLIDE 8

8

  • Unlabeled data:

○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M documents, 40 GB), C4 (Common Crawl, 745 GB) ○ GAN: unlabeled images and videos ○ Reinforcement learning: unsupervised self-play generates unlimited data

  • Labeled data:

○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000 categories

Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving

Datasets getting larger

slide-9
SLIDE 9

9

DL models increasing in complexity

Image Recognition NLP NLP – Generative Tasks Chatbots E-mail auto-completion Document Summarization Autonomous Vehicles Social Tagging Visual Search Q&A Sentiment Translation

1.5Bn 26M 340M

Next-level use-cases require gigantic models

https://github.com/NVIDIA/Megatron-LM

Project Megatron

8.3B parameters 8-way Model Parallel 64-way Data Parallel 24x larger than BERT

Speech Recognition Translation Object Detection

slide-10
SLIDE 10

AGENDA

  • Introduction
  • Why do we need to scale training
  • How to achieve scaling
slide-11
SLIDE 11

11

Scaling == whack-a-mole ?

Solving one bottleneck and another one pops up

slide-12
SLIDE 12

12

Multi-node infrastructure requirements

System Design Data Center Management SW Stack Multi-Node Success

slide-13
SLIDE 13

13

  • Hardware GPU cluster design:

○ Compute: significant CPU to GPU ratio, interconnect with GPU ○ Storage: high speed NFS, multi-tier caching ○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA

  • GPU cluster management:

○ Scheduler: Slurm vs. Kubernetes ○ Container technologies: Docker, Enroot, Singularity, etc.

  • Integrated software stack:

○ NVIDIA libraries: CUDA, cuDNN, NCCL ○ DL Framework scale-out optimization ○ Model scale-out implementation & optimization

Challenges of multi-node DL training

slide-14
SLIDE 14

14

A basic recipe for deep learning scaling

Step 1: Optimize your single GPU model Step 2: Scale to multiple GPUs on one node Step 3: Scale to multiple nodes

slide-15
SLIDE 15

15

Case study

  • BERT model scripts:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/ LanguageModeling/BERT/ https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Lan guageModeling/BERT Configurations for convergence, from 8 to 1500 GPUs, multi-node ready

  • Clone and train your own BERT model on multi-node

Or download a pre-trained BERT model from NGC and fine-tune for your NLP task

Bidirectional Encoder Representations from Transformers

Super Human Question & Answering

NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance

slide-16
SLIDE 16

16

  • Pre-training on non-labelled data opens up opportunities to using massive amounts of data:
  • BooksCorpus (800 million words)
  • English Wikipedia (2.5 billion words), multi-language Wikipedia
  • WebText (OpenAI, 8M documents, 40 GB of text)
  • More data tends to lead to better accuracy
  • BERT pre-training is computationally intensive and takes days to train even on the most

powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.

Why multi-node BERT training

slide-17
SLIDE 17

17

BERT multi-node pre-training performance

DGX-1 (16 GB) GPUs Time to train (Hrs) 1 8 153.6 (6.3 days) 4 32 39.3 16 128 10.4 DGX-2H (32 GB) GPUs Time to train (Hrs) 1 16 58.4 (2.4 days) 4 64 15.4 16 256 3.9 64 1024 1.2

Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results * Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer ** Gradient accumulation is applied to DGX-2H 1,4,16 node

Metric: Time to train

slide-18
SLIDE 18

18

  • Create efficient data pipeline
  • Enable mixed precision training
  • Enable XLA
  • Ensure latest GPU libraries
  • Develop model in container to facilitate scaling out

Step 1: Optimize model

slide-19
SLIDE 19

19

Step 1: Optimize model

  • Use tf.data to create performant input pipelines
  • Test I/O bottlenecks with a trivial model
  • NVIDIA DALI accelerates image-based input pipelines

Data pipeline

slide-20
SLIDE 20

20

d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) d = d.apply( tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True if is_training else False))

BERT

TFRecord - fast binary format Parallel read, map, & batch Fused map & batch op

Data pipeline

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py

slide-21
SLIDE 21

21

Step 1: Optimize model

  • 1-line optimizer wrapper:
  • pt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
  • Up to 3x speed up in training on Tensor Cores with
  • Same accuracy
  • No change in hyperparameters
  • ½ memory bandwidth & footprint
  • Optimal on Volta and Turing GPUs

Automatic Mixed Precision (AMP)

slide-22
SLIDE 22

22

Step 1: Optimize model

Automatic Mixed Precision (AMP)

  • Robust speedup across

different TensorFlow workloads

  • https://arxiv.org/abs/1710.0

3740

slide-23
SLIDE 23

23

Step 1: Optimize model

XLA (Accelerated Linear Algebra)

  • TensorFlow XLA can accelerate

models with minimal code changes

  • XLA optimizes graph, mostly by

fusing compatible kernels

  • Set XLA optimization level:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo deling/BERT/run_pretraining.py#L531 System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests run using NVIDIA 18.11 TensorFlow container.

config.graph_options.optimizer_options .global_jit_level = tf.OptimizerOptions.ON_1

slide-24
SLIDE 24

24

Step 1: Optimize model

  • Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries

(cuDNN, cuBLAS, NCCL)

Latest GPU optimizations

slide-25
SLIDE 25

25

Step 1: Optimize model

  • NGC containers: fully featured DL

containers

  • DL frameworks compiled with latest

GPU libraries

  • Portability of application libraries

facilitates multi-node scale-out

Latest GPU optimizations

slide-26
SLIDE 26

26

slide-27
SLIDE 27

27

  • Understand Data Parallel training concepts
  • Ensure optimal inter-GPU communication
  • Apply high level API for multi-GPU training

Step 2: Scale to multiple GPUs

slide-28
SLIDE 28

28

Step 2: Scale to multiple GPUs

  • Single GPU

Under the hood

slide-29
SLIDE 29

29

Step 2: Scale to multiple GPUs

  • Multiple GPU
  • Data parallel training

Under the hood

  • Allreduce algorithm
  • NCCL: NVIDIA Collective

Communication Library

slide-30
SLIDE 30

30

  • Inter-GPU communication:

Step 2: Scale to multiple GPUs

Under the hood

Effective bandwidth in GB/s

slide-31
SLIDE 31

31

  • Full non-blocking bandwidth

Step 2: Scale to multiple GPUs

Under the hood

slide-32
SLIDE 32

32

Step 2: Scale to multiple GPUs

  • Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras
  • Strong NCCL integration
  • Sample commands:
  • Single-node (4 GPUs):

horovodrun -np 4 -H localhost:4 python train.py

  • Multi-node (4 nodes with 4 GPUs each):

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Approach 1: Horovod

slide-33
SLIDE 33

33

Step 2: Scale to multiple GPUs

import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ...

  • pt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())

# Add Horovod Distributed Optimizer

  • pt = hvd.DistributedOptimizer(opt)

Approach 1: Horovod

# Add hook to synchronize initial state hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Session with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,

config=config, hooks=hooks) as mon_sess:

while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)

slide-34
SLIDE 34

34

  • Recently released native API that also support Allreduce with NCCL
  • Multi-GPU:

tf.distribute.MirrorStrategy

  • Multi-node:

tf.distribute.experimental.MultiWorkerMirroredStrategy

Step 2: Scale to multiple GPUs

Approach 2: tf.distribute.Strategy

Source: https://www.tensorflow.org/guide/distributed_training

slide-35
SLIDE 35

35

  • Adopt optimizer designed for large batch size
  • Ensure effective inter-node communication
  • Move data close to compute
  • Consider full application & system software stack

Step 3: Scale to multiple nodes

slide-36
SLIDE 36

36

  • Optimizer inspired by LARS
  • Layerwise Adaptive learning rate (You et al.)
  • Allows training at huge global batch size
  • Originally, BERT+Adam (Devlin et al.) – global batch 256
  • BERT+LAMB (You et al.) – global batch 64k
  • Massive data parallelism
  • Lower interconnect pressure with gradient accumulation

Step 3: Scale to multiple nodes

LAMB optimizer

slide-37
SLIDE 37

37

BERT+LAMB

Robustly scale to large batch size

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py

class LAMBOptimizer(tf.train.Optimizer): """A LAMB optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="LAMBOptimizer"): """Constructs a LAMBOptimizer.""" super(LAMBOptimizer, self).__init__(False, name) . . .

Step 3: Scale to multiple nodes

LAMB optimizer

slide-38
SLIDE 38

38

  • Inter-GPU communication (bigger picture):

Step 3: Scale to multiple nodes

Under the hood

Effective bandwidth in GB/s

slide-39
SLIDE 39

42

  • Tensor Fusion
  • Batch tensors together during allreduce
  • HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...
  • Gradient Compression (FP16 Allreduce):
  • hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
  • Reduces network utilization

Step 3: Scale to multiple nodes

Further Horovod optimizations

slide-40
SLIDE 40

43

  • DNN datasets are large
  • Read-dominated at beginning of each epoch
  • Keep data close to compute as much as possible:
  • RAM disk, SSDs in RAID 0, Fast network attached storage

Step 3: Scale to multiple nodes

Storage

slide-41
SLIDE 41

44

  • Integrated software and hardware system for multi-node scaling
  • State-of-the-art compute, GPU interconnect, node interconnect, and storage

Step 3: Scale to multiple nodes

Reference architecture: DGX SuperPOD

slide-42
SLIDE 42

45

NVIDIA DGX SuperPOD

Mellanox EDR 100G InfiniBand Network Mellanox Smart Director Switches In-Network Computing Acceleration Engines Fast and Efficient Storage Access with RDMA Up to 130Tb/s Switching Capacity per Switch Ultra-Low Latency of 300ns Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

Rack 1 Rack 16 Compute Backplane Switch Storage Backplane Switch

64 DGX-2

GPFS 200 Gb/s per node 800 Gb/s per node

White paper: https://www.nvidia.com/en-us/data-

center/resources/nvidia-dgx-superpod-reference-architecture/

slide-43
SLIDE 43

46

  • Deep Learning Model:
  • Hyperparameters tuned for multi-node

scaling

  • Multi-node launcher scripts
  • Deep Learning Container:
  • Optimized DL frameworks, GPU libraries,

and multi-node software

  • Host:
  • Host OS, GPU driver, IB driver, container

runtime engine (docker, enroot)

Step 3: Scale to multiple nodes

Software stack - Application

slide-44
SLIDE 44

47

  • Slurm: User job scheduling & management
  • Enroot: NVIDIA open-source tool to convert traditional container/OS images into

unprivileged sandboxes

  • Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
  • DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

Step 3: Scale to multiple nodes

Software stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS Slurm controller Enroot | Docker Pyxis NGC model containers (Pytorch, Tensorflow from 19.09) DCGM

slide-45
SLIDE 45

48

DeepOps leverages Ansible for automated large scale cluster deployment. Deployment doc

Deployment with DeepOps

Bootstrap all nodes Prepare provisioning node Provision all node(s) Deploy Slurm on Slurm nodes Deploy DL/ML development tools Deploy Production AI applications Deploy management services

DeepOps

  • Build your own GPU cluster following the DGX Pod and DGX

SuperPOD reference architectures.

  • Clone the DeepOps repo and follow the cluster setup guide.

Open a GitHub issue if any problem.

Step 3: Scale to multiple nodes

slide-46
SLIDE 46

49

  • Scaling requires careful consideration of algorithms and infrastructure at each step
  • Optimized single-GPU model
  • Efficient & scalable Allreduce library
  • GPU interconnect, networking, storage

...

  • NVIDIA platform makes scaling DL training easier and more efficient
  • Deep Learning Examples with SOTA accuracy and performance
  • NVIDIA NGC Container with optimized multi-GPU/multi-node software stack
  • Accelerated compute platform designed for performance and scaling

Summary

Scaling is important and we are here to help

slide-47
SLIDE 47