Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - PowerPoint PPT Presentation

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( 张雪萌 ) maggiez@nvidia.com

● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling

DL Training: from single GPU to multi-node 1.33 Minutes on MLPerf At Scale | 2019 6.3 Minutes on MLPerf DGX SuperPOD 52.7 Minutes on At Scale | 2018 MLPerf DGX Cluster DGX-2H | 2019 70 Minutes on MLPerf NVSwitch DGX-2H | 2018 NVSwitch 480 Mins (8 Hours) DGX-1V | 2017 Tensor Core 36000 Mins (25 1200 Mins (20 Hours) Days) DGX-1P | 2016 1xK80 | 2015 NVLink CUDA 2016 2017 2015 2018 2019 ResNet50 v1.5 training 3

The whole stack must be considered ● Compute ● Network ● Storage ● Frameworks & Libraries ● Numerical methods ● Training recipes 4

Largest TensorFlow model at scale Oak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs Source: https://arxiv.org/pdf/1810.01993.pdf 2018 Gordon Bell Prize Winner 6

Datasets getting larger ● Unlabeled data: ○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M documents, 40 GB), C4 (Common Crawl, 745 GB) ○ GAN: unlabeled images and videos Reinforcement learning: unsupervised self-play generates unlimited data ○ ● Labeled data: ○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000 categories ○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving 8

DL models increasing in complexity Next-level use-cases require gigantic models NLP – Generative Tasks Chatbots E-mail auto-completion Speech Document Summarization Recognition NLP Q&A Sentiment Translation Translation Project Megatron Image Recognition Autonomous Vehicles Social Tagging 1.5Bn 8.3B parameters Visual Search Object 8-way Model Parallel Detection 340M 26M 64-way Data Parallel 24x larger than BERT 9 https://github.com/NVIDIA/Megatron-LM

Scaling == whack-a-mole ? Solving one bottleneck and another one pops up 11

Multi-node infrastructure requirements System Design Multi-Node Success Data Center SW Stack Management 12

Challenges of multi-node DL training ● Hardware GPU cluster design: ○ Compute: significant CPU to GPU ratio, interconnect with GPU ○ Storage: high speed NFS, multi-tier caching ○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA ● GPU cluster management: Scheduler: Slurm vs. Kubernetes ○ ○ Container technologies: Docker, Enroot, Singularity, etc. ● Integrated software stack: NVIDIA libraries: CUDA, cuDNN, NCCL ○ DL Framework scale-out optimization ○ Model scale-out implementation & optimization ○ 13

A basic recipe for deep learning scaling Step 1: Optimize your single GPU model Step 2: Scale to multiple GPUs on one node Step 3: Scale to multiple nodes 14

Case study Bidirectional Encoder Representations from Transformers BERT model scripts: • https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/ LanguageModeling/BERT/ https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Lan guageModeling/BERT Configurations for convergence, from 8 to 1500 GPUs, multi-node ready Clone and train your own BERT model on multi-node • Or download a pre-trained BERT model from NGC and fine-tune for your NLP task Super Human Question & Answering NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance 15

Why multi-node BERT training • Pre-training on non-labelled data opens up opportunities to using massive amounts of data: BooksCorpus (800 million words) • English Wikipedia (2.5 billion words), multi-language Wikipedia • WebText (OpenAI, 8M documents, 40 GB of text) • More data tends to lead to better accuracy • • BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs. 16

BERT multi-node pre-training performance Metric: Time to train DGX-1 GPUs Time to train DGX-2H GPUs Time to train (16 GB) (Hrs) (32 GB) (Hrs) 1 8 153.6 (6.3 1 16 58.4 (2.4 days) days) 4 64 15.4 4 32 39.3 16 256 3.9 16 128 10.4 64 1024 1.2 Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results * Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer 17 ** Gradient accumulation is applied to DGX-2H 1,4,16 node

Step 1: Optimize model • Create efficient data pipeline • Enable mixed precision training • Enable XLA • Ensure latest GPU libraries • Develop model in container to facilitate scaling out 18

Step 1: Optimize model Data pipeline • Use tf.data to create performant input pipelines • Test I/O bottlenecks with a trivial model • NVIDIA DALI accelerates image-based input pipelines 19

d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) BERT d = d.apply( tf.contrib.data.parallel_interleave( Data pipeline tf.data.TFRecordDataset, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) TFRecord - fast binary format Parallel read, map, & batch d = d.apply( tf.contrib.data.map_and_batch( Fused map & batch op lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True if is_training else False)) https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py 20

Step 1: Optimize model Automatic Mixed Precision (AMP) • 1-line optimizer wrapper: opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt) • Up to 3x speed up in training on Tensor Cores with • Same accuracy • No change in hyperparameters • ½ memory bandwidth & footprint • Optimal on Volta and Turing GPUs 21

Step 1: Optimize model Automatic Mixed Precision (AMP) • Robust speedup across different TensorFlow workloads • https://arxiv.org/abs/1710.0 3740 22

Step 1: Optimize model XLA (Accelerated Linear Algebra) • TensorFlow XLA can accelerate models with minimal code changes • XLA optimizes graph, mostly by fusing compatible kernels • Set XLA optimization level: config.graph_options.optimizer_options .global_jit_level = tf.OptimizerOptions.ON_1 https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo deling/BERT/run_pretraining.py#L531 System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests run using NVIDIA 18.11 TensorFlow container. 23

Step 1: Optimize model Latest GPU optimizations • Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL) 24

Step 1: Optimize model Latest GPU optimizations • NGC containers : fully featured DL containers • DL frameworks compiled with latest GPU libraries • Portability of application libraries facilitates multi-node scale-out 25

Step 2: Scale to multiple GPUs • Understand Data Parallel training concepts • Ensure optimal inter-GPU communication • Apply high level API for multi-GPU training 27

Step 2: Scale to multiple GPUs Under the hood • Single GPU 28

Step 2: Scale to multiple GPUs Under the hood • Multiple GPU • Data parallel training • Allreduce algorithm • NCCL: N VIDIA C ollective C ommunication L ibrary 29

Step 2: Scale to multiple GPUs Under the hood • Inter-GPU communication: Effective bandwidth in GB/s 30

Step 2: Scale to multiple GPUs Under the hood • Full non-blocking bandwidth 31

Step 2: Scale to multiple GPUs Approach 1: Horovod • Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras • Strong NCCL integration • Sample commands: • Single-node (4 GPUs): horovodrun -np 4 -H localhost:4 python train.py • Multi-node (4 nodes with 4 GPUs each): horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py 32

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - PowerPoint PPT Presentation

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com Introduction Why do we need to scale training AGENDA How to achieve scaling DL Training: from single GPU to multi-node 1.33 Minutes on

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Oracle Accelerate for Midsize Companies Ian Boyling, Director and Lead Consultant Prject (EU)

Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Non-Instagrammable Urbanism Tactical Urbanism Wins + Fails place creative. FAIL 1 FAIL 2

NOVEMBER 12, 2018 AGENDA 1. DO Q3 RESULTS SHOW A SHIFT IN FOCUS FROM SALES TO PROFITS? 2. 3P

ComOps Limited Shareholder Update General Meeting of Shareholders 17 November 2014

Everything in Its Place: Getting Organized! Presenter Erin Skolte Erin Skolte has a

Supply Chain Management in the New World How will the Global Pandemic Impact Supply Chains?

Todays Webinar will be starting soon Best Practices in Empanelment from For the audio portion

October, 2014 Milligan: Inequality trends in Canada 1 Inequality in Canada: Outline Why should

US involvement in Mid-East wars: How can it end? The US has been in a state of perpetual war in

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - PowerPoint PPT Presentation

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com Introduction Why do we need to scale training AGENDA How to achieve scaling DL Training: from single GPU to multi-node 1.33 Minutes on

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Oracle Accelerate for Midsize Companies Ian Boyling, Director and Lead Consultant Prject (EU)

Using GPU VSIPL &amp; CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Non-Instagrammable Urbanism Tactical Urbanism Wins + Fails place creative. FAIL 1 FAIL 2

NOVEMBER 12, 2018 AGENDA 1. DO Q3 RESULTS SHOW A SHIFT IN FOCUS FROM SALES TO PROFITS? 2. 3P

ComOps Limited Shareholder Update General Meeting of Shareholders 17 November 2014

Everything in Its Place: Getting Organized! Presenter Erin Skolte Erin Skolte has a

Supply Chain Management in the New World How will the Global Pandemic Impact Supply Chains?

Todays Webinar will be starting soon Best Practices in Empanelment from For the audio portion

October, 2014 Milligan: Inequality trends in Canada 1 Inequality in Canada: Outline Why should

US involvement in Mid-East wars: How can it end? The US has been in a state of perpetual war in

Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

Deep learning for natural language processing A short primer on deep learning Benoit Favre <