accelerate deep learning
play

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) - PowerPoint PPT Presentation

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com Introduction Why do we need to scale training AGENDA How to achieve scaling DL Training: from single GPU to multi-node 1.33 Minutes on


  1. Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( 张雪萌 ) maggiez@nvidia.com

  2. ● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling

  3. DL Training: from single GPU to multi-node 1.33 Minutes on MLPerf At Scale | 2019 6.3 Minutes on MLPerf DGX SuperPOD 52.7 Minutes on At Scale | 2018 MLPerf DGX Cluster DGX-2H | 2019 70 Minutes on MLPerf NVSwitch DGX-2H | 2018 NVSwitch 480 Mins (8 Hours) DGX-1V | 2017 Tensor Core 36000 Mins (25 1200 Mins (20 Hours) Days) DGX-1P | 2016 1xK80 | 2015 NVLink CUDA 2016 2017 2015 2018 2019 ResNet50 v1.5 training 3

  4. The whole stack must be considered ● Compute ● Network ● Storage ● Frameworks & Libraries ● Numerical methods ● Training recipes 4

  5. MLPerf: NVIDIA advancing AI training Time to Train From 8 Hours to 80 Seconds 5 2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23

  6. Largest TensorFlow model at scale Oak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs Source: https://arxiv.org/pdf/1810.01993.pdf 2018 Gordon Bell Prize Winner 6

  7. ● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling

  8. Datasets getting larger ● Unlabeled data: ○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M documents, 40 GB), C4 (Common Crawl, 745 GB) ○ GAN: unlabeled images and videos Reinforcement learning: unsupervised self-play generates unlimited data ○ ● Labeled data: ○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000 categories ○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving 8

  9. DL models increasing in complexity Next-level use-cases require gigantic models NLP – Generative Tasks Chatbots E-mail auto-completion Speech Document Summarization Recognition NLP Q&A Sentiment Translation Translation Project Megatron Image Recognition Autonomous Vehicles Social Tagging 1.5Bn 8.3B parameters Visual Search Object 8-way Model Parallel Detection 340M 26M 64-way Data Parallel 24x larger than BERT 9 https://github.com/NVIDIA/Megatron-LM

  10. ● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling

  11. Scaling == whack-a-mole ? Solving one bottleneck and another one pops up 11

  12. Multi-node infrastructure requirements System Design Multi-Node Success Data Center SW Stack Management 12

  13. Challenges of multi-node DL training ● Hardware GPU cluster design: ○ Compute: significant CPU to GPU ratio, interconnect with GPU ○ Storage: high speed NFS, multi-tier caching ○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA ● GPU cluster management: Scheduler: Slurm vs. Kubernetes ○ ○ Container technologies: Docker, Enroot, Singularity, etc. ● Integrated software stack: NVIDIA libraries: CUDA, cuDNN, NCCL ○ DL Framework scale-out optimization ○ Model scale-out implementation & optimization ○ 13

  14. A basic recipe for deep learning scaling Step 1: Optimize your single GPU model Step 2: Scale to multiple GPUs on one node Step 3: Scale to multiple nodes 14

  15. Case study Bidirectional Encoder Representations from Transformers BERT model scripts: • https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/ LanguageModeling/BERT/ https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Lan guageModeling/BERT Configurations for convergence, from 8 to 1500 GPUs, multi-node ready Clone and train your own BERT model on multi-node • Or download a pre-trained BERT model from NGC and fine-tune for your NLP task Super Human Question & Answering NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance 15

  16. Why multi-node BERT training • Pre-training on non-labelled data opens up opportunities to using massive amounts of data: BooksCorpus (800 million words) • English Wikipedia (2.5 billion words), multi-language Wikipedia • WebText (OpenAI, 8M documents, 40 GB of text) • More data tends to lead to better accuracy • • BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs. 16

  17. BERT multi-node pre-training performance Metric: Time to train DGX-1 GPUs Time to train DGX-2H GPUs Time to train (16 GB) (Hrs) (32 GB) (Hrs) 1 8 153.6 (6.3 1 16 58.4 (2.4 days) days) 4 64 15.4 4 32 39.3 16 256 3.9 16 128 10.4 64 1024 1.2 Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results * Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer 17 ** Gradient accumulation is applied to DGX-2H 1,4,16 node

  18. Step 1: Optimize model • Create efficient data pipeline • Enable mixed precision training • Enable XLA • Ensure latest GPU libraries • Develop model in container to facilitate scaling out 18

  19. Step 1: Optimize model Data pipeline • Use tf.data to create performant input pipelines • Test I/O bottlenecks with a trivial model • NVIDIA DALI accelerates image-based input pipelines 19

  20. d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) BERT d = d.apply( tf.contrib.data.parallel_interleave( Data pipeline tf.data.TFRecordDataset, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) TFRecord - fast binary format Parallel read, map, & batch d = d.apply( tf.contrib.data.map_and_batch( Fused map & batch op lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True if is_training else False)) https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py 20

  21. Step 1: Optimize model Automatic Mixed Precision (AMP) • 1-line optimizer wrapper: opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt) • Up to 3x speed up in training on Tensor Cores with • Same accuracy • No change in hyperparameters • ½ memory bandwidth & footprint • Optimal on Volta and Turing GPUs 21

  22. Step 1: Optimize model Automatic Mixed Precision (AMP) • Robust speedup across different TensorFlow workloads • https://arxiv.org/abs/1710.0 3740 22

  23. Step 1: Optimize model XLA (Accelerated Linear Algebra) • TensorFlow XLA can accelerate models with minimal code changes • XLA optimizes graph, mostly by fusing compatible kernels • Set XLA optimization level: config.graph_options.optimizer_options .global_jit_level = tf.OptimizerOptions.ON_1 https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo deling/BERT/run_pretraining.py#L531 System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests run using NVIDIA 18.11 TensorFlow container. 23

  24. Step 1: Optimize model Latest GPU optimizations • Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL) 24

  25. Step 1: Optimize model Latest GPU optimizations • NGC containers : fully featured DL containers • DL frameworks compiled with latest GPU libraries • Portability of application libraries facilitates multi-node scale-out 25

  26. 26

  27. Step 2: Scale to multiple GPUs • Understand Data Parallel training concepts • Ensure optimal inter-GPU communication • Apply high level API for multi-GPU training 27

  28. Step 2: Scale to multiple GPUs Under the hood • Single GPU 28

  29. Step 2: Scale to multiple GPUs Under the hood • Multiple GPU • Data parallel training • Allreduce algorithm • NCCL: N VIDIA C ollective C ommunication L ibrary 29

  30. Step 2: Scale to multiple GPUs Under the hood • Inter-GPU communication: Effective bandwidth in GB/s 30

  31. Step 2: Scale to multiple GPUs Under the hood • Full non-blocking bandwidth 31

  32. Step 2: Scale to multiple GPUs Approach 1: Horovod • Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras • Strong NCCL integration • Sample commands: • Single-node (4 GPUs): horovodrun -np 4 -H localhost:4 python train.py • Multi-node (4 nodes with 4 GPUs each): horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend