hypar flow exploiting mpi and keras for scalable hy brid
play

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - PowerPoint PPT Presentation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of


  1. HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of Computer Science and Engineering , The Ohio State University {awan.10, jain.575, anthony.301, subramoni.1, panda.2}@osu.edu

  2. Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 2 CSE 5194

  3. The Deep Learning (DL) Revolution Deep Learning – A technique to achieve Artificial Intelligence • – Uses Deep Neural Networks Machine Deep Learning AI Learning (DL) (ML) Examples: Examples: MLPs, DNNs, Logistic Regression Adopted from: http://www.deeplearningbook.org/contents/intro.html Sourc rce: https://thenewstack. k.io io/demys ystif ifyin ying-dee eep-le learnin ing-an and-ar artificial al-in intellig lligence/ Network Based Computing Laboratory High-Performance Deep Learning 3 CSE 5194

  4. Accelerator/CP Family Deep Learning meets Super Computers Performance Share • NVIDIA GPUs - major force for accelerating DL workloads – Comput putationa nal r requi quirement nt is increasing ng e expo pone nent ntially www.top500.org Courtesy: https://openai.com/blog/ai-and-compute/ Network Based Computing Laboratory High-Performance Deep Learning 4 CSE 5194

  5. How to make Training Faster? • Data parallelism – Horovod: TensorFlow, PyTorch, and MXNet – TensorFlow: tf.distribute.Strategy API – PyTorch: torch.nn.parallel.DistributedDataParallel • Model-parallelism and Hybrid-parallelism – No framework-level support – Only LBANN supports it within the framework – Higher-level frameworks: Gpipe, Mesh-TensorFlow, etc. Network Based Computing Laboratory High-Performance Deep Learning 5 CSE 5194

  6. Distributed/Parallel Training Strategies for DNNs • Data Parallelism (most common) • Model and Hybrid Parallelism (emerging) • ‘X’-Parallelism – ‘X’—> Spatial, Channel, Filter, etc. Data Parallelism Model Parallelism Hybrid (Model and Data) Parallelism Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks Network Based Computing Laboratory High-Performance Deep Learning 6 CSE 5194

  7. Why Model Parallelism? • Data Parallelism – only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Designing a system for model- parallelism is challenging Network Based Computing Laboratory High-Performance Deep Learning 7 CSE 5194

  8. Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 8 CSE 5194

  9. Major Problems • Defining a distributed model -- necessary but difficult – requires knowledge of the model, communication library, and distributed hardware • Implementing distributed forward/back-propagation – needed because partitions reside in different memory spaces and need explicit communication • Obtaining parallel speedup on an inherently sequential task – forward pass followed by a backward pass – Limited opportunity for parallelism and scalability • Achieving scalability without losing out on a model’s accuracy – Valid concern for all types of parallelism strategies Network Based Computing Laboratory High-Performance Deep Learning 9 CSE 5194

  10. Research Challenges Challenge-2: Challenge-3: Challenge-1: Model- Communication Applying HPC Definition APIs and between Partitions and Techniques to Framework-specific Replicas Improve Features Performance Meet HyPar-Flow! Network Based Computing Laboratory High-Performance Deep Learning 10 CSE 5194

  11. Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 11 CSE 5194

  12. Key Contribution: Propose, Design, and Evaluate HyPar-Flow • HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI) – Based on Keras models and exploits TF 2.0 Eager Execution – Leverages performance of MPI pt-to-pt. and collectives for communication Network Based Computing Laboratory High-Performance Deep Learning 12 CSE 5194

  13. HyPar-Flow: Overview Network Based Computing Laboratory High-Performance Deep Learning 13 CSE 5194

  14. HyPar-Flow: Components • Model Generator is crucial for productivity • Load Balancer is crucial for performance • Trainer – Core of Back-propagation – Several system-level challenges – Communication of tensors – Blocking or non-blocking – Efficient pipelining is needed • Communication Engine – Isolate communication interfaces – Unified Data, Model, and Hybrid Parallelism Network Based Computing Laboratory High-Performance Deep Learning 14 CSE 5194

  15. Special Handling for Models with Skip Connections Network Based Computing Laboratory High-Performance Deep Learning 15 CSE 5194

  16. Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Characterization • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 16 CSE 5194

  17. Evaluation Setup • 3 Systems – Frontera at Texas Advanced Computing Center (TACC) – Stampede2 (Skylake partition) at TACC – AMD EPYC: Local system with dual-socket AMD EPYC 7551 32-core processors. • Interconnect – Frontera -- Mellanox InfiniBand HDR- 100 HCAs – Stampede2 -- Intel Omni-Path HFIs. • TensorFlow v1.13, MVAPICH2 2.3.2 on Frontera and Epyc, and Intel MPI 2018 on Stampede2 • We use and modify model definitions for VGG and ResNet(s) from keras.applications Network Based Computing Laboratory High-Performance Deep Learning 17 CSE 5194

  18. Verifying the Correctness of HyPar-Flow • The following variants have been compared: – SEQ (GT) - Sequential using tf.GradientTape (GT). – SEQ (MF) - Sequential using model.fit (MF). – SEQ (MF-E) - Sequential using model.fit (MF) and (E)ager Execution. – HF-MP (2)/(56) - HyPar-Flow model-parallel with 2/48 model-partitions. VGG-16 ResNet-110 ResNet-1k Network Based Computing Laboratory High-Performance Deep Learning 18 CSE 5194

  19. Model/Hybrid Parallelism on single/two nodes • ResNet-1k -- scales with batch size on one node as well as two nodes • Reason for scaling? – Counter-intuitive for model-parallelism to scale better than data-parallelism – Poor CPU implementation? Network Based Computing Laboratory High-Performance Deep Learning 19 CSE 5194

  20. Hybrid Parallelism for AmoebaNet • AmoebaNet -- different architecture compared to ResNet(s) • More branches and skip connections • Scales well using HyPar-Flow • Memory-hungry so single node restricted to BatchSize=64 Network Based Computing Laboratory High-Performance Deep Learning 20 CSE 5194

  21. HyPar-Flow (HF): Flexibility and Scalability • CPU based results – AMD EPYC – Intel Xeon • Excellent speedups for – VGG-19 – ResNet-110 – ResNet-1000 (1k layers) • Able to train “future” models – E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked) 110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2) Network Based Computing Laboratory High-Performance Deep Learning 21 CSE 5194

  22. HyPar-Flow at Scale (512 nodes on TACC Frontera) • ResNet-1001 with variable batch size • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) Network Based Computing Laboratory High-Performance Deep Learning 22 CSE 5194

  23. Agenda • Introduction and Motivation • Problems and Challenges • Key Contribution • Performance Evaluation • Conclusion Network Based Computing Laboratory High-Performance Deep Learning 23 CSE 5194

  24. Conclusion • In-depth analysis of Data/Model/Hybrid parallelism – The need for model/hybrid parallelism -- larger models • Proposed and Designed HyPar-Flow – Flexible and user-transparent system – Leverages existing technologies instead of reinventing anything – Keras, TensorFlow, and MPI for flexibility and scalability • Performance Evaluation on large systems – Three HPC clusters including Frontera at TACC (#5 on Top500) – Three DNNs with diverse requirements and sizes (VGG, ResNet-110/1k, and AmoebaNet) – 93% scaling efficiency on 512 nodes (Frontera) Network Based Computing Laboratory High-Performance Deep Learning 24 CSE 5194

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend