5194.01: Introduction to High-Performance Deep Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . 5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang The Ohio State University 10/21/2020 Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . 1 / 33

. . . . . . . . . . . . . . . . SparkNet & Mesh-TensorFlow SPARKNET: TRAINING DEEP NETWORKS IN SPARK Mesh-TensorFlow: Deep Learning for Supercomputers Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . 2 / 33

. . . . . . . . . . . . . . . . . Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 3 / 33

. Training DNN is time-consuming . . . . . . . . . . Background Using computational cluster to speed up training . Many attempts to speed up the training of deep networks rely on asynchronous, lock-free optimization. Batch-processing frameworks become popular. However, state-of-the-art deep learning systems rely on custom implementations to facilitate their asynchronous, communication-intensive workloads. SparkNet is designed to integrate distributed training algorithm with existing batch computational frameworks such as MapReduce and Spark. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 / 33

. . . . . . . . . . . . . . . Architecutre of SparkNet, Parameter server model Master node: keep the latest model parameters and serve them to worker nodes Worker nodes: Compute gradients with respect to the parameters and ship them back to master nodes Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 33

. . . . . . . . . . . . Advantages . It is convenient to integrate model training with the existing data-processing pipelines. Allows data to be kept in memory from start to fjnish, train and visualize within single framework *Hardware requirements are minimal Many distributed training approaches requires heavy communication. SparkNet dose not require optimization for communication in within cluster Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 33

. . . . . . . . . . . . . . Implementation: distributed training Master node broadcasts parameters to worker nodes Worker nodes train on batches individually for 50 iterations and ship back the parameters. Master node update parameters with the average of worker nodes and broadcast the new parameters. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 33

. . . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes No parallelization accuracy of a when training with a batch size of b Each block corresponds to a single SGD update with batch size b Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 8 / 33 N a ( b ) : number of serial iterations of SGD required to obtain an C ( b ) : time for computing the gradient over a batch of size b Total time: T 0 = N a ( b ) × C ( b )

. . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes Naive parallelization Distribute the computation by dividing minibatch for K machines. Broadcasting parameters takes time S S (limitation) Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . 9 / 33 . . . . Time for single node in a single iteration becomes C ( b / K ) Total time: T 1 = N a ( b ) × ( C ( b / K ) + S ) ( C ( b / K )+ S ) ∗ N a ( b ) < C ( b ) C ( b ) N a ( b ) T 1 = Speed up rate T 0

. . . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes SparkNet parallelization Distribute the computation in rounds for K machines. Broadcasting parameters takes time S Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 10 / 33 In a round, each machine runs SGD for τ iterations with batch size b M a ( b , K , τ ) : number of rounds required to achieve an accuracy of a. Time for single node in a single iteration is still C ( b ) Total time: T 2 = M a ( b , K , τ ) × ( τ ∗ C ( b ) + S ) C ( b ) N a ( b ) T 2 = Speed up rate T 0 ( τ ∗ C ( b )+ S ) ∗ M a ( b , K ,τ )

. . . . . . . . . . . . . . . Theroretical limitations for SparkNet parallelization Disregard the overhead due to synchronization speed up rate run SparkNet using a modifjed version of AlexNet on a subset of ImageNet (fjrst 100 classes each with approximately 1000 data) Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 11 / 33 C ( b ) N a ( b ) T 2 = Speed up rate T 0 ( τ ∗ C ( b )+ S ) ∗ M a ( b , K ,τ ) N a ( b ) τ ∗ M a ( b , K ,τ )

. . . . . . . . . . . . . . . . . Speed up rate disregarding communication Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 12 / 33 K = 1 case: only one worker, τ will not have efgects. τ = 1 case: equivalent to running serial SGD with batch size of K ∗ b Same K : The speed up does not increase as τ decrease. (surprising)

. . . . . . . . . . . . . . . Speed up with consideration of communication Naive parallelization gives no speed up when communication overhead is large However, SparkNet can give relatively consistent speedup when communication overhead is quite large. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 13 / 33 τ = 1 , 2 , 5 , 10 , 25 , 100 , 500 , 1000 , 2500

. . . . . . . . . . . . . . . Training benchmarks Figure: Performance with AlexNet(left) and GoogLeNet(right) on ImageNet dataset Train the default Cafge model of AlexNet on ImageNet dataset. Train the default Cafge model of GoogLeNet on ImageNet dataset Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 14 / 33 Compare the wall-clock time required to obtain an accuracy of 45 % Compare the wall-clock time required to obtain an accuracy of 40 %

. . . . . . . . . . . . . . . . . Each experiment was run with K = 5 worker Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 15 / 33 Dependence of Parallelization scheme on τ

. . . . . . . . . . . . . . . Conclusion easy-to-use deep learning implementation for Spark that is based on Cafge and enables the easy parallelization of existing Cafge models with minimal modifjcation Their approach is efgective even in highly bandwidth-limited settings. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 16 / 33

. . . . . . . . . . . . . . . . . Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 17 / 33

. distributed DNN . . . . . . . . . Background Batch-splitting (data parallelism) is dominant training strategy for Inability to train very large models (memory constraints) . High latency and ineffjciency at small batch size Model-parallelism can solved the problems of batch-splitting Complicated to specify the distribution strategies Diffjcult to compile and optimize Mesh-TensorFlow: a language for specifying a general class of distributed tensor computations User can specify any tensor dimensions to be split across any dimensions of a multi dimensional mesh of processors Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 / 33

5194.01: Introduction to High-Performance Deep Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . 5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang The Ohio State University 10/21/2020 Shen Wang (The Ohio State University) 5194.01: Introduction to

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

High-Performance Deep Learning: Issues, Trends, and Challenges CSE 5194.01 Autumn 20

Introduction to Deep Learning: Concepts and Terminologies CSE 5194.01 Autumn 20 Arpan Jain

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Mesh-Free Applications for Static and Dynamically Changing Node Configurations Natasha Flyer

Automated Canaries with Prometheus, Kubernetes and Service Mesh Bryan Boreham

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

5194.01: Introduction to High-Performance Deep Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . 5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang The Ohio State University 10/21/2020 Shen Wang (The Ohio State University) 5194.01: Introduction to

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

High-Performance Deep Learning: Issues, Trends, and Challenges CSE 5194.01 Autumn 20

Introduction to Deep Learning: Concepts and Terminologies CSE 5194.01 Autumn 20 Arpan Jain

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Mesh-Free Applications for Static and Dynamically Changing Node Configurations Natasha Flyer

Automated Canaries with Prometheus, Kubernetes and Service Mesh Bryan Boreham

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

Deep learning for natural language processing A short primer on deep learning Benoit Favre <