Demystifying Parallel and Distributed Deep Learning: An In-Depth - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. B EN -N UN , T. H OEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941

spcl.inf.ethz.ch @spcl_eth What is Deep Learning good for? Image Captioning Digit Recognition Object Classification Gameplay AI Neural Computers Translation Segmentation A very promising area of research! 23 papers per day! number of papers per year 1989 2012 2013 2014 2016 2017

spcl.inf.ethz.ch @spcl_eth How does Deep Learning work? Canziani et al. 2017 0.8 bn Number of users Deep Learning is Supercomputing! f(x) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.00 0.02 0.00 0.02 Truck Truck layer-wise weight update ▪ ImageNet (1k): 180 GB ▪ 10-22k labels ▪ 100-200 layers deep ▪ ImageNet (22k): A few TB ▪ growing (e.g., face recognition) ▪ ~100M-2B parameters ▪ Industry: Much larger ▪ weeks to train ▪ 0.1-8 GiB parameter storage

spcl.inf.ethz.ch @spcl_eth A brief theory of supervised deep learning 𝑔(𝑦) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck layer-wise weight update labeled samples 𝑦 ∈ 𝑌 ⊂ 𝒠 label domain 𝑍 true label 𝑚(𝑦) 𝑔 𝑦 = 𝑔 𝑜 𝑔 𝑜−1 𝑔 𝑜−2 … 𝑔 1 𝑦 … 𝑔 𝑦 : 𝑌 → 𝑍 2 ℓ 𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦 … 𝑔(𝑦) 𝑔 1 𝑦 𝑔 2 𝑔 1 𝑦 ℓ 0−1 𝑥, 𝑦 = ቊ0 𝑔 𝑦 = 𝑚(𝑦) fully connected convolution 1 convolution 3 convolution 2 network structure weights 𝑥 1 𝑔 𝑦 ≠ 𝑚(𝑦) pooling (fixed) (learned) 𝑓 𝑔 𝑦 𝑗 ℓ 𝑑𝑓 𝑥, 𝑦 = − ෍ 𝑚 𝑦 𝑗 ⋅ log σ 𝑙 𝑓 𝑔 𝑦 𝑙 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 𝑗 5

spcl.inf.ethz.ch @spcl_eth 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 Stochastic Gradient Descent 𝑔 1 (𝑦 ) convolution 1 𝑔 2 𝑔 1 𝑦 convolution 2 pooling … convolution 3 𝑔(𝑦) fully connected ▪ Layer storage = 𝑥 𝑚 + 𝑔 𝑚 𝑝 𝑚−1 + 𝛼𝑥 𝑚 + 𝛼𝑝 𝑚 6 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 7 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Deep Learning research is converging to MPI! 8 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 9 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree RedScat+Gat Butterfly Pipeline 𝑈 = 2𝑀 log 2 𝑄 + 𝑈 = 𝑀 log 2 𝑄 + 𝑈 = 2𝑀(𝑄 − 1) + 𝑈 = 2𝑀 log 2 𝑄 + 2𝛿𝑛𝐻 log 2 𝑄 𝛿𝑛𝐻 log 2 𝑄 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 … Work W = 39 Depth D = 7 Lower bound: 𝑈 ≥ 𝑀 log 2 𝑄 + 2𝛿𝑛𝐻 𝑄 − 1 /𝑄 𝑋 Average parallelism = E. Chan et al.: Collective communication: theory, practice, and experience. CCPE’07 10 𝐸 TH, D. Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI’14

spcl.inf.ethz.ch @spcl_eth GoogLeNet in more detail ▪ ~6.8M parameters ▪ 22 layers deep C. Szegedy et al. Going Deeper with Convolutions, CVPR’15 11

spcl.inf.ethz.ch @spcl_eth Parallelism in the different layer types 𝑔 𝑚 𝑦 𝛼𝑥 𝛼𝑝 𝑚 𝑔 𝑚 𝑦 𝛼𝑥 W is linear and D logarithmic – large average parallelism 𝛼𝑝 𝑚 4 1 9 8 21.9 59.3 53.9 43.9 1 -1 0 5 9 9 8 = -6.3 16.8 12.3 12 * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 𝑔 𝑚 𝑦 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 𝛼𝑥 𝛼𝑝 𝑚 𝑔 𝑚 𝑦 21.9 59.3 53.9 43.9 59.3 53.9 𝛼𝑥 -6.3 16.8 12.3 12 9.6 15.3 25.8 14 𝛼𝑝 𝑚 15.3 53.1 0.4 7.1 52.1 53.1 𝑔 𝑚 𝑦 𝛼𝑥 𝛼𝑝 𝑚 13 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth 𝑔 𝑚 𝑦 Computing fully connected layers 𝛼𝑥 𝛼𝑝 𝑚 𝑦 1 𝑥 1,1 𝜏 σ𝑥 𝑗,1 𝑦 𝑗 + 𝑥 1,2 𝑐 1 𝑦 1,1 𝑦 1,2 𝑦 1,3 1 𝑥 1,1 𝑥 1,2 𝑥 2,1 𝑦 2 𝑥 2,1 𝑥 2,2 𝑦 2,1 𝑦 2,2 𝑦 2,3 1 ⋅ 𝑥 2,2 𝑥 3,1 𝑥 3,2 ⋮ ⋮ ⋮ ⋮ 𝜏 σ𝑥 𝑗,2 𝑦 𝑗 + 𝑥 3,1 𝑐 1 𝑐 2 𝑐 2 𝑦 𝑂,1 𝑦 𝑂,2 𝑦 𝑂,3 1 𝑥 3,2 𝑦 3 14

spcl.inf.ethz.ch @spcl_eth Computing convolutional layers Direct Indirect 4 1 9 8 21.9 59.3 53.9 43.9 FFT Winograd 1 -1 0 5 9 9 8 -6.3 16.8 12.3 12 = Direct * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 ℱ 𝑥 ෝ 𝑥 × ℱ = im2col ℱ −1 X. Liu et al.: Efficient Sparse-Winograd S. Chetlur et al.: cuDNN: Efficient Primitives for Deep Learning, arXiv 2014 Convolutional Neural Networks, ICLR’17 Workshop K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Re cognition 2016 M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14 15 A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16

spcl.inf.ethz.ch @spcl_eth Model parallelism … 1 3 ▪ Parameters can be distributed across processors ▪ Mini-batch has to be copied to all processors ▪ Backpropagation requires all-to-all communication every layer U.A. Muller and A. Gunzinger : Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994 16

spcl.inf.ethz.ch @spcl_eth Pipeline parallelism … ▪ Layers/parameters can be distributed across processors ▪ Sparse communication pattern (only pipeline stages) ▪ Mini-batch has to be copied through all processors G. Blelloch and C.R. Rosenberg: Network Learning on the Connection Machine, IJCAI’87 17

spcl.inf.ethz.ch @spcl_eth Data parallelism … … … ▪ Simple and efficient solution, easy to implement ▪ Duplicate parameters at all processors X. Zhang et al.: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM- 2, NIPS’89 18

spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism … … … Layer (pipeline) Parallelism ▪ Layers/parameters can be distributed across processors ▪ Can distribute minibatch ▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) ▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS ’12. 19 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Updating parameters in distributed data parallelism Decentral collective allreduce of 𝒙 Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥) 𝑈 = 2𝑀 + 2𝑄 𝛿𝑛/𝑡 𝐻 - Collective operations - Topologies - Neighborhood collectives 𝛼𝑥 𝑥 𝑈 = 2𝑀 log 2 𝑄 + - RMA? 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 Hierarchical Parameter Server Adaptive Minibatch Size S. Gupta et al.: Model Accuracy and S. L. Smith et al.: Don't Decay the Runtime Tradeoff in Distributed Deep Learning Rate, Increase the Batch Size, Learning: A Systematic arXiv 2017 Study. ICDM’16 20 Training Agent Training Agent Training Agent Training Agent

Demystifying Parallel and Distributed Deep Learning: An In-Depth - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. B EN -N UN , T. H OEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 spcl.inf.ethz.ch @spcl_eth What is Deep Learning good for?

Demystifying the Finance Audit Committee DEMYSTIFYING THE FINANCE AND AUDIT COMMITTEE

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Demystifying Python Metaclasses Demystifying Python Metaclasses Eric D. Wills, Ph.D. Instructor,

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Demystifying Distributed Transactions with the Fairness-Isolation-Throughput Tradeoff Jose

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

Outline Parallel / Distributed Computers CSCI 8220 Parallel and Distributed Air

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5

Reopening Plan A Town Hall Meeting 7/27/2020 Scenario Planning for 2020-2021 * Task Force of Lay,

A GPU Inference System Scheduling Algorithm with Asynchronous Data Transfer Qin Zhang, Li Zha,

Freshman Parent Orientation Roseville High School 2020-21 Agenda High School 101: Schedule,

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Sambuz

Useful Links

Newsletter

Mail Us