High-Performance Communication in Machine Learning Keynote at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Keynote at Austrian HPC Meeting, Feb. 2019, Grundlsee W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI , AND OTHERS AT SPCL AND IST A USTRIA https://www.arxiv.org/abs/1802.09941

spcl.inf.ethz.ch @spcl_eth … What is Deep Learning good for? Image Captioning Digit Recognition Object Classification Gameplay AI Neural Computers Translation Segmentation A very active area of research! 23 papers per day! number of papers per year 1989 2012 2013 2014 2016 2017

spcl.inf.ethz.ch @spcl_eth How does Deep Learning work? Canziani et al. 2017 0.8 bn Number of users Deep Learning is Supercomputing! f(x) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.00 0.02 0.00 0.02 Truck Truck layer-wise weight update ▪ ImageNet (1k): 180 GB ▪ 10-22k labels ▪ 100-200 layers deep ▪ ImageNet (22k): A few TB ▪ growing (e.g., face recognition) ▪ ~100M-2B parameters ▪ Industry: Much larger ▪ weeks to train ▪ 0.1-8 GiB parameter storage

spcl.inf.ethz.ch @spcl_eth A brief theory of supervised deep learning 𝑔(𝑦) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck layer-wise weight update labeled samples 𝑦 ∈ 𝑌 ⊂ 𝒠 label domain 𝑍 true label 𝑚(𝑦) 𝑔 𝑦 = 𝑔 𝑜 𝑔 𝑜−1 𝑔 𝑜−2 … 𝑔 1 𝑦 … 𝑔 𝑦 : 𝑌 → 𝑍 2 ℓ 𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦 … 𝑔(𝑦) 𝑔 1 𝑦 𝑔 2 𝑔 1 𝑦 ℓ 0−1 𝑥, 𝑦 = ቊ0 𝑔 𝑦 = 𝑚(𝑦) fully connected convolution 1 convolution 3 convolution 2 network structure weights 𝑥 1 𝑔 𝑦 ≠ 𝑚(𝑦) pooling (fixed) (learned) 𝑓 𝑔 𝑦 𝑗 ℓ 𝑑𝑓 𝑥, 𝑦 = − ෍ 𝑚 𝑦 𝑗 ⋅ log σ 𝑙 𝑓 𝑔 𝑦 𝑙 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 𝑗 4

spcl.inf.ethz.ch @spcl_eth 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 Stochastic Gradient Descent 𝑔 1 (𝑦 ) convolution 1 𝑔 2 𝑔 1 𝑦 convolution 2 pooling … convolution 3 𝑔(𝑦) fully connected ▪ Layer storage = 𝑥 𝑚 + 𝑔 𝑚 𝑝 𝑚−1 + 𝛼𝑥 𝑚 + 𝛼𝑝 𝑚 5 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 6 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! 7 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree RedScat+Gat Butterfly Pipeline 𝑈 = 2𝑀 log 2 𝑄 + 𝑈 = 𝑀 log 2 𝑄 + 𝑈 = 2𝑀(𝑄 − 1) + 𝑈 = 2𝑀 log 2 𝑄 + 2𝛿𝑛𝐻 log 2 𝑄 𝛿𝑛𝐻 log 2 𝑄 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 … Work W = 39 Depth D = 7 Lower bound: 𝑈 ≥ 𝑀 log 2 𝑄 + 2𝛿𝑛𝐻 𝑄 − 1 /𝑄 𝑋 Average parallelism = E. Chan et al.: Collective communication: theory, practice, and experience. CCPE’07 8 𝐸 TH, D. Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI’14

spcl.inf.ethz.ch @spcl_eth Parallelism in Deep Learning ▪ Individual operators ▪ Network parallelism ▪ Optimization algorithm ▪ Distributed training Agent Agent Param. Server Agent Agent Operators Networks Training 9

spcl.inf.ethz.ch @spcl_eth Parallelism in the different layer types 𝑔 𝑚 𝑦 𝛼𝑥 𝛼𝑝 𝑚 𝑔 𝑚 𝑦 𝛼𝑥 W is linear and D logarithmic – large average parallelism 𝛼𝑝 𝑚 4 1 9 8 21.9 59.3 53.9 43.9 1 -1 0 5 9 9 8 = -6.3 16.8 12.3 12 * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 𝑔 𝑚 𝑦 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 𝛼𝑥 𝛼𝑝 𝑚 𝑔 𝑚 𝑦 21.9 59.3 53.9 43.9 59.3 53.9 𝛼𝑥 -6.3 16.8 12.3 12 9.6 15.3 25.8 14 𝛼𝑝 𝑚 15.3 53.1 0.4 7.1 52.1 53.1 𝑔 𝑚 𝑦 𝛼𝑥 𝛼𝑝 𝑚 10 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Example: Options for computing convolutional layers Direct Indirect 4 1 9 8 21.9 59.3 53.9 43.9 FFT Winograd 1 -1 0 5 9 9 8 -6.3 16.8 12.3 12 = Direct * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 ℱ 𝑥 ෝ 𝑥 × ℱ = im2col ℱ −1 X. Liu et al.: Efficient Sparse-Winograd S. Chetlur et al.: cuDNN: Efficient Primitives for Deep Learning, arXiv 2014 Convolutional Neural Networks, ICLR’17 Workshop K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Re cognition 2016 M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14 11 A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16

spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 12 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Microbatching (µ-cuDNN) – how to implement layers best in practice? ▪ In cuDNN there are ~16 convolution implementations ▪ Performance depends on temporary memory (workspace) size ▪ Key idea: segment minibatch into microbatches, reuse Fast (up to 4.54x faster on DeepBench) workspace, use different algorithms Microbatching Strategy ▪ How to choose microbatch sizes and algorithms? none (undivided) powers-of-two only Dynamic Programming (Space Reuse) any (unrestricted) Integer Linear Programming (Space Sharing) 13 Yosuke Oyama, Tal Ben-Nun, TH and Satoshi Matsuoka: µ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching, Cluster 2018

spcl.inf.ethz.ch @spcl_eth Model parallelism – limited by network size … 1 3 ▪ Parameters can be distributed across processors ▪ Mini-batch has to be copied to all processors ▪ Backpropagation requires all-to-all communication every layer U.A. Muller and A. Gunzinger : Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994 14

spcl.inf.ethz.ch @spcl_eth Pipeline parallelism – limited by network size … ▪ Layers/parameters can be distributed across processors ▪ Sparse communication pattern (only pipeline stages) ▪ Mini-batch has to be copied through all processors G. Blelloch and C.R. Rosenberg: Network Learning on the Connection Machine, IJCAI’87 15

spcl.inf.ethz.ch @spcl_eth Data parallelism – limited by batch-size … … … ▪ Simple and efficient solution, easy to implement ▪ Duplicate parameters at all processors X. Zhang et al.: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM- 2, NIPS’89 16

spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism … … … Layer (pipeline) Parallelism ▪ Layers/parameters can be distributed across processors ▪ Can distribute minibatch ▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) ▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS ’12. 17 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Updating parameters in distributed data parallelism Decentral collective allreduce of 𝒙 Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥) 𝑈 = 2𝑀 + 2𝑄 𝛿𝑛/𝑡 𝐻 - Collective operations - Topologies - Neighborhood collectives 𝛼𝑥 𝑥 𝑈 = 2𝑀 log 2 𝑄 + - RMA? 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 Hierarchical Parameter Server Adaptive Minibatch Size S. Gupta et al.: Model Accuracy and S. L. Smith et al.: Don't Decay the Runtime Tradeoff in Distributed Deep Learning Rate, Increase the Batch Size, Learning: A Systematic arXiv 2017 Study. ICDM’16 18 Training Agent Training Agent Training Agent Training Agent

High-Performance Communication in Machine Learning Keynote at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Keynote at Austrian HPC Meeting, Feb. 2019, Grundlsee W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI ,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

ucson Bible Church Psalms: Out of the Depths Lesson #1 March 29, 2017 Dean Bible Ministries

ucson Bible Church Psalms: Out of the Depths Lesson #3 March 31, 2017 Dean Bible Ministries

Omnicon March 30, 2019 Cookeville, Tennessee Presented By Christopher L. Augustus Tennessee

This Prophe phecy & Creation tion Revela lation tion Present THE FUEL PROJECT: Know

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning Today 2 Part 1: Course

CSC420: Intro to Image Understanding Introduction Sanja Fidler January 7, 2019 Sanja Fidler

Getting Parent Buy In Trevor Marsden The Mathematics Journey 2015 as a Headteacher in the UK I

A Perspective on Technology Education for Law Students Anthony G. Volini DePaul University

Sambuz

Useful Links

Newsletter

Mail Us

High-Performance Communication in Machine Learning Keynote at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Keynote at Austrian HPC Meeting, Feb. 2019, Grundlsee W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI ,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

ucson Bible Church Psalms: Out of the Depths Lesson #1 March 29, 2017 Dean Bible Ministries

ucson Bible Church Psalms: Out of the Depths Lesson #3 March 31, 2017 Dean Bible Ministries

Omnicon March 30, 2019 Cookeville, Tennessee Presented By Christopher L. Augustus Tennessee

This Prophe phecy &amp; Creation tion Revela lation tion Present THE FUEL PROJECT: Know

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning Today 2 Part 1: Course

CSC420: Intro to Image Understanding Introduction Sanja Fidler January 7, 2019 Sanja Fidler

Getting Parent Buy In Trevor Marsden The Mathematics Journey 2015 as a Headteacher in the UK I

A Perspective on Technology Education for Law Students Anthony G. Volini DePaul University

Sambuz

Useful Links

Newsletter

Mail Us

This Prophe phecy & Creation tion Revela lation tion Present THE FUEL PROJECT: Know