Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - PowerPoint PPT Presentation

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat � 1

Data Parallel Training parallel Stochastic Gradient Descent x , ∇ ˜ f 2 ( x ) x , ∇ ˜ f 1 ( x ) ( f i ( x ) ) n 1 x ( k +1) = x ( k ) − γ ( k ) ∑ ∇ ˜ n i =1 x , ∇ ˜ f 3 ( x ) x , ∇ ˜ inter-node average f 4 ( x ) n x ( k +1) = 1 ( x ( k ) − γ ( k ) ∇ ˜ ∑ f i ( x ) ) n i =1 � 2

Data Parallel Training Existing Approaches 1. Parallel SGD (AllReduce gradient aggregation, all nodes) � 3

Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) � 4

Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017. 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent," NeurIPS, 2017. 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018. � 5

Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) Blocks subsets of nodes and requires deadlock avoidance 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017. 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent," NeurIPS, 2017. 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018. � 6

Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) Blocks subsets of nodes and requires deadlock avoidance Proposed Approach Stochastic Gradient Push (PushSum parameter aggregation) nonblocking, no deadlock avoidance required � 7

Stochastic Gradient Push Enables optimization over directed and time-varying graphs 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE Trans. Automatic Control, 2016. � 8

Stochastic Gradient Push Enables optimization over directed and time-varying graphs ... naturally enables asynchronous implementations 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE Trans. Automatic Control, 2016. � 8

Stochastic Gradient Push Local Local Local Local Local Optimization Optimization Optimization Optimization Optimization Nonblocking Nonblocking Nonblocking Nonblocking Communication Communication Communication Communication Time � 9

Distributed Stochastic Optimization ImageNet, ResNet 50 SGP AllReduce SGD AD-PSGD D-PSGD 78 270 epochs 270 epochs Val. Acc. (%) 77 76 90 epochs 90 epochs 75 90 epochs 90 epochs 74 0 1/6 1/3 1/2 2/3 5/6 1 Training Time Relative to SGD 32 nodes (256 GPUs) interconnected via 10 Gbps Ethernet � 10

Stochastic Gradient Push Data Parallelism Algorithm features: * nonblocking communication asynchronous gossip � 11

Stochastic Gradient Push Data Parallelism Algorithm features: * nonblocking communication asynchronous gossip * convergence guarantees for smooth non-convex functions with arbitrary (bounded) message staleness paper: arxiv.org/pdf/1811.10792.pdf code: github.com/facebookresearch/stochastic_gradient_push poster: Pacific Ballroom #183 � 12

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - PowerPoint PPT Presentation

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat 1 Data Parallel Training parallel Stochastic Gradient Descent x , f 2 ( x ) x , f 1 ( x ) ( f i ( x ) ) n 1 x

Web Push Notifications Whois They are for push Whois They are for push Timely

1 Push-down Automata A push-down automaton is a finite automaton with an additional last-in

SI232 push(2) SlideSet #4: Procedures push(1) (more Chapter 2) pop() pop() push(6) pop()

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

How we scaled push messaging for millions of Netflix devices Susheel Aroskar Cloud Gateway

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5

Reopening Plan A Town Hall Meeting 7/27/2020 Scenario Planning for 2020-2021 * Task Force of Lay,

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie

Leadership Actions to Mitigate a COVID Learning Loss Webinar June 11, 2020 Slides Melissa