stochastic gradient push for distributed deep learning
play

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - PowerPoint PPT Presentation

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat 1 Data Parallel Training parallel Stochastic Gradient Descent x , f 2 ( x ) x , f 1 ( x ) ( f i ( x ) ) n 1 x


  1. Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat � 1

  2. Data Parallel Training parallel Stochastic Gradient Descent x , ∇ ˜ f 2 ( x ) x , ∇ ˜ f 1 ( x ) ( f i ( x ) ) n 1 x ( k +1) = x ( k ) − γ ( k ) ∑ ∇ ˜ n i =1 x , ∇ ˜ f 3 ( x ) x , ∇ ˜ inter-node average f 4 ( x ) n x ( k +1) = 1 ( x ( k ) − γ ( k ) ∇ ˜ ∑ f i ( x ) ) n i =1 � 2

  3. Data Parallel Training Existing Approaches 1. Parallel SGD (AllReduce gradient aggregation, all nodes) � 3

  4. Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) � 4

  5. Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017. 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent," NeurIPS, 2017. 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018. � 5

  6. Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) Blocks subsets of nodes and requires deadlock avoidance 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017. 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent," NeurIPS, 2017. 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018. � 6

  7. Data Parallel Training Existing Approaches Blocks all nodes 1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes) Blocks subsets of nodes and requires deadlock avoidance Proposed Approach Stochastic Gradient Push (PushSum parameter aggregation) nonblocking, no deadlock avoidance required � 7

  8. Stochastic Gradient Push Enables optimization over directed and time-varying graphs 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE Trans. Automatic Control, 2016. � 8

  9. Stochastic Gradient Push Enables optimization over directed and time-varying graphs ... naturally enables asynchronous implementations 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE Trans. Automatic Control, 2016. � 8

  10. Stochastic Gradient Push Local Local Local Local Local Optimization Optimization Optimization Optimization Optimization Nonblocking Nonblocking Nonblocking Nonblocking Communication Communication Communication Communication Time � 9

  11. Distributed Stochastic Optimization ImageNet, ResNet 50 SGP AllReduce SGD AD-PSGD D-PSGD 78 270 epochs 270 epochs Val. Acc. (%) 77 76 90 epochs 90 epochs 75 90 epochs 90 epochs 74 0 1/6 1/3 1/2 2/3 5/6 1 Training Time Relative to SGD 32 nodes (256 GPUs) interconnected via 10 Gbps Ethernet � 10

  12. Stochastic Gradient Push Data Parallelism Algorithm features: * nonblocking communication asynchronous gossip � 11

  13. Stochastic Gradient Push Data Parallelism Algorithm features: * nonblocking communication asynchronous gossip * convergence guarantees for smooth non-convex functions with arbitrary (bounded) message staleness paper: arxiv.org/pdf/1811.10792.pdf code: github.com/facebookresearch/stochastic_gradient_push poster: Pacific Ballroom #183 � 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend