Stochastic Gradient Push for Distributed Deep Learning
Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat
- 1
Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - - PowerPoint PPT Presentation
Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat 1 Data Parallel Training parallel Stochastic Gradient Descent x , f 2 ( x ) x , f 1 ( x ) ( f i ( x ) ) n 1 x
Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat
2
parallel Stochastic Gradient Descent x(k+1) = x(k) − γ(k) ( 1 n
n
∑
i=1
∇ ˜ fi(x)) inter-node average x, ∇ ˜ f1(x) x, ∇ ˜ f2(x) x, ∇ ˜ f3(x) x, ∇ ˜ f4(x)
x(k+1) = 1 n
n
∑
i=1
(x(k) − γ(k)∇ ˜ fi(x))
3
1. Parallel SGD (AllReduce gradient aggregation, all nodes)
Existing Approaches
4
1. Parallel SGD (AllReduce gradient aggregation, all nodes)
Existing Approaches Blocks all nodes
5
1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)
Existing Approaches
gradient descent," NeurIPS, 2017.
Blocks all nodes
6
Blocks subsets of nodes and requires deadlock avoidance
1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)
Existing Approaches Blocks all nodes
gradient descent," NeurIPS, 2017.
7
1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)
Existing Approaches
Stochastic Gradient Push (PushSum parameter aggregation)
Proposed Approach Blocks all nodes Blocks subsets of nodes and requires deadlock avoidance nonblocking, no deadlock avoidance required
8
Enables optimization over directed and time-varying graphs
8
Enables optimization over directed and time-varying graphs ... naturally enables asynchronous implementations
Nonblocking Communication Nonblocking Communication Nonblocking Communication Nonblocking Communication
9
Local Optimization
Time
Local Optimization Local Optimization Local Optimization Local Optimization
10
32 nodes (256 GPUs) interconnected via 10 Gbps Ethernet
74 75 76 77 78
Training Time Relative to SGD 1/6 1/3 1/2 2/3 5/6 1
SGP AllReduce SGD AD-PSGD D-PSGD
90 epochs 270 epochs 90 epochs 90 epochs 270 epochs 90 epochs
11
Algorithm features: asynchronous gossip
12
non-convex functions with arbitrary (bounded) message staleness
Algorithm features: asynchronous gossip
paper: arxiv.org/pdf/1811.10792.pdf code: github.com/facebookresearch/stochastic_gradient_push poster: Pacific Ballroom #183