On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization
Hao Yu, Rong Jin Machine Intelligence Technology Alibaba Group (US) Inc., Bellevue, WA
Poster @ Pacific Ballroom #103
On the Computation and Communication Complexity of Parallel SGD with - - PowerPoint PPT Presentation
On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Poster @ Pacific Ballroom #103 Hao Yu , Rong Jin Machine Intelligence Technology Alibaba Group (US) Inc., Bellevue, WA
Hao Yu, Rong Jin Machine Intelligence Technology Alibaba Group (US) Inc., Bellevue, WA
Poster @ Pacific Ballroom #103
min
x∈ℛm f(x) Δ
= 피ζ∼D[F(x; ζ)]
min
x∈ℛm f(x) Δ
= 피ζ∼D[F(x; ζ)]
xt+1 = xt − γ 1
B ∑B i=1 ∇F(xt; ζi)
stochastic gradient averaged from a mini-batch of size B
min
x∈ℛm f(x) Δ
= 피ζ∼D[F(x; ζ)]
xt+1 = xt − γ 1
B ∑B i=1 ∇F(xt; ζi)
stochastic gradient averaged from a mini-batch of size B
gradients at each step.
min
x∈ℛm f(x) Δ
= 피ζ∼D[F(x; ζ)]
xt+1 = xt − γ 1
B ∑B i=1 ∇F(xt; ζi)
stochastic gradient averaged from a mini-batch of size B
large BS is close to GD?
large BS is close to GD?
budgets of stochastic gradient Stochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
large BS is close to GD?
budgets of stochastic gradient Stochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
better stochastic opt error than GD.
large BS is close to GD?
budgets of stochastic gradient Stochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
better stochastic opt error than GD.
For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
O(1/T)
For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
O(1/T)
For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
SGD with (B=1) has SFO convergence
O(1/T)
O(1/ NT)
T: SFO access budge at each node Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
SGD with (B=1) has SFO convergence
O(1/T)
O(1/ NT)
T: SFO access budge at each node Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
SGD with (B=1) has SFO convergence
O(1/T)
O(1/ NT)
T: SFO access budge at each node Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
1 2∥∇f(x)∥2 ≥ μ( f(x) − f*), ∀x
Algorithm 1 CR-PSGD(f, N, T, x1, B1, ρ, γ)
1: Input: N, T, x1 2 Rm, γ , B1 and ρ > 1. 2: Initialize t = 1 3: while Pt
τ=1 Bτ T do
4:
Each worker calculates batch gradient average ¯ gt,i =
1 Bt
PBt
j=1 F(xt; ζi,j).
5:
Each worker aggregates gradient average ¯ gt = 1
N
PN
i=1 ¯
gt,i.
6:
Each worker updates in parallel via: xt+1 = xt γ¯ gt.
7:
Set batch size Bt+1 = bρtB1c.
8:
Update t t + 1.
9: end while 10: Return: xt
budge of SFO access at each worker
N workers has SFO convergence with comm rounds
local SGD in [Stich’18] for strongly convex opt only
O(log T)
O( 1
NT )
O( 1
NT )
O( NT)
N workers has SFO convergence with comm rounds
local SGD in [Stich’18] for strongly convex opt only
O(log T)
O( 1
NT )
O( 1
NT )
O( NT)
N workers has SFO convergence with comm rounds
local SGD in [Stich’18] for strongly convex opt only
al.’18]
proximal minimization
O(log T)
O( 1
NT )
O( 1
NT )
O( NT)
Algorithm 2 CR-PSGD-Catalyst(f, N, T, y0, B1, ρ, γ)
1: Input: N, T, θ, y0 2 Rm, γ , B1 and ρ > 1. 2: Initialize y(0) = y0 and k = 1. 3: while k b
p NTc do
4:
Define hθ(x; y(k−1)) ∆ = f(x) + θ
2kx y(k−1)k2 .
5:
Update y(k) via y(k) = CR-PSGD(hθ(·; y(k−1)), N, b p T/Nc, y(k−1), B1, ρ, γ)
6:
Update k k + 1.
7: end while
strongly convex fun whose unbiased stochastic gradient is easily estimated
SFO convergence with comm rounds
O(1/ NT)
O( NT log( T
N ))
O(1/ NT) O(N3/4T3/4)
Distributed Logistic Regression: N=10
Training ResNet20 over Cifar10: N=8
Poster on Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #103