Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks
Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers)
Distributed Approaches to Mirror Descent for Stochastic Learning - - PowerPoint PPT Presentation
Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous
Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers)
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Challenges:
exchange raw data
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Challenges:
exchange raw data Questions:
links are slow(/not fast)?
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
noisy data
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
noisy data
limited rates?
exchanges
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
noisy data
limited rates?
exchanges
balance gradient averaging and mini-batching
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
{ξi(t)}, for sensor i at time t
(ξ1(1),ξ1(2),…) (ξ2(1),ξ2(2),…) (ξ3(1),ξ3(2),…) (ξ4(1),ξ4(2),…) (ξ5(1),ξ5(2),…) (ξ6(1),ξ6(2),…)
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
(ξ1(1),ξ1(2),…) (ξ2(1),ξ2(2),…) (ξ3(1),ξ3(2),…) (ξ4(1),ξ4(2),…) (ξ5(1),ξ5(2),…) (ξ6(1),ξ6(2),…)
minx∈X ψ(x) = minx∈X Eξ[ɸ(x,ξ)]
||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
(ξ1(1),ξ1(2),…) (ξ2(1),ξ2(2),…) (ξ3(1),ξ3(2),…) (ξ4(1),ξ4(2),…) (ξ5(1),ξ5(2),…) (ξ6(1),ξ6(2),…)
minx∈X ψ(x) = minx∈X Eξ[ɸ(x,ξ)]
||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X
gi(t) := ∇ɸ(xi(t),ξi(t)) Eξ[gi(t)] = ∇ψ(xi(t)) Eξ[||gi(t) - ∇ψ(xi(t)||2] ≤ σ2
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Algorithm: Stochastic Mirror Descent
[Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012]
Initialize xi(0) ← 0 for t=1 to T: xi(t) ← Px[xi(t-1) - γt gi(t-1)] xavi(t) ← 1/t Qτ xi(τ) end for t
[Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010]
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Algorithm: Stochastic Mirror Descent
[Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012]
Initialize xi(0) ← 0 for t=1 to T: xi(t) ← Px[xi(t-1) - γt gi(t-1)] xavi(t) ← 1/t Qτ xi(τ) end for t
E[ψ(xav
i (T)) − ψ(x∗)] ≤ O(1)
L T + σ √ T
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
[Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012]
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) L T 2 + σ √ T
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
[Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012]
the SO problem
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) L T 2 + σ √ T
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) L (mT)2 + σ √ mT
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
[Ram et al., “Incremental stochastic sub-gradient algorithms for convex optimization”, 2009]
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) L (mT)2 + σ √ mT
gradient averaging:
[Duchi et al., “Dual averaging for distributed optimization…”, 2012] [Dekel et al., “Optimal distributed online prediction using mini-batches”, 2012]
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
message mi(r) to its neighbors
m2(r) m1(r) m3(r) m4(r)
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
ρ = 1/2
data rounds comms rounds data rounds comms rounds
ξi(t=1) ξi(t=2) mi(r=1) mi(r=2) mi(r=3) mi(r=4) ρ = 2 ξi(t=1) ξi(t=2) ξi(t=3) ξi(t=4) mi(r=1) mi(r=2)
message mi(r) to its neighbors
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
2.Compute distributed gradient averages via consensus
ξi(t=1) ξi(t=2) mi(r=1) mi(r=2) mi(r=3) mi(r=4) xi(t=1) xi(t=2)
data rounds consensus rounds search point updates
ρ = 2
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
2.Compute distributed gradient averages via consensus
ξi(t=1) ξi(t=2) mi(r=1) mi(r=2) mi(r=3) mi(r=4) xi(t=1) xi(t=2)
data rounds consensus rounds search point updates
ρ = 2
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
θi(s) = 1 b
sb
X
t=(s−1)b+1
gi(t)
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
θi(s) = 1 b
sb
X
t=(s−1)b+1
gi(t)
ξi(t=1) ξi(t=2) ξi(t=3) ξi(t=4) ξi(t=5) ξi(t=6) ξi(t=7) ξi(t=8) mi(r=1) mi(r=2) mi(r=3) mi(r=4) ϴi(s=1) ϴi(s=2) xi(t=1) xi(t=5) ρ = 1/2, b=4 data rounds consensus rounds mini-batch rounds search points
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
θi(s) = 1 b
sb
X
t=(s−1)b+1
gi(t)
ξi(t=1) ξi(t=2) ξi(t=3) ξi(t=4) ξi(t=5) ξi(t=6) ξi(t=7) ξi(t=8) mi(r=1) mi(r=2) mi(r=3) mi(r=4) ϴi(s=1) ϴi(s=2) xi(t=1) xi(t=5) ρ = 1/2, b=4 data rounds consensus rounds mini-batch rounds search points
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
which converge on the global average
are connected, i.e. (i,j) ∈ E
consensus rounds per mini-batch.
θr
i (s) =
X
i,j
wijθr−1
j
(s)
[Tsianos and Rabbat, “Efficient distributed online prediction and stochastic optimization”, 2016] [Duchi et al., “Dual averaging for distributed optimization…”, 2012]
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
θr
i (s) =
X
i,j
wijθr−1
j
(s)
Lemma: The equivalent gradient noise variance is bounded by
σ2
eq :=E[||θρb i (s) rψ(xi(s))||2]
O(1) " λ2ρb
2
(W)||xi(s) xj(s)||2 + λ2ρb
2
(W)σ2 b + σ2 mb #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
θr
i (s) =
X
i,j
wijθr−1
j
(s)
Lemma: The equivalent gradient noise variance is bounded by
consensus averaging, residual noise
σ2
eq :=E[||θρb i (s) rψ(xi(s))||2]
O(1) " λ2ρb
2
(W)||xi(s) xj(s)||2 + λ2ρb
2
(W)σ2 b + σ2 mb #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Algorithm: Distributed Stochastic Approximation Mirror Descent (D-SAMD) Initialize xi(0) ← 0, for all i for s=1 to T/b: [iterate over mini-batches] θ0i(s) ← θi(s) for r=1 to ρb: [iterate over consensus rounds] θri(s) = Qj wij θr-1i(s), for all i end for r xi(sb+1) ← Px[xi(sb) - γs θρbi(s)] xavi(t) ← 1/s Qτ xi(τb) end for s
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xav
i (T)) − ψ(x∗)] ≤ O(1)
L T + σ √ T
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xav
i (T)) − ψ(x∗)] ≤ O(1)
L T + σ √ T
i (T)) − ψ(x∗)] ≤ O(1)
" Lb T + r σ2
eqb
T # σ2
eq = O(1)
" λ2ρb
2
(W)||xi(s) − xj(s)||2 + λ2ρb
2
(W)σ2 b + σ2 mb #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xav
i (T)) − ψ(x∗)] ≤ O(1)
L T + σ √ T
i (T)) − ψ(x∗)] ≤ O(1)
" Lb T + r σ2
eqb
T # σ2
eq = O(1)
" λ2ρb
2
(W)||xi(s) − xj(s)||2 + λ2ρb
2
(W)σ2 b + σ2 mb #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Lemma: D-SAMD iterates are guaranteed to converge provided Furthermore, this condition is sufficient to ensure that
b ≥ O(1) 1 + log(mT) ρ log(1/λ2(W))
eq ≤ O(1)
r σ2 mT
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Lemma: D-SAMD iterates are guaranteed to converge provided Furthermore, this condition is sufficient to ensure that
b ≥ O(1) 1 + log(mT) ρ log(1/λ2(W))
eq ≤ O(1)
r σ2 mT E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) " L log(mT) ρ log(1/λ2(W))T + r σ2 mT #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Theorem: If Then the conditions of the previous lemma ensure that
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) "r σ2 mT # ρ ≥ O(1) m1/2 log(mT) σT 1/2 log(1/λ2(W))
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Theorem: If Then the conditions of the previous lemma ensure that
number of MD steps taken
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) "r σ2 mT # ρ ≥ O(1) m1/2 log(mT) σT 1/2 log(1/λ2(W))
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Algorithm: Accelerated Distributed Stochastic Approximation Mirror Descent (AD-SAMD) [simplified] for s=1 to T/b: [iterate over mini-batches] compute mini-batch gradients for r=1 to ρb: perform consensus iterations on gradients end for r perform accelerated MD updates end for s
more complicated averaging scheme
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
AD-SAMD has
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) " Lb2 T 2 + r σ2
eqb
T # σ2
eq = O(1)
λ2ρb||xi(s) − xj(s)||2 + λ2ρbσ2 b + σ2 mb
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
AD-SAMD has
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) " Lb2 T 2 + r σ2
eqb
T # σ2
eq = O(1)
λ2ρb||xi(s) − xj(s)||2 + λ2ρbσ2 b + σ2 mb
has optimum scaling, provided
b ≥ O(1) 1 + log(mT) ρ log(1/λ2(W))
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) " L log2(mT) ρ2 log2(1/λ2(W))T 2 + r σ2 mT #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Theorem: If Then the conditions of the previous lemma ensure that
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) "r σ2 mT # ρ ≥ O(1) m1/4 log(mT) σT 3/4 log(1/λ2(W))
" L log2(mT) ρ2 log2(1/λ2(W))T 2 + r σ2 mT #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Theorem: If Then the conditions of the previous lemma ensure that
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) "r σ2 mT #
ρ ≥ O(1) m1/4 log(mT) σT 3/4 log(1/λ2(W))
" L log2(mT) ρ2 log2(1/λ2(W))T 2 + r σ2 mT #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
(a) ρ = 1 (b) ρ = 10
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
ψ(x) = f(x) + h(x) ||rf(x) rf(y)|| L||x y|| ||h(x) h(y)|| M||x y|| E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) L T 2 + M + σ √ T
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
E[ψ(xi(T)) − ψ(x∗)] ≤ O(1) " Lb2 T 2 + M + σ/ √ mb p T/b + M #
Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Summary:
limited, wireless links
network gradient averaging and local mini-batching
Future work:
Preprint available: https://arxiv.org/abs/1704.07888