Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - - PowerPoint PPT Presentation
Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - - PowerPoint PPT Presentation
Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020 Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and
Acknowledgements
Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton
Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction, JMLR, 2020.
1
Distributed empirical risk minimization
Distributed/Federated learning: due to privacy and scalability, data are distributed at multiple locations / workers / agents. Let M = ∪iMi be a data partition with equal splitting:
f(x) := 1 n
n
- i=1
fi(x), where fi(x) := 1 (N/n)
- z∈Mi
ℓ(x; z).
f1(x) f2(x) f3(x) f4(x) f5(x)
2
N = number of total samples n = number of agents N/n
- m
= number of local samples
Decentralized ERM - algorithmic framework
minimizex f(x) := 1 n
n
- i=1
fi(x) ⇓ minimizexi 1 n
n
- i=1
fi(xi) subject to xi = xj
3
Decentralized ERM - algorithmic framework
minimizex f(x) := 1 n
n
- i=1
fi(x) ⇓ minimizexi 1 n
n
- i=1
fi(xi) subject to xi = xj
- Local computation: agents update local estimate;
⇒ need to be scalable!
3
Decentralized ERM - algorithmic framework
minimizex f(x) := 1 n
n
- i=1
fi(x) ⇓ minimizexi 1 n
n
- i=1
fi(xi) subject to xi = xj
- Local computation: agents update local estimate;
⇒ need to be scalable!
- Global communications: agents exchange for consensus;
⇒ need to be communication-efficient!
3
Decentralized ERM - algorithmic framework
minimizex f(x) := 1 n
n
- i=1
fi(x) ⇓ minimizexi 1 n
n
- i=1
fi(xi) subject to xi = xj
- Local computation: agents update local estimate;
⇒ need to be scalable!
- Global communications: agents exchange for consensus;
⇒ need to be communication-efficient! Guiding principle: more local computation leads to less communication
3
Two distributed schemes
f1(x) f2(x) f3(x) f4(x) f5(x)
Master/slave model PS coordinates global information sharing
4
Two distributed schemes
f1(x) f2(x) f3(x) f4(x) f5(x)
Master/slave model PS coordinates global information sharing
f1(x) f2(x) f3(x) f4(x) f5(x)
Network model agents share local information over a graph topology
4
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
5
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
5
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
5
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
5
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
Distributed Approximate NEwton (DANE) (Shamir et. al., 2014): xt
i = argmin x
fi(x) −
- ∇fi(xt−1) − ∇f(xt−1), x
- + µ
2
- x − xt−1
2
2
- Quasi-Newton method and less sensitive to ill-conditioning.
5
Distributed first-order methods in the master/slave setting
xt
i (
= LocalUpdate(fi, rf(xt), xt)
xt = 1 n
n
X
i=1
xt−1
i
rf(xt) = 1 n
n
X
i=1
rfi(xt)
parameter consensus gradient consensus local data
Distributed Stochastic Variance-Reduced Gradients (Cen et. al., 2020): xt,s
i
⇐ = xt,s−1
i
− η vt,s−1
i variance-reduced stochastic gradient
, s = 1, 2, . . . ,
- Better local computation efficiency.
5
Naive extension to the network setting
f1(x) f2(x) f3(x) f4(x) f5(x)
{xt
i, rfi(xt i)}
- Communicate: agent transmits {xt
i, ∇fi(xt i)};
- Compute:
xt
i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni
- surrogate of ∇f(xt)
, Avg{xt
j}j∈Ni
- surrogate of xt
)
6
Naive extension to the network setting
f1(x) f2(x) f3(x) f4(x) f5(x)
{xt
i, rfi(xt i)}
20 40 60 80 100 #iters 10-10 10-5 100 Optimality gap SVRG Naive Network-SVRG
Doesn't converge to global optimum!
- Communicate: agent transmits {xt
i, ∇fi(xt i)};
- Compute:
xt
i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni
- surrogate of ∇f(xt)
, Avg{xt
j}j∈Ni
- surrogate of xt
)
6
Naive extension to the network setting
f1(x) f2(x) f3(x) f4(x) f5(x)
{xt
i, rfi(xt i)}
20 40 60 80 100 #iters 10-10 10-5 100 Optimality gap SVRG Naive Network-SVRG
Doesn't converge to global optimum!
- Communicate: agent transmits {xt
i, ∇fi(xt i)};
- Compute:
xt
i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni
- surrogate of ∇f(xt)
, Avg{xt
j}j∈Ni
- surrogate of xt
) Consensus needs to be designed carefully in the network setting!
6
Average dynamic consensus
Assume that each agent generates some time-varying quantity rt
j.
How to track its the dynamic average 1
n
n
j=1 rt j = 1 n1⊤ n rt in each of the
agents, where rt = [rt
1, · · · , rt n]⊤? 7
Average dynamic consensus
Assume that each agent generates some time-varying quantity rt
j.
How to track its the dynamic average 1
n
n
j=1 rt j = 1 n1⊤ n rt in each of the
agents, where rt = [rt
1, · · · , rt n]⊤?
- Dynamic average consensus (Zhu and Martinez, 2010):
qt = W qt−1 mixing + rt − rt−1
- correction
, where qt = [qt
1, · · · , qt n]⊤ and W is the mixing matrix. 7
Average dynamic consensus
Assume that each agent generates some time-varying quantity rt
j.
How to track its the dynamic average 1
n
n
j=1 rt j = 1 n1⊤ n rt in each of the
agents, where rt = [rt
1, · · · , rt n]⊤?
- Dynamic average consensus (Zhu and Martinez, 2010):
qt = W qt−1 mixing + rt − rt−1
- correction
, where qt = [qt
1, · · · , qt n]⊤ and W is the mixing matrix.
- Key property: the average of {qt
i} dynamically tracks the average
- f {rt
i};
1⊤
n qt = 1⊤ n rt,
- M. Zhu and S. Martnez. ”Discrete-time dynamic average consensus.” Automatica 2010.
7
Gradient tracking
xt
i ⇐
= LocalUpdate(fi, ✘✘✘ ✘ ✿st
j
∇f(xt) ,✟✟ ✟ ✯yt xt )
- Parameter averaging:
yt
j =
- k∈Nj wjkxt−1
k
,
- Gradient tracking:
st
j =
- k∈Nj wjkst−1
k
+ ∇fj(yt
j) − ∇fj(yt−1 j
)
- gradient tracking
.
8
Gradient tracking
xt
i ⇐
= LocalUpdate(fi, ✘✘✘ ✘ ✿st
j
∇f(xt) ,✟✟ ✟ ✯yt xt )
- Parameter averaging:
yt
j =
- k∈Nj wjkxt−1
k
,
- Gradient tracking:
st
j =
- k∈Nj wjkst−1
k
+ ∇fj(yt
j) − ∇fj(yt−1 j
)
- gradient tracking
. We can now apply the same DANE and SVRG-type local updates!
8
Linear Regression: Well-Conditioned
fi(x) = yi − Aix2
2,
Ai ∈ R1000×40
20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆
Well-conditioned
100 101 102 103 104 #grads/#samples
Well-conditioned
DANE DGD Network-DANE Network-SVRG Network-SARAH
Figure: The optimality gap w.r.t. iterations and gradients evaluation. The condition number κ = 10. ER graph (p = 0.3), 20 agents.
9
Linear Regression: Ill-Conditioned
fi(x) = yi − Aix2
2,
Ai ∈ R1000×40
20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆
Ill-conditioned
100 101 102 103 104 #grads/#samples
Ill-conditioned
DANE DGD Network-DANE Network-SVRG Network-SARAH
Figure: The optimality gap w.r.t. iterations and gradients evaluations. The condition number κ = 104. ER graph (p = 0.3), 20 agents.
10
Extra Mixing
The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.
20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG
K = 1
11
Extra Mixing
The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.
20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG
K = 1 K = 2
11
Extra Mixing
The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.
20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG
K = 1 K = 2 K = 5 K = 20 K = 50
11
Final remarks
- gradient tracking provides a way to extend master/slave algorithms
(DANE and SVRG) to network settings;
- probing computational-communication trade-offs by employing
different local updates and extra mixing; Future work:
- convergence analysis in the nonconvex case.
Thank you! Communication-Efficient Distributed Optimization in Networks with Gradient Tracking
- B. Li, S. Cen, Y. Chen, and Y. Chi, JMLR 2020.
12