Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - - PowerPoint PPT Presentation

communication efficient decentralized learning
SMART_READER_LITE
LIVE PREVIEW

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - - PowerPoint PPT Presentation

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020 Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and


slide-1
SLIDE 1

Communication-Efficient Decentralized Learning

Yuejie Chi EdgeComm Workshop, 2020

slide-2
SLIDE 2

Acknowledgements

Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton

Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction, JMLR, 2020.

1

slide-3
SLIDE 3

Distributed empirical risk minimization

Distributed/Federated learning: due to privacy and scalability, data are distributed at multiple locations / workers / agents. Let M = ∪iMi be a data partition with equal splitting:

f(x) := 1 n

n

  • i=1

fi(x), where fi(x) := 1 (N/n)

  • z∈Mi

ℓ(x; z).

f1(x) f2(x) f3(x) f4(x) f5(x)

2

N = number of total samples n = number of agents N/n

  • m

= number of local samples

slide-4
SLIDE 4

Decentralized ERM - algorithmic framework

minimizex f(x) := 1 n

n

  • i=1

fi(x) ⇓ minimizexi 1 n

n

  • i=1

fi(xi) subject to xi = xj

3

slide-5
SLIDE 5

Decentralized ERM - algorithmic framework

minimizex f(x) := 1 n

n

  • i=1

fi(x) ⇓ minimizexi 1 n

n

  • i=1

fi(xi) subject to xi = xj

  • Local computation: agents update local estimate;

⇒ need to be scalable!

3

slide-6
SLIDE 6

Decentralized ERM - algorithmic framework

minimizex f(x) := 1 n

n

  • i=1

fi(x) ⇓ minimizexi 1 n

n

  • i=1

fi(xi) subject to xi = xj

  • Local computation: agents update local estimate;

⇒ need to be scalable!

  • Global communications: agents exchange for consensus;

⇒ need to be communication-efficient!

3

slide-7
SLIDE 7

Decentralized ERM - algorithmic framework

minimizex f(x) := 1 n

n

  • i=1

fi(x) ⇓ minimizexi 1 n

n

  • i=1

fi(xi) subject to xi = xj

  • Local computation: agents update local estimate;

⇒ need to be scalable!

  • Global communications: agents exchange for consensus;

⇒ need to be communication-efficient! Guiding principle: more local computation leads to less communication

3

slide-8
SLIDE 8

Two distributed schemes

f1(x) f2(x) f3(x) f4(x) f5(x)

Master/slave model PS coordinates global information sharing

4

slide-9
SLIDE 9

Two distributed schemes

f1(x) f2(x) f3(x) f4(x) f5(x)

Master/slave model PS coordinates global information sharing

f1(x) f2(x) f3(x) f4(x) f5(x)

Network model agents share local information over a graph topology

4

slide-10
SLIDE 10

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

5

slide-11
SLIDE 11

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

5

slide-12
SLIDE 12

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

5

slide-13
SLIDE 13

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

5

slide-14
SLIDE 14

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

Distributed Approximate NEwton (DANE) (Shamir et. al., 2014): xt

i = argmin x

fi(x) −

  • ∇fi(xt−1) − ∇f(xt−1), x
  • + µ

2

  • x − xt−1

2

2

  • Quasi-Newton method and less sensitive to ill-conditioning.

5

slide-15
SLIDE 15

Distributed first-order methods in the master/slave setting

xt

i (

= LocalUpdate(fi, rf(xt), xt)

xt = 1 n

n

X

i=1

xt−1

i

rf(xt) = 1 n

n

X

i=1

rfi(xt)

parameter consensus gradient consensus local data

Distributed Stochastic Variance-Reduced Gradients (Cen et. al., 2020): xt,s

i

⇐ = xt,s−1

i

− η vt,s−1

i variance-reduced stochastic gradient

, s = 1, 2, . . . ,

  • Better local computation efficiency.

5

slide-16
SLIDE 16

Naive extension to the network setting

f1(x) f2(x) f3(x) f4(x) f5(x)

{xt

i, rfi(xt i)}

  • Communicate: agent transmits {xt

i, ∇fi(xt i)};

  • Compute:

xt

i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni

  • surrogate of ∇f(xt)

, Avg{xt

j}j∈Ni

  • surrogate of xt

)

6

slide-17
SLIDE 17

Naive extension to the network setting

f1(x) f2(x) f3(x) f4(x) f5(x)

{xt

i, rfi(xt i)}

20 40 60 80 100 #iters 10-10 10-5 100 Optimality gap SVRG Naive Network-SVRG

Doesn't converge to global optimum!

  • Communicate: agent transmits {xt

i, ∇fi(xt i)};

  • Compute:

xt

i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni

  • surrogate of ∇f(xt)

, Avg{xt

j}j∈Ni

  • surrogate of xt

)

6

slide-18
SLIDE 18

Naive extension to the network setting

f1(x) f2(x) f3(x) f4(x) f5(x)

{xt

i, rfi(xt i)}

20 40 60 80 100 #iters 10-10 10-5 100 Optimality gap SVRG Naive Network-SVRG

Doesn't converge to global optimum!

  • Communicate: agent transmits {xt

i, ∇fi(xt i)};

  • Compute:

xt

i ⇐ LocalUpdate(fi, Avg{∇fj(xt j)}j∈Ni

  • surrogate of ∇f(xt)

, Avg{xt

j}j∈Ni

  • surrogate of xt

) Consensus needs to be designed carefully in the network setting!

6

slide-19
SLIDE 19

Average dynamic consensus

Assume that each agent generates some time-varying quantity rt

j.

How to track its the dynamic average 1

n

n

j=1 rt j = 1 n1⊤ n rt in each of the

agents, where rt = [rt

1, · · · , rt n]⊤? 7

slide-20
SLIDE 20

Average dynamic consensus

Assume that each agent generates some time-varying quantity rt

j.

How to track its the dynamic average 1

n

n

j=1 rt j = 1 n1⊤ n rt in each of the

agents, where rt = [rt

1, · · · , rt n]⊤?

  • Dynamic average consensus (Zhu and Martinez, 2010):

qt = W qt−1 mixing + rt − rt−1

  • correction

, where qt = [qt

1, · · · , qt n]⊤ and W is the mixing matrix. 7

slide-21
SLIDE 21

Average dynamic consensus

Assume that each agent generates some time-varying quantity rt

j.

How to track its the dynamic average 1

n

n

j=1 rt j = 1 n1⊤ n rt in each of the

agents, where rt = [rt

1, · · · , rt n]⊤?

  • Dynamic average consensus (Zhu and Martinez, 2010):

qt = W qt−1 mixing + rt − rt−1

  • correction

, where qt = [qt

1, · · · , qt n]⊤ and W is the mixing matrix.

  • Key property: the average of {qt

i} dynamically tracks the average

  • f {rt

i};

1⊤

n qt = 1⊤ n rt,

  • M. Zhu and S. Martnez. ”Discrete-time dynamic average consensus.” Automatica 2010.

7

slide-22
SLIDE 22

Gradient tracking

xt

i ⇐

= LocalUpdate(fi, ✘✘✘ ✘ ✿st

j

∇f(xt) ,✟✟ ✟ ✯yt xt )

  • Parameter averaging:

yt

j =

  • k∈Nj wjkxt−1

k

,

  • Gradient tracking:

st

j =

  • k∈Nj wjkst−1

k

+ ∇fj(yt

j) − ∇fj(yt−1 j

)

  • gradient tracking

.

8

slide-23
SLIDE 23

Gradient tracking

xt

i ⇐

= LocalUpdate(fi, ✘✘✘ ✘ ✿st

j

∇f(xt) ,✟✟ ✟ ✯yt xt )

  • Parameter averaging:

yt

j =

  • k∈Nj wjkxt−1

k

,

  • Gradient tracking:

st

j =

  • k∈Nj wjkst−1

k

+ ∇fj(yt

j) − ∇fj(yt−1 j

)

  • gradient tracking

. We can now apply the same DANE and SVRG-type local updates!

8

slide-24
SLIDE 24

Linear Regression: Well-Conditioned

fi(x) = yi − Aix2

2,

Ai ∈ R1000×40

20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆

Well-conditioned

100 101 102 103 104 #grads/#samples

Well-conditioned

DANE DGD Network-DANE Network-SVRG Network-SARAH

Figure: The optimality gap w.r.t. iterations and gradients evaluation. The condition number κ = 10. ER graph (p = 0.3), 20 agents.

9

slide-25
SLIDE 25

Linear Regression: Ill-Conditioned

fi(x) = yi − Aix2

2,

Ai ∈ R1000×40

20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆

Ill-conditioned

100 101 102 103 104 #grads/#samples

Ill-conditioned

DANE DGD Network-DANE Network-SVRG Network-SARAH

Figure: The optimality gap w.r.t. iterations and gradients evaluations. The condition number κ = 104. ER graph (p = 0.3), 20 agents.

10

slide-26
SLIDE 26

Extra Mixing

The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.

20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG

K = 1

11

slide-27
SLIDE 27

Extra Mixing

The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.

20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG

K = 1 K = 2

11

slide-28
SLIDE 28

Extra Mixing

The mixing rate of the graph α0 = 0.922. A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG.

20 40 60 80 100 10−12 10−9 10−6 10−3 100 #iters (f(¯ x(t)) − f ⋆)/f ⋆ Network-SVRG 100 200 300 400 500 K· #iters Network-SVRG

K = 1 K = 2 K = 5 K = 20 K = 50

11

slide-29
SLIDE 29

Final remarks

  • gradient tracking provides a way to extend master/slave algorithms

(DANE and SVRG) to network settings;

  • probing computational-communication trade-offs by employing

different local updates and extra mixing; Future work:

  • convergence analysis in the nonconvex case.

Thank you! Communication-Efficient Distributed Optimization in Networks with Gradient Tracking

  • B. Li, S. Cen, Y. Chen, and Y. Chi, JMLR 2020.

12