Acceleration for Compressed Gradient Descent in Distributed and - - PowerPoint PPT Presentation

acceleration for compressed gradient descent in
SMART_READER_LITE
LIVE PREVIEW

Acceleration for Compressed Gradient Descent in Distributed and - - PowerPoint PPT Presentation

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization Zhize Li King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and


slide-1
SLIDE 1

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

Zhize Li

King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and Peter Richt´ arik (KAUST)

ICML 2020

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 1 / 19

slide-2
SLIDE 2

Overview

1

Problem

2

Related Work

3

Our Contributions Single Device Setting Distributed Setting

4

Experiments

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 2 / 19

slide-3
SLIDE 3

Problem

Training distributed/federated learning models is typically performed by solving an optimization problem min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • ,

fi(x): loss function associated with data stored on node/device i ψ(x): regularization term (e.g., ℓ1 regularizer x1, ℓ2 regularizer x2

2

  • r indicator function IC(x) for some set C)

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 3 / 19

slide-4
SLIDE 4

Examples

min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • Each node/device i stores m data samples {(ai,j, bi,j) ∈ Rd+1}m

j=1

◮ Lasso regression: fi(x) = 1

m

m

j=1(aT i,jx − bi,j)2, ψ(x) = λx1

◮ Logistic regression: fi(x) = 1

m

m

j=1 log

  • 1 + exp(−bi,jaT

i,jx)

  • ◮ SVM: fi(x) = 1

m

m

j=1 max

  • 0, 1 − bi,jaT

i,jx

  • , ψ(x) = λ

2x2 2

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 4 / 19

slide-5
SLIDE 5

Goal

min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • Goal: find an ǫ-solution (parameters) ˆ

x, e.g., P(ˆ x) − P(x∗) ≤ ǫ or ˆ x − x∗2

2 ≤ ǫ, where x∗ := arg minx∈Rd P(x).

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19

slide-6
SLIDE 6

Goal

min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • Goal: find an ǫ-solution (parameters) ˆ

x, e.g., P(ˆ x) − P(x∗) ≤ ǫ or ˆ x − x∗2

2 ≤ ǫ, where x∗ := arg minx∈Rd P(x).

For optimization methods: Bottleneck: communication cost Common strategy: Compress the communicated messages (lower communication cost in each iteration/communication round) and hope that this will not increase the total number of iterations/comm. rounds.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19

slide-7
SLIDE 7

Related Work

  • Several recent work show that the total communication complexity

can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019].

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

slide-8
SLIDE 8

Related Work

  • Several recent work show that the total communication complexity

can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019].

  • However previous work usually lead to this kind of improvement:

Communication cost per iteration (- -) Iterations (+) ⇒ Total (-) ‘-’ denotes decrease, ‘+’ denotes increase

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

slide-9
SLIDE 9

Related Work

  • Several recent work show that the total communication complexity

can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019].

  • However previous work usually lead to this kind of improvement:

Communication cost per iteration (- -) Iterations (+) ⇒ Total (-) ‘-’ denotes decrease, ‘+’ denotes increase

  • In this work, we provide the first optimization methods provably

combining the benefits of gradient compression and acceleration: Communication cost per iteration (- -) Iterations (- -) ⇒ Total (- - - -)

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19

slide-10
SLIDE 10

Single Device Setting

  • First, consider the simple single device (i.e. n = 1)) case:

min

x∈Rd f (x),

where f : Rd → R is L-smooth, and convex or µ-strongly convex.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19

slide-11
SLIDE 11

Single Device Setting

  • First, consider the simple single device (i.e. n = 1)) case:

min

x∈Rd f (x),

where f : Rd → R is L-smooth, and convex or µ-strongly convex.

  • f is L-smooth or has L-Lipschitz continuous gradient (for L > 0) if

∇f (x) − ∇f (y) ≤ Lx − y, (1) and µ-strongly convex (for µ ≥ 0) if f (x) − f (y) − ∇f (y), x − y ≥ µ 2x − y2 (2) for all x, y ∈ Rd. The µ = 0 case reduces to the standard convexity.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19

slide-12
SLIDE 12

Compressed Gradient Descent (CGD)

  • Problem: minx∈Rd f (x)

1) Given initial point x0, step-size η 2) CGD update: xk+1 = xk − ηC(∇f (xk)), for k ≥ 0

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19

slide-13
SLIDE 13

Compressed Gradient Descent (CGD)

  • Problem: minx∈Rd f (x)

1) Given initial point x0, step-size η 2) CGD update: xk+1 = xk − ηC(∇f (xk)), for k ≥ 0

Definition (Compression operator)

A randomized map C : Rd → Rd is an ω-compression operator if E[C(x)] = x, E[C(x) − x2] ≤ ωx2, ∀x ∈ Rd. (3) In particular, no compression (C(x) ≡ x) implies ω = 0. Note that Condition (3) is satisfied by many practical compressions, e.g., random-k sparsification, (p, s)-quantization.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19

slide-14
SLIDE 14

Accelerated Compressed Gradient Descent (ACGD)

Inspired by Nesterov’s accelerated gradient descent (AGD) [Nesterov, 2004] and FISTA [Beck and Teboulle, 2009], here we propose the first accelerated compressed gradient descent (ACGD) method. Our ACGD update: 1) xk = αky k + (1 − αk)zk 2) y k+1 = xk − ηkC(∇f (xk)) 3) zk+1 = βk

  • θkzk + (1 − θk)xk

+ (1 − βk)

  • γky k+1 + (1 − γk)y k

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 9 / 19

slide-15
SLIDE 15

Convergence Results in Single Device Setting

Table: Convergence results (Iterations) for the single device (n = 1) case minx∈Rd f (x)

Algorithm µ-strongly convex f convex f Compressed Gradient Descent (CGD [Khirirat et al., 2018]) O

  • (1 + ω) L

µ log 1 ǫ

  • O
  • (1 + ω) L

ǫ

  • ACGD (this paper)

O

  • (1 + ω)
  • L

µ log 1 ǫ

  • O
  • (1 + ω)
  • L

ǫ

  • If no compression (i.e., ω = 0): CGD recovers the results of vanilla

(uncompressed) GD, i.e., O( L

µ log 1 ǫ) and O(L ǫ).

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19

slide-16
SLIDE 16

Convergence Results in Single Device Setting

Table: Convergence results (Iterations) for the single device (n = 1) case minx∈Rd f (x)

Algorithm µ-strongly convex f convex f Compressed Gradient Descent (CGD [Khirirat et al., 2018]) O

  • (1 + ω) L

µ log 1 ǫ

  • O
  • (1 + ω) L

ǫ

  • ACGD (this paper)

O

  • (1 + ω)
  • L

µ log 1 ǫ

  • O
  • (1 + ω)
  • L

ǫ

  • If no compression (i.e., ω = 0): CGD recovers the results of vanilla

(uncompressed) GD, i.e., O( L

µ log 1 ǫ) and O(L ǫ).

  • If compression parameter ω ≤ O
  • L

µ

  • r O
  • L

ǫ

  • :

Our ACGD enjoys the benefits of compression and acceleration, i.e., both the communication cost per iteration (compression) and the total number of iterations (acceleration) are smaller than that of GD.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19

slide-17
SLIDE 17

Recall the Discussion in Related Work

  • Previous work usually lead to this kind of improvement:

Communication cost per iteration (- -) Iterations (+) ⇒ Total (-) ‘-’ denotes decrease, ‘+’ denotes increase

  • In this work, we provide the first optimization methods provably

combining the benefits of gradient compression and acceleration: Communication cost per iteration (- -) Iterations (- -) ⇒ Total (- - - -)

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 11 / 19

slide-18
SLIDE 18

Distributed Setting

Now, we consider the general distributed setting with n devices/nodes: min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • .

The presence of multiple nodes (n > 1) and of the regularizer ψ poses additional challenges.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19

slide-19
SLIDE 19

Distributed Setting

Now, we consider the general distributed setting with n devices/nodes: min

x∈Rd

  • P(x) := 1

n

n

  • i=1

fi(x) + ψ(x)

  • .

The presence of multiple nodes (n > 1) and of the regularizer ψ poses additional challenges. We propose a distributed variant of ACGD (called ADIANA) which can be seen as an accelerated version of DIANA [Mishchenko et al., 2019].

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19

slide-20
SLIDE 20

Accelerated DIANA (ADIANA)

Main update of our ADIANA: 1) xk = θ1zk + θ2w k + (1 − θ1 − θ2)y k 2i) all devices/nodes/machines compress shifted local gradient Ck

i (∇fi(xk) − hk i ) in parallel and send to the server

2ii) update local shift hk+1

i

= hk

i + αCk i (∇fi(w k) − hk i )

3) Aggregate received compressed gradient information g k = 1

n n

  • i=1

Ck

i (∇fi(xk) − hk i ) + hk

4) Perform a proximal update step y k+1 = proxηψ(xk − ηg k) 5) zk+1 = βzk + (1 − β)xk + γ

η(y k+1 − xk)

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 13 / 19

slide-21
SLIDE 21

Convergence Results in Distributed Setting

Table: Convergence results (Iterations) for the general distributed case with n devices (the result in the case n < ω can be found in Table 2 of our paper)

Algorithm In the case n ≥ ω (lots of devices or low compression) Distributed CGD (DIANA [Mishchenko et al., 2019]) O

  • ω + L

µ

  • log 1

ǫ

  • ADIANA (this paper)

O

  • ω +
  • L

µ +

ω

n ωL µ

  • log 1

ǫ

  • Zhize Li (KAUST)

Acceleration for Compressed Gradient Descent ICML 2020 14 / 19

slide-22
SLIDE 22

Convergence Results in Distributed Setting

Table: Convergence results (Iterations) for the general distributed case with n devices (the result in the case n < ω can be found in Table 2 of our paper)

Algorithm In the case n ≥ ω (lots of devices or low compression) Distributed CGD (DIANA [Mishchenko et al., 2019]) O

  • ω + L

µ

  • log 1

ǫ

  • ADIANA (this paper)

O

  • ω +
  • L

µ +

ω

n ωL µ

  • log 1

ǫ

  • Note that ω + L

µ ≥ 2

  • ωL

µ and ω n ≤ 1.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 14 / 19

slide-23
SLIDE 23

Convergence Results in Distributed Setting

Table: Convergence results (Iterations) for the general distributed case with n devices (the result in the case n < ω can be found in Table 2 of our paper)

Algorithm In the case n ≥ ω (lots of devices or low compression) Distributed CGD (DIANA [Mishchenko et al., 2019]) O

  • ω + L

µ

  • log 1

ǫ

  • ADIANA (this paper)

O

  • ω +
  • L

µ +

ω

n ωL µ

  • log 1

ǫ

  • Note that ω + L

µ ≥ 2

  • ωL

µ and ω n ≤ 1.

  • If compression parameter ω ≤ O
  • min
  • L

µ, n

1 3

: Our ADIANA enjoys the benefits of compression and acceleration, i.e., lower communication cost per iteration (compression) and fewer total number of iterations (acceleration)

  • L

µ log 1 ǫ vs. L µ log 1 ǫ.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 14 / 19

slide-24
SLIDE 24

Experiments

We demonstrate the performance of our accelerated distributed method ADIANA and previous methods with different compression

  • perators on the regularized logistic regression problem,

min

x∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−biaT

i x)

  • + λ

2x2

  • (4)

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 15 / 19

slide-25
SLIDE 25

Experiments

We demonstrate the performance of our accelerated distributed method ADIANA and previous methods with different compression

  • perators on the regularized logistic regression problem,

min

x∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−biaT

i x)

  • + λ

2x2

  • (4)

Compression operators: We adopt three compression operators: random sparsification (see e.g. [Stich et al., 2018]), random dithering (see e.g. [Alistarh et al., 2017]), and natural compression (see e.g. [Horv´ ath et al., 2019]).

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 15 / 19

slide-26
SLIDE 26 1 2 3 4 5

Communication bits

1e7 10 15 10 12 10 9 10 6 10 3 100

f(xk) f(x * ) Random Sparsification

data: a5a

DCGD DIANA ADIANA

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

Communication bits

1e7 10 15 10 12 10 9 10 6 10 3 100

f(xk) f(x * ) Random Dithering

data: a5a

DCGD DIANA ADIANA

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

Communication bits

1e7 10 15 10 12 10 9 10 6 10 3 100

f(xk) f(x * ) Natural Compression

data: a5a

DCGD DIANA ADIANA

1 2 3 4 5 6

Communication bits

1e7 10 16 10 13 10 10 10 7 10 4 10 1

f(xk) f(x * ) Random Sparsification

data: mushrooms

DCGD DIANA ADIANA

0.0 0.5 1.0 1.5 2.0

Communication bits

1e7 10 16 10 13 10 10 10 7 10 4 10 1

f(xk) f(x * ) Random Dithering

data: mushrooms

DCGD DIANA ADIANA

0.0 0.5 1.0 1.5 2.0

Communication bits

1e7 10 16 10 13 10 10 10 7 10 4 10 1

f(xk) f(x * ) Natural Compression

data: mushrooms

DCGD DIANA ADIANA

Figure: The communication complexity of three different methods for three different compression operators on a5a (top) and mushrooms (bottom) datasets.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 16 / 19

slide-27
SLIDE 27 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

Communication bits

1e7 10 15 10 12 10 9 10 6 10 3 100

f(xk) f(x * ) Random Dithering VS No Compression

data: a5a

DIANA ADIANA UN_DIANA UN_ADIANA

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

Communication bits

1e7 10 15 10 12 10 9 10 6 10 3 100

f(xk) f(x * ) Natural Compression VS No Compression

data: a5a

DIANA ADIANA UN_DIANA UN_ADIANA

0.0 0.5 1.0 1.5 2.0

Communication bits

1e7 10 16 10 13 10 10 10 7 10 4 10 1

f(xk) f(x * ) Random Dithering VS No Compression

data: mushrooms

DIANA ADIANA UN_DIANA UN_ADIANA

0.0 0.5 1.0 1.5 2.0

Communication bits

1e7 10 16 10 13 10 10 10 7 10 4 10 1

f(xk) f(x * ) Natural Compression VS No Compression

data: mushrooms

DIANA ADIANA UN_DIANA UN_ADIANA

Figure: The communication complexity of DIANA and ADIANA with and without compression on a5a (top) and mushrooms (bottom) datasets.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 17 / 19

slide-28
SLIDE 28

Conclusion

  • We provide the first accelerated compressed gradient descent

methods (ACGD (n = 1) and ADIANA (general n > 1)) which combine the benefits of compression and acceleration.

  • The experimental results validate our theoretical results and confirm

the practical superiority of our accelerated methods.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 18 / 19

slide-29
SLIDE 29

Thanks!

Zhize Li

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 19 / 19

slide-30
SLIDE 30

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. Samuel Horv´ ath, Chen-Yu Ho, ˇ Ludov´ ıt Horv´ ath, Atal Narayan Sahu, Marco Canini, and Peter Richt´

  • arik. Natural compression for distributed deep learning. arXiv preprint

arXiv:1905.10988, 2019. Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018. Konstantin Mishchenko, Eduard Gorbunov, Martin Tak´ aˇ c, and Peter Richt´ arik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019. Yurii Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004. Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with

  • memory. In Advances in Neural Information Processing Systems, pages 4447–4458,

2018.

Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 19 / 19