Average-Case Acceleration Through Spectral Density Estimation and - - PowerPoint PPT Presentation

average case acceleration through spectral density
SMART_READER_LITE
LIVE PREVIEW

Average-Case Acceleration Through Spectral Density Estimation and - - PowerPoint PPT Presentation

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal) Motivation Consider the convex, quadratic


slide-1
SLIDE 1

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum

Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal)

slide-2
SLIDE 2

Motivation

Consider the convex, quadratic optimization problem min

x∈Rd f (x) = 1

2(x − x⋆)TH(x − x⋆) + f ∗.

1

slide-3
SLIDE 3

Motivation

Consider the convex, quadratic optimization problem min

x∈Rd f (x) = 1

2(x − x⋆)TH(x − x⋆) + f ∗. Efficient methods:

  • Conjugate gradients (”most optimal”)
  • Chebyshev 1st kind acceleration (worst-case optimal)
  • Polyak heavy-ball method (asympt. worst-case optimal)

1

slide-4
SLIDE 4

Polyak Heavy-Ball

Polyak Momentum algorithm, for ℓI H LI, xt+1 = xt − h∇f (xt) + m(xt − xt−1) where h = 4 √ L2 + ℓ2 , m = √ L − √ ℓ √ L + √ ℓ 2 .

  • Requires the knowledge of ℓ, L.
  • Easy to use (widely used in deep learning)
  • Works well for non-quadratic (deterministic or stochastic).

2

slide-5
SLIDE 5

Deep learning and large-scale problems

In deep learning, we solve min

x∈Rd N

  • i=1

fi(x) for huge d. Consequences:

  • The minimum eigenvalue ℓ is extremely hard to estimate.
  • Behaves like a quadratic when using gradient descent.
  • Nice statistical properties, like expected spectral density.

3

slide-6
SLIDE 6

2 4 6 0.0 0.1 0.2 0.3 0.4

Eigenvalue Density

Figure 1: Empirical vs expected spectral density of ∇2f (x).

slide-7
SLIDE 7

In this talk

We study the average-case convergence on quadratic problems.

  • Standard optimization methods only use ℓ and L. What if we

use expected density function?

  • How to build optimal methods in average case for given

spectral densities? Can we get rid of ℓ?

  • Asymptotic behavior of these optimal methods?

5

slide-8
SLIDE 8

Part 1: Spectral density, optimal methods and

  • rthogonal polynomials
slide-9
SLIDE 9

Setting

Consider a class of convex, quadratic optimization problem min

x∈Rd

1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution.

7

slide-10
SLIDE 10

Setting

Consider a class of convex, quadratic optimization problem min

x∈Rd

1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution. We define the expected spectral density µ of H to be P

  • λi(H) ∈ [a, b]
  • =

b

a

dµ, for random i, H. Remark: we are not interested in knowing the distribution over H!

7

slide-11
SLIDE 11

Beyond the condition number: spectral density

we know the distribution of the eigenvalues of H

1 2 3 0.0 0.2 0.4 0.6 0.8

Spectral density

8

Likely to have eigenvalues here. Unlikely to see them over here.

slide-12
SLIDE 12

First order methods and polynomials

We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆

Iteration t

= Pt(H)

Polynomial degree t

(x0 − x⋆), Pt(0) = 1.

9

slide-13
SLIDE 13

First order methods and polynomials

We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆

Iteration t

= Pt(H)

Polynomial degree t

(x0 − x⋆), Pt(0) = 1. Example: Gradient descent. xt−x⋆ = xt−1−x⋆ −

∇f (xt−1)

  • hH(xt−1 − x⋆)

= (I − hH)t(x0 − x⋆) = PtGrad(H)(x0 − x⋆) with PtGrad(λ) = (1 − hλ)t.

9

slide-14
SLIDE 14

From algorithm to polynomials

All first-order methods are polynomials∗ and all polynomials∗ are first-order methods!

(If Pt(0) = 1, i.e., Pt is a residual polynomial).

10

slide-15
SLIDE 15

Comparison of Polynomials

Visualizing the polynomial for gradient descent and momentum.

  • Gradient descent: xt = xt−1 − h∇f (xt−1).
  • Optimal Momentum:

xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2) The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ) 11

slide-16
SLIDE 16

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 4

12

slide-17
SLIDE 17

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 6

12

slide-18
SLIDE 18

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 12

12

slide-19
SLIDE 19

The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ)

What about the average-case?

slide-20
SLIDE 20

The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ)

What about the average-case? Proposition If the eigenvalues λi of H are distributed according µ, EH

  • xt − x∗2

2

  • ≤ x0 − x∗2

2

  • R

P2

t dµ

Note: The expectation is taken over the inputs. Contrary to the worst-case, the average-case is aware of the whole spectrum of the matrix H!

slide-21
SLIDE 21

Optimal Worst Case vs Optimal Average Case

The optimal worst-case method solves min

P:P(0)=1

max

λ∈[ℓ, L] P2(λ)

The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L.

14

slide-22
SLIDE 22

Optimal Worst Case vs Optimal Average Case

The optimal worst-case method solves min

P:P(0)=1

max

λ∈[ℓ, L] P2(λ)

The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L. The optimal method in average-case solves min

P:P(0)=1

  • R

P2 dµ The solution depends on µ, and uses the concept of orthogonal residual polynomials.

14

slide-23
SLIDE 23

Optimal Polynomial

Proposition (e.g. Bernd Fischer) If {Pi} is a sequence of residual orthogonal polynomials w.r.t λµ(λ), i.e.,

  • R

Pi(λ)Pj(λ) d[λµ(λ)]    = 0 if i = j, > 0

  • therwise,

then Pt solves Pt ∈ arg min

P:P(0)=1

  • R

P2 dµ, deg(P) = t.

15

slide-24
SLIDE 24

Polynomial to algorithms

The optimal polynomial comes from an orthogonal basis, and follow a two-term recursion! Proposition Let {P1, P2, . . .} be residual orthogonal polynomials. Then, for some mi and hi (function of λµ(λ)), Pi(λ) = Pi−1(λ) − hiλPi−1(λ) + mi(Pi−1(λ) − Pi−2(λ)) The optimal average-case algorithm reads xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).

16

slide-25
SLIDE 25

The recipe to create your own optimal method!

slide-26
SLIDE 26

The recipe to create your own optimal method!

  • 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
slide-27
SLIDE 27

The recipe to create your own optimal method!

  • 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
  • 2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It

gives you mt and ht)

slide-28
SLIDE 28

The recipe to create your own optimal method!

  • 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
  • 2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It

gives you mt and ht)

  • 3. Iterate over t:

xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).

slide-29
SLIDE 29

Part 2: Spectral density estimation

slide-30
SLIDE 30

Example of spectral densities

In the paper we study 3 different cases:

  • Uniform distribution in [ℓ, L],
  • Exponential decay µ(λ) = e−λ, midpoint between quadratic

convex and to convex non-smooth optimization.

  • Marchenko-Pastur distribution. Typical expected spectral

distribution of ∇f 2(x⋆) for DNN.

19

slide-31
SLIDE 31

Exponential decay

Assume the spectral density of H is µ = e− λ

λ0 .

Optimal algorithm: xt = xt−1 − λ0 t + 1∇f (xt−1) + λ0 t − 1(xt−1 − xt−2) Very close to stochastic averaged gradient for quadratics [Flammarion and Bach., 2015] Rate of convergence: EHxt − x∗2 =

1 λ0(t+1)x0 − x∗2 20

slide-32
SLIDE 32

Marchenko-Pastur distribution

We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +

  • (L − λ)(λ − ℓ)

2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.

21

slide-33
SLIDE 33

Marchenko-Pastur distribution

We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +

  • (L − λ)(λ − ℓ)

2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.

  • σ2 the variance
  • σ2r is the mean
  • Presence of zeros if r < 1 !

Motivation: For a certain class of nonlinear activation functions, the spectrum Hessian of Neural Network follow the MP distribution [Pennington et al, 2018]

21

slide-34
SLIDE 34

Figure 2: MP distribution for different values of r.

slide-35
SLIDE 35

Optimal for MP distribution

The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case!

23

slide-36
SLIDE 36

Optimal for MP distribution

The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case! Asymptotic (t → ∞) version: xt = xt−1 − 1 σ2 min{1 r , 1}∇f (xt−1) + min{r−1, r}(xt−1 − xt−2)

23

slide-37
SLIDE 37

Optimal for MP distribution

The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case! Asymptotic (t → ∞) version: xt = xt−1 − 1 σ2 min{1 r , 1}∇f (xt−1) + min{r−1, r}(xt−1 − xt−2) Main advantage: σ and r can be estimated cheaply! σ2r ≈ tr(H) d , σ2 ≈ tr(H2) d − r2

23

slide-38
SLIDE 38

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 4

24

slide-39
SLIDE 39

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 6

24

slide-40
SLIDE 40

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 8

24

slide-41
SLIDE 41

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 10

24

slide-42
SLIDE 42

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 12

24

slide-43
SLIDE 43

Optimal Polynomial for the MP distribution

1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)

The residual polynomial pt, with t = 14

24

slide-44
SLIDE 44

25

Numerical experiments

slide-45
SLIDE 45

Real data I

Least Squares f (x) = Mx − b2 Digits dataset n = 1797, p = 64

1 2 3 4 2 4

Eigenvalue Density

100 200 300 400 500 600 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 f(xt) − f Gradient Descent Nesterov (Cvx) Nesterov (StrCvx) Heavy-ball Density acceleration

Note: GD and Density acceleration don’t have access to λmin.

26

slide-46
SLIDE 46

Real data II

Least Squares f (x) = Mx − b2 Covtype dataset n = 581012, p = 54

1 2 3 4 1 2 3

Eigenvalue Density

100 200 300 400 500 600 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 f(xt) − f Gradient Descent Nesterov (Cvx) Nesterov (StrCvx) Heavy-ball Density acceleration

Note: GD and Density acceleration don’t have access to λmin.

27

slide-47
SLIDE 47

Part 3: Asymptotic Universal Optimality of Polyak Momentum

slide-48
SLIDE 48

That’s curious...

This project started with a simple observation. Optimal worst-case (Chebyshev 1st kind), t → ∞ : xt = xt−1 − 4 √ L2 + ℓ2 ∇f (xt) + √ L − √ ℓ √ L + √ ℓ 2 (xt−1 − xt−2)

29

slide-49
SLIDE 49

That’s curious...

This project started with a simple observation. Optimal worst-case (Chebyshev 1st kind), t → ∞ : xt = xt−1 − 4 √ L2 + ℓ2 ∇f (xt) + √ L − √ ℓ √ L + √ ℓ 2 (xt−1 − xt−2) Optimal for MP (Chebyshev 2st kind), r > 1 and t → ∞ : xt = xt−1 − 1 σ2r ∇f (xt) + 1 r (xt−1 − xt−2) Replacing ℓ = σ2(1 − √r)2, L = σ2(1 + √r)2, we get the same method!

29

slide-50
SLIDE 50
slide-51
SLIDE 51

Main results

Theorem (Scieur, Pedregosa, 2020). Let λµ(λ) be defined on [ℓ, L], and assume λµ(λ) > 0. Then, as t → ∞, the stepsize and momentum of the optimal method converge to ht → 4 √ L2 + ℓ2 ; mt → √ L − √ ℓ √ L + √ ℓ 2 , which are the ones of Polyak heavy-ball method. This implies that Polyak Heavy ball is asymptotically optimal !

31

slide-52
SLIDE 52

Main results

Theorem (Scieur, Pedregosa, 2020). Asymptotically, the rate of the

  • ptimal method converges to the rate of Polyak momentum,

t

  • EH
  • xt − x⋆2

2

  • x0 − x⋆2

→ √ L − √ ℓ √ L + √ ℓ 2 .

32

slide-53
SLIDE 53

10

13

10

9

10

5

10

1

Momentum

=-0.5, =-0.5 =-0.5, =0 =-0.5, =0.5 =0, =-0.5 =0, =0 =0, =0.5 =0.5, =-0.5 =0.5, =0 =0.5, =0.5

100 101 102 103 104 105 10

14

10

10

10

6

10

2

Stepsize

100 101 102 103

Iteration counter (= polynomial degree)

10

3

10

1

Rate Suboptimality of heavy-ball vs Iteration

slide-54
SLIDE 54

Summary

  • We introduce a framework to study the expected rate of

convergence (over the input)

34

slide-55
SLIDE 55

Summary

  • We introduce a framework to study the expected rate of

convergence (over the input)

  • We derived practical methods when we have a good idea of

the spectral density function

34

slide-56
SLIDE 56

Summary

  • We introduce a framework to study the expected rate of

convergence (over the input)

  • We derived practical methods when we have a good idea of

the spectral density function

  • We showed that, under mild assumptions, those methods

converges to the Polyak Heavy-ball.

34

slide-57
SLIDE 57

Samsung SAIL Montreal is recruiting!

slide-58
SLIDE 58

Samsung SAIL Montreal is recruiting!

(Benefit package may include typical Korean dinner) Contact SAIL.montreal.labgmail.com for more info :-)

slide-59
SLIDE 59

Thank you!

Our papers:

  • Pedregosa, Scieur (2020), Average-case Acceleration Through

Spectral Density Estimation. Arxiv 2002.04756.

  • Scieur, Pedregosa (2020), Universal Average-Case Optimality
  • f Polyak Momentum. Arxiv 2002.04664