[PPT] - Average-Case Acceleration Through Spectral Density Estimation and PowerPoint Presentation

SLIDE 1

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum

Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal)

SLIDE 2

Motivation

Consider the convex, quadratic optimization problem min

x∈Rd f (x) = 1

2(x − x⋆)TH(x − x⋆) + f ∗.

1

SLIDE 3

Motivation

Consider the convex, quadratic optimization problem min

x∈Rd f (x) = 1

2(x − x⋆)TH(x − x⋆) + f ∗. Efficient methods:

Conjugate gradients (”most optimal”)
Chebyshev 1st kind acceleration (worst-case optimal)
Polyak heavy-ball method (asympt. worst-case optimal)

1

SLIDE 4

Polyak Heavy-Ball

Polyak Momentum algorithm, for ℓI H LI, xt+1 = xt − h∇f (xt) + m(xt − xt−1) where h = 4 √ L2 + ℓ2 , m = √ L − √ ℓ √ L + √ ℓ 2 .

Requires the knowledge of ℓ, L.
Easy to use (widely used in deep learning)
Works well for non-quadratic (deterministic or stochastic).

2

SLIDE 5

Deep learning and large-scale problems

In deep learning, we solve min

x∈Rd N

i=1

fi(x) for huge d. Consequences:

The minimum eigenvalue ℓ is extremely hard to estimate.
Behaves like a quadratic when using gradient descent.
Nice statistical properties, like expected spectral density.

3

SLIDE 6

2 4 6 0.0 0.1 0.2 0.3 0.4

Eigenvalue Density

Figure 1: Empirical vs expected spectral density of ∇2f (x).

SLIDE 7

In this talk

We study the average-case convergence on quadratic problems.

Standard optimization methods only use ℓ and L. What if we

use expected density function?

How to build optimal methods in average case for given

spectral densities? Can we get rid of ℓ?

Asymptotic behavior of these optimal methods?

5

SLIDE 8

Part 1: Spectral density, optimal methods and

rthogonal polynomials

SLIDE 9

Setting

Consider a class of convex, quadratic optimization problem min

x∈Rd

1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution.

7

SLIDE 10

Setting

Consider a class of convex, quadratic optimization problem min

x∈Rd

1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution. We define the expected spectral density µ of H to be P

λi(H) ∈ [a, b]
=

b

a

dµ, for random i, H. Remark: we are not interested in knowing the distribution over H!

7

SLIDE 11

Beyond the condition number: spectral density

we know the distribution of the eigenvalues of H

1 2 3 0.0 0.2 0.4 0.6 0.8

Spectral density

8

Likely to have eigenvalues here. Unlikely to see them over here.

SLIDE 12

First order methods and polynomials

We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆

Iteration t

= Pt(H)

Polynomial degree t

(x0 − x⋆), Pt(0) = 1.

9

SLIDE 13

First order methods and polynomials

We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆

Iteration t

= Pt(H)

Polynomial degree t

(x0 − x⋆), Pt(0) = 1. Example: Gradient descent. xt−x⋆ = xt−1−x⋆ −

∇f (xt−1)

hH(xt−1 − x⋆)

= (I − hH)t(x0 − x⋆) = PtGrad(H)(x0 − x⋆) with PtGrad(λ) = (1 − hλ)t.

9

SLIDE 14

From algorithm to polynomials

All first-order methods are polynomials∗ and all polynomials∗ are first-order methods!

(If Pt(0) = 1, i.e., Pt is a residual polynomial).

10

SLIDE 15

Comparison of Polynomials

Visualizing the polynomial for gradient descent and momentum.

Gradient descent: xt = xt−1 − h∇f (xt−1).
Optimal Momentum:

xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2) The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ) 11

SLIDE 16

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 4

12

SLIDE 17

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 6

12

SLIDE 18

Residual Polynomial for Momentum

0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials

The residual polynomial pt, with t = 12

12

SLIDE 19

The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ)

What about the average-case?

SLIDE 20

The worst-case rate of convergence is given by the largest value xt − x∗2

2 ≤ x0 − x⋆2

max

λ∈[ℓ, L] Pt2(λ)

What about the average-case? Proposition If the eigenvalues λi of H are distributed according µ, EH

xt − x∗2

2

≤ x0 − x∗2

2

R

P2

t dµ

Note: The expectation is taken over the inputs. Contrary to the worst-case, the average-case is aware of the whole spectrum of the matrix H!

SLIDE 21

Optimal Worst Case vs Optimal Average Case

The optimal worst-case method solves min

P:P(0)=1

max

λ∈[ℓ, L] P2(λ)

The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L.

14

SLIDE 22

Optimal Worst Case vs Optimal Average Case

The optimal worst-case method solves min

P:P(0)=1

max

λ∈[ℓ, L] P2(λ)

The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L. The optimal method in average-case solves min

P:P(0)=1

R

P2 dµ The solution depends on µ, and uses the concept of orthogonal residual polynomials.

14

SLIDE 23

Optimal Polynomial

Proposition (e.g. Bernd Fischer) If {Pi} is a sequence of residual orthogonal polynomials w.r.t λµ(λ), i.e.,

R

Pi(λ)Pj(λ) d[λµ(λ)]    = 0 if i = j, > 0

therwise,

then Pt solves Pt ∈ arg min

P:P(0)=1

R

P2 dµ, deg(P) = t.

15

SLIDE 24

Polynomial to algorithms

The optimal polynomial comes from an orthogonal basis, and follow a two-term recursion! Proposition Let {P1, P2, . . .} be residual orthogonal polynomials. Then, for some mi and hi (function of λµ(λ)), Pi(λ) = Pi−1(λ) − hiλPi−1(λ) + mi(Pi−1(λ) − Pi−2(λ)) The optimal average-case algorithm reads xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).

16

SLIDE 25

The recipe to create your own optimal method!

SLIDE 26

The recipe to create your own optimal method!

1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).

SLIDE 27

The recipe to create your own optimal method!

1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It

gives you mt and ht)

SLIDE 28

The recipe to create your own optimal method!

1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It

gives you mt and ht)

3. Iterate over t:

xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).

SLIDE 29

Part 2: Spectral density estimation

SLIDE 30

Example of spectral densities

In the paper we study 3 different cases:

Uniform distribution in [ℓ, L],
Exponential decay µ(λ) = e−λ, midpoint between quadratic

convex and to convex non-smooth optimization.

Marchenko-Pastur distribution. Typical expected spectral

distribution of ∇f 2(x⋆) for DNN.

19

SLIDE 31

Exponential decay

Assume the spectral density of H is µ = e− λ

λ0 .

Optimal algorithm: xt = xt−1 − λ0 t + 1∇f (xt−1) + λ0 t − 1(xt−1 − xt−2) Very close to stochastic averaged gradient for quadratics [Flammarion and Bach., 2015] Rate of convergence: EHxt − x∗2 =

1 λ0(t+1)x0 − x∗2 20

SLIDE 32

Marchenko-Pastur distribution

We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +

(L − λ)(λ − ℓ)

2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.

21

SLIDE 33

Marchenko-Pastur distribution

We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +

(L − λ)(λ − ℓ)

2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.

σ2 the variance
σ2r is the mean
Presence of zeros if r < 1 !

Motivation: For a certain class of nonlinear activation functions, the spectrum Hessian of Neural Network follow the MP distribution [Pennington et al, 2018]

21

SLIDE 34

Figure 2: MP distribution for different values of r.

SLIDE 35

Optimal for MP distribution

The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case!

23

SLIDE 36

Optimal for MP distribution

The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case! Asymptotic (t → ∞) version: xt = xt−1 − 1 σ2 min{1 r , 1}∇f (xt−1) + min{r−1, r}(xt−1 − xt−2)

23

SLIDE 37

Optimal for MP distribution