Average-Case Acceleration Through Spectral Density Estimation and - - PowerPoint PPT Presentation
Average-Case Acceleration Through Spectral Density Estimation and - - PowerPoint PPT Presentation
Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal) Motivation Consider the convex, quadratic
Motivation
Consider the convex, quadratic optimization problem min
x∈Rd f (x) = 1
2(x − x⋆)TH(x − x⋆) + f ∗.
1
Motivation
Consider the convex, quadratic optimization problem min
x∈Rd f (x) = 1
2(x − x⋆)TH(x − x⋆) + f ∗. Efficient methods:
- Conjugate gradients (”most optimal”)
- Chebyshev 1st kind acceleration (worst-case optimal)
- Polyak heavy-ball method (asympt. worst-case optimal)
1
Polyak Heavy-Ball
Polyak Momentum algorithm, for ℓI H LI, xt+1 = xt − h∇f (xt) + m(xt − xt−1) where h = 4 √ L2 + ℓ2 , m = √ L − √ ℓ √ L + √ ℓ 2 .
- Requires the knowledge of ℓ, L.
- Easy to use (widely used in deep learning)
- Works well for non-quadratic (deterministic or stochastic).
2
Deep learning and large-scale problems
In deep learning, we solve min
x∈Rd N
- i=1
fi(x) for huge d. Consequences:
- The minimum eigenvalue ℓ is extremely hard to estimate.
- Behaves like a quadratic when using gradient descent.
- Nice statistical properties, like expected spectral density.
3
2 4 6 0.0 0.1 0.2 0.3 0.4
Eigenvalue Density
Figure 1: Empirical vs expected spectral density of ∇2f (x).
In this talk
We study the average-case convergence on quadratic problems.
- Standard optimization methods only use ℓ and L. What if we
use expected density function?
- How to build optimal methods in average case for given
spectral densities? Can we get rid of ℓ?
- Asymptotic behavior of these optimal methods?
5
Part 1: Spectral density, optimal methods and
- rthogonal polynomials
Setting
Consider a class of convex, quadratic optimization problem min
x∈Rd
1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution.
7
Setting
Consider a class of convex, quadratic optimization problem min
x∈Rd
1 2(x − x⋆)TH(x − x⋆) + f ⋆ For simplicity, assume that H is sampled randomly from some unknown distribution. We define the expected spectral density µ of H to be P
- λi(H) ∈ [a, b]
- =
b
a
dµ, for random i, H. Remark: we are not interested in knowing the distribution over H!
7
Beyond the condition number: spectral density
we know the distribution of the eigenvalues of H
1 2 3 0.0 0.2 0.4 0.6 0.8
Spectral density
8
Likely to have eigenvalues here. Unlikely to see them over here.
First order methods and polynomials
We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆
Iteration t
= Pt(H)
Polynomial degree t
(x0 − x⋆), Pt(0) = 1.
9
First order methods and polynomials
We will use first-order methods to solve the quadratic problem. xt ∈ x0 + span {∇f (x0), . . . , ∇f (xt−1)} . Main property. The error is a residual polynomial in H: xt − x⋆
Iteration t
= Pt(H)
Polynomial degree t
(x0 − x⋆), Pt(0) = 1. Example: Gradient descent. xt−x⋆ = xt−1−x⋆ −
∇f (xt−1)
- hH(xt−1 − x⋆)
= (I − hH)t(x0 − x⋆) = PtGrad(H)(x0 − x⋆) with PtGrad(λ) = (1 − hλ)t.
9
From algorithm to polynomials
All first-order methods are polynomials∗ and all polynomials∗ are first-order methods!
(If Pt(0) = 1, i.e., Pt is a residual polynomial).
10
Comparison of Polynomials
Visualizing the polynomial for gradient descent and momentum.
- Gradient descent: xt = xt−1 − h∇f (xt−1).
- Optimal Momentum:
xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2) The worst-case rate of convergence is given by the largest value xt − x∗2
2 ≤ x0 − x⋆2
max
λ∈[ℓ, L] Pt2(λ) 11
Residual Polynomial for Momentum
0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials
The residual polynomial pt, with t = 4
12
Residual Polynomial for Momentum
0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials
The residual polynomial pt, with t = 6
12
Residual Polynomial for Momentum
0.0 λmin 0.5 1.0 1.5 λmax 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) Gradient Descent, pt(z) = (1 − 2z/(λmin + λmax))2t Momentum, pt(z) = Chebyshev polynomials
The residual polynomial pt, with t = 12
12
The worst-case rate of convergence is given by the largest value xt − x∗2
2 ≤ x0 − x⋆2
max
λ∈[ℓ, L] Pt2(λ)
What about the average-case?
The worst-case rate of convergence is given by the largest value xt − x∗2
2 ≤ x0 − x⋆2
max
λ∈[ℓ, L] Pt2(λ)
What about the average-case? Proposition If the eigenvalues λi of H are distributed according µ, EH
- xt − x∗2
2
- ≤ x0 − x∗2
2
- R
P2
t dµ
Note: The expectation is taken over the inputs. Contrary to the worst-case, the average-case is aware of the whole spectrum of the matrix H!
Optimal Worst Case vs Optimal Average Case
The optimal worst-case method solves min
P:P(0)=1
max
λ∈[ℓ, L] P2(λ)
The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L.
14
Optimal Worst Case vs Optimal Average Case
The optimal worst-case method solves min
P:P(0)=1
max
λ∈[ℓ, L] P2(λ)
The unique solution is given by the Chebyshev polynomials of the first kind, depending only on ℓ, L. The optimal method in average-case solves min
P:P(0)=1
- R
P2 dµ The solution depends on µ, and uses the concept of orthogonal residual polynomials.
14
Optimal Polynomial
Proposition (e.g. Bernd Fischer) If {Pi} is a sequence of residual orthogonal polynomials w.r.t λµ(λ), i.e.,
- R
Pi(λ)Pj(λ) d[λµ(λ)] = 0 if i = j, > 0
- therwise,
then Pt solves Pt ∈ arg min
P:P(0)=1
- R
P2 dµ, deg(P) = t.
15
Polynomial to algorithms
The optimal polynomial comes from an orthogonal basis, and follow a two-term recursion! Proposition Let {P1, P2, . . .} be residual orthogonal polynomials. Then, for some mi and hi (function of λµ(λ)), Pi(λ) = Pi−1(λ) − hiλPi−1(λ) + mi(Pi−1(λ) − Pi−2(λ)) The optimal average-case algorithm reads xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).
16
The recipe to create your own optimal method!
The recipe to create your own optimal method!
- 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
The recipe to create your own optimal method!
- 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
- 2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It
gives you mt and ht)
The recipe to create your own optimal method!
- 1. Find the distribution µ of the eigenvalues in H, or ∇2f (x).
- 2. Find a sequence of orthogonal polynomials Pt w.r.t λµ(λ). (It
gives you mt and ht)
- 3. Iterate over t:
xt = xt−1 − ht∇f (xt−1) + mt(xt−1 − xt−2).
Part 2: Spectral density estimation
Example of spectral densities
In the paper we study 3 different cases:
- Uniform distribution in [ℓ, L],
- Exponential decay µ(λ) = e−λ, midpoint between quadratic
convex and to convex non-smooth optimization.
- Marchenko-Pastur distribution. Typical expected spectral
distribution of ∇f 2(x⋆) for DNN.
19
Exponential decay
Assume the spectral density of H is µ = e− λ
λ0 .
Optimal algorithm: xt = xt−1 − λ0 t + 1∇f (xt−1) + λ0 t − 1(xt−1 − xt−2) Very close to stochastic averaged gradient for quadratics [Flammarion and Bach., 2015] Rate of convergence: EHxt − x∗2 =
1 λ0(t+1)x0 − x∗2 20
Marchenko-Pastur distribution
We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +
- (L − λ)(λ − ℓ)
2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.
21
Marchenko-Pastur distribution
We know study the Marchenko-Pastur distribution: µ(λ) = δ0(λ)(1 − r)+ +
- (L − λ)(λ − ℓ)
2πσ2λ 1λ∈[ℓ,L]. with ℓ def = σ2(1 − √r)2, L def = σ2(1 + √r)2.
- σ2 the variance
- σ2r is the mean
- Presence of zeros if r < 1 !
Motivation: For a certain class of nonlinear activation functions, the spectrum Hessian of Neural Network follow the MP distribution [Pennington et al, 2018]
21
Figure 2: MP distribution for different values of r.
Optimal for MP distribution
The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case!
23
Optimal for MP distribution
The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case! Asymptotic (t → ∞) version: xt = xt−1 − 1 σ2 min{1 r , 1}∇f (xt−1) + min{r−1, r}(xt−1 − xt−2)
23
Optimal for MP distribution
The optimal polynomials for the MP distribution are Chebyshev polynomials of 2nd kind. (Fun fact). Chebyshev 1st kind is optimal for the worst-case! Asymptotic (t → ∞) version: xt = xt−1 − 1 σ2 min{1 r , 1}∇f (xt−1) + min{r−1, r}(xt−1 − xt−2) Main advantage: σ and r can be estimated cheaply! σ2r ≈ tr(H) d , σ2 ≈ tr(H2) d − r2
23
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 4
24
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 6
24
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 8
24
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 10
24
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 12
24
Optimal Polynomial for the MP distribution
1 2 3 4 5 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 pt(z) density Momentum, pt(z) = Chebyshev polynomials MP, pt(z)
The residual polynomial pt, with t = 14
24
25
Numerical experiments
Real data I
Least Squares f (x) = Mx − b2 Digits dataset n = 1797, p = 64
1 2 3 4 2 4
Eigenvalue Density
100 200 300 400 500 600 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 f(xt) − f Gradient Descent Nesterov (Cvx) Nesterov (StrCvx) Heavy-ball Density acceleration
Note: GD and Density acceleration don’t have access to λmin.
26
Real data II
Least Squares f (x) = Mx − b2 Covtype dataset n = 581012, p = 54
1 2 3 4 1 2 3
Eigenvalue Density
100 200 300 400 500 600 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 f(xt) − f Gradient Descent Nesterov (Cvx) Nesterov (StrCvx) Heavy-ball Density acceleration
Note: GD and Density acceleration don’t have access to λmin.
27
Part 3: Asymptotic Universal Optimality of Polyak Momentum
That’s curious...
This project started with a simple observation. Optimal worst-case (Chebyshev 1st kind), t → ∞ : xt = xt−1 − 4 √ L2 + ℓ2 ∇f (xt) + √ L − √ ℓ √ L + √ ℓ 2 (xt−1 − xt−2)
29
That’s curious...
This project started with a simple observation. Optimal worst-case (Chebyshev 1st kind), t → ∞ : xt = xt−1 − 4 √ L2 + ℓ2 ∇f (xt) + √ L − √ ℓ √ L + √ ℓ 2 (xt−1 − xt−2) Optimal for MP (Chebyshev 2st kind), r > 1 and t → ∞ : xt = xt−1 − 1 σ2r ∇f (xt) + 1 r (xt−1 − xt−2) Replacing ℓ = σ2(1 − √r)2, L = σ2(1 + √r)2, we get the same method!
29
Main results
Theorem (Scieur, Pedregosa, 2020). Let λµ(λ) be defined on [ℓ, L], and assume λµ(λ) > 0. Then, as t → ∞, the stepsize and momentum of the optimal method converge to ht → 4 √ L2 + ℓ2 ; mt → √ L − √ ℓ √ L + √ ℓ 2 , which are the ones of Polyak heavy-ball method. This implies that Polyak Heavy ball is asymptotically optimal !
31
Main results
Theorem (Scieur, Pedregosa, 2020). Asymptotically, the rate of the
- ptimal method converges to the rate of Polyak momentum,
t
- EH
- xt − x⋆2
2
- x0 − x⋆2
→ √ L − √ ℓ √ L + √ ℓ 2 .
32
10
13
10
9
10
5
10
1
Momentum
=-0.5, =-0.5 =-0.5, =0 =-0.5, =0.5 =0, =-0.5 =0, =0 =0, =0.5 =0.5, =-0.5 =0.5, =0 =0.5, =0.5
100 101 102 103 104 105 10
14
10
10
10
6
10
2
Stepsize
100 101 102 103
Iteration counter (= polynomial degree)
10
3
10
1
Rate Suboptimality of heavy-ball vs Iteration
Summary
- We introduce a framework to study the expected rate of
convergence (over the input)
34
Summary
- We introduce a framework to study the expected rate of
convergence (over the input)
- We derived practical methods when we have a good idea of
the spectral density function
34
Summary
- We introduce a framework to study the expected rate of
convergence (over the input)
- We derived practical methods when we have a good idea of
the spectral density function
- We showed that, under mild assumptions, those methods
converges to the Polyak Heavy-ball.
34
Samsung SAIL Montreal is recruiting!
Samsung SAIL Montreal is recruiting!
(Benefit package may include typical Korean dinner) Contact SAIL.montreal.labgmail.com for more info :-)
Thank you!
Our papers:
- Pedregosa, Scieur (2020), Average-case Acceleration Through
Spectral Density Estimation. Arxiv 2002.04756.
- Scieur, Pedregosa (2020), Universal Average-Case Optimality
- f Polyak Momentum. Arxiv 2002.04664