Average-case Acceleration Through Spectral Density Estimation - - PowerPoint PPT Presentation

average case acceleration through spectral density
SMART_READER_LITE
LIVE PREVIEW

Average-case Acceleration Through Spectral Density Estimation - - PowerPoint PPT Presentation

Average-case Acceleration Through Spectral Density Estimation Fabian Pedregosa (Google Research) Damien Scieur (Samsung SAIT AI Lab, Montral) International Conference on Machine Learning 2020 Complexity Analysis in Optimization Worst-case


slide-1
SLIDE 1

Average-case Acceleration Through Spectral Density Estimation

Fabian Pedregosa (Google Research) Damien Scieur (Samsung SAIT AI Lab, Montréal) International Conference on Machine Learning 2020

slide-2
SLIDE 2

Worst-case analysis ✓ Bound on the complexity for any input. ✗ Potentially worse than observed runtime.

Complexity Analysis in Optimization

Simplex method (Dantzig, '98, Spielman & Teng '04) ✗ Exponential worst-case. ✓ Runtime typically polynomial.

slide-3
SLIDE 3

Average-case Complexity

✓ Complexity averaged over all problem instances. ✓ Representative of the typical complexity. Betuer bounds, sometimes betuer algorithms → Quicksoru (Hoare ’62): Fast average-case soruing Rarely used in optimization

slide-4
SLIDE 4

Average-case analysis for

  • ptimization on quadratics.

Main contributions

Optimal methods under this analysis.

slide-5
SLIDE 5

Problem Distribution: Random Quadratics

where H, x★ are random matrix, vector. ✓ exact runtime known depends on eigenvalues(H). ✓ shares (some) dynamics of real problems, e.g., Neural Tangent Kernel (Jacot et al., 2018).

slide-6
SLIDE 6

Example: Random Least Squares

Spectrum of H will be close to Marchenko-Pastur. When elements of A are iid, standardized:

slide-7
SLIDE 7

Expected Error For Gradient-Based Methods

R2 is the distance to optimum at initialization . Problem diffjculty represented by expected density Hessian eigenvalue d𝜈 Pt

is a polynomial of degree t determined from the optimization algorithm.

Fixed Flexible: algorithm design

slide-8
SLIDE 8

Goal: Find method with minimal expected error =

Average-case Optimal Method

Solution: Polynomial of degree t, oruhogonal wru to λd𝜈(λ) .

Find polynomial Pt

  • f degree t that minimizes expected error (with

proper normalization). Algorithms ↔ Polynomials

slide-9
SLIDE 9

Marchenko-Pastur Acceleration

Model for d𝜈 = Marchenko-Pastur(r, 𝛕). Algorithm Simple momentum-like method, low memory requirements. r and 𝛕 estimated from:

  • Largest eigenvalue
  • Trace of H

No need to know strong convexity constant.

slide-10
SLIDE 10

Decaying Exponential Acceleration

Model for d𝜈 = decaying exponential(λ0). Unbounded largest eigenvalue. Only access to Tr(H).

  • Decaying step-size - Similar to Polyak averaging

Algorithm

slide-11
SLIDE 11

Benchmarks: Least Squares

slide-12
SLIDE 12

Conclusions

Average-case analysis based on random quadratics. Optimal methods under difgerent eigenvalue distribution. ✓ Acceleration without knowledge of strong convexity. In paper + More methods, convergence rates, empirical extension to non-quadratic

  • bjectives.

Follow-up work on asymptotic analysis (Scieur and P., "Universal Average-Case Optimality of Polyak Momentum")