average case acceleration through spectral density
play

Average-case Acceleration Through Spectral Density Estimation - PowerPoint PPT Presentation

Average-case Acceleration Through Spectral Density Estimation Fabian Pedregosa (Google Research) Damien Scieur (Samsung SAIT AI Lab, Montral) International Conference on Machine Learning 2020 Complexity Analysis in Optimization Worst-case


  1. Average-case Acceleration Through Spectral Density Estimation Fabian Pedregosa (Google Research) Damien Scieur (Samsung SAIT AI Lab, Montréal) International Conference on Machine Learning 2020

  2. Complexity Analysis in Optimization Worst-case analysis ✓ Bound on the complexity for any input. ✗ Potentially worse than observed runtime. Simplex method (Dantzig, '98, Spielman & Teng '04) ✗ Exponential worst-case. ✓ Runtime typically polynomial.

  3. Average-case Complexity ✓ Complexity averaged over all problem instances. ✓ Representative of the typical complexity. Betuer bounds, sometimes betuer algorithms → Quicksoru (Hoare ’62): Fast average-case soruing Rarely used in optimization

  4. Main contributions Average-case analysis for optimization on quadratics. Optimal methods under this analysis.

  5. Problem Distribution: Random Quadratics where H , x ★ are random matrix, vector. ✓ exact runtime known depends on eigenvalues( H ). ✓ shares (some) dynamics of real problems, e.g., Neural Tangent Kernel (Jacot et al., 2018).

  6. Example: Random Least Squares When elements of A are iid, standardized: Spectrum of H will be close to Marchenko-Pastur.

  7. Expected Error For Gradient-Based Methods Fixed R 2 is the distance to optimum at initialization . Problem diffjculty represented by expected density Hessian eigenvalue d 𝜈 is a polynomial of degree t determined from the optimization algorithm. P t Flexible: algorithm design

  8. Average-case Optimal Method Goal : Find method with minimal expected error = Algorithms ↔ Polynomials of degree t that minimizes expected error (with Find polynomial P t proper normalization). Solution: Polynomial of degree t, oruhogonal wru to λd 𝜈 (λ) .

  9. Marchenko-Pastur Acceleration Model for d 𝜈 = Marchenko-Pastur(r, 𝛕 ). r and 𝛕 estimated from: - Largest eigenvalue No need to know strong convexity constant. - Trace of H Algorithm Simple momentum-like method, low memory requirements.

  10. Decaying Exponential Acceleration Model for d 𝜈 = decaying exponential(λ 0 ). Unbounded largest eigenvalue. Only access to Tr( H ). Algorithm - Decaying step-size - Similar to Polyak averaging

  11. Benchmarks: Least Squares

  12. Conclusions Average-case analysis based on random quadratics. Optimal methods under difgerent eigenvalue distribution. ✓ Acceleration without knowledge of strong convexity. In paper + More methods, convergence rates, empirical extension to non-quadratic objectives. Follow-up work on asymptotic analysis (Scieur and P., "Universal Average-Case Optimality of Polyak Momentum" )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend