first order methods
play

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015 Outline Lect 1: Recap on convexity Lect 1: Recap on duality, optimality First-order optimization algorithms Proximal


  1. First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015

  2. Outline – Lect 1: Recap on convexity – Lect 1: Recap on duality, optimality – First-order optimization algorithms – Proximal methods, operator splitting Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 2 / 23

  3. Descent methods min x f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

  4. Descent methods min x f ( x ) x k x k +1 . . . x ∗ ∇ f ( x ∗ ) = 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

  5. Descent methods x Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  6. Descent methods ∇ f ( x ) x −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  7. Descent methods ∇ f ( x ) x − α ∇ f ( x ) x x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  8. Descent methods ∇ f ( x ) x + α 2 d x − α ∇ f ( x ) x d x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

  9. Algorithm 1 Start with some guess x 0 ; 2 For each k = 0 , 1 , . . . x k + 1 ← x k + α k d k Check when to stop (e.g., if ∇ f ( x k + 1 ) = 0) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 5 / 23

  10. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  11. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  12. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  13. Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Usually methods seek monotonic descent f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

  14. Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

  15. Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. ◦ . . . Exercise: Verify that �∇ f ( x k ) , d k � < 0 for above choices Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

  16. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  17. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  18. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  19. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  20. Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Usually, σ small ∈ [ 10 − 5 , 0 . 1 ] , while β from 1 / 2 to 1 / 10 depending on how confident we are about initial stepsize s . ◮ Constant: α k = 1 / L (for suitable value of L ) ◮ Diminishing: α k → 0 but � k α k = ∞ . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

  21. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  22. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  23. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  24. Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . α k = � u k , v k � � u k � 2 � v k � 2 , α k = � u k , v k � u k = x k − x k − 1 , v k = ∇ f ( x k ) − ∇ f ( x k − 1 ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

  25. Least-squares Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 10 / 23

  26. Nonnegative least squares 2 � Ax − b � 2 + � x ≥ 0 � 1 min intensities, concentrations, frequencies, . . . Applications Machine learning Physics Statistics Bioinformatics Image Processing Remote Sensing Computer Vision Engineering Medical Imaging Inverse problems Astronomy Finance Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 11 / 23

  27. NNLS: � Ax − b � 2 s.t. x ≥ 0 Unconstrained solution x uc = ( A T A ) − 1 A T b Solve ∇ f ( x ) = 0 = ⇒ x = ( x uc ) + Cannot just truncate x ∗ ( x uc ) + x uc x ≥ 0 makes problem trickier as problem size ր Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 12 / 23

  28. Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

  29. Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Too slow! Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

  30. NNLS: long studied problem Method Remarks Scalability Accuracy NNLS (1976) M ATLAB default poor high FNNLS (1989) fast NNLS poor high LBFGS-B (1997) famous solver fair medium TRON (1999) TR newton poor high SPG (2000) spectral proj fair+ medium ASA (2006) prev state-of-art fair+ medium SBB (2011) subspace BB steps very good medium Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 14 / 23

  31. Spectacular failure of projection 2 10 Naive BB+Projxn 1 10 Objective function value 0 10 −1 10 −2 10 −3 10 −4 10 5 10 15 20 25 30 35 40 Running time (seconds) x ′ = ( x − α ∇ f ( x )) + Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 15 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend