optimization for machine learning
play

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28 The Story So


  1. Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28

  2. The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28

  3. The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28

  4. Classical Quasi-Newton Algorithms Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 28

  5. Classical Quasi-Newton Algorithms Broyden, Fletcher, Goldfarb, Shanno S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 28

  6. Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

  7. Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

  8. Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − H − 1 ∇ J ( w t ) t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

  9. Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t H − 1 ∇ J ( w t ) t η t is a step size usually found via a line search S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

  10. Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t B t ∇ J ( w t ) η t is a step size usually found via a line search B t = H − 1 is a symmetric PSD matrix t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

  11. Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

  12. Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

  13. Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

  14. Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

  15. Classical Quasi-Newton Algorithms Line Search Wolfe Conditions Sufficient decrease: J ( w t + η t d t ) ≤ J ( w t ) + c 1 η t �∇ J ( w t ) , d t � Curvature condition: �∇ J ( w t + η t d t ) , d t � ≥ c 2 �∇ J ( w t ) , d t � , where 0 < c 1 < c 2 < 1. S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28

  16. Classical Quasi-Newton Algorithms Line Search Wolfe Conditions Sufficient decrease: J ( w t + η t d t ) ≤ J ( w t ) + c 1 η t �∇ J ( w t ) , d t � Curvature condition: �∇ J ( w t + η t d t ) , d t � ≥ c 2 �∇ J ( w t ) , d t � , where 0 < c 1 < c 2 < 1. S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28

  17. Non-smooth Problems Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 28

  18. Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

  19. Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

  20. Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this Houston we Have a Problem! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

  21. Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

  22. Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

  23. Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

  24. Non-smooth Problems Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction � = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence 3 2 1 0 − 3 − 2 − 1 0 1 2 3 f ( x ) = | x | and ∂ f (0) = [ − 1 , 1] S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend