numerical optimization techniques
play

Numerical Optimization Techniques L eon Bottou NEC Labs America - PowerPoint PPT Presentation

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010 Todays Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic


  1. Numerical Optimization Techniques L´ eon Bottou NEC Labs America COS 424 – 3/2/2010

  2. Today’s Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/30 COS 424 – 3/2/2010

  3. Introduction General scheme – Set a goal. – Define a parametric model. – Choose a suitable loss function. – Choose suitable capacity control methods. – Optimize average loss over the training set. Optimization – Sometimes analytic (e.g. linear model with squared loss.) – Usually numerical (e.g. everything else.) L´ eon Bottou 3/30 COS 424 – 3/2/2010

  4. Summary 1. Convex vs. Nonconvex 2. Differentiable vs. Nondifferentiable 3. Constrained vs. Unconstrained 4. Line search 5. Gradient descent 6. Hessian matrix, etc. 7. Stochastic optimization L´ eon Bottou 4/30 COS 424 – 3/2/2010

  5. Convex Definition ∀ x, y , ∀ 0 ≤ λ ≤ 1 , f ( λx + (1 − λ ) y ) ≤ λf ( x ) + (1 − λ ) f ( y ) Property Any local minimum is a global minimum. Conclusion Optimization algorithms are easy to use. They always return the same solution. Example: Linear model with convex loss function. – Curve fitting with mean squared error. – Linear classification with log-loss or hinge loss. L´ eon Bottou 5/30 COS 424 – 3/2/2010

  6. Nonconvex Landscape – local minima, saddle points. – plateaux, ravines, etc. Optimization algorithms – Usually find local minima. – Good and bad local minima. – Result depend on subtle details. Examples – Multilayer networks. – Mixture models. – Clustering algorithms. – Hidden Markov Models. – Learning features. – Selecting features (some). – Semi-supervised learning. – Transfer learning. L´ eon Bottou 6/30 COS 424 – 3/2/2010

  7. Differentiable vs. Nondifferentiable ��������������������� ����������������������� ������������������������ ������������������������ ������������������ ������������������������������ ������� �������� No such local cues without derivatives – Derivatives may not exist. – Derivatives may be too costly to compute. Examples – Log loss versus Hinge loss. L´ eon Bottou 7/30 COS 424 – 3/2/2010

  8. Constrained vs. Unconstrained Compare w 2 < C min w f ( w ) subject to min w f ( w ) + λw 2 Constraints – Adding constraints lead to very different algorithms. Keywords – Lagrange coefficients. – Karush-Kuhn-Tucker theorem. – Primal optimization, dual optimization. L´ eon Bottou 8/30 COS 424 – 3/2/2010

  9. Line search - Bracketing a minimum � Three points a < b < c such that f ( b ) < f ( a ) and f ( b ) < f ( c ) . L´ eon Bottou 9/30 COS 424 – 3/2/2010

  10. Line search - Refining the bracket � Split the largest half and compute f ( x ) . L´ eon Bottou 10/30 COS 424 – 3/2/2010

  11. Line search - Refining the bracket � – Redefine a < b < c . Here a ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 11/30 COS 424 – 3/2/2010

  12. Line search - Refining the bracket � – Redefine a < b < c . Here a ← b , b ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 12/30 COS 424 – 3/2/2010

  13. Line search - Refining the bracket � – Redefine a < b < c . Here c ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 13/30 COS 424 – 3/2/2010

  14. Line search - Golden Section Algorithm �������������� � � � � � ����� � ����� ����� � – Optimal improvement by splitting at the golden ratio . L´ eon Bottou 14/30 COS 424 – 3/2/2010

  15. Line search - Parabolic Interpolation � – Fitting a parabola can give much better guess. L´ eon Bottou 15/30 COS 424 – 3/2/2010

  16. Line search - Parabolic Interpolation � – Fitting a parabola sometimes gives much better guess. L´ eon Bottou 16/30 COS 424 – 3/2/2010

  17. Line search - Brent Algorithm Brent Algorithm for line search – Alternate golden section and parabolic interpolation. – No more than twice slower than golden section. – No more than twice slower than parabolic section. – In practice, almost as good as the best of the two. Variants with derivatives – Improvements if we can compute f ( x ) and f ′ ( x ) together. – Improvements if we can compute f ( x ) , f ′ ( x ) , f ′′ ( x ) together. L´ eon Bottou 17/30 COS 424 – 3/2/2010

  18. Coordinate Descent �������������� Perform successive line searches along the axes. – Tends to zig-zag. L´ eon Bottou 18/30 COS 424 – 3/2/2010

  19. Gradient � ∂f � The gradient ∂f ∂w 1 , . . . , ∂f ∂w = gives the steepest descent direction. ∂w d L´ eon Bottou 19/30 COS 424 – 3/2/2010

  20. Steepest Descent �������������� Perform successive line searches along the gradient direction. – Beneficial if computing the gradients is cheap enough. – Line searches can be expensive L´ eon Bottou 20/30 COS 424 – 3/2/2010

  21. Gradient Descent Repeat w ← w − γ ∂f ∂w ( w ) ��������������������������� ��������������������������� – Merge gradient and line search. – Large gain increase zig-zag tendencies, possibly divergent. – High curvature direction limits gain size. – Low curvature direction limits speed of approach. L´ eon Bottou 21/30 COS 424 – 3/2/2010

  22. Hessian matrix Hessian matrix  ∂ 2 f ∂ 2 f  ∂w 1 ∂w 1 · · · ∂w 1 ∂w d . .  . .  H ( w ) = . .   ∂ 2 f ∂ 2 f   ∂w d ∂w 1 · · · ∂w d ∂w d Curvature information – Taylor expansion near the optimum w ∗ : f ( w ) ≈ f ( w ∗ ) + 1 2( w − w ∗ ) ⊤ H ( w ∗ ) ( w − w ∗ ) – This paraboloid has ellipsoidal level curves. – Principal axes are the eigenvectors of the Hessian. – Ratio of curvatures = ratio of eigenvalues of the Hessian. L´ eon Bottou 22/30 COS 424 – 3/2/2010

  23. Newton method Idea Since Taylor says ∂f ∂w ( w ) ≈ H ( w ) ( w − w ∗ ) then w ∗ ≈ w − H ( w ) − 1 ∂f ∂w ( w ) . Newton algorithm w ← w − H ( w ) − 1 ∂f ∂w ( w ) – Succession of paraboloidal approximations. – Exact when f ( w ) is a paraboloid, e.g. linear model + squared loss. – Very few iterations needed when H ( w ) is definite positive! – Beware when H ( w ) is not definite positive. – Computing and storing H ( w ) − 1 can be too costly. Quasi-Newton methods – Methods that avoid the drawbacks of Newton – But behave like Newton during the final convergence. L´ eon Bottou 23/30 COS 424 – 3/2/2010

  24. Conjugate Gradient algorithm Conjugate directions ⇒ u ⊤ H v = 0 . – u, v conjugate ⇐ Non interacting directions. Conjugate Gradient algorithm – Compute g t = ∂f ∂w ( w t ) . – Determine a line search direction d t = g t − λd t − 1 – Choose λ such that d t H d t − 1 = 0 . – Since g t − g t − 1 ≈ H ( w t − w t − 1 ) ∝ H d t − 1 , this means λ = g t ( g t − g t − 1 ) d t ( g t − g t − 1 ) . – Perform a line search in direction d t . – Loop. This is a fast and robust quasi-Newton algorithm. A solution for all our learning problems? L´ eon Bottou 24/30 COS 424 – 3/2/2010

  25. Optimization vs. learning Empirical cost – Usually f ( w ) = 1 � n i =1 L ( x i , y i , w ) n – The number n of training examples can be large (billions?) Redundant examples – Examples are redundant (otherwise there is nothing to learn.) – Doubling the number of examples brings a little more information. – Do we need it during the first optimization iterations? Examples on-the-fly – All examples may not be available simultaneously. – Sometimes they come on the fly (e.g. web click stream.) – In quantities that are too large to store or retrieve (e.g. click stream.) L´ eon Bottou 25/30 COS 424 – 3/2/2010

  26. Offline vs. Online n 2 � w � 2 + 1 Minimize C ( w ) = λ � L ( x i , y i , w ) . n i =1 Offline: process all examples together – Example: minimization by gradient descent n � � λw + 1 ∂L � Repeat: w ← w − γ ∂w ( x i , y i , w ) n i =1 Offline: process examples one by one – Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t , y t � � λw + ∂L (b) w ← w − γ t ∂w ( x t , y t , w ) L´ eon Bottou 26/30 COS 424 – 3/2/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend