Numerical Optimization Techniques L eon Bottou NEC Labs America - PowerPoint PPT Presentation

Numerical Optimization Techniques L´ eon Bottou NEC Labs America COS 424 – 3/2/2010

Today’s Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/30 COS 424 – 3/2/2010

Introduction General scheme – Set a goal. – Define a parametric model. – Choose a suitable loss function. – Choose suitable capacity control methods. – Optimize average loss over the training set. Optimization – Sometimes analytic (e.g. linear model with squared loss.) – Usually numerical (e.g. everything else.) L´ eon Bottou 3/30 COS 424 – 3/2/2010

Summary 1. Convex vs. Nonconvex 2. Differentiable vs. Nondifferentiable 3. Constrained vs. Unconstrained 4. Line search 5. Gradient descent 6. Hessian matrix, etc. 7. Stochastic optimization L´ eon Bottou 4/30 COS 424 – 3/2/2010

Convex Definition ∀ x, y , ∀ 0 ≤ λ ≤ 1 , f ( λx + (1 − λ ) y ) ≤ λf ( x ) + (1 − λ ) f ( y ) Property Any local minimum is a global minimum. Conclusion Optimization algorithms are easy to use. They always return the same solution. Example: Linear model with convex loss function. – Curve fitting with mean squared error. – Linear classification with log-loss or hinge loss. L´ eon Bottou 5/30 COS 424 – 3/2/2010

Nonconvex Landscape – local minima, saddle points. – plateaux, ravines, etc. Optimization algorithms – Usually find local minima. – Good and bad local minima. – Result depend on subtle details. Examples – Multilayer networks. – Mixture models. – Clustering algorithms. – Hidden Markov Models. – Learning features. – Selecting features (some). – Semi-supervised learning. – Transfer learning. L´ eon Bottou 6/30 COS 424 – 3/2/2010

Differentiable vs. Nondifferentiable �� No such local cues without derivatives – Derivatives may not exist. – Derivatives may be too costly to compute. Examples – Log loss versus Hinge loss. L´ eon Bottou 7/30 COS 424 – 3/2/2010

Constrained vs. Unconstrained Compare w 2 < C min w f ( w ) subject to min w f ( w ) + λw 2 Constraints – Adding constraints lead to very different algorithms. Keywords – Lagrange coefficients. – Karush-Kuhn-Tucker theorem. – Primal optimization, dual optimization. L´ eon Bottou 8/30 COS 424 – 3/2/2010

Line search - Bracketing a minimum � Three points a < b < c such that f ( b ) < f ( a ) and f ( b ) < f ( c ) . L´ eon Bottou 9/30 COS 424 – 3/2/2010

Line search - Refining the bracket � Split the largest half and compute f ( x ) . L´ eon Bottou 10/30 COS 424 – 3/2/2010

Line search - Refining the bracket � – Redefine a < b < c . Here a ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 11/30 COS 424 – 3/2/2010

Line search - Refining the bracket � – Redefine a < b < c . Here a ← b , b ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 12/30 COS 424 – 3/2/2010

Line search - Refining the bracket � – Redefine a < b < c . Here c ← x . – Split the largest half and compute f ( x ) . L´ eon Bottou 13/30 COS 424 – 3/2/2010

Line search - Golden Section Algorithm �� – Optimal improvement by splitting at the golden ratio . L´ eon Bottou 14/30 COS 424 – 3/2/2010

Line search - Parabolic Interpolation � – Fitting a parabola can give much better guess. L´ eon Bottou 15/30 COS 424 – 3/2/2010

Line search - Parabolic Interpolation � – Fitting a parabola sometimes gives much better guess. L´ eon Bottou 16/30 COS 424 – 3/2/2010

Line search - Brent Algorithm Brent Algorithm for line search – Alternate golden section and parabolic interpolation. – No more than twice slower than golden section. – No more than twice slower than parabolic section. – In practice, almost as good as the best of the two. Variants with derivatives – Improvements if we can compute f ( x ) and f ′ ( x ) together. – Improvements if we can compute f ( x ) , f ′ ( x ) , f ′′ ( x ) together. L´ eon Bottou 17/30 COS 424 – 3/2/2010

Coordinate Descent �� Perform successive line searches along the axes. – Tends to zig-zag. L´ eon Bottou 18/30 COS 424 – 3/2/2010

Gradient � ∂f � The gradient ∂f ∂w 1 , . . . , ∂f ∂w = gives the steepest descent direction. ∂w d L´ eon Bottou 19/30 COS 424 – 3/2/2010

Steepest Descent �� Perform successive line searches along the gradient direction. – Beneficial if computing the gradients is cheap enough. – Line searches can be expensive L´ eon Bottou 20/30 COS 424 – 3/2/2010

Gradient Descent Repeat w ← w − γ ∂f ∂w ( w ) �� – Merge gradient and line search. – Large gain increase zig-zag tendencies, possibly divergent. – High curvature direction limits gain size. – Low curvature direction limits speed of approach. L´ eon Bottou 21/30 COS 424 – 3/2/2010

Hessian matrix Hessian matrix  ∂ 2 f ∂ 2 f  ∂w 1 ∂w 1 · · · ∂w 1 ∂w d . .  . .  H ( w ) = . .   ∂ 2 f ∂ 2 f   ∂w d ∂w 1 · · · ∂w d ∂w d Curvature information – Taylor expansion near the optimum w ∗ : f ( w ) ≈ f ( w ∗ ) + 1 2( w − w ∗ ) ⊤ H ( w ∗ ) ( w − w ∗ ) – This paraboloid has ellipsoidal level curves. – Principal axes are the eigenvectors of the Hessian. – Ratio of curvatures = ratio of eigenvalues of the Hessian. L´ eon Bottou 22/30 COS 424 – 3/2/2010

Newton method Idea Since Taylor says ∂f ∂w ( w ) ≈ H ( w ) ( w − w ∗ ) then w ∗ ≈ w − H ( w ) − 1 ∂f ∂w ( w ) . Newton algorithm w ← w − H ( w ) − 1 ∂f ∂w ( w ) – Succession of paraboloidal approximations. – Exact when f ( w ) is a paraboloid, e.g. linear model + squared loss. – Very few iterations needed when H ( w ) is definite positive! – Beware when H ( w ) is not definite positive. – Computing and storing H ( w ) − 1 can be too costly. Quasi-Newton methods – Methods that avoid the drawbacks of Newton – But behave like Newton during the final convergence. L´ eon Bottou 23/30 COS 424 – 3/2/2010

Conjugate Gradient algorithm Conjugate directions ⇒ u ⊤ H v = 0 . – u, v conjugate ⇐ Non interacting directions. Conjugate Gradient algorithm – Compute g t = ∂f ∂w ( w t ) . – Determine a line search direction d t = g t − λd t − 1 – Choose λ such that d t H d t − 1 = 0 . – Since g t − g t − 1 ≈ H ( w t − w t − 1 ) ∝ H d t − 1 , this means λ = g t ( g t − g t − 1 ) d t ( g t − g t − 1 ) . – Perform a line search in direction d t . – Loop. This is a fast and robust quasi-Newton algorithm. A solution for all our learning problems? L´ eon Bottou 24/30 COS 424 – 3/2/2010

Optimization vs. learning Empirical cost – Usually f ( w ) = 1 � n i =1 L ( x i , y i , w ) n – The number n of training examples can be large (billions?) Redundant examples – Examples are redundant (otherwise there is nothing to learn.) – Doubling the number of examples brings a little more information. – Do we need it during the first optimization iterations? Examples on-the-fly – All examples may not be available simultaneously. – Sometimes they come on the fly (e.g. web click stream.) – In quantities that are too large to store or retrieve (e.g. click stream.) L´ eon Bottou 25/30 COS 424 – 3/2/2010

Offline vs. Online n 2 � w � 2 + 1 Minimize C ( w ) = λ � L ( x i , y i , w ) . n i =1 Offline: process all examples together – Example: minimization by gradient descent n � � λw + 1 ∂L � Repeat: w ← w − γ ∂w ( x i , y i , w ) n i =1 Offline: process examples one by one – Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t , y t � � λw + ∂L (b) w ← w − γ t ∂w ( x t , y t , w ) L´ eon Bottou 26/30 COS 424 – 3/2/2010

Numerical Optimization Techniques L eon Bottou NEC Labs America - PowerPoint PPT Presentation

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010 Todays Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

JUST THE MATHS SLIDES NUMBER 17.7 NUMERICAL MATHEMATICS 7 (Numerical solution) of

JUST THE MATHS SLIDES NUMBER 17.8 NUMERICAL MATHEMATICS 8 (Numerical solution) of

4. Numerical Quadrature Where analytical abilities end . . . 4. Numerical Quadrature Numerical

Numerical Differentiation & Integration Elements of Numerical Integration I Numerical

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

Numerical Differentiation & Integration Numerical Differentiation II Numerical Analysis (9th

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

Numerical Techniques for Holography based on KB, Christopher P. Herzog arXiv:1312.4953 Koushik

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere!

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Response prediction using collaborative filtering with hierarchies and side-information Aditya

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Deep Character-Level Bora Edizel - Phd Student UPF Click-Through Rate Prediction Amin Mantrach -

Numerical Optimization Techniques L eon Bottou NEC Labs America - PowerPoint PPT Presentation

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010 Todays Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Numerical Differentiation &amp; Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

JUST THE MATHS SLIDES NUMBER 17.7 NUMERICAL MATHEMATICS 7 (Numerical solution) of

JUST THE MATHS SLIDES NUMBER 17.8 NUMERICAL MATHEMATICS 8 (Numerical solution) of

4. Numerical Quadrature Where analytical abilities end . . . 4. Numerical Quadrature Numerical

Numerical Differentiation &amp; Integration Elements of Numerical Integration I Numerical

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

Numerical Differentiation &amp; Integration Numerical Differentiation II Numerical Analysis (9th

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

Numerical Techniques for Holography based on KB, Christopher P. Herzog arXiv:1312.4953 Koushik

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere!

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Response prediction using collaborative filtering with hierarchies and side-information Aditya

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Deep Character-Level Bora Edizel - Phd Student UPF Click-Through Rate Prediction Amin Mantrach -

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Differentiation & Integration Elements of Numerical Integration I Numerical

Numerical Differentiation & Integration Numerical Differentiation II Numerical Analysis (9th