machine learning mt 2016 6 optimisation
play

Machine Learning - MT 2016 6. Optimisation Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016 Outline Most machine learning methods can (ultimately) be cast as optimization problems. Linear Programming Basics: Gradients, Hessians


  1. Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016

  2. Outline Most machine learning methods can (ultimately) be cast as optimization problems. ◮ Linear Programming ◮ Basics: Gradients, Hessians ◮ Gradient Descent ◮ Stochastic Gradient Descent ◮ Constrained Optimization Most machine learning packages such as scikit-learn, tensorflow, octave, torch etc. , will have optimization methods implemented. But you will have to understand the basics of optimization to use them effectively. 1

  3. Linear Programming Looking for solutions x ∈ R n to the following optimization problem c T x minimize subject to: a T i x ≤ b i , i = 1 , . . . , m a T i x = ¯ ¯ b i , i = 1 , . . . , l ◮ No analytic solution ◮ ‘‘Efficient’’ algorithms exist 2

  4. Linear Model with Absolute Loss Suppose we have data � ( x i , y i ) � N i =1 and that we want to minimise the objective: N � | x T L ( w ) = i w − y i | i =1 Let us introduce ζ i one for each datapoint Consider the linear program in the D + N variables w 1 , . . . , w D , ζ 1 , . . . , ζ N N � minimize ζ i i =1 subject to: w T x i − y i ≤ ζ i , i = 1 , . . . , N y i − w T x i ≤ ζ i , i = 1 , . . . , N 3

  5. Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � ( w T x i − y i ) 2 + λ � L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods 4

  6. Calculus Background: Gradients z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2 ∂w 1 = 2 w 1 ∂f a 2 ∂w 2 = 2 w 2 ∂f b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2 ◮ Gradient vectors are orthogonal to contour curves ◮ Gradient points in the direction of steepest increase 5

  7. Calculus Background: Hessians z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2     ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 1 ∂w 2 a 2  = H = 1    ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 2 ∂w 1 b 2 2 ◮ As long as all second derivates exist, the Hessian H is symmetric ◮ Hessian captures the curvature of the surface 6

  8. Calculus Background: Chain Rule z = f ( w 1 ( θ 1 , θ 2 ) , w 2 ( θ 1 , θ 2 )) w 1 θ 1 f z w 2 θ 2 ∂f ∂w 1 · ∂w 1 ∂f ∂w 2 · ∂w 2 ∂f ∂θ 1 = ∂θ 1 + ∂θ 1 We will use this a lot when we study neural networks and back propagation 7

  9. General Form for Gradient and Hessian Suppose w ∈ R D and f : R D → R The gradient vector contains all first order partial derivatives   ∂f ∂w 1  ∂f    ∂w 2   ∇ w f ( w ) = .   .   .   ∂f ∂w D Hessian matrix of f contains all second order partial derivatives.   ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D   1  ∂ 2 f ∂ 2 f ∂ 2 f  · · ·   ∂w 2 ∂w 1 ∂w 2 ∂w 2 ∂w D   H = 2 . . .  ...  . . .   . . .    ∂ 2 f ∂ 2 f ∂ 2 f  · · · ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D 8

  10. Gradient Descent Algorithm Gradient descent is one of the simplest, but very general algorithm for optimization It is an iterative algorithm, producing a new w t +1 at each iteration as w t +1 = w t − η t g t = w t − η t ∇ f ( w t ) We will denote the gradients by g t η t > 0 is the learning rate or step size 9

  11. Gradient Descent for Least Squares Regression N � L ( w ) = ( Xw − y ) T ( Xw − y ) = ( x T i w − y i ) 2 i =1 We can compute the gradient of L with respect to w � � X T Xw − X T y ∇ w L = 2 ◮ Why would you want to use gradient descent instead of directly plugging in the formula? ◮ If N and D are both very large � � ◮ Computational complexity of matrix formula O min { N 2 D, ND 2 } ◮ Each gradient calculation O ( ND ) 10

  12. Choosing a Step Size ◮ Choosing a good step-size is important ◮ It step size is too large, algorithm may never converge ◮ If step size is too small, convergence may be very slow ◮ May want a time-varying step size 11

  13. Newton’s Method (Second Order Method) ◮ Newton’s method uses second ◮ Gradient descent uses only the derivatives first derivative ◮ Degree 2 Taylor approximation ◮ Local linear approximation around current point 12

  14. Newton’s Method in High Dimensions The updates depend on the gradient g t and the Hessian H t at point w t w t +1 = w t − H − 1 t g t Approximate f around w t using second order Taylor approximation t ( w − w t ) + 1 f quad ( w ) = f ( w t ) + g T 2( w − w t ) T H t ( w − w t ) We move directly to the (unique) stationary point of f quad The gradient of f quad is given by: ∇ w f quad = g t + H t ( w − w t ) Setting ∇ w f quad = 0 , to get w t +1 , we have w t +1 = w t − H − 1 t g t 13

  15. Newton’s Method gives Stationary Points H has positive eigenvalues H has negative eigenvalues H has mixed eigenvalues Hessian will tell you which kind of stationary point is found Newton’s method can be computationally expensive in high dimensions. Need to compute and invert a Hessian at each iteration 14

  16. Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � � ( w T x i − y i ) 2 + λ L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods ◮ We still have the problem that the objective function is not differentiable! 15

  17. Sub-gradient Descent Focus on the case when f is convex, f ( αx + (1 − α ) y ) ≤ αf ( x ) + (1 − α ) f ( y ) for all x, y , α ∈ [0 , 1] f ( x ) ≥ f ( x 0 ) + g ( x − x 0 ) where g is a sub-derivative f ( x ) ≥ f ( x 0 ) + g T ( x − x 0 ) where g is a sub-gradient Any g satisfying the above inequality will be called a sub-gradient at x 0 16

  18. Sub-gradient Descent f ( w ) = | w 1 | + | w 2 | + | w 3 | + | w 4 | for w ∈ R 4 What is a sub-gradient at the point w = [2 , − 3 , 0 , 1] T ? 2   1   − 1   1 . 5 g = ∇ w f =   γ   1 1 for any γ ∈ [ − 1 , 1] 0 . 5 0 − 2 − 1 0 1 2 The sub-derivative of f ( x ) = max( x, 0) at x = 0 is [0 , 1] . 17

  19. Optimization Algorithms for Machine Learning We have data D = � ( x i , y i ) � N i =1 . We are minimizing the objective function: N L ( w ; D ) = 1 � ℓ ( w ; x i , y i ) + λ R ( w ) N � �� � i =1 Regularisation Term The gradient of the objective function is N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + λ ∇ w R ( w ) N i =1 For Ridge Regression we have N L ridge ( w ) = 1 � ( w T x i − y i ) 2 + λ w T w N i =1 N ∇ w L ridge = 1 � 2( w T x i − y i ) x i + 2 λ w N i =1 18

  20. Stochastic Gradient Descent As part of the learning algorithm, we calculate the following gradient: N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + R ( w ) N i =1 Suppose we pick a random datapoint ( x i , y i ) and evaluate g i = ∇ w ℓ ( w ; x i , y i ) What is E [ g i ] ? N E [ g i ] = 1 � ∇ w ℓ ( w ; x i , y i ) N i =1 Instead of computing the entire gradient, we can compute the gradient at just a single datapoint! In expectation g i points in the same direction as the entire gradient (except for the regularisation term) 19

  21. Online Learning: Stochastic Gradient Descent ◮ Using stochastic gradient descent it is possible to learn ‘‘online’’, i.e., we get data little at a time ◮ Cost of computing the gradient in ‘Stochastic Gradient Descent (SGD)’ is significantly less compared to the gradient on the full dataset ◮ Learning rates should be chosen by (cross-)validation 20

  22. Batch/Offline Learning N w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) N i =1 Online Learning w t +1 = w t − η ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) Minibatch Online Learning b w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) b i =1 21

  23. Many Optimisation Techniques (Tricks) First Order Methods/(Sub) Gradient Methods ◮ Nesterov’s Accelerated Gradient ◮ Line-Search to Find Step-Size ◮ Momentum-based Methods ◮ AdaGrad, AdaDelta, Adam, RMSProp Second Order/Newton/Quasinewton Methods ◮ Conjugate Gradient Method ◮ BGFS and L-BGFS 22

  24. Adagrad: Example Application for Text Data y x 1 x 2 x 3 x 4 Heathrow: Will Boris Johnson lie down in front of 1 1 0 0 1 the bulldozers? He was happy to lie down the side of -1 1 1 0 0 a bus. -1 1 1 1 0 . . . 1 1 1 0 0 On his part, Johnson has already sought to clarify the 1 1 0 0 0 comments, telling Sky News that what he in fact said -1 1 1 1 0 was not that he would lie down in front of the 1 1 1 0 0 bulldozers, but that he would lie down the side . And he 1 1 1 0 1 never actually said bulldozers , he said bus . 1 1 1 0 0 Adagrad Update η w t +1 ,i ← w t,i − g t,i �� t s =1 g 2 s,i Rare features (which are 0 in most datapoints) can be most predictive 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend