Machine Learning - MT 2016 6. Optimisation Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016

Outline Most machine learning methods can (ultimately) be cast as optimization problems. ◮ Linear Programming ◮ Basics: Gradients, Hessians ◮ Gradient Descent ◮ Stochastic Gradient Descent ◮ Constrained Optimization Most machine learning packages such as scikit-learn, tensorflow, octave, torch etc. , will have optimization methods implemented. But you will have to understand the basics of optimization to use them effectively. 1

Linear Programming Looking for solutions x ∈ R n to the following optimization problem c T x minimize subject to: a T i x ≤ b i , i = 1 , . . . , m a T i x = ¯ ¯ b i , i = 1 , . . . , l ◮ No analytic solution ◮ ‘‘Efficient’’ algorithms exist 2

Linear Model with Absolute Loss Suppose we have data � ( x i , y i ) � N i =1 and that we want to minimise the objective: N � | x T L ( w ) = i w − y i | i =1 Let us introduce ζ i one for each datapoint Consider the linear program in the D + N variables w 1 , . . . , w D , ζ 1 , . . . , ζ N N � minimize ζ i i =1 subject to: w T x i − y i ≤ ζ i , i = 1 , . . . , N y i − w T x i ≤ ζ i , i = 1 , . . . , N 3

Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � ( w T x i − y i ) 2 + λ � L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods 4

Calculus Background: Gradients z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2 ∂w 1 = 2 w 1 ∂f a 2 ∂w 2 = 2 w 2 ∂f b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2 ◮ Gradient vectors are orthogonal to contour curves ◮ Gradient points in the direction of steepest increase 5

Calculus Background: Hessians z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2     ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 1 ∂w 2 a 2  = H = 1    ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 2 ∂w 1 b 2 2 ◮ As long as all second derivates exist, the Hessian H is symmetric ◮ Hessian captures the curvature of the surface 6

Calculus Background: Chain Rule z = f ( w 1 ( θ 1 , θ 2 ) , w 2 ( θ 1 , θ 2 )) w 1 θ 1 f z w 2 θ 2 ∂f ∂w 1 · ∂w 1 ∂f ∂w 2 · ∂w 2 ∂f ∂θ 1 = ∂θ 1 + ∂θ 1 We will use this a lot when we study neural networks and back propagation 7

General Form for Gradient and Hessian Suppose w ∈ R D and f : R D → R The gradient vector contains all first order partial derivatives   ∂f ∂w 1  ∂f    ∂w 2   ∇ w f ( w ) = .   .   .   ∂f ∂w D Hessian matrix of f contains all second order partial derivatives.   ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D   1  ∂ 2 f ∂ 2 f ∂ 2 f  · · ·   ∂w 2 ∂w 1 ∂w 2 ∂w 2 ∂w D   H = 2 . . .  ...  . . .   . . .    ∂ 2 f ∂ 2 f ∂ 2 f  · · · ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D 8

Gradient Descent Algorithm Gradient descent is one of the simplest, but very general algorithm for optimization It is an iterative algorithm, producing a new w t +1 at each iteration as w t +1 = w t − η t g t = w t − η t ∇ f ( w t ) We will denote the gradients by g t η t > 0 is the learning rate or step size 9

Gradient Descent for Least Squares Regression N � L ( w ) = ( Xw − y ) T ( Xw − y ) = ( x T i w − y i ) 2 i =1 We can compute the gradient of L with respect to w � � X T Xw − X T y ∇ w L = 2 ◮ Why would you want to use gradient descent instead of directly plugging in the formula? ◮ If N and D are both very large � � ◮ Computational complexity of matrix formula O min { N 2 D, ND 2 } ◮ Each gradient calculation O ( ND ) 10

Choosing a Step Size ◮ Choosing a good step-size is important ◮ It step size is too large, algorithm may never converge ◮ If step size is too small, convergence may be very slow ◮ May want a time-varying step size 11

Newton’s Method (Second Order Method) ◮ Newton’s method uses second ◮ Gradient descent uses only the derivatives first derivative ◮ Degree 2 Taylor approximation ◮ Local linear approximation around current point 12

Newton’s Method in High Dimensions The updates depend on the gradient g t and the Hessian H t at point w t w t +1 = w t − H − 1 t g t Approximate f around w t using second order Taylor approximation t ( w − w t ) + 1 f quad ( w ) = f ( w t ) + g T 2( w − w t ) T H t ( w − w t ) We move directly to the (unique) stationary point of f quad The gradient of f quad is given by: ∇ w f quad = g t + H t ( w − w t ) Setting ∇ w f quad = 0 , to get w t +1 , we have w t +1 = w t − H − 1 t g t 13

Newton’s Method gives Stationary Points H has positive eigenvalues H has negative eigenvalues H has mixed eigenvalues Hessian will tell you which kind of stationary point is found Newton’s method can be computationally expensive in high dimensions. Need to compute and invert a Hessian at each iteration 14

Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � � ( w T x i − y i ) 2 + λ L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods ◮ We still have the problem that the objective function is not differentiable! 15

Sub-gradient Descent Focus on the case when f is convex, f ( αx + (1 − α ) y ) ≤ αf ( x ) + (1 − α ) f ( y ) for all x, y , α ∈ [0 , 1] f ( x ) ≥ f ( x 0 ) + g ( x − x 0 ) where g is a sub-derivative f ( x ) ≥ f ( x 0 ) + g T ( x − x 0 ) where g is a sub-gradient Any g satisfying the above inequality will be called a sub-gradient at x 0 16

Sub-gradient Descent f ( w ) = | w 1 | + | w 2 | + | w 3 | + | w 4 | for w ∈ R 4 What is a sub-gradient at the point w = [2 , − 3 , 0 , 1] T ? 2   1   − 1   1 . 5 g = ∇ w f =   γ   1 1 for any γ ∈ [ − 1 , 1] 0 . 5 0 − 2 − 1 0 1 2 The sub-derivative of f ( x ) = max( x, 0) at x = 0 is [0 , 1] . 17

Optimization Algorithms for Machine Learning We have data D = � ( x i , y i ) � N i =1 . We are minimizing the objective function: N L ( w ; D ) = 1 � ℓ ( w ; x i , y i ) + λ R ( w ) N � �� i =1 Regularisation Term The gradient of the objective function is N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + λ ∇ w R ( w ) N i =1 For Ridge Regression we have N L ridge ( w ) = 1 � ( w T x i − y i ) 2 + λ w T w N i =1 N ∇ w L ridge = 1 � 2( w T x i − y i ) x i + 2 λ w N i =1 18

Stochastic Gradient Descent As part of the learning algorithm, we calculate the following gradient: N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + R ( w ) N i =1 Suppose we pick a random datapoint ( x i , y i ) and evaluate g i = ∇ w ℓ ( w ; x i , y i ) What is E [ g i ] ? N E [ g i ] = 1 � ∇ w ℓ ( w ; x i , y i ) N i =1 Instead of computing the entire gradient, we can compute the gradient at just a single datapoint! In expectation g i points in the same direction as the entire gradient (except for the regularisation term) 19

Online Learning: Stochastic Gradient Descent ◮ Using stochastic gradient descent it is possible to learn ‘‘online’’, i.e., we get data little at a time ◮ Cost of computing the gradient in ‘Stochastic Gradient Descent (SGD)’ is significantly less compared to the gradient on the full dataset ◮ Learning rates should be chosen by (cross-)validation 20

Batch/Offline Learning N w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) N i =1 Online Learning w t +1 = w t − η ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) Minibatch Online Learning b w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) b i =1 21

Many Optimisation Techniques (Tricks) First Order Methods/(Sub) Gradient Methods ◮ Nesterov’s Accelerated Gradient ◮ Line-Search to Find Step-Size ◮ Momentum-based Methods ◮ AdaGrad, AdaDelta, Adam, RMSProp Second Order/Newton/Quasinewton Methods ◮ Conjugate Gradient Method ◮ BGFS and L-BGFS 22

Adagrad: Example Application for Text Data y x 1 x 2 x 3 x 4 Heathrow: Will Boris Johnson lie down in front of 1 1 0 0 1 the bulldozers? He was happy to lie down the side of -1 1 1 0 0 a bus. -1 1 1 1 0 . . . 1 1 1 0 0 On his part, Johnson has already sought to clarify the 1 1 0 0 0 comments, telling Sky News that what he in fact said -1 1 1 1 0 was not that he would lie down in front of the 1 1 1 0 0 bulldozers, but that he would lie down the side . And he 1 1 1 0 1 never actually said bulldozers , he said bus . 1 1 1 0 0 Adagrad Update η w t +1 ,i ← w t,i − g t,i �� t s =1 g 2 s,i Rare features (which are 0 in most datapoints) can be most predictive 23

Machine Learning - MT 2016 6. Optimisation Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016 Outline Most machine learning methods can (ultimately) be cast as optimization problems. Linear Programming Basics: Gradients, Hessians

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

COMP24111: Machine Learning and Optimisation Chapter 1A: Machine Learning Basics Dr. Tingting Mu

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

COMP24111: Machine Learning and Optimisation Chapter 1: Machine Learning Basics Dr. Tingting Mu

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Introduction to program optimisation Michel Schinz (based on Erik Stenmans slides) Advanced

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

CS 133 - Introduction to Computational and Data Science Instructor: Renzhi Cao Computer Science

The Cost of Reducing Emissions The Cost of Reducing Emissions EES 3310/5310 EES 3310/5310

IoT Projects in FLOSS Foundations Dashboard https://iotfloss.bitergia.net/ A dashboard based on

Global Aviation Sector Overview 21 st February 2017 Rob Morris, Global Head of Consultancy

Review Languages and Grammars Alphabets, strings, languages Regular Languages

Best-first Utility-guided Search Wheeler Ruml and Minh B. Do Embedded Reasoning Area Palo Alto

t trsrtt tr

Introduction to Numerical Micromagnetism. Application to Mesoscopic Magnetic Systems Liliana