Machine Learning - MT 2016
- 6. Optimisation
Machine Learning - MT 2016 6. Optimisation Varun Kanade University - - PowerPoint PPT Presentation
Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016 Outline Most machine learning methods can (ultimately) be cast as optimization problems. Linear Programming Basics: Gradients, Hessians
◮ Linear Programming ◮ Basics: Gradients, Hessians ◮ Gradient Descent ◮ Stochastic Gradient Descent ◮ Constrained Optimization
1
i x ≤ bi,
i x = ¯
◮ No analytic solution ◮ ‘‘Efficient’’ algorithms exist
2
i=1 and that we want to minimise the
N
i w − yi|
N
3
N
D
◮ Quadratic part of the loss function can’t be framed as linear
◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods
4
1
2
∂f ∂w1 = 2w1
∂f ∂w2 = 2w2
∂f ∂w1 ∂f ∂w2
2w1 a2 2w2 b2
◮ Gradient vectors are orthogonal to contour curves ◮ Gradient points in the direction of steepest increase
5
1
2
∂f ∂w1 ∂f ∂w2
2w1 a2 2w2 b2
∂2f ∂w2
1
∂2f ∂w1∂w2 ∂2f ∂w2∂w1 ∂2f ∂w2
2
2 a2 2 b2
◮ As long as all second derivates exist, the Hessian H is symmetric ◮ Hessian captures the curvature of the surface
6
∂f ∂θ1 = ∂f ∂w1 · ∂w1 ∂θ1 + ∂f ∂w2 · ∂w2 ∂θ1
7
∂f ∂w1 ∂f ∂w2
∂f ∂wD
∂2f ∂w2
1
∂2f ∂w1∂w2
∂2f ∂w1∂wD ∂2f ∂w2∂w1 ∂2f ∂w2
2
∂2f ∂w2∂wD
∂2f ∂wD∂w1 ∂2f ∂wD∂w2
∂2f ∂w2
D
8
9
N
i w − yi)2
◮ If N and D are both very large ◮ Computational complexity of matrix formula O
10
◮ Choosing a good step-size is important ◮ It step size is too large, algorithm may never converge ◮ If step size is too small, convergence may be very slow ◮ May want a time-varying step size
11
◮ Gradient descent uses only the
◮ Local linear approximation ◮ Newton’s method uses second
◮ Degree 2 Taylor approximation
12
t gt
t (w − wt) + 1
t gt
13
14
N
D
◮ Quadratic part of the loss function can’t be framed as linear
◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods ◮ We still have the problem that the objective function is not
15
16
−2 −1 1 2 0.5 1 1.5 2
17
i=1. We are minimizing the objective function:
N
Regularisation Term
N
N
N
18
N
N
19
◮ Using stochastic gradient descent it is possible to learn ‘‘online’’, i.e., we
◮ Cost of computing the gradient in ‘Stochastic Gradient Descent (SGD)’ is
◮ Learning rates should be chosen by (cross-)validation
20
N
b
21
◮ Nesterov’s Accelerated Gradient ◮ Line-Search to Find Step-Size ◮ Momentum-based Methods ◮ AdaGrad, AdaDelta, Adam, RMSProp
◮ Conjugate Gradient Method ◮ BGFS and L-BGFS
22
s=1 g2 s,i
23
i=1 |wi| < R
24
◮ Convex Optimization is ‘efficient’ (i.e., polynomial time) ◮ Try to cast learning problem as a convex optimization problem ◮ Many, many extensions exist: Adagrad, Momentum-based, BGFS,
◮ Books: Boyd and Vandenberghe, Nesterov’s Book
◮ Encountered frequently in deep learning ◮ Stochastic Gradient Descent gives local minima ◮ Nonlinear Programming - Dimitri Bertsekas
25