IAML: Optimization
Charles Sutton and Victor Lavrenko School of Informatics Semester 1
1 / 24
IAML: Optimization Charles Sutton and Victor Lavrenko School of - - PowerPoint PPT Presentation
IAML: Optimization Charles Sutton and Victor Lavrenko School of Informatics Semester 1 1 / 24 Outline Why we use optimization in machine learning The general optimization problem Gradient descent Problems with gradient descent
1 / 24
Many illustrations, text, and general ideas from these slides are taken from Sam Roweis (1972-2010). 2 / 24
3 / 24
E(w) E w wj wi E(w)
4 / 24
5 / 24
◮ How do we compute the gradient ∇E efficiently? ◮ Once we have the gradient, how do we minimize the error? ◮ Where will we end up in weight space? 6 / 24
◮ A procedure that computes E(w) ◮ A procedure that computes the partial derivative ∂E
∂wj
7 / 24
8 / 24
9 / 24
◮ We must choose η > 0. ◮ η too small → too slow ◮ η too large → instability 10 / 24
−3 −2 −1 1 2 3 2 4 6 8 w E(w)
11 / 24
−3 −2 −1 1 2 3 2 4 6 8 w E(w)
12 / 24
13 / 24
14 / 24
15 / 24
16 / 24
17 / 24
18 / 24
quickly down the valley walls but very slowly along the valley bottom. dE dw
19 / 24
directly at the nearest local minimum.
dE dW
20 / 24
error parameter space
21 / 24
◮ Some of these are second-order methods like Newton’s
◮ Also there are fancy first-order methods like quasi-Newton
◮ They are the state of the art methods for logistic regression
◮ We will not discuss these methods in the course.
22 / 24
◮ Example: Observe the points {0.5, 1.0} from a Gaussian
◮ Constraint: σ must be positive. ◮ In this case to find the maximum likelihood solution, the
µ,σ 2
◮ There are ways to solve this (in this case: can be done
23 / 24
◮ How and why we convert learning problems into
◮ Modularity between modelling and optimization ◮ Gradient descent ◮ Why gradient descent can run into problems ◮ Especially local minima
24 / 24