SLIDE 1
CSE 446: Machine Learning Lecture
Stationary points, non-convex optimization, and more...
Instructor: Sham Kakade
1 Terminology
- stationary point of f(w): a point which has zero gradient.
- local minima of f(w): a point which locally is at a minima (i.e. any infinitesimal change to the point will result
in an infinitesimal decrease in the function value).
- global minimum of f(w): a point w∗ which achieves the minimal possible value of f(w) over all w.
- saddle point of f(w): a point which will go up, under some infinitesimal perturbation, and will go down, under
some other infinitesimal perturbation. Issues related to training are:
- non-convexity
- initialization
- weight symmetries and “symmetry breaking”
- saddle points & local optima & global optima
- vanishing gradients
2 Gradient descent in the non-convex setting
Suppose we do gradient descent on a function F: w(k+1) = w(k) − η(k) · ∇F(w(k)) . We could also do SGD: w(k+1) = w(k) − η(k) ·
- ∇F(w(k)) .
where
- ∇F(w(k)) is some (unbiased) estimate of the gradient. The basic question is: where does this lead us to?