 
              Gradient descent ● Start from a random point u . ● How do I get closer to the solutjon? ● Follow the opposite of the gradient. The gradient indicates the directjon of steepest increase. ∇ - f(u)) f(u) f(u + ) u + u
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … ● Stop at some point
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion Usually: stop when
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ? – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge ? – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.
BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) u α∇ u- f(u)
BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) ? α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)
BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)
BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)
BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)
BLS: no shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) ≤ f(u)- /2 f(u) The step size is small enough. f(u) α α ∇ ∇ T f(u) T f(u) ∇ ∇ f(u)- /2 f(u) f(u)- /2 f(u) α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)
Backtracking line search ● Shrinking parameter , initjal step size ● Choose an initjal point ● Repeat for k=1, 2, 3, … – If shrink the step size: – Else: – Update: ● Stop when
Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u
Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u ?
Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v :
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● What is the new update rule? ?
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● New update rule:
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ? This is a problem of the form
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Second-order characterizatjon of convex functjons
Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Solve using the conjugate gradient method.
Conjugate gradient method Solve ● Idea: build a set of A-conjugate vectors (basis of ) – Initjalisatjon: – At step t: ● Update rule: ● residual ● ensures – Convergence: hence
Conjugate gradient method ? Prove Given – Initjalisatjon: – At step t: ● Update rule: ● residual ● and assuming
Prove Given – Initjalisatjon: – Update rule: – residual – and assuming
Conjugate gradient method ? Prove and conclude the proof Given – Initjalisatjon: – At step t: ● Update rule: ● residual ●
Prove Given – Initjalisatjon: – Update rule: – residual –
Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ∇ 1 st order Taylor applied to f – – Secant equatjon: ⇒ ● Initjalizatjon: Identjty
Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ● BFGS: Broyden-Fletcher-Goldfarb-Shanno – – Secant equatjon: The mean value G of between u and v verifjes – ⇒ ● L-BFGS: Limited memory variant Do not store the full matrix W k .
Stochastjc gradient descent ● For ● Gradient descent: ● Stochastjc gradient descent: – Cyclic : cycle over 1, 2, …, m, 1, 2, …, m, … – Randomized: chose i k uniformely at random in {1, 2, …, m}.
Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:
Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:
Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …: Variants: – re-order the coordinates randomly – Proceed by blocks of coordinates (2 or more at a tjme)
Summary: Unconstrained convex optjmizatjon If f is difgerentjable – Set its gradient to zero – If hard to solve: gradient descent Settjng the learning rate: ● Backtracking Line Search (adapt heuristjcally to avoid “overshootjng”) ● Newton’s method: Suppose f twice difgerentjable – – If the Hessian is hard to invert, compute by solving by the conjugate gradient method – If the Hessian is hard to compute, approximate the inverse Hessian with a quasi-Newton method such as BFGS ( L-BFGS : less memory) – If f is separable: stochastjc gradient descent – If the non-smooth part of f is separable: coordinate descent.
Constrained convex optjmizatjon
Constrained convex optjmizatjon ● Convex optjmizatjon program/problem: – f is convex – are convex – are affjne – The feasible set is convex
Lagrangian ● Lagrangian: = Lagrange multjpliers = dual variables
Lagrange dual functjon ● Lagrangian: ● Lagrange dual functjon: Infjmum = the greatest value x such that x ≤ L(u, α, β) ● Q is concave (independently of the convexity of f)
Lagrange dual functjon ● ● Q is concave (independently of the convexity of f)
Lagrange dual functjon ● The dual functjon gives a lower bound on our solutjon Let feasible set Then for any
Weak duality ● for any ● What is the best lower bound on p* we can get? ?
Recommend
More recommend