2 elements of convex optjmizatjon
play

2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Why talk about


  1. Gradient descent ● Start from a random point u . ● How do I get closer to the solutjon? ● Follow the opposite of the gradient. The gradient indicates the directjon of steepest increase. ∇ - f(u)) f(u) f(u + ) u + u

  2. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … ● Stop at some point

  3. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion

  4. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion Usually: stop when

  5. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ? – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  6. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  7. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  8. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge ? – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  9. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  10. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  11. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) u α∇ u- f(u)

  12. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) ? α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)

  13. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  14. BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  15. BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  16. BLS: no shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) ≤ f(u)- /2 f(u) The step size is small enough. f(u) α α ∇ ∇ T f(u) T f(u) ∇ ∇ f(u)- /2 f(u) f(u)- /2 f(u) α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)

  17. Backtracking line search ● Shrinking parameter , initjal step size ● Choose an initjal point ● Repeat for k=1, 2, 3, … – If shrink the step size: – Else: – Update: ● Stop when

  18. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u

  19. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u ?

  20. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v :

  21. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● What is the new update rule? ?

  22. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● New update rule:

  23. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form

  24. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ? This is a problem of the form

  25. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Second-order characterizatjon of convex functjons

  26. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Solve using the conjugate gradient method.

  27. Conjugate gradient method Solve ● Idea: build a set of A-conjugate vectors (basis of ) – Initjalisatjon: – At step t: ● Update rule: ● residual ● ensures – Convergence: hence

  28. Conjugate gradient method ? Prove Given – Initjalisatjon: – At step t: ● Update rule: ● residual ● and assuming

  29. Prove Given – Initjalisatjon: – Update rule: – residual – and assuming

  30. Conjugate gradient method ? Prove and conclude the proof Given – Initjalisatjon: – At step t: ● Update rule: ● residual ●

  31. Prove Given – Initjalisatjon: – Update rule: – residual –

  32. Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ∇ 1 st order Taylor applied to f – – Secant equatjon: ⇒ ● Initjalizatjon: Identjty

  33. Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ● BFGS: Broyden-Fletcher-Goldfarb-Shanno – – Secant equatjon: The mean value G of between u and v verifjes – ⇒ ● L-BFGS: Limited memory variant Do not store the full matrix W k .

  34. Stochastjc gradient descent ● For ● Gradient descent: ● Stochastjc gradient descent: – Cyclic : cycle over 1, 2, …, m, 1, 2, …, m, … – Randomized: chose i k uniformely at random in {1, 2, …, m}.

  35. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:

  36. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:

  37. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …: Variants: – re-order the coordinates randomly – Proceed by blocks of coordinates (2 or more at a tjme)

  38. Summary: Unconstrained convex optjmizatjon If f is difgerentjable – Set its gradient to zero – If hard to solve: gradient descent Settjng the learning rate: ● Backtracking Line Search (adapt heuristjcally to avoid “overshootjng”) ● Newton’s method: Suppose f twice difgerentjable – – If the Hessian is hard to invert, compute by solving by the conjugate gradient method – If the Hessian is hard to compute, approximate the inverse Hessian with a quasi-Newton method such as BFGS ( L-BFGS : less memory) – If f is separable: stochastjc gradient descent – If the non-smooth part of f is separable: coordinate descent.

  39. Constrained convex optjmizatjon

  40. Constrained convex optjmizatjon ● Convex optjmizatjon program/problem: – f is convex – are convex – are affjne – The feasible set is convex

  41. Lagrangian ● Lagrangian: = Lagrange multjpliers = dual variables

  42. Lagrange dual functjon ● Lagrangian: ● Lagrange dual functjon: Infjmum = the greatest value x such that x ≤ L(u, α, β) ● Q is concave (independently of the convexity of f)

  43. Lagrange dual functjon ● ● Q is concave (independently of the convexity of f)

  44. Lagrange dual functjon ● The dual functjon gives a lower bound on our solutjon Let feasible set Then for any

  45. Weak duality ● for any ● What is the best lower bound on p* we can get? ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend