25 nlp algorithms
play

25. NLP algorithms Overview Local methods Constrained optimization - PowerPoint PPT Presentation

CS/ECE/ISyE 524 Introduction to Optimization Spring 201718 25. NLP algorithms Overview Local methods Constrained optimization Global methods Black-box methods Course wrap-up Laurent Lessard (www.laurentlessard.com)


  1. CS/ECE/ISyE 524 Introduction to Optimization Spring 2017–18 25. NLP algorithms ❼ Overview ❼ Local methods ❼ Constrained optimization ❼ Global methods ❼ Black-box methods ❼ Course wrap-up Laurent Lessard (www.laurentlessard.com)

  2. Review of algorithms Studying Linear Programs , we talked about: ❼ Simplex method: traverse the surface of the feasible polyhedron looking for the vertex with minimum cost. Only applicable for linear programs. Used by solvers such as Clp and CPLEX . Hybrid versions used by Gurobi and Mosek . ❼ Interior point methods: traverse the inside of the feasible polyhedron and move towards the boundary point with minimum cost. Applicable to many different types of optimization problems. Used by SCS , ECOS , Ipopt . 25-2

  3. Review of algorithms Studying Mixed Integer Programs , we talked about: ❼ Cutting plane methods: solve a sequence of LP relaxations and keep adding cuts (special extra linear constraints) until solution is integral, and therefore optimal. Also applicable for more general convex problems. ❼ Branch and bound methods: solve a sequence of LP relaxations (upper bounding), and branch on fractional variables (lower bounding). Store problems in a tree, prune branches that aren’t fruitful. Most optimization problems can be solved this way. You just need a way to branch (split the feasible set) and a way to bound (efficiently relax). ❼ Variants of methods above are used by all MIP solvers. 25-3

  4. Overview of NLP algorithms To solve Nonlinear Programs with continuous variables , there is a wide variety of available algorithms. We’ll assume the problem has the standard form: minimize f 0 ( x ) x subject to: f i ( x ) ≤ 0 for i = 1 , . . . , m ❼ What works best depends on the kind of problem you’re solving. We need to talk about problem categories. 25-4

  5. Overview of NLP algorithms 1. Are the functions differentiable? Can we efficiently compute gradients or second derivatives of the f i ? 2. What problem size are we dealing with? a few variables and constraints? hundreds? thousands? millions? 3. Do we want to find local optima, or do we need the global optimum (more difficult!) 4. Does the objective function have a large number of local minima? or a relatively small number? Note: items 3 and 4 don’t matter if the problem is convex. In that case any local minimum is also a global minimum! 25-5

  6. Survey of NLP algorithms ❼ Local methods using derivative information. It’s what most NLP solvers use (and what most JuMP solvers use). ◮ unconstrained case ◮ constrained case ❼ Global methods ❼ Derivative-free methods 25-6

  7. Local methods using derivatives Let’s start with the unconstrained case: minimize f ( x ) x Stochastic gradient descent slow Many methods Gradient descent available! Accelerated methods Conjugate gradient Quasi-Newton methods fast Newton’s method cheap expensive 25-7

  8. Iterative methods Local methods iteratively step through the space looking for a point where ∇ f ( x ) = 0. 1. pick a starting point x 0 . 2. choose a direction to move in ∆ k . This is the part where different algorithms do different things. 3. update your location x k +1 = x k + ∆ k 4. repeat until you’re happy with the function value or the algorithm has ceased to make progress. 25-8

  9. Vector calculus Suppose f : R n → R is a twice-differentiable function. f : R n → R n defined by: ❼ The gradient of f is a function ∇ i = ∂ f � � ∇ f ∂ x i ∇ f ( x ) points in the direction of greatest increase of f at x . ❼ The Hessian of f is a function ∇ 2 f : R n → R n × n where: ∂ 2 f � � ∇ 2 f ij = ∂ x i ∂ x j ∇ 2 f ( x ) is a matrix that encodes the curvature of f at x . 25-9

  10. Vector calculus Example: suppose f ( x , y ) = x 2 + 3 xy + 5 y 2 − 7 x + 2 � ∂ f � � 2 x + 3 y − 7 � ∂ x ❼ ∇ f = = 3 x + 10 y ∂ f ∂ y � ∂ 2 f � ∂ 2 f � 2 � 3 ∂ x 2 ∂ x ∂ y ❼ ∇ 2 f = = ∂ 2 f ∂ 2 f 3 10 ∂ y 2 ∂ x ∂ y Taylor’s theorem in n dimensions best linear approximation � f ( x 0 ) T ( x − x 0 ) +1 �� � 2( x − x 0 ) T ∇ 2 f ( x 0 )( x − x 0 ) f ( x ) ≈ f ( x 0 ) + ∇ + · · · � �� � best quadratic approximation 25-10

  11. Gradient descent ❼ The simplest of all iterative methods. It’s a first-order method, which means it only uses gradient information: x k +1 = x k − t k ∇ f ( x k ) ❼ −∇ f ( x k ) points in the direction of local steepest decrease of the function. We will move in this direction. ❼ t k is the stepsize. Many ways to choose it: ◮ Pick a constant t k = t √ ◮ Pick a slowly decreasing stepsize, such as t k = 1 / k ◮ Exact line search: t k = arg min t f ( x k − t ∇ f ( x k )). ◮ A heuristic method (most common in practice). Example: backtracking line search. 25-11

  12. Gradient descent We can gain insight into the effectiveness of a method by seeing how it performs on a quadratic: f ( x ) = 1 2 x T Qx . The condition number κ := λ max ( Q ) λ min ( Q ) determines convergence. Optimal step 10 0 2 Shorter step 10 -2 distance to optimal point Even shorter 1 10 -4 10 -6 0 10 -8 κ = 10 1 10 -10 Optimal step Shorter step 2 10 -12 Even shorter 10 -14 5 0 5 10 0 10 1 10 2 10 3 number of iterations 10 0 Optimal step Optimal step 2 Shorter step Shorter step distance to optimal point 10 -2 Even shorter Even shorter 1 10 -4 10 -6 0 10 -8 κ = 1 . 2 1 10 -10 2 10 -12 10 -14 5 0 5 10 0 10 1 10 2 10 3 number of iterations 25-12

  13. Gradient descent Advantages ❼ Simple to implement and cheap to execute. ❼ Can be easily adjusted. ❼ Robust in the presence of noise and uncertainty. Disadvantages ❼ Convergence is slow. ❼ Sensitive to conditioning. Even rescaling a variable can have a substantial effect on performance! ❼ Not always easy to tune the stepsize. Note: The idea of preconditioning (rescaling) before solving adds another layer of possible customizations and tradeoffs. 25-13

  14. Other first-order methods Accelerated methods (momentum methods) ❼ Still a first-order method, but makes use of past iterates to accelerate convergence. Example: the Heavy-ball method: x k +1 = x k − α k ∇ f ( x k ) + β k ( x k − x k − 1 ) Other examples: Nesterov, Beck & Teboulle, others. ❼ Can achieve substantial improvement over gradient descent with only a moderate increase in computational cost ❼ Not as robust to noise as gradient descent, and can be more difficult to tune because there are more parameters. 25-14

  15. Other first-order methods Mini-batch stochastic gradient descent (SGD) ❼ Useful if f ( x ) = � N i =1 f i ( x ). Use direction � i ∈ S ∇ f i ( x k ) where S ⊆ { 1 , . . . , N } . Size of S determines “batch size”. | S | = 1 is SGD and | S | = N is ordinary gradient descent. ❼ Same pros and cons as gradient descent, but allows further tradeoff of speed vs computation. ❼ Industry standard for big-data problems like deep learning. Nonlinear conjugate gradient ❼ Variant of the standard conjugate gradient algorithm for solving Ax = b , but adapted for use in general optimization. ❼ Requires more computation than accelerated methods. ❼ Converges exactly in a finite number of steps when applied to quadratic functions. 25-15

  16. Newton’s method Basic idea: approximate the function as a quadratic, move directly to the minimum of that quadratic, and repeat. ❼ If we’re at x k , then by Taylor’s theorem: f ( x k ) T ( x − x 0 )+ 1 2( x − x k ) T ∇ 2 f ( x k )( x − x k ) f ( x ) ≈ f ( x k )+ ∇ ❼ If ∇ 2 f ( x k ) ≻ 0, the minimum of the quadratic occurs at: x k +1 := x opt = x k − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ❼ Newton’s method is a second-order method; it requires computing the Hessian (second derivatives). 25-16

  17. Newton’s method in 1D 4 . 2 Example: f ( x ) = log( e x +3 + e − 2 x +2 ) starting at: x 0 = 0 . 5 4 3 . 8 ( x 1 , f 1 ) ( x 0 , f 0 ) 3 . 6 ( x 2 , f 2 ) 3 . 4 3 . 2 3 − 1 − 0 . 8 − 0 . 6 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x example by: L. El Ghaoui, UC Berkeley, EE127a 25-17

  18. Newton’s method in 1D Example: f ( x ) = log( e x +3 + e − 2 x +2 ) 60 ( x 1 , f 1 ) starting at: x 0 = 1 . 5 divergent! x 2 = 2 . 3 × 10 6 ... 40 20 ( x 0 , f 0 ) 0 − 30 − 20 − 10 0 10 20 30 x example by: L. El Ghaoui, UC Berkeley, EE127a 25-18

  19. Newton’s method Advantages ❼ It’s usually very fast. Converges to the exact optimum in one iteration if the objective is quadratic. ❼ It’s scale-invariant. Convergence rate is not affected by any linear scaling or transformation of the variables. Disadvantages ❼ If n is large, storing the Hessian (an n × n matrix) and computing ∇ 2 f ( x k ) − 1 ∇ f ( x k ) can be prohibitively expensive. ❼ If ∇ 2 f ( x k ) ⊁ 0, Newton’s method may converge to a local maximum or a saddle point. ❼ May fail to converge at all if we start too far from the optimal point. 25-19

  20. Quasi-Newton methods ❼ An approximate Newton’s method that doesn’t require computing the Hessian. ❼ Uses an approximation H k ≈ ∇ 2 f ( x k ) − 1 that can be updated directly and is faster to compute than the full Hessian. x k +1 = x k − H k ∇ f ( x k ) H k +1 = g ( H k , ∇ f ( x k ) , x k ) ❼ Several popular update schemes for H k : ◮ DFP (Davidon–Fletcher–Powell) ◮ BFGS (Broyden–Fletcher–Goldfarb–Shanno) 25-20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend