 
              CSCI 1951-G – Optimization Methods in Finance Part 06: Algorithms for Unconstrained Convex Optimization March 9, 2018 1 / 28
This material is covered in S. Boyd, L. Vandenberge’s book Convex Optimization https://web.stanford.edu/~boyd/cvxbook/ . Some of the materials and the figures are taken from it. 2 / 28
Outline 1 Unconstrained minimization: descent methods 2 Equality constrained minimization: Newton’s method 3 General minimization: Interior point methods 3 / 28
Unconstrained minimization Consider the unconstrained minimization problem: min f ( x ) where f : R n → R , convex and twice continuously differentiable . x ∗ : optimal solution with optimal obj. value p ∗ . Necessary and sufficient condition for x ∗ to be optimal: ∇ f ( x ∗ ) = 0 The above is a system of ... n equations in ... n variables. Solving ∇ f ( x ) = 0 analytically is ofen not easy or not possible. 4 / 28
Example: unconstrained geometric program � m � � exp( a T min f ( x ) = ln i x + b i ) i =1 f ( x ) is convex. The optimality condition is m 1 i x ∗ + b i ) a i � exp( a T 0 = ∇ f ( x ∗ ) = j x ∗ + b j ) � m j =1 exp( a T i =1 which in general has no analytical solution. 5 / 28
Iterative algorithms Iterative algorithms for minimization compute a minimizing sequence x (0) , x (1) , . . . of feasible points s.t. f ( x ( k ) → p ∗ as k → ∞ The algorithm terminates when f ( x ( k ) ) − p ∗ ≤ ε, for a specified tolerance ε > 0 . 6 / 28
How to know when to stop? Consider the sublevel set S = { x : f ( x ) ≤ f ( x (0) ) } Additional assumption: f is strongly convex on S , i.e., there exist m > 0 s.t. ∇ 2 f ( x ) − mI > 0 for all x ∈ S i.e., the difference on the l.h.s. is positive definite. Consequence : f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 2 for all x and y in S. (What happens when f is “just” convex?) 7 / 28
Strong convexity gives a stopping rule f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 2 For any fixed x , the r.h.s. is a convex quadratic function g x ( y ) of y . Let’s find the y for which the r.h.s. is minimal. How? Solve ∇ g x ( y ) = 0 ! Solution: y = x − 1 ˜ m ∇ f ( x ) Then: y − x ) + m f ( y ) ≥ f ( x ) + ∇ f ( x ) T (˜ y − x � 2 2 � ˜ 2 = f ( x ) − 1 2 m �∇ f ( x ) � 2 2 8 / 28
Strong convexity gives a stopping rule f ( y ) ≥ f ( x ) − 1 2 m �∇ f ( x ) � 2 2 for any x and y in S For y = x ∗ , the above becomes: p ∗ ≥ f ( x ) − 1 2 m �∇ f ( x ) � 2 2 for any x ∈ S Intuition: if �∇ f ( x ) � 2 2 is small, x is nearly optimal. Suboptimality condition: In order to have f ( x ) − p ∗ ≤ ε , it must hold that √ �∇ f ( x ) � 2 ≤ 2 mε Strong convexity also gives us a bound on � x − x ∗ � 2 in terms of �∇ f ( x ) � 2 : � x − x ∗ � 2 ≤ 2 m �∇ f ( x ) � 2 9 / 28
Descent methods We now describe algorithms producing a minimizing sequence ( x ( k ) k ≥ 1 where x ( k +1) = x ( k ) + t ( k ) ∆ x ( k ) • ∆ x ( k ) ∈ R n (vector): step/search direction . • t ( k ) > 0 (scalar): step size/length . The algorithms are descent methods , i.e., f ( x ( k +1) ) < f ( x ( k ) ) 10 / 28
Descent direction How to chose ∆ x ( k ) so that f ( x ( k +1) ) < f ( x ( k ) ) ? From convexity we know that ∇ f ( x ( k ) ) T ( y − x ( k ) ) ≥ 0 ⇒ f ( y ) . . . ≥ f ( x ( k ) ) so ∆ x ( k ) must satisfy: ∇ f ( x ( k ) ) T ∆ x ( k ) < 0 I.e., the angle between −∇ f ( x ( k ) ) and ∆ x ( k ) must be ... acute . Such a direction is known as a descent direction . 11 / 28
General descent method input: function f , starting point x repeat 1 Determine a descent direction ∆ x ; 2 Line search : choose a step size t ≥ 0 ; 3 Update : x ← x + t ∆ x ; until stopping criterion is satisfied Step 2 is called line search because it determines where on the ray { x + t ∆ x : t ≥ 0 } the next iterate will be. 12 / 28
Exact line search Choose t to minimize f along the ray { x + t ∆ x : t ≥ 0 } : t = arg min s ≥ 0 f ( x + s ∆ x ) Useful when the cost of the above minimization problem is low w.r.t. computing ∆ x (e.g., analytical solution) 13 / 28
Backtracking line search Most line searches are inexact : they approximately minimize f along the ray { x + t ∆ x : t ≥ 0 } Backtracking line search : input: descent direction ∆ x for f at x , α ∈ (0 , 0 . 5) , β ∈ (0 , 1) t ← 1 while f ( x + t ∆ x ) > f ( x ) + αt ∇ f ( x ) T ∆ x t ← βt end “Backtracking”: starts with large t and iteratively shrinks it. 14 / 28
Why does backtracking line search terminate? For small t , f ( x + t ∆ x ) ≈ f ( x ) + t ∇ f ( x ) T ∆ x It holds f ( x ) + t ∇ f ( x ) T ∆ x < f ( x ) + αt ∇ f ( x ) T ∆ x because ∇ f ( x ) T ∆ x ≤ 0 because ∆ x is a descent direction. 15 / 28
Visualization f ( x + t ∆ x ) f ( x ) + t ∇ f ( x ) T ∆ x f ( x ) + αt ∇ f ( x ) T ∆ x t t = 0 t 0 Figure 9.1 Backtracking line search. The curve shows f , restricted to the line over which we search. The lower dashed line shows the linear extrapolation of f , and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that f lies below the upper dashed line, i.e. , 0 ≤ t ≤ t 0 . 16 / 28
Gradient descent method input: function f , starting point x repeat 1 ∆ x ← −∇ f ( x ) ; 2 Line search : choose a step size t ≥ 0 via exact or backtracking line search; 3 Update : x ← x + t ∆ x ; until stopping criterion is satisfied (e.g., �∇ f ( x ) � 2 ≤ η ) 17 / 28
Example min f ( x 1 , x 2 ) = e x 1 +3 x 2 − 0 . 1 + e x 1 − 3 x 2 − 0 . 1 + e − x 1 − 0 . 1 Let’s solve it with gradient descent and backtrack line search with α = 0 . 1 and β = 0 . 7 . x (0) x (2) x (1) Figure 9.3 Iterates of the gradient method with backtracking line search, for the problem in R 2 with objective f given in (9.20). The dashed curves are level curves of f , and the small circles are the iterates of the gradient method. The solid lines, which connect successive iterates, show the scaled steps t ( k ) ∆ x ( k ) . The lines connecting successive iterates show the scaled steps: x ( k +1) − x ( k ) = − t ( k ) ∇ f ( x ( k ) ) 18 / 28
Example x (0) x (1) Figure 9.5 Iterates of the gradient method with exact line search for the problem in R 2 with objective f given in (9.20). 19 / 28
Example 10 5 10 0 f ( x ( k ) ) − p ⋆ backtracking l.s. 10 − 5 10 − 10 exact l.s. 10 − 15 0 5 10 15 20 25 k Figure 9.4 Error f ( x ( k ) ) − p ⋆ versus iteration k of the gradient method with backtracking and exact line search, for the problem in R 2 with objective f given in (9.20). The plot shows nearly linear convergence, with the error reduced approximately by the factor 0 . 4 in each iteration of the gradient method with backtracking line search, and by the factor 0 . 2 in each iteration of the gradient method with exact line search. 20 / 28
Convergence analysis Fact: if f is strongly convex on S , then ∃ M ∈ R + s.t. ∇ 2 f ( x ) ≤ MI , for all x ∈ S . Converge of gradient descent : Let ε > 0 . Let � � f ( x (0) ) − p ∗ 1 k ≥ log 1 − m � � ε − log M Afer k iterations it must hold f ( x ( k ) ) − p ∗ ≤ ε More interpretable bound: 1 − m f ( x ( k ) ) − p ∗ ≤ � � ( f ( x (0) ) − p ∗ ) M I.e., the error converges to 0 at least as fast as a geometric series ( linear convergence (on a log-linear plot)) 21 / 28
Steepest descent method We saw that gradient descent may converge very slowly if M/m is large. Is the gradient the best descent direction to take (and in what sense)? First-order Taylor approximation of f ( x + v ) around x : f ( x + v ) ≈ f ( x ) + ∇ f ( x ) T v ∇ f ( x ) T v is the directional derivative of f at x in the direction v 22 / 28
Steepest descent method v is a descent direction if the directional derivative ∇ f ( x ) T v is negative. How to choose v to make the directional derivative as negative as possible? Since ∇ f ( x ) T v is linear in v , we must restrict the choice of v somehow (oth. ...we could just keep growing the magnitude of v ) Let � · � be any norm in R n Normalized steepest descent direction w.r.t. � · � : ∆ x nsd = arg min {∇ f ( x ) T v : � v � = 1 } It gives the largest decrease in the linear approximation of f 23 / 28
Example If � · � is the Euclidean norm , then ∆ x nsd = −∇ f ( x ) 24 / 28
Example Consider the quadratic norm � z � P = ( z T Pz ) 1 / 2 = � P 1 / 2 z � 2 where P is positive definite. The normalized steepest descent direction is ∆ x nsd = ( ∇ f ( x ) T P − 1 ∇ f ( x )) 1 / 2 P − 1 ∇ f ( x ) for the step v = − P − 1 ∇ f ( x ) . 25 / 28
Geometric interpretation −∇ f ( x ) ∆ x nsd Figure 9.9 Normalized steepest descent direction for a quadratic norm. The ellipsoid shown is the unit ball of the norm, translated to the point x . The normalized steepest descent direction ∆ x nsd at x extends as far as possible in the direction −∇ f ( x ) while staying in the ellipsoid. The gradient and normalized steepest descent directions are shown. 26 / 28
Recommend
More recommend