CSCI 1951-G Optimization Methods in Finance Part 06: Algorithms - - PowerPoint PPT Presentation

csci 1951 g optimization methods in finance part 06
SMART_READER_LITE
LIVE PREVIEW

CSCI 1951-G Optimization Methods in Finance Part 06: Algorithms - - PowerPoint PPT Presentation

CSCI 1951-G Optimization Methods in Finance Part 06: Algorithms for Unconstrained Convex Optimization March 9, 2018 1 / 28 This material is covered in S. Boyd, L. Vandenberges book Convex Optimization


slide-1
SLIDE 1

CSCI 1951-G – Optimization Methods in Finance Part 06: Algorithms for Unconstrained Convex Optimization

March 9, 2018

1 / 28

slide-2
SLIDE 2

This material is covered in S. Boyd, L. Vandenberge’s book Convex Optimization https://web.stanford.edu/~boyd/cvxbook/. Some of the materials and the figures are taken from it.

2 / 28

slide-3
SLIDE 3

Outline

1 Unconstrained minimization: descent methods 2 Equality constrained minimization: Newton’s method 3 General minimization: Interior point methods

3 / 28

slide-4
SLIDE 4

Unconstrained minimization

Consider the unconstrained minimization problem: min f(x) where f : Rn → R, convex and twice continuously differentiable. x∗: optimal solution with optimal obj. value p∗. Necessary and sufficient condition for x∗ to be optimal: ∇f(x∗) = 0 The above is a system of ...n equations in ...n variables. Solving ∇f(x) = 0 analytically is ofen not easy or not possible.

4 / 28

slide-5
SLIDE 5

Example: unconstrained geometric program

min f(x) = ln m

  • i=1

exp(aT

i x + bi)

  • f(x) is convex.

The optimality condition is 0 = ∇f(x∗) = 1 m

j=1 exp(aT j x∗ + bj) m

  • i=1

exp(aT

i x∗ + bi)ai

which in general has no analytical solution.

5 / 28

slide-6
SLIDE 6

Iterative algorithms

Iterative algorithms for minimization compute a minimizing sequence x(0), x(1), . . .

  • f feasible points s.t.

f(x(k) → p∗ as k → ∞ The algorithm terminates when f(x(k)) − p∗ ≤ ε, for a specified tolerance ε > 0.

6 / 28

slide-7
SLIDE 7

How to know when to stop?

Consider the sublevel set S = {x : f(x) ≤ f(x(0))} Additional assumption: f is strongly convex on S, i.e., there exist m > 0 s.t. ∇2f(x) − mI > 0 for all x ∈ S i.e., the difference on the l.h.s. is positive definite. Consequence: f(y) ≥ f(x) + ∇f(x)T(y − x) + m 2 y − x2

2 for all x and y in S.

(What happens when f is “just” convex?)

7 / 28

slide-8
SLIDE 8

Strong convexity gives a stopping rule

f(y) ≥ f(x) + ∇f(x)T(y − x) + m 2 y − x2

2

For any fixed x, the r.h.s. is a convex quadratic function gx(y) of y. Let’s find the y for which the r.h.s. is minimal. How? Solve ∇gx(y) = 0! Solution: ˜ y = x − 1 m∇f(x) Then: f(y) ≥ f(x) + ∇f(x)T(˜ y − x) + m 2 ˜ y − x2

2

= f(x) − 1 2m∇f(x)2

2

8 / 28

slide-9
SLIDE 9

Strong convexity gives a stopping rule

f(y) ≥ f(x) − 1 2m∇f(x)2

2 for any x and y in S

For y = x∗, the above becomes: p∗ ≥ f(x) − 1 2m∇f(x)2

2 for any x ∈ S

Intuition: if ∇f(x)2

2 is small, x is nearly optimal.

Suboptimality condition: In order to have f(x) − p∗ ≤ ε, it must hold that ∇f(x)2 ≤ √ 2mε Strong convexity also gives us a bound on x − x∗2 in terms of ∇f(x)2: x − x∗2 ≤ 2 m∇f(x)2

9 / 28

slide-10
SLIDE 10

Descent methods

We now describe algorithms producing a minimizing sequence (x(k)

k≥1 where

x(k+1) = x(k) + t(k)∆x(k)

  • ∆x(k) ∈ Rn (vector): step/search direction.
  • t(k) > 0 (scalar): step size/length.

The algorithms are descent methods, i.e., f(x(k+1)) < f(x(k))

10 / 28

slide-11
SLIDE 11

Descent direction

How to chose ∆x(k) so that f(x(k+1)) < f(x(k))? From convexity we know that ∇f(x(k))T(y − x(k)) ≥ 0 ⇒ f(y) . . . ≥ f(x(k)) so ∆x(k) must satisfy: ∇f(x(k))T∆x(k) < 0 I.e., the angle between −∇f(x(k)) and ∆x(k) must be ...acute. Such a direction is known as a descent direction.

11 / 28

slide-12
SLIDE 12

General descent method

input: function f, starting point x repeat 1 Determine a descent direction ∆x; 2 Line search: choose a step size t ≥ 0; 3 Update: x ← x + t∆x; until stopping criterion is satisfied Step 2 is called line search because it determines where on the ray {x + t∆x : t ≥ 0} the next iterate will be.

12 / 28

slide-13
SLIDE 13

Exact line search

Choose t to minimize f along the ray {x + t∆x : t ≥ 0}: t = arg min

s≥0 f(x + s∆x)

Useful when the cost of the above minimization problem is low w.r.t. computing ∆x (e.g., analytical solution)

13 / 28

slide-14
SLIDE 14

Backtracking line search

Most line searches are inexact: they approximately minimize f along the ray {x + t∆x : t ≥ 0} Backtracking line search: input: descent direction ∆x for f at x, α ∈ (0, 0.5), β ∈ (0, 1) t ← 1 while f(x + t∆x) > f(x) + αt∇f(x)T∆x t ← βt end “Backtracking”: starts with large t and iteratively shrinks it.

14 / 28

slide-15
SLIDE 15

Why does backtracking line search terminate?

For small t, f(x + t∆x) ≈ f(x) + t∇f(x)T∆x It holds f(x) + t∇f(x)T∆x < f(x) + αt∇f(x)T∆x because ∇f(x)T∆x ≤ 0 because ∆x is a descent direction.

15 / 28

slide-16
SLIDE 16

Visualization

t f(x + t∆x) t = 0 t0 f(x) + αt∇f(x)T ∆x f(x) + t∇f(x)T ∆x

Figure 9.1 Backtracking line search. The curve shows f, restricted to the line

  • ver which we search. The lower dashed line shows the linear extrapolation
  • f f, and the upper dashed line has a slope a factor of α smaller.

The backtracking condition is that f lies below the upper dashed line, i.e., 0 ≤ t ≤ t0. 16 / 28

slide-17
SLIDE 17

Gradient descent method

input: function f, starting point x repeat 1 ∆x ← −∇f(x); 2 Line search: choose a step size t ≥ 0 via exact or backtracking line search; 3 Update: x ← x + t∆x; until stopping criterion is satisfied (e.g., ∇f(x)2 ≤ η)

17 / 28

slide-18
SLIDE 18

Example

min f(x1, x2) = ex1+3x2−0.1 + ex1−3x2−0.1 + e−x1−0.1 Let’s solve it with gradient descent and backtrack line search with α = 0.1 and β = 0.7.

x(0) x(1) x(2)

Figure 9.3 Iterates of the gradient method with backtracking line search, for the problem in R2 with objective f given in (9.20). The dashed curves are level curves of f, and the small circles are the iterates of the gradient

  • method. The solid lines, which connect successive iterates, show the scaled

steps t(k)∆x(k).

The lines connecting successive iterates show the scaled steps: x(k+1) − x(k) = −t(k)∇f(x(k))

18 / 28

slide-19
SLIDE 19

Example

x(0) x(1)

Figure 9.5 Iterates of the gradient method with exact line search for the problem in R2 with objective f given in (9.20).

19 / 28

slide-20
SLIDE 20

Example

k f(x(k)) − p⋆ backtracking l.s. exact l.s. 5 10 15 20 25 10−15 10−10 10−5 100 105

Figure 9.4 Error f(x(k)) − p⋆ versus iteration k of the gradient method with backtracking and exact line search, for the problem in R2 with objective f given in (9.20). The plot shows nearly linear convergence, with the error reduced approximately by the factor 0.4 in each iteration of the gradient method with backtracking line search, and by the factor 0.2 in each iteration

  • f the gradient method with exact line search.

20 / 28

slide-21
SLIDE 21

Convergence analysis

Fact: if f is strongly convex on S, then ∃M ∈ R+ s.t. ∇2f(x) ≤ MI, for all x ∈ S. Converge of gradient descent: Let ε > 0. Let k ≥ log

  • f(x(0)) − p∗

ε

  • 1

− log

  • 1 − m

M

  • Afer k iterations it must hold

f(x(k)) − p∗ ≤ ε More interpretable bound: f(x(k)) − p∗ ≤

  • 1 − m

M

  • (f(x(0)) − p∗)

I.e., the error converges to 0 at least as fast as a geometric series (linear convergence (on a log-linear plot))

21 / 28

slide-22
SLIDE 22

Steepest descent method

We saw that gradient descent may converge very slowly if M/m is large. Is the gradient the best descent direction to take (and in what sense)? First-order Taylor approximation of f(x + v) around x: f(x + v) ≈ f(x) + ∇f(x)Tv ∇f(x)Tv is the directional derivative of f at x in the direction v

22 / 28

slide-23
SLIDE 23

Steepest descent method

v is a descent direction if the directional derivative ∇f(x)Tv is negative. How to choose v to make the directional derivative as negative as possible? Since ∇f(x)Tv is linear in v, we must restrict the choice of v somehow (oth. ...we could just keep growing the magnitude of v) Let · be any norm in Rn Normalized steepest descent direction w.r.t. · : ∆xnsd = arg min{∇f(x)Tv : v = 1} It gives the largest decrease in the linear approximation of f

23 / 28

slide-24
SLIDE 24

Example

If · is the Euclidean norm, then ∆xnsd = −∇f(x)

24 / 28

slide-25
SLIDE 25

Example

Consider the quadratic norm zP = (zTPz)1/2 = P 1/2z2 where P is positive definite. The normalized steepest descent direction is ∆xnsd = (∇f(x)TP −1∇f(x))1/2P −1∇f(x) for the step v = −P −1∇f(x).

25 / 28

slide-26
SLIDE 26

Geometric interpretation

−∇f(x) ∆xnsd

Figure 9.9 Normalized steepest descent direction for a quadratic norm. The ellipsoid shown is the unit ball of the norm, translated to the point x. The normalized steepest descent direction ∆xnsd at x extends as far as possible in the direction −∇f(x) while staying in the ellipsoid. The gradient and normalized steepest descent directions are shown.

26 / 28

slide-27
SLIDE 27

Coordinate-descent

Let · be the ℓ1 norm. Let i be any index for which ∇f(x)∞ = |(∇f(x))i| then ∆xnsd = −sign ∂f(x) ∂xi

  • ei

where ei is the ith standard basis vector. Thus, only one component of x is going to change! This can greatly simplify the line search step.

27 / 28

slide-28
SLIDE 28

Geometric interpretation

−∇f(x) ∆xnsd

Figure 9.10 Normalized steepest descent direction for the ℓ1-norm. The diamond is the unit ball of the ℓ1-norm, translated to the point x. The normalized steepest descent direction can always be chosen in the direction

  • f a standard basis vector; in this example we have ∆xnsd = e1.

28 / 28