Algorithms for unconstrained local optimization Fabio Schoen 2008 - - PowerPoint PPT Presentation

algorithms for unconstrained local optimization
SMART_READER_LITE
LIVE PREVIEW

Algorithms for unconstrained local optimization Fabio Schoen 2008 - - PowerPoint PPT Presentation

Algorithms for unconstrained local optimization Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Algorithms for unconstrained local optimization p. Optimization Algorithms Most common form for optimization algorithms: Line


slide-1
SLIDE 1

Algorithms for unconstrained local

  • ptimization

Fabio Schoen 2008

http://gol.dsi.unifi.it/users/schoen

Algorithms for unconstrained local optimization – p.

slide-2
SLIDE 2

Optimization Algorithms

Most common form for optimization algorithms: Line search-based methods: Given a starting point x0 a sequence is generated: xk+1 = xk + αkdk where dk ∈ Rn: search direction, αk > 0: step Usually first dk is chosen and than the step is obtained, often from a 1–dimensional optimization

Algorithms for unconstrained local optimization – p.

slide-3
SLIDE 3

Trust-region algorithms

A model m(x) and a confidence region U(xk) containing xk are

  • defined. The new iterate is chosen as the solution of the

constrained optimization problem min

x∈U(xk) m(x)

The model and the confidence region are possibly updated at each iteration.

Algorithms for unconstrained local optimization – p.

slide-4
SLIDE 4

Speed measures

Let x⋆: local optimum. The error in xk might be measured e.g. as e(xk) = xk − x⋆

  • r

e(xk) = |f(xk) − f(x⋆)|. Given {xk} → x⋆ if ∃ q > 0, β ∈ (0, 1) : (for k large enough): e(xk) ≤ qβk ⇒{xk} is linearly convergent, or converges with order 1; β : convergence rate A sufficient condition for linear convergence: lim sup e(xk+1) e(xk) ≤ β

Algorithms for unconstrained local optimization – p.

slide-5
SLIDE 5

super–linear convergence

If for every β ∈ (0, 1) exists q: e(xk) ≤ qβk then convergence is super–linear. Sufficient condition: lim sup e(xk+1) e(xk) = 0

Algorithms for unconstrained local optimization – p.

slide-6
SLIDE 6

Higher order convergence

If, given p > 1, ∃ q > 0, β ∈ (0, 1) : e(xk) ≤ qβ(pk) then {xk} is said to converge with order at least p If p = 2 ⇒quadratic convergence Sufficient condition: lim sup e(xk+1) e(xk)p < ∞

Algorithms for unconstrained local optimization – p.

slide-7
SLIDE 7

Examples

1 k converges to 0 with order one 1 (linear convergence)

Algorithms for unconstrained local optimization – p.

slide-8
SLIDE 8

Examples

1 k converges to 0 with order one 1 (linear convergence) 1 k2 converges to 0 with order 1

Algorithms for unconstrained local optimization – p.

slide-9
SLIDE 9

Examples

1 k converges to 0 with order one 1 (linear convergence) 1 k2 converges to 0 with order 1

2−k converges to 0 with order 1

Algorithms for unconstrained local optimization – p.

slide-10
SLIDE 10

Examples

1 k converges to 0 with order one 1 (linear convergence) 1 k2 converges to 0 with order 1

2−k converges to 0 with order 1 k−k converges to 0 with order 1; convergence is super–linear

Algorithms for unconstrained local optimization – p.

slide-11
SLIDE 11

Examples

1 k converges to 0 with order one 1 (linear convergence) 1 k2 converges to 0 with order 1

2−k converges to 0 with order 1 k−k converges to 0 with order 1; convergence is super–linear

1 22k converges a 0 with order 2 quadratic convergence

Algorithms for unconstrained local optimization – p.

slide-12
SLIDE 12

Descent directions and the gradient

Let f ∈ C1(Rn), xk ∈ Rn : ∇f(xk) = 0 Let d ∈ Rn. If dT∇f(xk) < 0 then d is a descent direction Taylor expansion: f(xk + αd) − f(xk) = αdT∇f(xk) + o(α) f(xk + αd) − f(xk) α = dT∇f(xk) + o(1) Thus if α is small enough f(xk + αd) − f(xk) < 0 NB: d might be a descent direction even if dT∇f(xk) = 0

Algorithms for unconstrained local optimization – p.

slide-13
SLIDE 13

Convergence of line search methods

If a sequence xk+1 = xk + αkdk is generated in such a way that: L0 = {x : f(x) ≤ f(x0)} is compact dk = 0 whenever ∇f(xk) = 0 f(xk+1) ≤ f(xk) if ∇f(xk) = 0 ∀ k then lim

k→∞

dT

k

dk∇f(xk) = 0

Algorithms for unconstrained local optimization – p.

slide-14
SLIDE 14

if dk = 0 then |dT

k ∇f(xk)|

dk ≥ σ(∇f(xk)) where σ is such that limk→∞ σ(tk) = 0⇒ limk→∞ tk = 0 (σ is called a forcing function)

Algorithms for unconstrained local optimization – p. 1

slide-15
SLIDE 15

Then either there exists a finite index ¯ k such that ∇f(x¯

k) = 0 or

  • therwise

xk ∈ L0 and all of its limit points are in L0 {f(xk)} admits a limit limk→∞ ∇f(xk) = 0 for every limit point ¯ x of {xk} we have ∇f(¯ x) = 0

Algorithms for unconstrained local optimization – p. 1

slide-16
SLIDE 16

Comments on the assumptions

f(xk+1) ≤ f(xk): most optimization methods choose dk as a descent direction. If dk is a descent direction, choosing αk “sufficiently small” ensures the validity of the assumption limk→∞

dT

k

dk∇f(xk) = 0: given a normalized direction dk, the

scalar product dkT∇f(xk) is the directional derivative of f along dk: it is required that this goes to zero. This can be achieved through precise line searches (choosing the step so that f is minimized along dk)

|dT

k ∇f(xk)|

dk

≥ σ(∇f(xk)): letting, e.g., σ(t) = ct, c > 0, if dk : dT

k ∇f(xk) < 0 then the condition becomes

dT

k ∇f(xk)

dk ∇f(xk ≤ −c

Algorithms for unconstrained local optimization – p. 1

slide-17
SLIDE 17

Recalling that cos θk = dT

k ∇f(xk)

dk ∇f(xk then the condition becomes cos θk ≤ −c that is, the angle between dk and ∇f(xk) is bounded away from

  • rthogonality.

θk dT

k ∇f(xk) Algorithms for unconstrained local optimization – p. 1

slide-18
SLIDE 18

Gradient Algorithms

General scheme: xk+1 = xk − αkDk∇f(xk) with Dk ≻ 0 e αk > 0 If ∇f(xk) = 0 then dk = Dk∇f(xk) is a descent direction. In fact dT

k ∇f(xk) = −∇Tf(xk)Dk∇f(xk)

< 0

Algorithms for unconstrained local optimization – p. 1

slide-19
SLIDE 19

Steepest Descent

  • r “gradient” method:

Dk := I i.e. xk+1 = xk − αk∇f(xk). If ∇f(xk) = 0 then dk = −∇f(xk) is a descent direction. Moreover, it is the steepest (w.r.t. the euclidean norm): min

d∈Rn ∇Tf(xk)d

d ≤ 1

Algorithms for unconstrained local optimization – p. 1

slide-20
SLIDE 20

∇f(xk)

Algorithms for unconstrained local optimization – p. 1

slide-21
SLIDE 21

. . .

min

d∈Rn ∇Tf(xk)d

√ dTd ≤ 1 KKT conditions: In the interior ⇒∇Tf(xk) = 0; if the constraint is active ⇒ ∇f(xk) + λ d d = 0 √ dTd = 1 λ ≥ 0 ⇒d = − ∇f(xk)

∇f(xk).

Algorithms for unconstrained local optimization – p. 1

slide-22
SLIDE 22

Newton’s method

Dk := −

  • ∇2f(xk)

−1 Motivation: Taylor expansion of f: f(x) ≈ f(xk) + ∇Tf(xk)(x − xk) + 1 2(x − xk)T∇2f(xk)(x − xk) Minimizing the approximation: ∇f(xk) + ∇2f(xk)(x − xk) = 0 If the hessian is non singular ⇒ x = xk −

  • ∇2f(xk)

−1 ∇f(xk)

Algorithms for unconstrained local optimization – p. 1

slide-23
SLIDE 23

Step choice

Given dk, how to choose αk so that xk+1 = xk + αkdk? “optimal” choice (one-dimensional optimization): αk = arg min

α≥0 f(xk + αdk).

Analytical expression of the optimal step is available only in few

  • cases. E.g. if f(x) = 1

2xTQx + cTx with Q ≻ 0. Then

f(xk + αdk) = 1 2(xk + αdk)TQ(xk + αdk) + cT(xk + αdk) = 1 2α2dT

k Qdk + α(Qxk + c)Tdk + β

where β does not depend on α.

Algorithms for unconstrained local optimization – p. 1

slide-24
SLIDE 24

Minimizing w.r.t. α: αdT

k Qdk + (Qxk + c)Tdk = 0

⇒ α = −(Qxk + c)Tdk dT

k Qdk

= − dT

k ∇f(xk)

dT

k ∇2f(xk)dk

E.g., in steepest descent: αk = ∇f(xk)2 ∇Tf(xk)∇2f(xk)∇f(xk)

Algorithms for unconstrained local optimization – p. 2

slide-25
SLIDE 25

Approximate step size

Rules for choosing a step-size (from the sufficient condition for convergence): f(xk+1) < f(xk) limk→∞

dT

k

dk∇f(xk) = 0

Often it is also required that xk+1 − xk → 0 dT

K∇f(xk + αkdk) → 0

In general it is important to insure a sufficient reduction of f and a sufficiently large step xk+1 − xk

Algorithms for unconstrained local optimization – p. 2

slide-26
SLIDE 26

Avoid too large steps

✉ ✉ ✉ ✉

Algorithms for unconstrained local optimization – p. 2

slide-27
SLIDE 27

Avoid too small steps

✉ ✉ ✉ ✉ ✉

Algorithms for unconstrained local optimization – p. 2

slide-28
SLIDE 28

Armijo’s rule

Input: δ ∈ (0, 1), γ ∈ (0, 1/2), ∆k > 0

α := ∆k;

while (f(xk + αdk) > f(xk) + γαdT

k ∇f(xk)) do

α := δα ;

end return α

Typical values : δ ∈ [0.1, 0.5], γ ∈ [10−4, 10−3]. On exit the returned step is such that f(xk + αdk) ≤ f(xk) + γαdT

k ∇f(xk)

Algorithms for unconstrained local optimization – p. 2

slide-29
SLIDE 29

α acceptable steps αdT

k ∇f(xk)

γαdT

k ∇f(xk)

Algorithms for unconstrained local optimization – p. 2

slide-30
SLIDE 30

Line search in practice

How to choose the initial step size ∆k? Let φ(α) = f(xk + αdk). A possibility is to choose ∆k = α⋆, the minimizer of a quadratic approximation to φ(·). Example: q(α) = c0 + c1α + 1 2c2α2 q(0) = c0 := f(xk) q′(0) = c1 := dT

k ∇f(xk)

Then α⋆ = −c1/c2.

Algorithms for unconstrained local optimization – p. 2

slide-31
SLIDE 31

Third condition? If an estimate ˆ f of the minimum of f(xk + αdk) is available ⇒choose c2 : min q(α) = ˆ f. min q(α) = q(−c1/c2) = c0 − c2

1/c2 := ˆ

f c2 = c2

1/2( ˆ

f − c0) α⋆ = −c1/c2 = 2 ˆ f − c0 c1

Algorithms for unconstrained local optimization – p. 2

slide-32
SLIDE 32

Thus it is reasonable to start with ∆k = 2 ˆ f − f(xk) dT

k ∇f(xk)

A reasonable estimate might be to choose ∆k = 2(f(xk−1)−f(xk))

dT

k ∇f(xk)

Algorithms for unconstrained local optimization – p. 2

slide-33
SLIDE 33

Convergence of steepest descent

xk+1 = xk − αk∇f(xk) If a sufficiently accurate step size is used ⇒the condition of the theorem on global convergence are satisfied ⇒the steepest descent algorithm globally converges to a stationary point. “Sufficiently accurate” means exact line search or, e.g., Armijo’s rule.

Algorithms for unconstrained local optimization – p. 2

slide-34
SLIDE 34

Local analysis of steepest descent

Behaviour of the algorithm when minimizing f(x) = 1 2xTQx where Q ≻ 0. (local and global) optimum: x⋆ = 0. Steepest descent method: xk+1 = xk − αk∇f(xk) = xk − αkQxk = (I − αkQ)xk Error (in x) at step k + 1: xk+1 − 0 = (I − αkQ)xk =

  • xT

k (I − αkQ)2xk

Algorithms for unconstrained local optimization – p. 3

slide-35
SLIDE 35

Analysis

Let A: symmetric with eigenvalues: λ1 < · · · < λn. Then λ1v2 ≤ vTAv ≤ λmv2 ∀ v ∈ Rn ⇒ xT

k (I − αkQ)2xk ≤ λ⋆xT k xk

where λ⋆ largest eigenvalue of (I − αkQ)2.

Algorithms for unconstrained local optimization – p. 3

slide-36
SLIDE 36

. . .

λ is an eigenvalue of A iff αλ is an eigenvalue of αA λ is an eigenvalue of A iff 1 + λ is an eigenvalue of I + A Thus the eigenvalues of (I − αkQ) are 1 − αλi where λi are the eigenvalues of Q. The maximum eigenvalue will be: max{(1 − αkλ1)2, (1 − αkλn)2} thus xk+1 ≤

  • max{(1 − αkλ1)2, (1 − αkλn)2}xk

= max{|1 − αkλ1|, |1 − αkλn|}xk

Algorithms for unconstrained local optimization – p. 3

slide-37
SLIDE 37

. . .

Eliminating the dependency on αk: max{|1 − αλ1|, |1 − αλn|} = max{1 − αλ1, −1 + αλ1, 1 − αλn, −1 + αλn} 1 2 3 4 5 0.2 0.4 0.6 0.8 1

|1 − αλ1| |1 − αλn|

Algorithms for unconstrained local optimization – p. 3

slide-38
SLIDE 38

. . .

α ≥ 0 and λ1 ≤ λn, ⇒ 1 − αλ1 ≥ 1 − αλn −1 + αλ1 ≤ −1 + αλn and thus max{|1 − αkλ1|, |1 − αkλn|}xk = max{1 − αλ1, −1 + αλn} Minimum point: 1 − αλ1 = −1 + αλn i.e. α⋆ = 2 λ1 + λn

Algorithms for unconstrained local optimization – p. 3

slide-39
SLIDE 39

Analysis

In the best possible case xk+1 xk ≤ |1 − α⋆λ1| = |1 − 2 λ1 + λn λ1| = λn − λ1 λn + λ1 = ρ − 1 ρ + 1 where ρ = λn/λ1: condition number of Q ρ ≫ 1 (ill–conditioned problem) ⇒very slow convergence ρ ≈ 1 ⇒very speed convergence

Algorithms for unconstrained local optimization – p. 3

slide-40
SLIDE 40

Zig–zagging

min 1 2(x2 + My2) where M > 0. Optimum: x⋆0y⋆ = 0. Starting point: (M, 1). Iterates:

  • xk+1

yk+1

  • =
  • xk

yk

  • + α
  • xk

Myk

  • With optimal step size ⇒
  • xk+1

yk+1

  • =
  • M

M−1

M+1

k

  • − M−1

M+1

k

  • Algorithms for unconstrained local optimization – p. 3
slide-41
SLIDE 41

Converegence is rapid if M ≈ 1 very slow and “zig–zagging” if M ≫ 1 or M ≪ 1 Slow convergence and zig–zagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets)

Algorithms for unconstrained local optimization – p. 3

slide-42
SLIDE 42

Zig–zagging

  • 10
  • 5

5 10 20 40 60 80 100

Algorithms for unconstrained local optimization – p. 3

slide-43
SLIDE 43

Analysis of Newton’s method

Newton-Raphson method: xk+1 = xk − (∇2f(xk))−1 ∇f(xk). Let x⋆: local optimum. Taylor expansion of ∇f: ∇f(x⋆) = 0 = ∇f(xk) + ∇2f(xk)(x⋆ − xk) + o(x⋆ − xk) If ∇2f(xk) is non singular and (∇2f(xk))−1 is limited ⇒ 0 =

  • ∇2f(xk)

−1 ∇f(xk) + (x⋆ − xk) +

  • ∇2f(xk)

−1 o(x⋆ − xk) = x⋆ − xk+1 + o(x⋆ − xk)

Algorithms for unconstrained local optimization – p. 3

slide-44
SLIDE 44

Thus x⋆ − xk+1 = o(x⋆ − xk) i.e. x⋆−xk+1

x⋆−xk = o(x⋆−xk) x⋆−xk

⇒convergence is at least super–linear

Algorithms for unconstrained local optimization – p. 4

slide-45
SLIDE 45

Local Convergence of Newton’s Method

Let f ∈ C2(U(x⋆, δ1)), where U: ball with radius δ1 and center x⋆; let ∇2f(x⋆) be non–singular. Then:

  • 1. ∃ δ > 0 : if x0 ∈ U(x⋆, δ) ⇒{xk} is well defined and

converges to x⋆ at least superlinearly.

  • 2. If ∃ δ > 0, L > 0, M > 0 :

∇2f(x) − ∇2f(y) ≤ Lx − y and (∇2f(x))−1 ≤ M then, if x0 ∈ U(x⋆, δ) Newton’s method converges with order at least 2 and xk+1 − x⋆ ≤ LM 2 xk − x⋆2

Algorithms for unconstrained local optimization – p. 4

slide-46
SLIDE 46

Difficulties

Many things might go wrong: at some iteration, ∇2f(xk) might be singular. For example: if xk belongs to a flat region f(x) = constant. even if non singular, inversion ∇2f(xk) or, in any case, solving a linear system with coefficient matrix ∇2f(xk) is numerically unstable and computationally demanding there is no guarantee that ∇2f(xk) ≻ 0 ⇒Newton direction might not be a descent direction

Algorithms for unconstrained local optimization – p. 4

slide-47
SLIDE 47

Difficulties

Newton’s method just tries to solve the system ∇f(xk) = 0 and thus might very well be attracted towards a maximum the method lacks global convergence: it converges only if started “near” a local optimum

Algorithms for unconstrained local optimization – p. 4

slide-48
SLIDE 48

Newton–type methods

line search variant: xk+1 = xk − αk (∇2f(xk))−1 ∇f(xk) Modified Newton method: replace ∇2f(xk) by (∇2f(xk) + Dk) where Dk is chosen so that ∇2f(xk) + Dk is positive definite

Algorithms for unconstrained local optimization – p. 4

slide-49
SLIDE 49

Quasi-Newton methods

Consider solving the nonlinear system ∇f(x) = 0. Taylor expansion of the gradient: ∇f(xk) ≈ ∇f(xk+1) + ∇2f(xk+1)(xk − xk+1) Let Bk+1 be an approximation of the hessian in xk+1. Quasi–Newton equation: Bk+1(xk+1 − xk) = ∇f(xk+1) − ∇f(xk)

Algorithms for unconstrained local optimization – p. 4

slide-50
SLIDE 50

Quasi–Newton equation

Let: sk := xk+1 − xk yk := ∇f(xk+1) − ∇f(xk) Quasi–Newton equation: Bk+1sk = yk. If Bk was the previous approximate hessian, we ask that

  • 1. the variation between Bk and Bk+1 is “small”
  • 2. nothing changes along directions which are normal to the

step sk: Bkz = Bk+1z ∀ z : zTsk = 0 Choosing n−1 vectors z which are orthogonal to sk ⇒n2 linearly independent equations in n2 unknowns ⇒∃ a unique solution.

Algorithms for unconstrained local optimization – p. 4

slide-51
SLIDE 51

Broyden updating

It can be shown that the unique solution is given by: Bk+1 = Bk + (yk − Bksk)sT

k

sT

k sk

Theorem: let Bk ∈ Rn×n and sk = 0. The unique solution to: min

ˆ B Bk − ˆ

BF ˆ Bsk = yk is Broyden’s update Bk+1 here XF = √ TrXTX denotes Frobenius norm.

Algorithms for unconstrained local optimization – p. 4

slide-52
SLIDE 52

proof

Bk+1 − Bk =

  • (yk − Bksk)sT

k

sT

k sk

  • =
  • ( ˆ

Bsk − Bksk)sT

k

sT

k sk

  • =
  • ( ˆ

B − Bk)sksT

k

sT

k sk

  • ( ˆ

B − Bk)

  • sksT

k

sT

k sk

=

  • ( ˆ

B − Bk)

  • TrsksT

k sksT k

sT

k sk

=

  • ( ˆ

B − Bk)

  • sT

k sk

sT

k sk

= ( ˆ B − Bk) Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region.

Algorithms for unconstrained local optimization – p. 4

slide-53
SLIDE 53

Quasi-Newton and optimization

Special situation:

  • 1. the hessian matrix in optimization problems is symmetric;
  • 2. in gradient methods, when we let

xk+1 = xk − (Bk+1)−1 ∇f(xk), it is desirable that Bk+1 be positive definite. Broyden’s update: Bk+1 = Bk + (yk − Bksk)sT

k

sT

k sk

is generally not symmetric even if Bk is.

Algorithms for unconstrained local optimization – p. 4

slide-54
SLIDE 54

Simmetry

Remedy: let C1 = Bk +

(yk−Bksk)sT

k

sT

k sk

symmetrization: C2 = 1 2(C1 + CT

1 )

However, it does not satisfy Quasi–Newton equation. Broyden update of C2: C3 = C2 + (yk − C2sk)sT

k

sT

k sk

which is not symmetric, . . .

Algorithms for unconstrained local optimization – p. 5

slide-55
SLIDE 55

PBS update

In the limit Bk+1 = Bk + (yk − Bksk)sT

k + sk(yk − Bksk)T

sT

k sk

+ (sT

k (yk − Bksk))sksT k

(sT

k sk)2

(PBS – Powell-Broyden-Symmetric update). Imposing also hereditary positive definiteness, DFP (Davidon-Fletcher-Powell) is obtained: Bk+1 = Bk + (yk − Bksk)yT

k + yk(yk − Bksk)T

yT

k sk

+ (sT

k (yk − Bksk))ykyT k

(yT

k sk)2

=

  • I − yksT

k

yT

k sk

  • Bk
  • I − skyT

k

yT

k sk

  • + ykyT

k

yT

k sk

Algorithms for unconstrained local optimization – p. 5

slide-56
SLIDE 56

BFGS

Same ideas, but applied to the approximate inverse Hessian: Inverse Quasi–Newton equation: sk = Hk+1yk lead to the most common Quasi–Newton update: BFGS (Broyden-Fletcher-Goldfarb-Shanno): Hk+1 =

  • I − skyT

k

yT

k sk

  • Hk
  • I − yksT

k

yT

k sk

  • + sksT

k

yT

k sk

Algorithms for unconstrained local optimization – p. 5

slide-57
SLIDE 57

BFGS method

xk+1 = xk − αkHk∇f(xk) Hk+1 =

  • I − skyT

k

yT

k sk

  • Hk
  • I − yksT

k

yT

k sk

  • + sksT

k

yT

k sk

yk = ∇f(xk+1) − ∇f(xk) sk = xk+1 − xk

Algorithms for unconstrained local optimization – p. 5

slide-58
SLIDE 58

Trust Region methods

Possible defect of standard Newton method: the approximation becomes less and less precise if we move away from the current point. Long step ⇒bad approximation. Idea: constrained minimization of quadratic approximation: xk+1 = arg min

xk+1−xk≤∆k mk(x)

where

mk(x) = f(xk) + ∇Tf(xk)(xk+1 − xk) + 1 2(xk+1 − xk)T∇2f(xk)(xk+1 − xk) ∆k > 0: parameter. First advantage (over pure Newton): the step is always definite (thanks to Weierstrass’s theorem)

Algorithms for unconstrained local optimization – p. 5

slide-59
SLIDE 59

Outline of Trust Region

Let mk(·) a local model function. E.g. in Newton Trust Region methods, mk(s) = f(xk) + sT∇f(xk) + 1 2sT∇2f(xk)s

  • r in a Quasi-Newton Trust Region method

mk(s) = f(xk) + sT∇f(xk) + 1 2sTBks

Algorithms for unconstrained local optimization – p. 5

slide-60
SLIDE 60

How to choose and update the trust region radius ∆k? Given a step sk, let ρk = f(xk) − f(xk + sk) mk(0) − mk(sk) the ratio between the actual reduction and the predicted reduction

Algorithms for unconstrained local optimization – p. 5

slide-61
SLIDE 61

Model updating

ρk = f(xk) − f(xk + sk) mk(0) − mk(sk) The predicted reduction is always non negative; if ρk is small (surely if it is negative) the model and the function strongly disagree ⇒the step must be rejected and the trust region reduced if ρk ≥ 1 it is safe to expand the trust region intermediate ρk values lead us to keep the region unchanged

Algorithms for unconstrained local optimization – p. 5

slide-62
SLIDE 62

Algorithm

Data: ˆ

∆ > 0, ∆0 ∈ (0, ˆ ∆), η ∈ [0, 1/4]

for k = 0, 1, . . . do

Find the step sk and ρk minimizing the model in the trust region ;

if ρk < 1/4 then

∆k+1 = ∆k/4 ;

else if ρk > 3/4 and sk = ∆k then

∆k+1 = min{2∆k, ˆ ∆} ;

else

∆k+1 = ∆k;

end end if ρk > η then

xk+1 = xk + sk;

else

xk+1 = xk;

end end

Algorithms for unconstrained local optimization – p. 5

slide-63
SLIDE 63

Solving the model

How to find min

s

∇f(xk)Ts + 1 2sTBks s ≤ ∆ If Bk ≻ 0, KKT conditions are necessary and sufficient; rewriting the constraint as sTs ≤ ∆2 ⇒: ∇f(xk) + Bks + 2λs = 0 λ(∆ − s) = 0

Algorithms for unconstrained local optimization – p. 5

slide-64
SLIDE 64

Thus either s is in the interior of the ball with radius ∆, in which case λ = 0 and we have the (quasi)-Newton step: p = −B−1

k ∇f(xk)

  • r s = ∆ and if λ > 0 then 2λs = −∇f(xk) − Bs = −∇mk(s)

⇒s is parallel to the negtaive gradient of the model and normal to its contour lines.

Algorithms for unconstrained local optimization – p. 6

slide-65
SLIDE 65

The Cauchy Point

Strategy to approximately solve the trust region sub–problem. Find the “Cauchy point”: the minimizer of mk along the direction −∇f(xk) within the trust region. First find the direction: ps

k = arg min p

fk + ∇f(xk)Tp p ≤ ∆k Then along this direction find a minimizer τk = arg min

τ≥0 mk(τps k)

τps

k ≤ ∆k

The Cauchy point is xk + τkps

k.

Algorithms for unconstrained local optimization – p. 6

slide-66
SLIDE 66

Finding the Cauchy point

Finding ps

k is easy: analytic solution:

ps

k = −∇f(xk)

gk ∆k For the step size τk: If ∇f(xk)TBk∇f(xk) ≤ 0 ⇒negative curvature direction ⇒largest possible step ⇒τk = 1 Otherwise the model along the line is strictly convex, so τk = min{1, ∇f(xk)3 ∆k∇f(xk)TBk∇f(xk)} Choosing the Cauchy point ⇒global but extremely slow convergence (similar to steepest descent). Usually an improved point is searched starting from the Cauchy one.

Algorithms for unconstrained local optimization – p. 6

slide-67
SLIDE 67

Derivative Free Optimization

Algorithms for unconstrained local optimization – p. 6

slide-68
SLIDE 68

Pattern Search

For smooth optimization, but without knowledge of derivatives. Elementary idea: if x ∈ R2 is not a local minimum for f, then at least one of the directions e1, e2, −e1, −e2 (moving towards E, N, W, S) forms an acute angle with −∇f(x) ⇒is a descent direction. Direct search: explores all the direction in search of one which gives a descent.

Algorithms for unconstrained local optimization – p. 6

slide-69
SLIDE 69

Coordinate search

Let D⊕ = {±ei} be the set of coordinate directions and their

  • pposites

Data: k = 0, ∆0 an initial step length, x0 a starting point while ∆ is large enough do if f(xk + ∆kd) < f(xk) for some d ∈ D⊕ then

xk+1 = xk + ∆kd (step accepted) ;

else

∆k+1 = 0.5∆k ;

end

k = k + 1 ;

end

Algorithms for unconstrained local optimization – p. 6

slide-70
SLIDE 70

Pattern search

It is not necessary to explore 2n directions. It is sufficient that the set of directions forms a positive span, i.e. every v ∈ Rn should be expressible as a non negative linear combination of the vectors in the set. Formally, G is a generating set iff ∀ v = 0 ∈ Rn∃ g ∈ G : vTg > 0 A “good” generating set should be characterized by a sufficiently high cosine measure: κ(G) := min

v=0 max d∈G

vTd vd

Algorithms for unconstrained local optimization – p. 6

slide-71
SLIDE 71

Examples

✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉

In the first case κ ≈ 0.19612, in the second κ = 0.5, in the third κ = √ 0.5 ≈ 0.7017

Algorithms for unconstrained local optimization – p. 6

slide-72
SLIDE 72

Step Choice

xk+1 =      xk + ∆kdk if f(xk + ∆kdk) < f(xk) − ρ(∆k)(success) xk

  • therwise

(failure) where ρ(t) = o(t). We let ∆k+1 = φk∆k where φk ≥ 1 for successful iterations, φk < 1 otherwise. Direct methods possess good convergence properties.

Algorithms for unconstrained local optimization – p. 6

slide-73
SLIDE 73

b

Algorithms for unconstrained local optimization – p. 6

slide-74
SLIDE 74

b

Algorithms for unconstrained local optimization – p. 7

slide-75
SLIDE 75

b

Algorithms for unconstrained local optimization – p. 7

slide-76
SLIDE 76

b

Algorithms for unconstrained local optimization – p. 7

slide-77
SLIDE 77

Nelder-Mead Simplex

Given a simplex S = {v1, . . . , vn+1} in Rn let vr the worst point: r = arg maxi{f(vi)}. Let C be the centroid of S \ {vr}: C =

  • i=r vi

n The algorithm performs a sort of line search along the direction C − vr. Let R = C + (C − vr) a reflection of the worst point along the direction. Let ¯ f be the best function value in the current simplex. Three cases might occur:

Algorithms for unconstrained local optimization – p. 7

slide-78
SLIDE 78

1: Reflection

Check f(R): if it is intermediate, i.e. better than the worst and worse than the best, then accept the reflection, i.e. discard the worst point in the simplex and replace it with R.

Algorithms for unconstrained local optimization – p. 7

slide-79
SLIDE 79

Reflection step

b b b

worst reflection

Algorithms for unconstrained local optimization – p. 7

slide-80
SLIDE 80

2: improvement

if the trial step is an improvement: f(R) < ¯ f then attempt an expansion: try to move R to ¯ R = R + (R − C) If successful (f( ¯ R) < f(R)) then accept the expansion and discard the worst point. If unsuccessful, then accept R as a new point and discard the worst one.

Algorithms for unconstrained local optimization – p. 7

slide-81
SLIDE 81

Expansion

b

b b b

worst reflection expansion

Algorithms for unconstrained local optimization – p. 7

slide-82
SLIDE 82

3: contraction

If however the reflected point R is worse than all points in the simplex (possibly except the worst vr), than a contraction step is performed: if f(R) > f(vr) (R is worse than all points in the simplex), add 0.5(vr + C) to the simplex and discard vr

  • therwise if R is better than vr than add

0.5(R + C) to the simplex and discard vr

Algorithms for unconstrained local optimization – p. 7

slide-83
SLIDE 83

Contraction

b b

b b b

worst reflection contraction

b

Algorithms for unconstrained local optimization – p. 7

slide-84
SLIDE 84

Nelder-Mead is not a direct search method (only a single direction at a time is explored) It is widely used by practitioners. However it may fail to converge to a local minimum. There are examples of strictly convex functions in R2 on which the method converges to a non-stationary point. The bad convergence properties are connected to the event that the n–dimensional simplex degenerates into a lower dimensional space. Moreover the method has a strong tendency to generate directions which are almost normal to that of the gradient! Convergent variants of Nelder-Mead method do exists.

Algorithms for unconstrained local optimization – p. 8

slide-85
SLIDE 85

Implicit filtering

Let f(x) = h(x) + w(x) where h(x) is a smooth function, while w(x) can be considered as an additive, typically random, noise. The method performs a rough estimate of the gradient (finite difference with a “large step”) and proceeds with an Armijo line

  • search. If unsuccessful, the step for finite differences is

reduced.

Algorithms for unconstrained local optimization – p. 8

slide-86
SLIDE 86

Implicit filtering

Data: {εk} ↓ 0, params δ, γ, ∆ of Armijo’s rule repeat

OuterIteration = false;

repeat

compute f(xk) and a finite difference estimate of ∇f(xk): ∇εkf(xk) = [(f(xk + εkei) − f(xkεkei))/2εk]

if ∇εkf(xk) ≤ εk then

OuterIteration = true

else

Armijo: if successful accept the Armijo step;

  • therwise let OuterIteration = true

end until OuterIteration ;

k = k + 1;

until convergence criterion ;

Algorithms for unconstrained local optimization – p. 8

slide-87
SLIDE 87

Convergence properties

If ∇2h(x) is Lipschitz continuous the sequence {xk} generated by the method is infinite lim

k→∞ ε2 k + η(xk; εk)

εk = 0 where η(x; ε) = sup

z:z−x∞≤ε

|w(x)| unsuccessful Armijo steps occur at most a finite number of times then all limit points of {xk} are stationary

Algorithms for unconstrained local optimization – p. 8