Convex Optimization 9. Unconstrained minimization Prof. Ying Cui - - PowerPoint PPT Presentation

convex optimization
SMART_READER_LITE
LIVE PREVIEW

Convex Optimization 9. Unconstrained minimization Prof. Ying Cui - - PowerPoint PPT Presentation

Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization problems Descent methods


slide-1
SLIDE 1

Convex Optimization

  • 9. Unconstrained minimization
  • Prof. Ying Cui

Department of Electrical Engineering Shanghai Jiao Tong University

2017 Autumn Semester

SJTU Ying Cui 1 / 40

slide-2
SLIDE 2

Outline

Unconstrained minimization problems Descent methods Gradient descent method Steepest descent method Newton’s method Self-concordance Implementation

SJTU Ying Cui 2 / 40

slide-3
SLIDE 3

Unconstrained minimization

min

x

f(x) assumptions:

◮ assume f : Rn → R is convex, twice continuously

differentiable (implying that domf is open)

◮ assume there exists an optimal point x∗ (optimal value

p∗ = infx f(x) is attained and finite) a necessary and sufficient condition for optimality: ∇f(x∗) = 0

◮ solving unconstrained minimization problem is the same as

finding a solution of optimality equation

◮ in a few special cases, can be solved analytically ◮ usually, must be solved by an iterative algorithm

◮ produce a sequence of points x(k) ∈ dom f, k = 0, 1, ... with

f(x(k)) → p∗, as k → ∞

◮ terminated when f(x(k)) − p∗ ≤ ǫ for some tolerance ǫ > 0 SJTU Ying Cui 3 / 40

slide-4
SLIDE 4

Initial point and sublevel set

algorithms in this chapter require a starting point x(0) such that

◮ x(0) ∈ domf ◮ sublevel set S = {x|f(x) ≤ f(x(0))} is closed (hard to verify)

2nd condition is satisfied for all x(0) ∈ domf if f is closed, i.e., all sublevel sets are closed, equivalent to epi f is closed

◮ true if f is continuous and dom f = Rn ◮ true if f(x) → ∞ as x → bd dom f

examples of differentiable functions with closed sublevel sets: f(x) = log(

m

  • i=1

exp(aT

i x + bi)),

f(x) = −

m

  • i=1

log(bi − aT

i x)

SJTU Ying Cui 4 / 40

slide-5
SLIDE 5

Strong convexity and implications

f is strongly convex on S if there exists an m > 0 such that ∇2f(x) mI for all x ∈ S implications

◮ for x, y ∈ S, f(y) ≥ f(x) + ∇f(x)T (y − x) + m 2 ||x − y||2 2

◮ m = 0: recover the basic inequality characterizing convexity ◮ m > 0: a better lower bound than follows from convexity alone ◮ imply that S is bounded

◮ p∗ > −∞ and for x ∈ S, f(x) − p∗ ≤ 1 2m||∇f(x)||2 2

◮ if gradient is small at a point, then the point is nearly optimal ◮ a condition for suboptimality generalizing optimality condition

||∇f(x)||2 ≤ (2mǫ)1/2 = ⇒ f(x) − p∗ ≤ ǫ

◮ useful as a stopping criterion if m is known

◮ upper bound on ∇f(x): there exists an M > 0 such that

∇2f(x) MI for all x ∈ S

SJTU Ying Cui 5 / 40

slide-6
SLIDE 6

Condition number of matrix and convex set

◮ condition number of a matrix: the ratio of its largest

eigenvalue to its smallest eigenvalue

◮ condition number of a convex set: square of the ratio of its

maximum width to its minimum width

◮ width of a convex set C in the direction q with ||q||2 = 1:

W(C, q) = supz∈C qT z − infz∈C qT z

◮ minimum width and maximum width of C:

Wmin = inf||q||2=1 W(C, q) and Wmax = sup||q||2=1 W(C, q)

◮ condition number of C: cond(C) = W 2

max

W 2

min

◮ a measure of its anisotropy or eccentricity: cond(C) small

means C has approximately the same width in all directions (nearly spherical); cond(C) large means that C is far wider in some directions than in others

SJTU Ying Cui 6 / 40

slide-7
SLIDE 7

Condition number of sublevel sets

mI ∇2f(x) MI for all x ∈ S

◮ upper bound of condition number of ∇2f(x):

cond(∇2f(x)) ≤ M/m

◮ upper bound of condition number of sublevel set

Cα = {x|f(x) ≤ α}, p∗ < α ≤ f(x(0)): cond(Cα) ≤ M/m

◮ geometric interpretation:

lim

α→p∗ cond(Cα) = cond(∇2f(x∗)) ◮ condition number of the sublevel sets of f (which is bounded

by M/m) has a strong effect on the efficiency of some common methods for unconstrained minimization

SJTU Ying Cui 7 / 40

slide-8
SLIDE 8

Descent methods

algorithms described in this chapter produce a minimizing sequence x(k), k = 1, · · · , where x(k+1) = x(k) + t(k)∆x(k) with f(x(k+1)) < f(x(k)) and t(k) > 0

◮ other notations: x+ = x + t∆x, x := x + t∆x ◮ ∆x is step (or search direction); t is step size (or step length) ◮ convexity of f implies ∇f(x(k))T ∆x(k) < 0 (i.e., ∆x(k) is a

descent direction)

General descent method. given a starting point x ∈ dom f. repeat

  • 1. Determine a descent direction ∆x.
  • 2. Line search. Choose a step size t > 0.
  • 3. Update. x := x + t∆x

until stopping criterion is satisfied.

SJTU Ying Cui 8 / 40

slide-9
SLIDE 9

Line search types

exact line search: t = arg mint>0 f(x + t∆x)

◮ minimize f along ray {x + t∆x|t ≥ 0} ◮ used when cost of the minimization problem with one variable

is low compared to the cost of computing the search direction itself

◮ in some special cases the minimizer can be found analytically,

and in others it can be computed efficiently

SJTU Ying Cui 9 / 40

slide-10
SLIDE 10

Line search types

backtracking line search (with parameters α ∈ (0, 1

2), β ∈ (0, 1)) ◮ reduce f enough along ray {x + t∆x|t ≥ 0} ◮ starting at t = 1, repeat t := βt until

f(x + t∆x) < f(x) + αt∇f(x)T ∆x

◮ convexity of f: f(x + t∆x) ≥ f(x) + t∇f(x)T ∆x ◮ constant α can be interpreted as the fraction of decrease in f

predicted by linear extrapolation that we will accept

◮ graphical interpretation: backtrack until t ≤ t0

t f(x + t∆x) t = 0 t0 f(x) + αt∇f(x)T ∆x f(x) + t∇f(x)T ∆x Figure 9.1 Backtracking line search. The curve shows f, restricted to the line

  • ver which we search. The lower dashed line shows the linear extrapolation
  • f f, and the upper dashed line has a slope a factor of α smaller.

The backtracking condition is that f lies below the upper dashed line, i.e., 0 ≤ t ≤ t0.

SJTU Ying Cui 10 / 40

slide-11
SLIDE 11

Gradient descent method

general descent method with ∆x = −∇f(x)

Gradient descent method. given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

◮ stopping criterion usually of the form ||∇f(x)||2 ≤ ǫ ◮ convergence result: for strongly convex f,

f(x(k)) − p∗ ≤ ck(f(x(0)) − p∗)

◮ exact line search: c = 1 − m/M < 1 ◮ backtracking line search: c = 1 − min{2mα, 2βαm/M} < 1 ◮ linear convergence: the error lies below a line on a log-linear

plot of error versus iteration number

◮ very simple, but often very slow; rarely used in practice

SJTU Ying Cui 11 / 40

slide-12
SLIDE 12

Examples

a quadratic problem in R2 f(x) = (1/2)(x2

1 + γx2 2)

(γ > 0) with exact line search, starting at x(0) = (γ, 1): closed-form expressions for iterates

x(k)

1

= γ γ − 1 γ + 1 k , x(k)

2

= γ

  • −γ − 1

γ + 1 k , f(x(k)) = γ − 1 γ + 1 2k f(x(0))

◮ exact solution found in one iteration if γ = 1; convergence rapid if γ

not far from 1; convergence very slow if γ ≫ 1 or γ ≪ 1

x1 x2 x(0) x(1) −10 10 −4 4 Figure 9.2 Some contour lines of the function f(x) = (1/2)(x2

1 + 10x2 2). The

condition number of the sublevel sets, which are ellipsoids, is exactly 10. The figure shows the iterates of the gradient method with exact line search, started at x(0) = (10, 1).

SJTU Ying Cui 12 / 40

slide-13
SLIDE 13

Examples

a nonquadratic problem in R2 f(x1, x2) = ex1+3x2−0.1 + ex1−3x2−0.1 + e−x1−0.1

◮ backtracking line search: approximately linear convergence

(sublevel sets of f not too badly conditioned, M/m not too large)

◮ exact line search: approximately linear convergence, about

twice as fast as with backtracking line search

x(0) x(1) x(2) Figure 9.3 Iterates of the gradient method with backtracking line search, for the problem in R2 with objective f given in (9.20). The dashed curves are level curves of f, and the small circles are the iterates of the gradient

  • method. The solid lines, which connect successive iterates, show the scaled

steps t(k)∆x(k). x(0) x(1) Figure 9.5 Iterates of the gradient method with exact line search for the problem in R2 with objective f given in (9.20). k f(x(k)) − p⋆ backtracking l.s. exact l.s. 5 10 15 20 25 10−15 10−10 10−5 100 105 Figure 9.4 Error f(x(k)) − p⋆ versus iteration k of the gradient method with backtracking and exact line search, for the problem in R2 with objective f given in (9.20). The plot shows nearly linear convergence, with the error reduced approximately by the factor 0.4 in each iteration of the gradient method with backtracking line search, and by the factor 0.2 in each iteration

  • f the gradient method with exact line search.

SJTU Ying Cui 13 / 40

slide-14
SLIDE 14

Examples

a problem in R100 f(x) = cT x −

500

  • i=1

log(bi − aT

i x) ◮ backtracking line search: approximately linear convergence ◮ exact line search: approximately linear convergence, only a bit

faster than with backtracking line search

k f(x(k)) − p⋆ exact l.s. backtracking l.s. 50 100 150 200 10−4 10−2 100 102 104

Figure 9.6 Error f(x(k))−p⋆ versus iteration k for the gradient method with backtracking and exact line search, for a problem in R100.

SJTU Ying Cui 14 / 40

slide-15
SLIDE 15

Conclusions of gradient decent method

characteristics:

◮ exhibit approximately linear convergence, i.e., error converges

to zero approximately as a geometric series

◮ choice of backtracking parameters α, β has a noticeable but

not dramatic effect on the convergence

◮ exact line search sometimes improves the convergence, but

not much (and probably not worth trouble of implementing it)

◮ convergence rate depends greatly on the condition number of

the Hessian, or the sublevel sets main adavantage and disadvantage:

◮ main advantage: simplicity ◮ main disadvantage: convergence rate depends so critically on

the condition number of the Hessian or sublevel sets

SJTU Ying Cui 15 / 40

slide-16
SLIDE 16

Steepest descent method

normalized steepest descent direction (at x, for norm || · ||) ∆xnsd = argmin{∇f(x)T v| ||v|| = 1}

◮ first-order Taylor approximation of f(x + v) around x is

f(x + v) ≈ f(x) + ∇f(x)Tv

◮ directional derivative of f at x in direction v is ∇f(x)Tv ◮ direction ∇xnsd is unit-norm direction with most negative

directional derivative (unnormalized) steepest descent direction ∆xsd = ||∇f(x)||∗∆xnsd satisfies ∇f(x)T ∆xsd = −||∇f(x)||2

SJTU Ying Cui 16 / 40

slide-17
SLIDE 17

Steepest descent method

general descent method with ∆x = ∆xsd

Steepest descent method. given a starting point x ∈ dom f. repeat

  • 1. Compute steepest descent direction ∆xsd.
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆xsd.

until stopping criterion is satisfied.

◮ when exact line search is used, scale factors in the descent

direction have no effect, so ∆xnsd or ∆xsd can be used

◮ convergence result: for strongly convex f,

f(x(k)) − p∗ ≤ ck(f(x(0)) − p∗)

◮ backtracking line search: c = 1 − 2mα˜

γ2 min{1, βγ2/M} < 1

◮ any norm can be bounded in terms of the Euclidean norm,

i.e., there exist constants γ, ˜ γ ∈ (0, 1] such that ||x|| ≥ γ||x||2 and ||x||∗ ≥ ˜ γ||x||2

◮ linear convergence, same as gradient decent method SJTU Ying Cui 17 / 40

slide-18
SLIDE 18

Steepest decent for different norms

◮ Euclidean norm: ∆xsd = −∇f(x)

◮ coincide with the gradient descent method

◮ quadratic norm ||x||P = (xT Px)1/2(P ∈ Sn ++):

∇xsd = −P −1∇f(x)

◮ can be thought of as the gradient descent method applied to

the problem after the change of coordinates ¯ x = P 1/2x

◮ ℓ1-norm: ∆xsd = − ∂f(x) ∂xi ei, where |∂f(x) ∂xi | = ||∇f(x)||∞

◮ is a coordinate-descent algorithm (update the component with

maximum absolute partial derivative value)

−∇f(x) ∆xnsd Figure 9.9 Normalized steepest descent direction for a quadratic norm. The ellipsoid shown is the unit ball of the norm, translated to the point x. The normalized steepest descent direction ∆xnsd at x extends as far as possible in the direction −∇f(x) while staying in the ellipsoid. The gradient and normalized steepest descent directions are shown. −∇f(x) ∆xnsd Figure 9.10 Normalized steepest descent direction for the ℓ1-norm. The diamond is the unit ball of the ℓ1-norm, translated to the point x. The normalized steepest descent direction can always be chosen in the direction

  • f a standard basis vector; in this example we have ∆xnsd = e1.

SJTU Ying Cui 18 / 40

slide-19
SLIDE 19

Choice of norm for steepest descent

choice of norm has strong effect on speed of convergence of steepest decent method (consider quadratic P-norm)

◮ steepest descent method with quadratic P-norm is same as

gradient method after change of coordinates ¯ x = P 1/2x

◮ to increase speed of convergence, choose P so that the

sublevel sets of f, transformed by P −1/2, are well conditioned

◮ ellipsoid {x|xT Px ≤ 1} approximates shape of sublevel sets

◮ work well in cases where we can identify a matrix P for which

the transformed problem has moderate condition number

◮ steepest descent with backtracking line search for two

quadratic norms (ellipses show {x| ||x − x(k)||P = 1})

x(0) x(1) x(2) Figure 9.11 Steepest descent method with a quadratic norm · P1. The ellipses are the boundaries of the norm balls {x | x − x(k)P1 ≤ 1} at x(0) and x(1). x(0) x(1) x(2) Figure 9.12 Steepest descent method, with quadratic norm · P2. k P1 P2 f(x(k)) − p⋆ 10 20 30 40 10−15 10−10 10−5 100 105 Figure 9.13 Error f(x(k)) − p⋆ versus iteration k, for the steepest descent method with the quadratic norm · P1 and the quadratic norm · P2. Convergence is rapid for the norm · P1 and very slow for · P2.

SJTU Ying Cui 19 / 40

slide-20
SLIDE 20

Newton step

Newton step for f at x: ∆xnt = −∇2f(x)−1∇f(x)

◮ convexity of f (∇2f(x) 0) implies ∇f(x)T ∆xnt < 0 unless

∇f(x) = 0

◮ Newton step is a decent direction unless x is optimal ◮ affine invariant: Newton step of ¯

f(y) = f(Ty) (T nonsingular) at y and Newton step of f at x = Ty satisfies ∆xnt = T∆ynt

SJTU Ying Cui 20 / 40

slide-21
SLIDE 21

Interpretations Newton step

Minimizer of second-order approximation x + ∆xnt minimizes second-order Taylor approximation of f at x (a convex quadratic function of v) ˆ f(x + v) = f(x) + ∇f(x)T v + 1 2vT ∇2f(x)v

◮ if f is quadratic, then x + ∆xnt is the exact minimizer of f ◮ if f is nearly quadratic, then x + ∆xnt should be a very good

estimate of the minimizer of f

◮ when x is near x∗ (quadratic model of f will be very

accurate), x + ∆xnt should be a very good approximation of x∗

f

  • f

(x, f(x)) (x + ∆xnt, f(x + ∆xnt)) Figure 9.16 The function f (shown solid) and its second-order approximation

  • f at x (dashed). The Newton step ∆xnt is what must be added to x to give

the minimizer of f.

SJTU Ying Cui 21 / 40

slide-22
SLIDE 22

Interpretations Newton step

Solution of linearized optimality condition x + ∆xnt solves linearized optimality condition ∇f(x + v) ≈ ∇f(x) + ∇2f(x)v = 0

◮ when x is near x∗ (so the optimality condition almost holds),

x + ∆xnt should be a very good approximation of x∗

f ′

  • f ′

(x, f ′(x)) (x + ∆xnt, f ′(x + ∆xnt))

Figure 9.18 The solid curve is the derivative f ′ of the function f shown in figure 9.16. f ′ is the linear approximation of f ′ at x. The Newton step ∆xnt is the difference between the root of f ′ and the point x.

SJTU Ying Cui 22 / 40

slide-23
SLIDE 23

Interpretations Newton step

Steepest descent direction in Hessian norm ∆xnt is steepest descent direction at x for the quadratic norm defined by the Hessian ∇2f(x), i.e., ||u||∇2f(x) = (uT ∇2f(x)u)1/2

◮ when x is near x∗ (∇2f(x) after the associated change of

coordinates ¯ x = (∇2f(x))1/2x has small condition number), steepest descent with || · ||∇2f(x) converges very rapidly

x x + ∆xnt x + ∆xnsd

Figure 9.17 The dashed lines are level curves of a convex function. The ellipsoid shown (with solid line) is {x + v | vT ∇2f(x)v ≤ 1}. The arrow shows −∇f(x), the gradient descent direction. The Newton step ∆xnt is the steepest descent direction in the norm · ∇2f(x). The figure also shows ∆xnsd, the normalized steepest descent direction for the same norm.

SJTU Ying Cui 23 / 40

slide-24
SLIDE 24

Newton decrement

Newton decrement at x (a measure of the proximity of x to x∗): λ(x) = (∇f(x)T ∇2f(x)−1∇f(x))1/2 properties

◮ 1 2λ(x)2 is an estimate of f(x) − p∗, using quadratic approx. ˆ

f: f(x) − inf

v

ˆ f(x + v) = f(x) − ˆ f(x + ∆xnt) = 1 2λ(x)2

◮ λ(x) is equal to the norm of Newton step at x in the

quadratic Hessian norm ||u||∇2f(x) = (uT ∇2f(x)u)1/2: λ(x) = ||∆xnt||∇2f(x) = (∆xT

nt∇2f(x)∆xnt)1/2 ◮ −λ(x)2 is directional derivative of f at x in Newton direction:

−λ(x)2 = ∇f(x)T∆xnt = d dtf(x + ∆xntt)|t=0

◮ affine invariant: Newton decrement of ¯

f(y) = f(Ty) (T nonsingular) at y same as Newton decrement of f at x = Ty

SJTU Ying Cui 24 / 40

slide-25
SLIDE 25

Newton’s method

general descent method with ∆x = ∆xnt

General descent method. given a starting point x ∈ domf, tolerance ǫ > 0 repeat

  • 1. Compute the Newton step and decrement

∆xnt := −∇2f(x)−1∇f(x); λ2 := ∇f(x)T ∇2f(x)−1∇f(x)

  • 2. Stopping criterion.quit if λ2/2 ≤ ǫ.
  • 3. Line search. Choose step size t by backtracking line search.
  • 4. Update x := x + t∆xnt.

Newton’s method is affine invariant due to affine invariance of Newton step and decrement

◮ independent of linear changes of coordinates ◮ Newton iterates for ˆ

f(y) = f(Ty) with starting point y(0) = T −1x(0) are y(k) = T −1x(k)

SJTU Ying Cui 25 / 40

slide-26
SLIDE 26

Classical convergence analysis

assumptions

◮ f strongly convex on S with constant m, implying

mI ∇2f(x) MI for all x ∈ S

◮ ∇2f is Lipschitz continuous on S with constant L > 0, i.e.,

||∇2f(x) − ∇2f(y)||2 ≤ L||x − y||2 for all x ∈ S L measures how well f can be approximated by a quadratic function

  • utline: there exist constants η ∈ (0, m2/L), γ > 0 such that

◮ if ||∇f(x(k))||2 ≥ η, then f(x(k+1)) − f(x(k)) ≤ −γ ◮ if ||∇f(x(k))||2 < η, then L 2m2 ||∇f(x(k+1))||2 ≤

L

2m2 ||∇f(x(k))||2

2

◮ implying for all l ≥ k, we have ||∇f(x(l))||2 < η SJTU Ying Cui 26 / 40

slide-27
SLIDE 27

Classical convergence analysis

damped Newton phase (||∇f(x)||2 ≥ η)

◮ most iterations require backtracking steps ◮ function value decreases by at least γ, i.e.,

f(x(k+1)) − f(x(k)) ≤ −γ

◮ if p∗ > −∞, this phase ends after at most (f(x0) − p∗)/γ

iterations quadratically convergent phase (||∇f(x)||2 < η)

◮ all iterations use step size t = 1 (no backtracking steps) ◮ ||∇f(x)||2 converges to zero quadratically: if

||∇f(x(k))||2 < η then

L 2m2 ||∇f(xl)||2 ≤

L

2m2 ||∇f(x(k))||2

2l−k ≤ 1

2

2l−k , l ≥ k = ⇒ f(x(l))−p∗ ≤ 1 2m||∇f(xl)||2

2 ≤ 2m3

L2 1 2 2l−k+1 , l ≥ k

◮ if p∗ > −∞, this phase ends (f(x(l)) − p∗ ≤ ǫ) after at most

log2 log2(ǫ0/ǫ) iterations

SJTU Ying Cui 27 / 40

slide-28
SLIDE 28

Classical convergence analysis

conclusion: number of iterations until f(x) − p∗ ≤ ǫ is bounded above by f(x(0)) − p∗ γ + log2 log2(ǫ0/ǫ)

◮ γ = αβη2 m M2, η = min{1, 3(1 − 2α)}m2 L , ǫ0 = 2m3/L2 ◮ second term is small (of the order of 6) and almost constant

for practical purposes

◮ in practice, constants m, L (hence γ, ǫ0) are usually unknown ◮ provide qualitative insight in convergence properties (i.e.,

explains two algorithm phases)

SJTU Ying Cui 28 / 40

slide-29
SLIDE 29

Examples

example in R2 (page 12)

x(0) x(1) Figure 9.19 Newton’s method for the problem in R2, with objective f given in (9.20), and backtracking line search parameters α = 0.1, β = 0.7. Also shown are the ellipsoids {x | x−x(k)∇2f(x(k)) ≤ 1} at the first two iterates. k f(x(k)) − p⋆ 1 2 3 4 5 10−15 10−10 10−5 100 105 Figure 9.20 Error versus iteration k of Newton’s method for the problem in R2. Convergence to a very high accuracy is achieved in five iterations.

◮ backtracking parameters α = 0.1, β = 0.7 ◮ converges in only 5 steps ◮ apparent quadratic convergence

SJTU Ying Cui 29 / 40

slide-30
SLIDE 30

Examples

example in R100 (page 13)

k f(x(k)) − p⋆ exact l.s. backtracking l.s. 2 4 6 8 10 10−15 10−10 10−5 100 105 Figure 9.21 Error versus iteration for Newton’s method for the problem in

  • R100. The backtracking line search parameters are α = 0.01, β = 0.5. Here

too convergence is extremely rapid: a very high accuracy is attained in only seven or eight iterations. The convergence of Newton’s method with exact line search is only one iteration faster than with backtracking line search. k step size t(k) exact l.s. backtracking l.s. 2 4 6 8 0.5 1 1.5 2 Figure 9.22 The step size t versus iteration for Newton’s method with back- tracking and exact line search, applied to the problem in R100. The back- tracking line search takes one backtracking step in the first two iterations. After the first two iterations it always selects t = 1.

◮ backtracking parameters α = 0.01, β = 0.5 ◮ backtracking line search almost as fast as exact line search

(and much simpler)

◮ clearly shows two convergent phases (damped phase of 2

iterations)

SJTU Ying Cui 30 / 40

slide-31
SLIDE 31

Examples

example in R10000 (with sparse ai) f(x) = −

10000

  • i=1

log(1 − x2

i ) − 100000

  • i=1

log(bi − aT

i x)

k f(x(k)) − p⋆ 5 10 15 20 10−5 100 105 Figure 9.23 Error versus iteration of Newton’s method, for a problem in R10000. A backtracking line search with parameters α = 0.01, β = 0.5 is

  • used. Even for this large scale problem, Newton’s method requires only 18

iterations to achieve very high accuracy.

◮ backtracking parameters α = 0.01, β = 0.5 ◮ a linearly convergent phase of about 13 iterations followed by

a quadratically convergent phase of 4 or 5 iterations

◮ convergence performance similar to small examples

SJTU Ying Cui 31 / 40

slide-32
SLIDE 32

Conclusions of Newton’s method

strong advantages over gradient and steepest descent methods:

◮ convergence of Newton’s method is rapid in general, and

quadratic near x∗

◮ Newton’s method is affine invariant, insensitive to choice of

coordinates, or condition number of sublevel sets of f

◮ Newton’s method scales well with problem size ◮ good performance of Newton’s method is not dependent on

the choice of algorithm parameters main disadvantage:

◮ cost of forming and storing the Hessian ◮ cost of computing the Newton step, which requires solving a

set of linear equations

◮ in many cases it is possible to exploit problem structure to

substantially reduce cost step

SJTU Ying Cui 32 / 40

slide-33
SLIDE 33

Self-concordance

shortcomings of classical convergence analysis

◮ depends on unknown constants m, M, L, only conceptually

useful

◮ bound is not affinely invariant (m, M, L change if coordinates

change), although Newton’s method is convergence analysis via self-concordance (Nesterov and Nemirovski)

◮ analysis of Newton’s method for self-concordant functions

does not depend on any unknown constants

◮ gives affine-invariant bound ◮ include many logarithmic barrier functions that play an

important role in interior-point methods for solving convex

  • ptimization problems

SJTU Ying Cui 33 / 40

slide-34
SLIDE 34

Self-concordant functions

definition

◮ convex f : R → R is self-concordant if |f ′′′(x)| ≤ 2f ′′(x)3/2

for all x ∈ dom f

◮ f : Rn → R is self-concordant if it is self-concordant along

every line in its domain, i.e.,

◮ g(t) = f(x + tv) is self-concordant for all x ∈ dom f, v ∈ Rn

examples on R

◮ linear functions (zero second and third derivatives) ◮ quadratic functions (zero third derivative and nonnegative

second derivative)

◮ negative logarithm f(x) = − log x ◮ negative entropy plus negative logarithm:

f(x) = x log x − log x

SJTU Ying Cui 34 / 40

slide-35
SLIDE 35

Self-concordant functions

remarks

◮ constant 2 is chosen for convenience, in order to simplify the

formulas later on; any other positive constant could be used instead

◮ if f : R → R satisfies |f ′′′(x)| ≤ kf ′′(x)3/2, then

˜ f(x) = k2/4f(x) satisfies | ˜ f ′′′(x)| ≤ 2 ˜ f ′′(x)3/2

◮ what is important is that the third derivative of the function is

bounded by some multiple of the 3/2-power of its second derivative

◮ self-concordance is affine invariant

◮ if f : R → R is s.c., then

f(y) = f(ay + b) is s.c.

◮ self-concordance condition limits the third derivative of a

function, in a way independent of affine coordinate changes

SJTU Ying Cui 35 / 40

slide-36
SLIDE 36

Self-concordant calculus

properties

◮ preserved under positive scaling α ≥ 1, and sum

◮ if f is s.c. and a ≥ 1, then af is s.c. ◮ if f1 and f2 are s.c., then f1 + f2 is s.c.

◮ preserved under composition with affine function

◮ if f : Rn → R is s.c. and A ∈ Rn×m, b ∈ Rn, then f(Ax + b)

is s.c.

◮ preserved under composition with logarithm

◮ if g : R → R is convex with domg = R++ and

|g′′′(x)| ≤ 3g′′(x)/x for all x, then f(x) = log(−g(x)) − log x is s.c. on {x|x > 0, g(x) < 0}

◮ if |g′′′(x)| ≤ 3g′′(x)/x holds for g, then it holds for

g(x) + ax2 + bx + c where a ≥ 0

examples: the following are s.c.

◮ f(x) = − m i=1 log(bi − aT i x) on {x|aT i x < bi, i = 1, ..., m} ◮ f(X) = − log det X on Sn ++ ◮ f(x) = − log(y2 − xT x) on {(x, y)| x2 < y}

SJTU Ying Cui 36 / 40

slide-37
SLIDE 37

Convergence analysis for self-concordant functions

assumptions: f : Rn → R is s.c. summary: there exist constants η ∈ (0, 1/4], γ > 0 such that (η = (1 − 2α)/4, γ = αβ η2

1+η) ◮ if λ(x) > η, then f(x(k+1)) − f(x(k)) ≤ −γ ◮ if λ(x) ≤ η, then 2λ(x(k+1)) ≤ (2λ(x(k)))2

= ⇒ f(x(l)) − p∗ ≤ λ(x(l))2 ≤ 1

2

2l−k+1 , l ≥ k complexity bound: number of Newton iterations bounded by f(x(0)) − p∗ γ + log2 log2(1/ǫ) = 20 − 8α αβ(1 − 2α)2 (f(x(0)) − p∗) + log2 log2(1/ǫ) depends only on line search parameters α, β and final accuracy ǫ

◮ second term is small and can be safely replaced with 6 ◮ example: for α = 0.1, β = 0.8, ǫ = 10−10, we have bound

375(f(x(0)) − p∗) + 6

SJTU Ying Cui 37 / 40

slide-38
SLIDE 38

Numerical example

150 randomly generated instances of ai and bi for min

x

m

  • i=1

log(bi − aT

i x)

f(x(0)) − p⋆ iterations 5 10 15 20 25 30 35 5 10 15 20 25 Figure 9.25 Number of Newton iterations required to minimize self- concordant functions versus f(x(0)) − p⋆. The function f has the form f(x) = − m

i=1 log(bi − aT i x), where the problem data ai and b are ran-

domly generated. The circles show problems with m = 100, n = 50; the squares show problems with m = 1000, n = 500; and the diamonds show problems with m = 1000, n = 50. Fifty instances of each are shown.

◮ number of iterations much smaller than 375(f(x(0)) − p∗) + 6 ◮ bound of form c(f(x(0)) − p∗) + 6 with smaller c (empirically)

valid

SJTU Ying Cui 38 / 40

slide-39
SLIDE 39

Implementation

main effort in each iteration: evaluate derivatives and solve Newton system H∆x = −g where H = ∇2f(x), g = ∇f(x) via Cholesky factorization H = LLT , ∆xnt = −L−T L−1g, λ(x) = L−1g2 where L is a lower triangular matrix

◮ cost (1/3)n3 flops for unstructured system ◮ cost ≪ (1/3)n3 if H sparse, banded

SJTU Ying Cui 39 / 40

slide-40
SLIDE 40

Example of dense Newton system with structure

f(x) =

n

  • i=1

ψi(xi) + ψ0(Ax + b), H = D + AT H0A

◮ assume A ∈ Rp×n, dense, with p ≪ n ◮ D diagonal with diagonal elements ψ′′ i (xi);

H0 = ∇2ψ0(Ax + b) method 1: form H , solve via dense Cholesky factorization (cost (1/3)n3) method 2: factor H0 = L0LT

0 ; write Newton system as

D∆x + AT L0ω = −g, LT

0 A∆x − ω = 0

eliminate ∆x from first equation; compute ω and ∆x from (I + LT

0 AD−1AT L0)ω = −LT 0 AD−1g,

D∆x = −g − AT L0ω cost: 2p2n (dominated by computation of LT

0 AD−1AT L0)

SJTU Ying Cui 40 / 40