[PPT] - Unconstrained Minimization (II) Lijun Zhang zlj@nju.edu.cn PowerPoint Presentation

SLIDE 1

Unconstrained Minimization (II)

Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj

SLIDE 2

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 3

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 4

General Descent Method

 The Algorithm

Given a starting point 𝑦 ∈ dom 𝑔 Repeat

1. Determine a descent direction Δ𝑦.
2. Line search: Choose a step size 𝑢 0.
3. Update: 𝑦 𝑦 𝑢∆𝑦.

until stopping criterion is satisfied.

 Descent Direction

𝛼𝑔 𝑦

Δ𝑦 0

SLIDE 5

Gradient Descent Method

 The Algorithm

Given a starting point 𝑦 ∈ dom 𝑔 Repeat

1. Δ𝑦 ≔ 𝛼𝑔𝑦.

2. Line search: Choose step size 𝑢 via exact or

backtracking line search.

3. Update: 𝑦 ≔ 𝑦 𝑢∆𝑦.

until stopping criterion is satisfied.

 Stopping Criterion

SLIDE 6

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 7

Preliminary





 is both strongly convex and smooth  Define as

 A quadratic upper bound on

SLIDE 8

Analysis for Exact Line Search

1. Minimize Both Sides of

 Left side:

, where is the step

length that minimizes  Right side: is the solution

2. Subtracting

∗ from Both Sides

∗

∗

SLIDE 9

Analysis for Exact Line Search

is strongly convex on

4. Combining
5. Applying it Recursively

 

coverges to

∗ as

⇒ 𝛼𝑔 𝑦

2𝑛 𝑔 𝑦 𝑞∗
∗

∗

∗
∗

SLIDE 10

Discussions

 Iteration Complexity



∗

after at most 

∗

indicates that initialization is important  is a function of the condition number  When is large

∗

iterations

log 1/𝑑 log1 𝑛/𝑁 𝑛/𝑁

SLIDE 11

Discussions

 Iteration Complexity



∗

after at most 

∗

indicates that initialization is important  is a function of the condition number  When is large

∗
∗

iterations

log 1/𝑑 log1 𝑛/𝑁 𝑛/𝑁

SLIDE 12

Discussions

 Iteration Complexity



∗

after at most 

∗

indicates that initialization is important  is a function of the condition number  Linear Convergence

 Error lies below a line on a log-linear plot of error versus iteration number

∗

iterations

SLIDE 13

Analysis for Backtracking Line Search

 Backtracking Line Search

given a descent direction ∆𝑦 for 𝑔 at 𝑦 ∈ 𝐞𝐩𝐧 𝑔, 𝛽 ∈ 0, 0.5 , 𝛾 ∈ 0, 1 𝑢 ≔ 1 w hile 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽𝑢𝛼𝑔 𝑦 ∆𝑦, 𝑢 ≔ 𝛾𝑢

for all

0 𝑢 1 𝑁 ⇒ 𝑢 𝑁𝑢 2 𝑢 2 𝑔 𝑢 𝑔 𝑦 𝑢 𝛼𝑔 𝑦

𝑁𝑢

2 𝛼𝑔 𝑦

SLIDE 14

Analysis for Backtracking Line Search

 Backtracking Line Search

given a descent direction ∆𝑦 for 𝑔 at 𝑦 ∈ 𝐞𝐩𝐧 𝑔, 𝛽 ∈ 0, 0.5 , 𝛾 ∈ 0, 1 𝑢 ≔ 1 w hile 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽𝑢𝛼𝑔 𝑦 ∆𝑦, 𝑢 ≔ 𝛾𝑢

for all



𝑔 𝑢 𝑔 𝑦 𝑢/2 𝛼𝑔 𝑦

𝑔 𝑦 𝛽𝑢 𝛼𝑔 𝑦

SLIDE 15

Analysis for Backtracking Line Search

2. Backtracking Line Search Terminates

 Either with  Or with a value  So,

3. Subtracting

∗ from Both Sides

∗

∗

SLIDE 16

Analysis for Backtracking Line Search

4. Combining with Strong Convexity
5. Applying it Recursively




converges to

∗ with an exponent

that depends on the condition number  Linear Convergence

∗

∗

∗
∗

SLIDE 17

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 18

A Quadratic Problem in

 A Quadratic Objective Function

 The optimal point

∗

 The optimal value is  The Hessian of is constant and has eigenvalues and 

min1, 𝛿 , 𝑁 max1, 𝛿

 Condition number

SLIDE 19

A Quadratic Problem in

 A Quadratic Objective Function  Gradient Descent Method

 Exact line search starting at

 Reduced by the factor
𝑦

𝛿 𝛿 1

𝛿 1

, 𝑦

𝛿 𝛿 1

𝛿 1

𝑔 𝑦

𝛿 𝛿 1 2 𝛿 1 𝛿 1

𝛿 1

𝛿 1

𝑔𝑦

Convergence is exactly linear

SLIDE 20

A Quadratic Problem in

 Comparisons

  From our general analysis, the error is reduced by  From the closed-form solution, the error is reduced by  When is large, the iteration complexity differs by a factor of

𝛿 1 𝛿 1

1 𝑛/𝑁

1 𝑛/𝑁

1

2𝑛/𝑁 1 𝑛/𝑁

1 𝑛

𝑁

SLIDE 21

A Quadratic Problem in

 Experiments

 For not far from one, convergence is rapid

SLIDE 22

A Non-Quadratic Problem in

 The Objective Function

 Gradient descent method with backtracking line search

 𝛽 0.1, 𝛾 0.7

.

. .

SLIDE 23

A Non-Quadratic Problem in

 The Objective Function

 Gradient descent method with exact line search

.

. .

SLIDE 24

A Non-Quadratic Problem in

 Comparisons

 Both are linear, and exact l.s. is faster

SLIDE 25

A Problem in

 A Larger Problem

 and  Gradient descent method with backtracking line search

 𝛽 0.1, 𝛾 0.5

 Gradient descent method with exact line search

SLIDE 26

A Problem in

 Comparisons

 Both are linear, and exact l.s. is only a bit faster

SLIDE 27

Gradient Method and Condition Number

 A Larger Problem

 Replace by

 A Family of Optimization Problems

 Indexed by

/

/ /

SLIDE 28

Gradient Method and Condition Number

 Number of iterations required to

btain
∗
Backtracking line search

with 𝛽 0.3 and 𝛾 0.7

SLIDE 29

Gradient Method and Condition Number

 The condition number of the Hessian

∗ at the optimum

The larger the condition number, the larger the number of iterations

SLIDE 30

Conclusions

1. The gradient method often exhibits

approximately linear convergence.

2. The convergence rate depends greatly on

the condition number of the Hessian, or the sublevel sets.

3. An exact line search sometimes improves

the convergence of the gradient method, but the effect is not large.

4. The choice of backtracking parameters

has a noticeable but not dramatic effect

n the convergence.

SLIDE 31

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 32

General Convex Functions

 is convex  is Lipschitz continuous  Gradient Descent Method

Given a starting point 𝑦 ∈ dom 𝑔 For 𝑙 1,2, … , 𝐿 do

Update: 𝑦 𝑦 𝑢𝛼𝑔𝑦

End for Return

𝛼𝑔 𝑦

𝐻

SLIDE 33

Analysis

 Define

∗
 Let

𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 𝜃 𝑦 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗

𝑦 𝑦∗
𝑦 𝑦

SLIDE 34

Analysis

 Define

∗
 Let

𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗

𝑦 𝑦∗
𝜃

2 𝛼𝑔 𝑦

1

2𝜃 𝑦 𝑦∗

𝑦 𝑦∗
𝜃

2 𝐻 1 𝜃 𝑦 𝑦 𝑦 𝑦∗

SLIDE 35

Analysis

 So,  Summing over

 Dividing both sides by

𝑔 𝑦 𝑔 𝑦∗ 1 2𝜃 𝑦 𝑦∗

𝑦 𝑦∗
𝜃

2 𝐻

𝑔 𝑦 𝐿𝑔 𝑦∗
1

2𝜃 𝐸 𝜃𝐿 2 𝐻 1 𝐿 𝑔 𝑦 𝑔 𝑦∗

1

𝐿 1 2𝜃 𝐸 𝜃𝐿 2 𝐻 𝐸 2𝜃𝐿 𝜃 2 𝐻

SLIDE 36

Analysis

 By Jensen’s Inequality



𝑔 𝑦̅ 𝑔 𝑦∗ 𝑔

1 𝐿 𝑦

𝑔 𝑦∗

1 𝐿 𝑔 𝑦 𝑔 𝑦∗

𝐸

2𝜃𝐿 𝜃 2 𝐻 𝐻𝐸 𝐿

SLIDE 37

Discussions

 How to Ensure

?

 Add a Domain Constraint

 Can model any constrained convex

ptimization problem

 Gradient Descent with Projection

 Property of Euclidean Projection

min 𝑔𝑦

s. t.

𝑦 ∈ 𝑌

𝑦 𝑦∗

𝑄

𝑦

𝑄

𝑦∗

𝑦 𝑦∗

SLIDE 38

Gradient Descent with Projection

 The Problem  The Algorithm

Given a starting point 𝑦 ∈ dom 𝑔 For 𝑙 1,2, … , 𝐿 do

Update: 𝑦 𝑦 𝑢 𝛼𝑔 𝑦 Projection: 𝑦 𝑄

𝑦

End for

Return

 Assumptions

𝛼𝑔 𝑦

𝐻,

∀𝑦 ∈ 𝑌

SLIDE 39

Analysis

 Define

∗
∗

∈

 Let

𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗

𝑦

𝑦∗

𝜃

2 𝐻 1 𝜃 𝑦 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗

𝑦 𝑦∗
𝜃

2 𝐻

Property

f

Euclidean Projection

SLIDE 40

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 41

Motivation

 The First-order Taylor Approximation



is the directional derivative of

at in the direction  It gives the approximate change in for a small step  is a descent direction if

is

negative

 A Good Search Direction

 Make

as negative as possible

SLIDE 42

Steepest Descent Method

 Normalized Steepest Descent Direction

 with respect to the norm  Equivalent to

 The direction in the unit ball of ⋅ that extends farthest in the direction 𝛼𝑔𝑦

 Unnormalized Steepest Descent Direction

∗
∗
∗

SLIDE 43

Steepest Descent Method

 The Algorithm

Given a starting point 𝑦 ∈ dom 𝑔 Repeat

1. Compute steepest descent direction Δ𝑦.
2. Line search: Choose 𝑢 via exact or

backtracking line search.

3. Update: 𝑦 ≔ 𝑦 𝑢Δ𝑦.

until stopping criterion is satisfied.

 When exact line search is used, scale factors in the direction have no effect.

SLIDE 44

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 45

Steepest Descent Method

 Steepest Descent for Euclidean Norm

 The steepest descent method coincides with the gradient descent method

SLIDE 46

Steepest Descent Method

 Steepest Descent for Quadratic Norm



quadratic norm, where
 The dual norm

∗

/
 Normalized Steepest Descent Direction

 Unnormalized Steepest Descent Direction

/
∗

𝑄𝛼𝑔 𝑦

/

/

SLIDE 47

Steepest Descent Method

 Steepest Descent for Quadratic Norm

 The ellipsoid is the unit ball of the norm

Δ𝑦 extends as far as possible in the direction 𝛼𝑔 𝑦 while staying in the ellipsoid.

SLIDE 48

Steepest Descent Method

 Steepest Descent for Quadratic Norm

 Interpretation via Change of Coordinates  Define

/ , so

 An Equivalent Problem

 Gradient descent method  Correspond to the direction

/ / / / / /

SLIDE 49

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 50

Steepest Descent Method

 Steepest Descent for -norm

 Normalized Steepest Descent Direction

 𝑗 be any index for which 𝛼𝑔 𝑦 𝛼𝑔𝑦  𝑓 is the 𝑗-th standard basis vector

 Unnormalized Steepest Descent Direction

SLIDE 51

Steepest Descent Method

 Steepest Descent for -norm

 The diamond is the unit ball of ℓ-norm

Δ𝑦 can always be chosen in the direction of a standard basis vector (or a negative one).

SLIDE 52

Steepest Descent Method

 Steepest Descent for -norm  Coordinate-descent Algorithm

1. Select a component of

with maximum absolute value

2. Decrease or increase the corresponding

component of  Simplify, or even trivialize, the line search

SLIDE 53

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 54

Convergence Analysis

1. Any norm can be bounded in terms of

the Euclidean norm

 Exist

is smooth, i.e,

∗
∗
∗

SLIDE 55

Convergence Analysis

3. Exit Condition for the Backtracking

Line Search



0 𝑢 𝛿

𝑁 ⇒ 𝑢 𝑁𝑢 2𝛿 𝑢 2

∗
∗
∗

SLIDE 56

Convergence Analysis

3. Exit Condition for the Backtracking

Line Search

  Backtracking line search terminates  So

𝑔 𝑦 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽 min 1, 𝛾𝛿

𝑁 𝑔 𝑦

∗

𝑔 𝑦 𝛽 𝛿

min 1, 𝛾𝛿 𝑁 𝑔 𝑦

SLIDE 57

Convergence Analysis

4. Subtracting

∗ from Both Sides

5. Combining with Strong Convexity



6. Applying it Recursively

 Linear convergence

∗

∗

𝑔 𝑦 𝑞∗ 𝑔 𝑦 𝑞∗ 𝛽 𝛿 min 1, 𝛾𝛿 𝑁 𝑔 𝑦

∗
∗

Fail to illustrate the advantage

SLIDE 58

Outline

 Gradient Descent Method

 Convergence Analysis  Examples  General Convex Functions

 Steepest Descent Method

 Euclidean and Quadratic Norms 

norm

 Convergence Analysis  Discussion and Examples

SLIDE 59

Choice of Norm for Steepest Descent

 Steepest Descent Method with Quadratic -norm

 Equivalent to gradient method after the change of coordinates

 Gradient Method Works Well

 When the condition numbers of the sublevel sets (or Hessian) are moderate

 Steepest Descent Method will Work Well

 When the sublevel sets, after the change of coordinates, are moderately conditioned

SLIDE 60

Choice of Norm for Steepest Descent

 Choosing to make the sublevel sets

f

are well conditioned

 If an approximation

f the Hessian at the
ptimal point

∗

were known  A good choice of would be  The Hessian of at the optimum

 Choosing to make the ellipsoid approximate the the sublevel set of

/

∗

/

SLIDE 61

Example

 The Objective Function

 Steepest descent method

 Using the two quadratic norms

 Backtracking line search

 𝛽 0.1 and 𝛾 0.7

.

. .

SLIDE 62

Example

 The Objective Function

.

. .

SLIDE 63

Example

 The Objective Function

.

. .

SLIDE 64

Example

 The Objective Function

.

. .

SLIDE 65

Example

 Why is better than

?

 Problems after the changes of coordinates

 The change of variables associated with 𝑄

yields

sublevel sets with modest condition number