Unconstrained Minimization (II) Lijun Zhang zlj@nju.edu.cn - - PowerPoint PPT Presentation
Unconstrained Minimization (II) Lijun Zhang zlj@nju.edu.cn - - PowerPoint PPT Presentation
Unconstrained Minimization (II) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Gradient Descent Method Convergence Analysis Examples General Convex Functions Steepest Descent Method Euclidean and Quadratic
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
General Descent Method
The Algorithm
Given a starting point 𝑦 ∈ dom 𝑔 Repeat
- 1. Determine a descent direction Δ𝑦.
- 2. Line search: Choose a step size 𝑢 0.
- 3. Update: 𝑦 𝑦 𝑢∆𝑦.
until stopping criterion is satisfied.
Descent Direction
𝛼𝑔 𝑦
Δ𝑦 0
Gradient Descent Method
The Algorithm
Given a starting point 𝑦 ∈ dom 𝑔 Repeat
1. Δ𝑦 ≔ 𝛼𝑔𝑦.
- 2. Line search: Choose step size 𝑢 via exact or
backtracking line search.
- 3. Update: 𝑦 ≔ 𝑦 𝑢∆𝑦.
until stopping criterion is satisfied.
Stopping Criterion
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Preliminary
-
is both strongly convex and smooth Define as
A quadratic upper bound on
Analysis for Exact Line Search
- 1. Minimize Both Sides of
Left side:
, where is the step
length that minimizes Right side: is the solution
- 2. Subtracting
∗ from Both Sides
- ∗
∗
Analysis for Exact Line Search
is strongly convex on
- 4. Combining
- 5. Applying it Recursively
- coverges to
∗ as
⇒ 𝛼𝑔 𝑦
- 2𝑛 𝑔 𝑦 𝑞∗
- ∗
∗
- ∗
- ∗
Discussions
Iteration Complexity
- ∗
after at most
- ∗
indicates that initialization is important is a function of the condition number When is large
- ∗
iterations
log 1/𝑑 log1 𝑛/𝑁 𝑛/𝑁
Discussions
Iteration Complexity
- ∗
after at most
- ∗
indicates that initialization is important is a function of the condition number When is large
- ∗
- ∗
iterations
log 1/𝑑 log1 𝑛/𝑁 𝑛/𝑁
Discussions
Iteration Complexity
- ∗
after at most
- ∗
indicates that initialization is important is a function of the condition number Linear Convergence
Error lies below a line on a log-linear plot of error versus iteration number
- ∗
iterations
Analysis for Backtracking Line Search
Backtracking Line Search
given a descent direction ∆𝑦 for 𝑔 at 𝑦 ∈ 𝐞𝐩𝐧 𝑔, 𝛽 ∈ 0, 0.5 , 𝛾 ∈ 0, 1 𝑢 ≔ 1 w hile 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽𝑢𝛼𝑔 𝑦 ∆𝑦, 𝑢 ≔ 𝛾𝑢
- for all
0 𝑢 1 𝑁 ⇒ 𝑢 𝑁𝑢 2 𝑢 2 𝑔 𝑢 𝑔 𝑦 𝑢 𝛼𝑔 𝑦
- 𝑁𝑢
2 𝛼𝑔 𝑦
Analysis for Backtracking Line Search
Backtracking Line Search
given a descent direction ∆𝑦 for 𝑔 at 𝑦 ∈ 𝐞𝐩𝐧 𝑔, 𝛽 ∈ 0, 0.5 , 𝛾 ∈ 0, 1 𝑢 ≔ 1 w hile 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽𝑢𝛼𝑔 𝑦 ∆𝑦, 𝑢 ≔ 𝛾𝑢
- for all
𝑔 𝑢 𝑔 𝑦 𝑢/2 𝛼𝑔 𝑦
- 𝑔 𝑦 𝛽𝑢 𝛼𝑔 𝑦
Analysis for Backtracking Line Search
- 2. Backtracking Line Search Terminates
Either with Or with a value So,
- 3. Subtracting
∗ from Both Sides
- ∗
∗
Analysis for Backtracking Line Search
- 4. Combining with Strong Convexity
- 5. Applying it Recursively
-
- converges to
∗ with an exponent
that depends on the condition number Linear Convergence
- ∗
∗
- ∗
- ∗
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
A Quadratic Problem in
A Quadratic Objective Function
The optimal point
∗
The optimal value is The Hessian of is constant and has eigenvalues and
min1, 𝛿 , 𝑁 max1, 𝛿
Condition number
A Quadratic Problem in
A Quadratic Objective Function Gradient Descent Method
Exact line search starting at
- Reduced by the factor
- 𝑦
𝛿 𝛿 1
𝛿 1
- , 𝑦
𝛿 𝛿 1
𝛿 1
- 𝑔 𝑦
𝛿 𝛿 1 2 𝛿 1 𝛿 1
- 𝛿 1
𝛿 1
- 𝑔𝑦
Convergence is exactly linear
A Quadratic Problem in
Comparisons
From our general analysis, the error is reduced by From the closed-form solution, the error is reduced by When is large, the iteration complexity differs by a factor of
𝛿 1 𝛿 1
- 1 𝑛/𝑁
1 𝑛/𝑁
- 1
2𝑛/𝑁 1 𝑛/𝑁
- 1 𝑛
𝑁
A Quadratic Problem in
Experiments
For not far from one, convergence is rapid
A Non-Quadratic Problem in
The Objective Function
Gradient descent method with backtracking line search
𝛽 0.1, 𝛾 0.7
- .
. .
A Non-Quadratic Problem in
The Objective Function
Gradient descent method with exact line search
- .
. .
A Non-Quadratic Problem in
Comparisons
Both are linear, and exact l.s. is faster
A Problem in
A Larger Problem
and Gradient descent method with backtracking line search
𝛽 0.1, 𝛾 0.5
Gradient descent method with exact line search
A Problem in
Comparisons
Both are linear, and exact l.s. is only a bit faster
Gradient Method and Condition Number
A Larger Problem
Replace by
A Family of Optimization Problems
Indexed by
- /
/ /
Gradient Method and Condition Number
Number of iterations required to
- btain
- ∗
- Backtracking line search
with 𝛽 0.3 and 𝛾 0.7
Gradient Method and Condition Number
The condition number of the Hessian
- ∗ at the optimum
The larger the condition number, the larger the number of iterations
Conclusions
- 1. The gradient method often exhibits
approximately linear convergence.
- 2. The convergence rate depends greatly on
the condition number of the Hessian, or the sublevel sets.
- 3. An exact line search sometimes improves
the convergence of the gradient method, but the effect is not large.
- 4. The choice of backtracking parameters
has a noticeable but not dramatic effect
- n the convergence.
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
General Convex Functions
is convex is Lipschitz continuous Gradient Descent Method
Given a starting point 𝑦 ∈ dom 𝑔 For 𝑙 1,2, … , 𝐿 do
Update: 𝑦 𝑦 𝑢𝛼𝑔𝑦
End for Return
- 𝛼𝑔 𝑦
𝐻
Analysis
Define
- ∗
- Let
𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 𝜃 𝑦 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗
- 𝑦 𝑦∗
- 𝑦 𝑦
Analysis
Define
- ∗
- Let
𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗
- 𝑦 𝑦∗
- 𝜃
2 𝛼𝑔 𝑦
- 1
2𝜃 𝑦 𝑦∗
- 𝑦 𝑦∗
- 𝜃
2 𝐻 1 𝜃 𝑦 𝑦 𝑦 𝑦∗
Analysis
So, Summing over
Dividing both sides by
𝑔 𝑦 𝑔 𝑦∗ 1 2𝜃 𝑦 𝑦∗
- 𝑦 𝑦∗
- 𝜃
2 𝐻
- 𝑔 𝑦 𝐿𝑔 𝑦∗
- 1
2𝜃 𝐸 𝜃𝐿 2 𝐻 1 𝐿 𝑔 𝑦 𝑔 𝑦∗
- 1
𝐿 1 2𝜃 𝐸 𝜃𝐿 2 𝐻 𝐸 2𝜃𝐿 𝜃 2 𝐻
Analysis
By Jensen’s Inequality
- 𝑔 𝑦̅ 𝑔 𝑦∗ 𝑔
1 𝐿 𝑦
- 𝑔 𝑦∗
1 𝐿 𝑔 𝑦 𝑔 𝑦∗
- 𝐸
2𝜃𝐿 𝜃 2 𝐻 𝐻𝐸 𝐿
Discussions
How to Ensure
- ?
Add a Domain Constraint
Can model any constrained convex
- ptimization problem
Gradient Descent with Projection
Property of Euclidean Projection
min 𝑔𝑦
- s. t.
𝑦 ∈ 𝑌
- 𝑦 𝑦∗
𝑄
𝑦
𝑄
𝑦∗
𝑦 𝑦∗
Gradient Descent with Projection
The Problem The Algorithm
Given a starting point 𝑦 ∈ dom 𝑔 For 𝑙 1,2, … , 𝐿 do
Update: 𝑦 𝑦 𝑢 𝛼𝑔 𝑦 Projection: 𝑦 𝑄
𝑦
- End for
Return
- Assumptions
𝛼𝑔 𝑦
𝐻,
∀𝑦 ∈ 𝑌
Analysis
Define
- ∗
- ∗
∈
Let
𝑔 𝑦 𝑔 𝑦∗ 𝛼𝑔 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗
- 𝑦
𝑦∗
- 𝜃
2 𝐻 1 𝜃 𝑦 𝑦 𝑦 𝑦∗ 1 2𝜃 𝑦 𝑦∗
- 𝑦 𝑦∗
- 𝜃
2 𝐻
Property
- f
Euclidean Projection
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Motivation
The First-order Taylor Approximation
- is the directional derivative of
at in the direction It gives the approximate change in for a small step is a descent direction if
- is
negative
A Good Search Direction
Make
- as negative as possible
Steepest Descent Method
Normalized Steepest Descent Direction
with respect to the norm Equivalent to
The direction in the unit ball of ⋅ that extends farthest in the direction 𝛼𝑔𝑦
Unnormalized Steepest Descent Direction
- ∗
- ∗
- ∗
Steepest Descent Method
The Algorithm
Given a starting point 𝑦 ∈ dom 𝑔 Repeat
- 1. Compute steepest descent direction Δ𝑦.
- 2. Line search: Choose 𝑢 via exact or
backtracking line search.
- 3. Update: 𝑦 ≔ 𝑦 𝑢Δ𝑦.
until stopping criterion is satisfied.
When exact line search is used, scale factors in the direction have no effect.
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Steepest Descent Method
Steepest Descent for Euclidean Norm
The steepest descent method coincides with the gradient descent method
Steepest Descent Method
Steepest Descent for Quadratic Norm
- quadratic norm, where
- The dual norm
∗
- /
- Normalized Steepest Descent Direction
Unnormalized Steepest Descent Direction
- /
- ∗
𝑄𝛼𝑔 𝑦
- /
/
Steepest Descent Method
Steepest Descent for Quadratic Norm
The ellipsoid is the unit ball of the norm
Δ𝑦 extends as far as possible in the direction 𝛼𝑔 𝑦 while staying in the ellipsoid.
Steepest Descent Method
Steepest Descent for Quadratic Norm
Interpretation via Change of Coordinates Define
/ , so
- An Equivalent Problem
Gradient descent method Correspond to the direction
/ / / / / /
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Steepest Descent Method
Steepest Descent for -norm
Normalized Steepest Descent Direction
𝑗 be any index for which 𝛼𝑔 𝑦 𝛼𝑔𝑦 𝑓 is the 𝑗-th standard basis vector
Unnormalized Steepest Descent Direction
Steepest Descent Method
Steepest Descent for -norm
The diamond is the unit ball of ℓ-norm
Δ𝑦 can always be chosen in the direction of a standard basis vector (or a negative one).
Steepest Descent Method
Steepest Descent for -norm Coordinate-descent Algorithm
- 1. Select a component of
with maximum absolute value
- 2. Decrease or increase the corresponding
component of Simplify, or even trivialize, the line search
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Convergence Analysis
- 1. Any norm can be bounded in terms of
the Euclidean norm
Exist
is smooth, i.e,
- ∗
- ∗
- ∗
Convergence Analysis
- 3. Exit Condition for the Backtracking
Line Search
- 0 𝑢 𝛿
𝑁 ⇒ 𝑢 𝑁𝑢 2𝛿 𝑢 2
- ∗
- ∗
- ∗
Convergence Analysis
- 3. Exit Condition for the Backtracking
Line Search
Backtracking line search terminates So
- 𝑔 𝑦 𝑔 𝑦 𝑢Δ𝑦 𝑔 𝑦 𝛽 min 1, 𝛾𝛿
𝑁 𝑔 𝑦
∗
- 𝑔 𝑦 𝛽 𝛿
min 1, 𝛾𝛿 𝑁 𝑔 𝑦
Convergence Analysis
- 4. Subtracting
∗ from Both Sides
- 5. Combining with Strong Convexity
- 6. Applying it Recursively
Linear convergence
- ∗
∗
𝑔 𝑦 𝑞∗ 𝑔 𝑦 𝑞∗ 𝛽 𝛿 min 1, 𝛾𝛿 𝑁 𝑔 𝑦
- ∗
- ∗
Fail to illustrate the advantage
Outline
Gradient Descent Method
Convergence Analysis Examples General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm
Convergence Analysis Discussion and Examples
Choice of Norm for Steepest Descent
Steepest Descent Method with Quadratic -norm
Equivalent to gradient method after the change of coordinates
Gradient Method Works Well
When the condition numbers of the sublevel sets (or Hessian) are moderate
Steepest Descent Method will Work Well
When the sublevel sets, after the change of coordinates, are moderately conditioned
Choice of Norm for Steepest Descent
Choosing to make the sublevel sets
- f
are well conditioned
If an approximation
- f the Hessian at the
- ptimal point
∗
were known A good choice of would be The Hessian of at the optimum
Choosing to make the ellipsoid approximate the the sublevel set of
/
- ∗
/
Example
The Objective Function
Steepest descent method
Using the two quadratic norms
Backtracking line search
𝛽 0.1 and 𝛾 0.7
- .
. .
Example
The Objective Function
- .
. .
Example
The Objective Function
- .
. .
Example
The Objective Function
- .
. .
Example
Why is better than
?
Problems after the changes of coordinates
The change of variables associated with 𝑄
yields
sublevel sets with modest condition number
Summary
Gradient Descent Method
Convergence Analysis General Convex Functions
Steepest Descent Method
Euclidean and Quadratic Norms
- norm