Descent Algorithms for Optimizing Unconstrained Problems Techniques - - PowerPoint PPT Presentation

▶

Feb 23, 2024 13 likes •144 views

Descent Algorithms for Optimizing Unconstrained Problems Techniques relevant for most (convex) optimization problems that do not yield themselves to closed form solutions. We will start with unconstrained minimization. min f ( x ) x D For

SLIDE 1

Descent Algorithms for Optimizing Unconstrained Problems

Techniques relevant for most (convex) optimization problems that do not yield themselves to closed form solutions. We will start with unconstrained minimization.

x∈D

minf(x) For analysis: Assume thatfis convex and differentiable and that it attains a finite optimal valuep

∗

.

(k)

Minimization techniques produce a sequence of pointsx ∈D,k= 0,1,...such that f x(k) x(k) ( ) ( ) →p ∗

ask→ ∞or,∇f

→0ask→ ∞. Iterative techniques for optimization, further require a starting pointx (0) ∈Dand sometimes thatepi(f)is closed. Theepi(f)can be inferred to be closed either ifD=ℜ

n

rf(x)→ ∞asx→∂D. The functionf(x) =

1 x forx>0is an example of a function

whoseepi(f)is not closed.

September 7, 2018 109 / 406

SLIDE 2

Descent Algorithms

Descent methods for minimization have been in use since the last70years or more. General idea: Next iteratex (k+1) is the current iteratex (k) added with a descent or search direction∆x (k) (a unit vector), which is multiplied by a scale factort (k), called the step length. x(k+1) =x (k) +t (k)∆x(k) The incremental step is determined while aiming th atf(x (k+1))<f(x (k)) e We assume that we are dealing with theextended value extension fof the convex e f(x) = functionf:D→ ℜ, withD⊆ ℜ

n which returns∞for any point outside its domain.

However, if we do so, we need to make sure that the initial point indeed lies in the domainD.

Definition

{ f(x)ifx∈D ∞ifx/∈D (27)

ideally we make progress in every iteration

September 7, 2018 110 / 406

SLIDE 3

Descent Algorithms

A single iteration of the general descent algorithm consists of two main steps,viz.,

1 2

determining a good descent direction∆x (k), which is typically forced to have unit norm and determining the step size using some line search technique.

If the functionfis convex, from the necessary and sufficient condition for convexity restated here for reference: f(x(k+1))≥f(x (k)) +∇ Tf(x(k))(x(k+1) −x (k))

GIVEN

We require thatf(x (k+1))<f(x (k))and sincet (k) >0, we musthave

NEED NECESSARY CONDITION TO MEET OUR NEED BASED ON WHAT ISGIVEN

September 7, 2018 111 / 406

SLIDE 4

Descent Algorithms

A single iteration of the general descent algorithm consists of two main steps,viz.,

1 2

determining a good descent direction∆x (k), which is typically forced to have unit norm and determining the step size using some line search technique.

That is, the descent direction∆x (k) must make (sufficiently) obtuse angle (θ∈

π 3π 2 2

If the functionfis convex, from the necessary and sufficient condition for convexity restated here for reference: f(x(k+1))≥f(x (k))+∇ Tf(x(k))(x(k+1) −x (k)) We require thatf(x (k+1))<f(x (k))and sincet (k) >0, we musthave ∇Tf(x(k))∆x(k) <0 ( ) , ) with the gradient vector A natural choice of∆x (k) that satisfies the above necessary condition is

Since the inequality above is only necessary

September 7, 2018 111 / 406

SLIDE 5

Descent Algorithms (contd.)

(0) ∈D

Finda starting pointx repeat

1. Determine∆x (k).
2. Choose a step sizet (k) >0using ray a search.
3. Obtainx (k+1) =x (k) +t (k)∆x(k).
4. Setk=k+ 1.

untilstopping criterion (such as||∇f(x (k+1))||<ϵ) is satisfied

aMany textbooks refer to this as line search, but we prefer to call it ray search, since the step

must be positive.

Figure 7: The general descent algorithm.

There are many different empirical techniques for ray search, though it matters much less than the search for the descent direction. These techniques reduce then−dimensional problem to a 1−dimensional problem, which can be easy to solve by use of plotting and eyeballing or even exact search.

September 7, 2018 112 / 406

SLIDE 6

Finding the step sizet

Iftis too large, we get diverging updates ofx Iftis too small, we get a very slow descent We need to find atthat isjust right We discuss two ways of findingt:

1 2

Exact ray search Backtracking ray search

September 7, 2018 113 / 406

SLIDE 7

Exact ray search

This method gives the most optimal step size in the given descent direction∆x k It ensures thatf(x k+1)≤f(x k). Why?

Choose a step length that minimizes thefunction in the chosen descent direction

tk+1 =argmitn f ( xk +t∆x k) ) =argmitn ϕ(t)

Given the myopic goal of making

September 7, 2018 114 / 406

f(x^(k+1)) as smaller as possible than f(x^k)

SLIDE 8

Exact ray search

k+1 t

(

k

t =argmin f x + t∆x ) ) =argmitn ϕ(t) This method gives the most optimal step size in the given descent direction∆x k It ensures thatf(x k+1)≤f(x k). Why?Because ( ) ϕ(tk+1) =f(x k +t k+1∆xk) =mitn ϕ(t) =mitn f xk +t∆x k) ≤ ϕ(0) =f(x k) Homework1: Consider the function

2

f(x) =x −4x + 2

2 1 2 1 1 2 2

x x + 2x + 2x + 14 This function has a minimum atx= (5,−3). Suppose you are at a point(4,−4)

T after

In how many steps will thealgorithm few iterations, and∆x=−∇f(x)at everyx, then using theexact line search algorithm, find the point for the next iteration. converge?

From last class, convex function f restricted to a line (\phi)is

k

also convex along that line (that is along t)

September 7, 2018 114 / 406

SLIDE 9

Ray Search for Descent: Options

1 Exact ray search:The exact ray search seeks a scaling factortthat satisfies

t=argmin f

t>0

(x+t∆ x) (28)

This might itself require us to invoke a numerical solver for \phi(t) This may be expensive. But more importantly, is it worth it? Can we look at the geometry of descent and come up with some intuitive criteria that ray search should meet? 1) Sufficient decrease in the function 2) Sufficient decrease in the slope after update

September 7, 2018 115 / 406

SLIDE 10

Ray Search for Descent: Options

1 Exact ray search:The exact ray search seeks a scaling factortthat satisfies

t=argmin f(x+t∆ x) (28)

t>0

2 Backtracking ray search:The exact line search may not be feasible or could be

expensive to compute for complex non-linear functions. A relatively simpler ray search iterates over values of step size starting from1and scaling it down by a factor of

2

β∈(0, )after every iteration till the following condition, called theArmijo conditionis

1

satisfied for some0<c 1 <1. f(x+t∆x)≤f(x) +c

1t∇Tf(x)∆x(29)

Based on first order convexity condition, it can be inferred that whenc 1 = 1, the right hand side of (29) gives a lower bound on the value off(x+t∆x)and hence

(29) cannot hold

September 7, 2018 115 / 406

SLIDE 11

Ray Search for Descent: Options

1 Exact ray search:The exact ray search seeks a scaling factortthat satisfies

t=argmin f(x+t∆ x) (28)

t>0

2 Backtracking ray search:The exact line search may not be feasible or could be

expensive to compute for complex non-linear functions. A relatively simpler ray search iterates over values of step size starting from1and scaling it down by a factor of

2

β∈(0, )after every iteration till the following condition, called theArmijo conditionis

1

satisfied for some0<c 1 <1.

1t∇Tf(x)∆

Negative term assuming descent direction

f(x+t∆x)≤f(x) +c x(29) Based on first order convexity condition, it can be inferred that whenc 1 = 1,

inequality in (29) cannot hold (and gets flipped)

September 7, 2018 115 / 406

SLIDE 12

Backtracking ray search

The algorithm

▶ Choose aβ∈(0,1)

1t∇Tf(x)∆x, do

c1 is fixed to a value (0,1) so that sufficient decrease is ensured when the search for t is complete

▶ Start witht= 1 ▶ Until f(x+t∆x)<f(x) +c ⋆ Updatet←βt

Questions: 1) What is a good choice of c1 in (0,1)? Further from 1 will make it feasible to satisfy Armijo condition. Further from 0 will make the decrease sufficient Often c1 = 0.5 2) Will Armijo condition be satisfied for any given c1 in (0,1)

Given that l(t) was the tightest supporting hyperplane for any c1 < 1, the rotated l(t) should intersect the graph of the function. HenceSeptember 7, 2018

116 / 406

SLIDE 13

Interpretation of backtracking line search

∆x=direction of descent=−∇f(x

k)for gradient descent

A different way of understanding the varying step size withβ: Multiplyingtbyβcauses the interpolation to tilt as indicated in the figure Homework 2: Letf(x) =x 2 forx∈ℜ. Letx

0 = 2,∆x k =−1for allk(since it is a valid

descent direction ofx>0) andx

k = 1 + 2−k. What is the step sizet k implicitly being used.

Whilet k satisifies the Armijo condition (determine ac 1) is this choice of step size ok?

In best case with t=1 and the necessary descent direction, we have Armijocondition right in the first step In general, we might have to make several beta updates before the Armijo conditionis met

September 7, 2018 117 / 406