Constrained Optimization in : Recap October 19, 2018 197 / 424 - - PowerPoint PPT Presentation

constrained optimization in recap
SMART_READER_LITE
LIVE PREVIEW

Constrained Optimization in : Recap October 19, 2018 197 / 424 - - PowerPoint PPT Presentation

Constrained Optimization in : Recap October 19, 2018 197 / 424 Global Extrema on Closed Intervals Recall the extreme value theorem. A consequence is that: if either of c or d lies in( a , b ), then it is a critical number of f ; else each


slide-1
SLIDE 1

Constrained Optimization inℜ: Recap

October 19, 2018 197 / 424

slide-2
SLIDE 2

Global Extrema on Closed Intervals

Recall the extreme value theorem. A consequence is that: if either ofcordlies in(a,b), then it is a critical number off; else each ofcanddmust lie on one of the boundaries of[a,b]. This gives us a procedure for finding the maximum and minimum of a continuous functionf

  • n a closed bounded intervalI:

Procedure

[Finding extreme values on closed, bounded intervals]: Find the critical points inint(I).

1 2 Compute the values of f at the critical points and at the endpoints of the

interval.

3 Select the least and greatest of the computed values. October 19, 2018 198 / 424

slide-3
SLIDE 3

Global Extrema on Closed Intervals (contd)

To compute the maximum and minimum values off(x) = 4x 3 −8x 2 + 5xon the interval [0,1],

▶ We first computef ′(x) = 12x2 −16x+ 5which is0atx=

2 6 1, 5. 2 6 27

▶ Values at the critical points aref( 1 ) = 1,f( 5 ) = 25 . ▶ The values at the end points aref(0) = 0andf(1) = 1. ▶ Therefore, the minimum value isf(0) = 0and the maximum value isf(1) =f(

2 1 ) =1.

In this context, it is relevant to discuss the one-sided derivatives of a function at the endpoints of the closed interval on which it is defined.

October 19, 2018 199 / 424

slide-4
SLIDE 4

Global Extrema on Closed Intervals (contd)

Definition

[One-sided derivatives at endpoints]: Let f be defined on a closed bounded interval[a,b]. The (right-sided) derivative of f at x=a is defined as f′(a) =lim

h→0+

f(a+h)−f(a) h Similarly, the (left-sided) derivative of f at x=b is defined as f′(b) =lim

h→0−

f(b+h)−f(b) h Essentially, each of the one-sided derivatives defines one-sided slopes at the endpoints.

October 19, 2018 200 / 424

slide-5
SLIDE 5

Global Extrema on Closed Intervals (contd)

If f(a)is the maximum value of f on[a,b], th

′(a)≤0or

f(a)is the minimum value of f on[a,b], then f ′(a)≥0or Based on these definitions, the following result can be derived.

Claim

If f is continuous on[a,b]and f ′(a)exists as a real number or as±∞, then we have the following necessary conditions for extremum at a. en f f ′(a)=−∞. If f ′(a)=∞. If f is continuous on[a,b]and f ′(b)exists as a real number or as±∞, then we have the following necessary conditions for extremum at b

October 19, 2018 201 / 424

slide-6
SLIDE 6

Global Extrema on Closed Intervals (contd)

Based on these definitions, the following result can be derived.

Claim

If f is continuous on[a,b]and f ′(a)exists as a real number or as±∞, then we have the following necessary conditions for extremum at a. If f(a)is the maximum value of f on[a,b], then f ′(a)≤0or f ′(a) =−∞. If f(a)is the minimum value of f on[a,b], then f ′(a)≥0or f ′(a) =∞. If f is continuous on[a,b]and f ′(b)exists as a real number or as±∞, then we have the following necessary conditions for extremum at b If f(b)is the maximum value of f on[a,b], then f ′(b)≥0or f ′(b) =∞. If f(b)is the minimum value of f on[a,b], then f ′(b)≤0or f ′(b) =−∞.

October 19, 2018 201 / 424

slide-7
SLIDE 7

Global Extrema on Closed Intervals (contd)

The following result gives a useful procedure for finding extrema on closed intervals.

Claim

If f is continuous on[a,b]and f ′′(x)exists for all x∈(a,b). Then, If f′′(x)≤0,∀x∈(a,b), then the minimum value of f on[a,b]is either f(a)or f(b). If, in addition, f has a critical point c∈(a,b), then f(c)is the maximum value of f on[a,b]. If f′′(x)≥0,∀x∈(a,b), then the maximum value of f on[a,b]is either f(a)or f(b). If, in addition, f has a critical point c∈(a,b), then f(c)is the minimum value of f on[a,b].

October 19, 2018 202 / 424

slide-8
SLIDE 8

Global Extrema on Open Intervals

The next result is very useful for finding extrema on open intervals.

Claim

LetIbe an open interval and let f ′′(x)exist∀x∈I. If f′′(x)≥0,∀x∈I, and if there is a number c∈Iwhere f global minimum value of fonI. If f′′(x)≤0,∀x∈I, and if there is a number c∈Iwhere f

′(c) = 0, then f(c)isthe ′(c) = 0, then f(c)isthe

global maximum value of fonI.

2 3

For example, letf(x) = x−secxand I= ( , )

−π π ′ 2 2 2 3

.f (x) = −secxtanx= −

2 sinx 3 cos2 x 6

= 0⇒x=

π. Further, 2 2

f′′(x) =−secx(tan 2 x+sec 2 x)<0on(

−π , π). Therefore,fattains the maximumvalue 6 9 π π 2 √ 3

f( ) = −

  • nI .

October 19, 2018 203 / 424

slide-9
SLIDE 9

Global Extrema on Open Intervals (contd)

As another example, let us find the dimensions of the cone with minimum volume that can contain a sphere with radiusR. Lethbe the height of the cone andrthe radius of its base.

3

The objective to be minimized is the volumef(r,h) = π

1 2

r h. The constraint betwenrandhis shown in Figure 10. The traingleAEFis similar to traingleADBand therefore,

R √ h2 h−R = +r

2

r

.

Figure 10:

October 19, 2018 204 / 424

slide-10
SLIDE 10

Constrained Optimization and Subgradient Descent

October 19, 2018 206 / 424

slide-11
SLIDE 11

Constrained Optimization

Consider the objective minf(x) s.t.g i(x)≤0,∀i Recall: Indicator function forg i(x)

gi

I (x) = { 0,ifg

i(x)≤0

∞,otherwise

▶ We have shown that this is convex if eachg i(x)is convex.

Option 1:

Subgradient descent on f(x) + I_g(x)

October 19, 2018 207 / 424

slide-12
SLIDE 12

Constrained Optimization

Consider the objective minf(x) s.t.g i(x)≤0,∀i Recall: Indicator function forg i(x)

gi

I (x) = { 0,ifg

i(x)≤0

∞,otherwise

▶ We have shown that this is convex if eachg i(x)is convex.

gi

Option 1: Use subgradient descent to minimizef(x) + I (x)

i

Option 2: Barrier Method (approximateI gi (x)using some differentiable and non-decreasing function such as−(1/t)log−u), Augmented Lagrangian, ADMM,etc.

October 19, 2018 207 / 424

slide-13
SLIDE 13

Option 1: (Sub)Gradient Descent with Sum of indicators

Convert our objective to the following unconstrained optimization problem

i

EachC = x

i

{ }

i

|g (x)≤0 is convex ifg (x)is convex. We take

x x

min F(x) =min f(x ∑

i Ci

) + I ( x) Recap a subgradient ofF:

October 19, 2018 208 / 424

slide-14
SLIDE 14

Option 1: (Sub)Gradient Descent with Sum of indicators

Convert our objective to the following unconstrained optimization problem

i

EachC = x

i

{ }

i

|g (x)≤0 is convex ifg (x)is convex. We take

x x

min F(x) =min f(x ∑

i Ci

) + I ( x) Recap a subgradient ofF:h (x

F f

) =h ( ∑

I i

Ci

x) + h ( x). Recallthat

▶ hf(x) =∇f(x)iff(x)is differentiable. Also,−∇f(x)atx

k optimizes

Let us treat the gradient of f at x^k as that vector which minimized the second order quadratic expansion of f around x^k

October 19, 2018 208 / 424

slide-15
SLIDE 15

Option 1: (Sub)Gradient Descent with Sum of indicators

Convert our objective to the following unconstrained optimization problem

i

EachC = x

i

{ }

i

|g (x)≤0 is convex ifg (x)is convex. We take

x x

min F(x) =min f(x ∑

i Ci

) + I ( x) Recap a subgradient ofF:h (x

F f

) =h ( ∑

I i

Ci

x) + h ( x). Recallthat

▶ hf(x) =∇f(x)iff(x)is differentiable. Also,−∇f(x)atx

k optimizes the first order k h

approximation forf(x)aroundx :−∇f(x) =argmin f

k

(x ) +

T k

1 2 ∇ f(x )h+ ||

2

h|| : Variations on the form of 1 ||h||2 lead to Mirror Descent etc.

▶ hICi

i ICi i

has other solutions ifxis on the boundary: Analysis for convexg i’s leads to KKT conditions and Dual Ascent etc.

replacing with entropic

2

(x)isd∈R

n s.t.d Tx≥d Ty,∀y∈C

. Also,h (x) = 0ifxis in t

r

h

e

e

g u

i n

l a

t

r

e

i z

r i

e

  • r

r

  • fC

, and

October 19, 2018 208 / 424

slide-16
SLIDE 16

Option 1: Generalized Gradient Descent

Consider the problem of minimizing the following sum of a differentiable functionf(x) and a (possibly) nondifferentiable functionc(x)(an example being ∑

i ICi (x))

m

xin

F(x) =mixn f(x) +c(x) As in gradient descent, consider the first order approximation forf(x)aroundx

k leaving

c(x)alone to obtain the next iteratex

k+1: x k+1 k

x =argmin f(x )+

T k k

1 2t ∇ f(x )(x−x ) + ||

k 2

x−x ||+ c(x)

October 19, 2018 209 / 424

slide-17
SLIDE 17

Option 1: Generalized Gradient Descent

Consider the problem of minimizing the following sum of a differentiable functionf(x) and a (possibly) nondifferentiable functionc(x)(an example being ∑

i ICi (x))

m

xin

F(x) =mixn f(x) +c(x) As in gradient descent, consider the first order approximation forf(x)aroundx

k leaving

c(x)alone to obtain the next iteratex

k+1: x

x =argmin f(x ) +∇ f

k+1 k T k k

1 2t (x )(x−x ) + ||

k 2

x−x ||+c(x) x Deleting f( k) from the objective and adding t ||∇f(xk)||2 to the objective (without any

2

loss) to complete squares, we obtainx k+1 as:

October 19, 2018 209 / 424

slide-18
SLIDE 18

Option 1: Generalized Gradient Descent

Consider the problem of minimizing the following sum of a differentiable functionf(x) and a (possibly) nondifferentiable functionc(x)(an example being ∑

i ICi (x))

m

xin

F(x) =mixn f(x) +c(x) As in gradient descent, consider the first order approximation forf(x)aroundx

k leaving

c(x)alone to obtain the next iteratex

k+1: x

x =argmin f(x ) +∇ f

k+1 k T k

(x )(x−

k

1 2t x ) + ||

k 2

x−x ||+c(x)

k t

Deleting f(x ) from the objective and adding ||∇f(x )

k 2

|| to the objective (without any

2

loss) to complete squares, we obtainx k+1 as:

1 x 2t

xk+1 =argmin ||

k k 2

x−(x −t∇f(x ))|| +c(x) In general, such a step is called a proximal step with respect to c(x)

k+1 c xk −t∇f(x k)

( ) x = prox ) =

x

1 2t argmin ||

k

x− (x −t∇f(

k 2

x ))|| + c(x)

(point closest to the unregulated gradient descent update with a later regulation using c(x))

this unregulated descent will be often referred to as z

October 19, 2018 209 / 424

slide-19
SLIDE 19

PROX gives you the point closes to the unregulated (wrt to c(x)) update when we also bring in the effect of (minimizing) c(x) Basically we have phased out the subgradient descent update into two phases (a) unregulated update (such as gradient descent) for f(x) alone (b) course correction based on c(x)

slide-20
SLIDE 20

Algorithm: The Generalized Gradient Descent

m

xin

f(x)+c(x) Finda starting pointx 0

  • p. =

Setk= 1 repeat

k

  • 1. Choose a step sizet

∝1/ kor using exact or backtracking ray search or .

  • 2. Setz k =x k−1 −t k∇f(xk−1).

c k k

( )

  • 3. Setx

=prox z .

  • 4. Setk=k+1.

untilstopping criterion (such as||x k −x k−1||≤ϵorf(x

k)>f(x k−1)) issatisfieda

aBetter criteria can be found using Lagrange duality theory, etc.

Figure 11: The generalized gradient descent algorithm.

October 19, 2018 210 / 424

slide-21
SLIDE 21

Option 1: Generalized Gradient Descent

Interesting because in many settings,prox c(z)can be computed efficiently

x

1 2t proxc(z) =argmin ||

2

x−z|| +c(x)

9ftakes values in the extended real number line such thatf(x)<+∞for at least onexandf(x)>−∞for October 19, 2018 211 / 424

it is finite value < +inf atleast at one point and is not -inf everywhere else

Theorem: Ifcis a proper convex 9 function with a closed epigraph then (fort>0) it has a unique value ofprox c(z).Hint: The quadratic term introduces strong convexity⇒ strict

  • convexity. A strictly convex function has a unique minimizer
slide-22
SLIDE 22

Option 1: Generalized Gradient Descent

x

Interesting because in many settings,prox c(z)can be computed efficiently 1 2t

2

proxc(z) =argmin ||x−z|| +c(x) Theorem: Ifcis a proper convex 9 function with a closed epigraph then (fort>0) it has a unique value ofprox c(z).Hint: The quadratic term introduces strong convexity⇒ strict convexity. For non-convexc, the solution set is non-empty under similar conditions.

Forx∈ ℜ,c(x) = Forz∈ ℜ&t= 1, proxc(z) = Simplified Lasso:λ|x| 1 ?? µx x≥0 ∞x<0 ?? µλx3 x≥0 ∞x<0 ?? −λlogx x>0 ∞x≤0 ?? Inspired by or inspire δ[0,η]∩ℜ ??

ftakes values in the extended real number line such thatf(x)<+∞for at least onexandf(x)>−∞for

9 October 19, 2018 211 / 424

s barrier function

slide-23
SLIDE 23

Option 1: Generalized Gradient Descent

x

Interesting because in many settings,prox c(z)can be computed efficiently 1 2t

2

proxc(z) =argmin ||x−z|| +c(x) Theorem: Ifcis a proper convex 9 function with a closed epigraph then (fort>0) it has a unique value ofprox c(z).Hint: The quadratic term introduces strong convexity⇒ strict convexity. For non-convexc, the solution set is non-empty under similar conditions.

λc (

1x

) c(x) = Fort= 1, proxc(z) = Constant:c ?? Affine:a Tx+b ?? Convex quadratic: 1 xTAx+b Tx+c

2

(whereA∈S n ,b

n)

??

+ ∈ ℜ n

Sum over components:c(x) = ∑ ci(xi)

i=1

??? c(λx+a) ??

λ

?? calculus c(x) +a Tx+ β ∥x∥2+γ

2

?? c(Ax+b) ?? c(∥x∥) ??

Forx∈ ℜ,c(x) = Forz∈ ℜ&t= 1, proxc(z) = Simplified Lasso:λ|x| 1 ?? µx x≥0 ∞x<0 ?? µλx3 x≥0 ∞x<0 ?? −λlogx x>0 ∞x≤0 ?? δ[0,η]∩ℜ ??

ftakes values in the extended real number line such thatf(x)<+∞for at least onexandf(x)>−∞for

9 October 19, 2018 211 / 424

slide-24
SLIDE 24

Proximal Subgradient Descent for Lasso

2 2

Letf(x) =∥Ax−y∥ ,c

1

(x) =∥x∥ andF(x) =f(x)+c(x) Proximal Subgradient Descent Algorithm: Initialization:Find starting pointx

(0)

▶ Let xb(k+1) ≡z (k+1) be a next gradient descent iterate forf(x k) ▶

x 2 (k+1) 1 (k+1) 2 2

Computex =argmin ∥x−z ∥ +λ

1

t∥x∥ by setting subgradient of this objective

  • to0. This results in (see

https://www.cse.iitb.ac.in/~cs709/notes/enotes/lassoElaboration.pdf)

1 2 3

...

k w.r.t

▶ Setk=k+ 1,untilstopping criterion is satisfied (such as no significant changes inx

x(k−1))

prox step

October 19, 2018 213 / 424

... ... Vector x^(k+1) is obtained by componentwise minimization

slide-25
SLIDE 25

Iterative Soft Thresholding Algorithm (Proximal Subgradient Descent) for Lasso

Letf(x) =∥Ax−y∥

2 2 1

,c(x) =∥x∥ andF(x) =f(x)+c(x) Proximal Subgradient Descent Algorithm: Initialization:Find starting pointx

(0)

▶ Letz (k+1) be a next gradient descent iterate forf(x k) ▶

∥x∥1

( Compute prox z )

(k+1) (k+1)

= x =argmin

x 1 (k+1) 2 2t 2 1

∥x−z ∥ +λ∥x∥ as follows:

If x

( k+ 1 ) i

=− λt+z

( k+ 1 )

z ( k + 1)

i

>λt, then

( k + 1)

Ifz i <−λt, thenx

( k+ 1 ) =λt+z i ( k+ 1 ) i i 1 2 3

0otherwise.

▶ Setk=k+ 1,untilstopping criterion is satisfied (such as no significant changes inx

k w.r.t

x(k−1))

If unregulated z was gretater than lambda t reduce it by that amount

October 19, 2018 214 / 424

slide-26
SLIDE 26

Tables for the Proximal Operator

c

prox (z) =argmin

x

1 2t

2

||x−z|| +c(x)

6λ x √ 2 Forx∈ ℜ,c(x) = Forz∈ ℜ&t= 1, proxc(z) = Simplified Lasso:λ|x| [|x|−λ] +sign(x) µx x≥0 ∞x<0 [x−µ] + µλx3 x≥0 ∞x<0 −1 + √ 1 + 12λ[x]+ −λlogx x>0 ∞x≤0 + x + 4λ 2 δ[0,η]∩ℜ min{max{x,0},η}

October 19, 2018 215 / 424