(Log) Barrier methods
November 9, 2018 339 / 429
(Log) Barrier methods November 9, 2018 339 / 429 Barrier Methods - - PowerPoint PPT Presentation
(Log) Barrier methods November 9, 2018 339 / 429 Barrier Methods for Constrained Optimization Consider a more general constrained optimization problem min f ( x ) x R n s.t. g i ( x ) 0 i = 1 ... m and A x = b Possibly reformulations
November 9, 2018 339 / 429
Consider a more general constrained optimization problem min f(x)
x∈Rn
s.t.gi(x)≤0i= 1...m andAx=b Possibly reformulations of this problem include: m
xin f(x)+λB(x)
whereBis abarrier functionlike
2
B(x) = ∥
ρ 2
Ax−b∥ (in Augmented Langragian - for a specific type of strong convexity wrt
2
∥.∥ )) ∑
gi
B(x) = I (x)(Projected Gradient Descent: built on this & a linear approximation tof(x))
1 2 3
( B(x) =ϕ gi (x) =− 1 log −gi(x)
t
)
▶
1 t
Here,− is used instead ofλ. Lets discuss this in more details
November 9, 2018 340 / 429
As a very simple example, consider the following inequality constrained optimization problem. minimizex
2
subject tox≥1 The logarithmic barrier formulation of this problem is minimizex
2−µln(x−1)
The unconstrained minimizer for this convex logarithmic barrier function is
1 1 2 2
√ x(µ) = + 1 + 2µ b . Asµ→0, the optimal point of the logarithmic barrier problem approaches the actual point of optimality x= 1(which, as we can see, lies on the boundary of b the feasible region). The generalized idea, that asµ→0,f ( x)→p ∗ (wherep ∗ is the optimal b for primal) will be proved next.
November 9, 2018 341 / 429
Recap:
−bTλ Problem type Objective Function Constraints L∗(λ) Dual constraints Strong duality Linear Program cTx Ax≤b ATλ+c=0 Feasible primal
What are necessary conditions at primal-dual optimality? .. ..
November 9, 2018 342 / 429
gi
The log barrier function is defined as B(x) =ϕ (x)= 1 t ( − log − i g (x) ) ∑
gi
Approximates I (x)(better approximation ast→ ∞) ∑
g i
i
f(x) + ϕ (x)is convex iffandg
i are convex
Why?ϕ gi (x)is negative of monotonically increasing concave function (log) of a concave function−g i(x) Letλ i be lagrange multiplier associated with inequality constraintg i(x)≤0 We’ve taken care of the inequality constraints, lets also consider an equality constraint Ax=bwith corresponding langrage multipler (vector)ν
November 9, 2018 343 / 429
Our objective becomes ∑
i
1 t ( ) ( m i n
x f(x) +
− log − i g (x)) s.t.Ax=b At different values oft, we get differentx ∗(t) Letλ ∗
i(t) =
First-order necessary conditions for optimality (and strong duality)17 atx ∗(t),λ ∗
i(t):
1 2 3 4
.. .. .. ..
⋆ .. 17of original problem November 9, 2018 344 / 429
Our objective becomes
x
minf(x)+ ∑
i
1 t ( ) ( − log − i g (x)) s.t.Ax=b At different values oft, we get differentx ∗ Let λ
∗ i
(t) =
−1 tgi(x∗(t))
First-order necessary conditions for optimality (and strong duality)18 at x∗(t),λ ∗
i(t):
i ∗
g x ( ( ) t) ≤0 Ax∗(t) =b
∗
∇f x (t) + ( ) ∑ m
i=1 λ ∗ i
(t)∇gi (x∗(t)) +ν ∗(t)⊤A= 0
1 2 3 4 λ
∗ i(t)≥0
( )
⋆ Since gi x∗(t) ≤0 and t≥0
All above conditions hold at optimal solutionx(t),ν(t), of barrier problem ⇒ (λ∗i (t),ν ∗(t)) are dual feasible.
(onlt complementary slackness is violated)
18of original problem November 9, 2018 345 / 429
If necessary conditions are satisfied andiffandg
i’s are convex, andg i’s strictly
feasible, the conditions are also sufficient. Thus, x ( ∗
∗ i
t),λ (
∗
t),ν ( ( ) t) form a critical point for the Lagrangian L(x,λ,ν) =f(x)+
m
∑
i=1 i i ⊤
λ g (x) +ν (Ax−b) Lagrange dual function
L∗(λ,ν) =min
xL(x,λ,ν)
(
∗ ∗ ∗
L λ (t),ν ( ) (
∗
t) =f x ( )
m
∑
i=1 ∗ i
t) + λ (
i
(
∗
t)g x ( ) (
∗ ⊤ ∗
t) +ν (t) Ax (t)−b ) = f(x*(t) ) -m/t Ast→ ∞, duality gap→. . .0
▶ November 9, 2018 346 / 429 ▶ .m
. . . ./ . .t . . . .is theduality gap upperbound
If necessary conditions are satisfied andiffandg
i’s are convex, andg i’s strictly
feasible, the conditions are also sufficient. Thus, x ( ∗
∗ i
t),λ (
∗
t),ν ( ( ) t) form a critical point for the Lagrangian L(x,λ,ν) =f(x)+
m
∑
i=1 i i ⊤
λ g (x) +ν (Ax−b) Lagrange dual function
L∗(λ,ν) =min
xL(x,λ,ν)
(
∗ ∗ ∗
L λ (t),ν ( ) (
∗
t) =f x ( )
m
∑
i=1 ∗ i
t) + λ (
i
(
∗
t)g x ( ) t) +ν ( (
∗ ⊤ ∗
t) Ax (t)−b )
∗
( ) = f x (t) − m t
t
▶ mhere is called theduality gap ▶ As t→ ∞, duality gap→0 , but computing optimal solutionx(t)to barrier problem will be
that harder
November 9, 2018 347 / 429
At optimality, primal optimal=dual optimal i.e. p
∗ =d ∗
From weak duality,
∗
( ) m t f x (t) − ≤p ∗ (
∗
=⇒f x ( )
∗
t) −p ≤ m t
▶ The duality gap is always≤
m t
▶ The more we increaset, the smaller will be the duality gap
Log Barrier method: Start with small t (conservative about feasibility set Iteratively solve the barrier formulation (start with solution to prev itera increase value of t
November 9, 2018 348 / 429
Also known as sequential unconstrained minimization technique (SUMT) & barrier method & path-following method Start witht=t (0),µ >1, and considerϵtolerance
1 2 Repeat 1Solve
x m
∑
i=1
1 ( ) g x t argmin f(x) + − t log − i( ) ( ) s.t.Ax=b
2 If m<ϵ,Quit
else
t,sett=µt
Scale up the value of t multiplicatively in every
INNER ITERATION: (solved using Dual Ascent or Augment Lagrangian) Newton algo especially good for this
x∗( ) =
for solving for x*(t), initialize using x*(t-1)
November 9, 2018 349 / 429
Also known as sequential unconstrained minimization technique (SUMT) & barrier method & path-following method
(0),µ >1, and considerϵtolerance
1 2
Start witht=t Repeat
1Solve
x∗( ) =
x
t argmin f(x) +
m
∑
i=1
1 ( ) ( − t log −gi(x)) s.t.Ax=b
2 If m <ϵ,Quit
else
t,sett=µt
Note: Computingx ∗(t)exactly is not necessary since the central path has no significance
Also smallµ⇒faster inner iterations. Largeµ⇒faster outer iterations.
Since x*(t-1) will not be far from x*(t) Upper bound on duality gap will shrink quickly
November 9, 2018 349 / 429
Central path for an LP with n = 2 and m = 6. The dashed curves show three contour lines of the logarithmic barrier function φ. The central path converges to the optimal point x* as t → ∞. Also shown is the point on the central path with t = 10. [Figure source: Boyd & Vandenberghe]
In the process, we can also obtainλ ∗(t)andν ∗(t) Convergence of outer iterations: We getϵaccuracy after
log
m ϵt(0)
( ) log(µ)
updates oft
November 9, 2018 350 / 429
The inner optimization in the iterative algorithm using a barrier method,
∗ x
∑
i
1 t ( ) x (t) =argmin f(x) + − log − i g (x) ( ) s.t.Ax=b can be solved using (sub)gradient descent starting from older value ofxfrom previous iteration We must start with a strictly feasiblex, otherwise −log (−gi(x)) → ∞
November 9, 2018 351 / 429
November 9, 2018 352 / 429
Basic Phase I method x(0) =argmin Γ
x
s.t.g i(x)≤Γ We solve this using the barrier method, and thus will also need a strictly feasible starting xˆ
(0 )
Here, Γ=max gi(xˆ(0)) +δ
i=1...m
where,δ>0
▶ i.e.Γis slightly larger than the largestg
i(xˆ(0))
November 9, 2018 353 / 429
On solving this optimization for findingx (0),
▶ IfΓ ∗
<0,x (0) is strictly feasible
▶ IfΓ ∗
= 0,x (0) is feasible (but not strictly)
▶ IfΓ ∗
>0,x (0) is not feasible
A slightly ‘richer’ problem can consider differentΓ i for eachg i, to improve numerical precision x(0) =argmin Γi
x
s.t.g i(x)≤Γ i
min over i
November 9, 2018 354 / 429
Choice of a good xˆ(0) orx (0) depends on the nature/class of the problem, use domain knowledge to decide it
November 9, 2018 355 / 429
log
m ϵt(0)
( ) log(µ)
We need not obtainx ∗(t)exactly from each outer iteration If not solving forx ∗(t)exactly, we will getϵaccuracy aftermore than updates oft
▶ However, solving the inner iteration exactly may take too much time ▶ Fewer inner loop iterations correspond to more outer loop iterations
TRADEOFFS
November 9, 2018 356 / 429
log
m ϵt(0)
( ) log(µ)
We need not obtainx ∗(t)exactly from each outer iteration If not solving forx ∗(t)exactly, we will getϵaccuracy aftermore than updates oft
▶ However, solving the inner iteration exactly may take too much time ▶ Fewer inner loop iterations correspond to more outer loop iterations
Second order descent algorithms (such as Newton Descent) found effective in such settings for following reasons:
▶ Accounts for curvature of the function; useful to converge tox(µt)quickly fromx(t). ▶ Quadratic convergence when close tox ∗(t) ▶ Less (or no) dependence on step sizet k
Recall: Curvature naturally characterized by the Hessian
November 9, 2018 356 / 429
Proved in Boyd Accouting for curvature reduces sensitivity to step size
November 9, 2018 357 / 429
Newton algo Gradient descent (roundabout)
November 9, 2018 358 / 429
This choice of∆x k+1 corresponds to the direction of steepest descent under the matrix norm19 induced by the Hessian∇ 2f(xk):
(k) T (k)
2 k
∇ f(x )
{ } ∆x =argmin ∇ f(x )v| ||v|| = 1 . Equivalently, based on approximating a function around the current iteratex (k) using a second degree Taylor expansion. e Q(x)≈ f 1 (x) =f(x ) +∇ f(x )(x−x ) + 2(x−x ) ∇ f
(k) T (k) (k) (k) T 2 (k) (k)
(x )(x−x ) Convexf⇒
19(
vT
2 k
∇ f(x )v ) 1
2
Hessian is positive semi-definite Q(x) will ALSO be convex
November 9, 2018 358 / 429
Second order information through hessian gives an ellipsoid which is aligned with the curvature in that specific region (Europe)
November 9, 2018 358 / 429
This choice of∆x k+1 corresponds to the direction of steepest descent under the matrix norm19 induced by the Hessian∇ 2f(xk):
(k) T (k)
2 k
∇ f(x )
{ } ∆x =argmin ∇ f(x )v| ||v|| = 1 . Equivalently, based on approximating a function around the current iteratex (k) using a second degree Taylor expansion. e 1
(k) T (k) (k) (k) T 2 (k) (k)
Q(x)≈ f(x) =f(x ) +∇ f(x )(x−x ) + 2(x−x ) ∇ f(x )(x−x ) Convexf⇒convex quadratic approximation. Newton’s method is based on solving the approximation exactly Setting gradient of quadratic approximation (with respect tox) to0gives ∇Tf(x(k)) +∇ 2f(x(k))(x(k+1) −x (k)) = 0 Assuming∇ 2f(xk)is invertible, next iterate isx
(k+1) =x (k) −
( ∇2f(x(k)) )−1 ∇f(x(k))
19(
vT∇2f(xk)v ) 1
2
November 9, 2018 358 / 429
Finda starting pointx
(0) ∈D.
Selectan appropriate toleranceϵ>0. repeat
( ∇2f(x(k)) )−1 ∇f(x).
( ∇2f(x(k)) )−1 ∇f(x(k))⇔
Directional derivative in the Newton Direction
λ 2 2
≤ϵ , quit.
until
Figure 32: The Newton’s method which typically uses a step size of1.∆x (k) can be shown to be always a Descent Direction (Theorem 83 of notes). Forx∈ ℜ
n, each Newton’s step takesO(n 3)time
(without using any fast matrix multiplication methods).
Most expensive step: Computing Hessian inverse
November 9, 2018 359 / 429
Newton algo Gradient descent (roundabout)
Special Cases:When Objective function is a composition of two functions (such as Loss l
BasicsOfConvexOptimization.pdf) and Levenberg-Marquardt (Section 4.5.5) Quasi-Newton Algorithms:When Hessian inverse ∇ f
2 k+1
( ) −1 (x ) is approximated by a matrixB k+1 such that
▶ gradient of quadratic approximationQ(x k)agrees atx k andx k+1 ▶ Bk+1 is as close as possible toB k in some norm (such as the Frobenius norm)
See BFGS (Section 4.5.6),LBFGSetc.
November 9, 2018 360 / 429
November 9, 2018 361 / 429
Consider amother general formulation of convex optimization problems20: minimizec
Tx
subject tog
j(x)≤0forj= 1,2,...,m
(92) whereg j(x)are convex functions. How can every convex optimization problem be presented in this form?
20All convex optimization problems of the form discussed so far can be cast in this form. November 9, 2018 362 / 429
Consider amother general formulation of convex optimization problems20: minimizec
Tx
subject tog
j(x)≤0forj= 1,2,...,m
(92) whereg j(x)are convex functions. How can every convex optimization problem be presented in this form? For objective functionf(x), translate it into a constraintf(x)−c≤0and minimizec Lets j(xi)be a subgradient forg j atx i. By definition of subgradient
20All convex optimization problems of the form discussed so far can be cast in this form. November 9, 2018 362 / 429
Consider amother general formulation of convex optimization problems20: minimizec
Tx
subject tog
j(x)≤0forj= 1,2,...,m
(92) whereg j(x)are convex functions. How can every convex optimization problem be presented in this form? For objective functionf(x), translate it into a constraintf(x)−c≤0and minimizec Lets j(xi)be a subgradient forg j atx i. By definition of subgradient
j
g (x
j
)≥g (
j i T i i
x ) +s (x )(x−x )for allx∈dom(g
j). [Eg:s j(xi)could be∇g j(xi)]
20All convex optimization problems of the form discussed so far can be cast in this form. November 9, 2018 362 / 429