Introductory Course on Non-smooth Optimisation Lecture 09 - - - PowerPoint PPT Presentation
Introductory Course on Non-smooth Optimisation Lecture 09 - - - PowerPoint PPT Presentation
Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang Department of Applied Mathematics and Theoretical Physics Table of contents Examples 1 2 Non-convex optimisation Convex relaxation 3 4
Table of contents
1
Examples
2
Non-convex optimisation
3
Convex relaxation
4
Łojasiewicz inequality
5
Kurdyka-Łojasiewicz inequality
Compressed sensing
For
- rwar
ard ob
- bser
servation tion b = A˚ x, ˚ x ∈ Rn is sparse. A : Rn → Rm with m << n. Compr Compressed essed sensing sensing min
x∈Rn ||x||0
s.t. Ax = b. NB: NP-hard problem.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Image processing
Two-phase
- -phase segmen
segmentation tion Given an image I, which consists of foreground and background, segment the foreground. Ideally, I = fC + bΩ\C. Mum Mumfor
- rd–Shah
d–Shah model model E(u, C) =
- Ω
(u − I)2dx + λ
- Ω\C
||∇u||2dx + α|C|
- ,
where |C| = peri(C).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Principal component pursuit
For
- rwar
ard mix mixtur ture model model w = ˚ x +˚ y + ǫ, where˚ x ∈ Rm×n is κ-sparse,˚ y ∈ Rm×n is σ-low-rank and ǫ is noise. Non-c Non-con
- nvex PCP
PCP min
x,y∈Rm×n
1 2||x + y − w||2
s.t. ||x||0 ≤ κ and rank(y) ≤ σ.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Neural networks
Each ach la layer er of
- f NNs
NNs is is con
- nvex
Linear operation, e.g. convolution. Non-linear activation function, e.g. rectifier max{x, 0}. The composition of convex functions is not necessarily convex... Neural networks are universal function approximators. Hence need to approximate non-convex functions. Cannot approximate non-convex functions with convex functions.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples 2 Non-convex optimisation 3 Convex relaxation 4 Łojasiewicz inequality 5 Kurdyka-Łojasiewicz inequality
Non-convex optimisation Non-convex problem
Any problem that is not convex/concave is non-convex...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Challenges
Potentially many local minima. Saddle points. Very flat regions. Widely varying curvature. NP-hard.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples 2 Non-convex optimisation 3 Convex relaxation 4 Łojasiewicz inequality 5 Kurdyka-Łojasiewicz inequality
Convex relaxation
Non-c Non-con
- nvex op
- ptimisa
timisation tion pr problem
- blem
min
x
E(x). Con Convex op
- ptimisa
timisation tion pr problem
- blem
min
x
F(x). Wha What if if Argmin(F) ⊆ Argmin(E), Subtle and case-dependent. Somehow, finding F is almost equivalent to solving E.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convex relaxation
Loose relaxation Ideal relaxation In practice, it is easier to obtain Argmin(E) ⊆ Argmin(F). Loose relaxation will work
- rk if two global minima are close enough.
Ideal relaxation will fail ail if Argmin(F) is too large.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convolution
For certain problems, non-convexity can be treated as noise... Original function Convolution Symmetric boundary condition for the convolution. Almost convex problem after convolution.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples 2 Non-convex optimisation 3 Convex relaxation 4 Łojasiewicz inequality 5 Kurdyka-Łojasiewicz inequality
Smooth problem
Let F ∈ C1
L.
Gradient descent xk+1 = xk − γ∇F(xk). Descent property F(xk) − F(xk+1) ≥ γ(1 − γL
2 )||∇F(xk)||2.
Let γ ∈]0, 2/L[, γ(1 − γL
2 ) k
i=0 ||∇F(xi)||2 ≤ F(x0) − F(xk+1) ≤ F(x0) − F(x⋆).
F(x⋆) > −∞, rhs is a positive constant. for lhs, let k → +∞, lim
k→+∞ ||∇F(xk)||2 = 0.
NB: for smooth case, a critical point is guarantee. For non-smooth problem...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Semi-algebraic sets and functions Semi-algebraic set
A semi-algebraic subset of Rn is a finite union of sets of the form
- x ∈ Rn : fi(x) = 0, gj(x) ≤ 0, i ∈ I, j ∈ J
- where I, J are finite and fi, gj : Rn → R are real polynomial functions.
Stability under finite ∩, ∪ and complementation.
Semi-algebraic set
A function or a mapping is semi-algebraic if its graph is a semi-algebraic set. Same definition for real-extended function or multivalued mappings.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Properties Tarski-Seidenberg
The image of a semi-algebraic set by a linear projection is semi-algebraic. The closure of a semi-algebraic set A is semi-algebraic. Ex Example ample The graph of the derivative of a semi-algebraic function is semi-algebraic. Let A be a semi-algebraic subset of Rn and f : Rn → Rp semi-algebraic. Then f(A) is semi-algebraic. g(x) = max{F(x, y) : y ∈ S} is semi-algebraic if F and S are semi-algebraic. Other examples min
x
1 2||Ax − b||2 + µ||x||p : p is rational,
min
X
1 2||AX − B||2 + µrank(X).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Subdifferential
Con Convex subdiff subdiffer eren ential tial R ∈ Γ0(Rn) ∂R(x) =
- g : R(x′) ≥ R(x) + g, x′ − x, ∀x′ ∈ Rn
.
Fréchet subdifferential
Given x ∈ dom(R), the Fréchet subdifferential ˆ ∂R(x) of R at x is the set of vectors v such that lim inf
x′→x, x′=x
1 ||x − x′||
- R(x′) − R(x) − v, x′ − x
- ≥ 0.
If x / ∈ dom(R), then ˆ ∂R(x) = ∅.
Limiting subdifferential
The limiting-subdifferential (or simply subdifferential) of R at x, written as ∂R(x), reads ∂R(x)
def
= {v ∈ Rn : ∃xk → x, R(xk) → R(x), vk ∈ ˆ ∂R(xk) → v}. ˆ ∂R is convex and ∂R is closed.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Critical points
Minimal Minimal norm norm subgr subgradien adient ||∂R(x)||− = min{||v|| : v ∈ ∂R(x)}.
Critical points
Fermat’s rule: if x is a minimiser of R, then 0 ∈ ∂R(x). Conversely when 0 ∈ ∂R(x), the point x is called a critical point. When R is convex, any minimiser is a global minimiser. When R is non-convex – Local minima. – Local maxima. – Saddle point.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Sharpness Sharpness
Function R : Rn → R ∪ {+∞} is called sharp on the slice [a < R < b]
def
=
- x ∈ Rn : a < f(x) < b
- .
If there exists α > 0 such that ||∂R(x)||− ≥ α, ∀x ∈ [a < R < b]. Norms, e.g. R(x) = ||x||.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Łojasiewicz inequality Łojasiewicz inequality
Let R : Rn → R ∪ {+∞} be proper lower semi-continuous, and moreover continuous along its
- domain. Then R is said to have Łojasiewicz property if: for any critical point ¯
x, there exist C, ǫ > 0 and θ ∈ [0, 1[ such that |R(x) − R(¯ x)|θ ≤ C||v||, ∀x ∈ B¯
x(ǫ), v ∈ ∂R(x).
By convention, let 00 = 0.
Property
Suppose that R has Łojasiewicz property. If S is a connected subset of the set of critical points of R, that is 0 ∈ ∂R(x) for all x ∈ S, then R is constant on S. If in addition S is a compact set, then there exist C, ǫ > 0 and θ ∈ [0, 1[ such that ∀x ∈ Rn, dist(x, S) ≤ ǫ, ∀v ∈ ∂R(x) : |R(x) − R(¯ x)|θ ≤ C||v||.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Non-convex PPA Proximal point algorithm
Let R : Rn → R ∪ {+∞} be proper and lower semi-continuous. From arbitrary x0 ∈ Rn, xk+1 ∈ argminx γR(x) + 1
2||x − xk||2.
Assump Assumption tion R is proper, that is inf
x∈Rn R(x) > −∞.
This implies argminx γR(x) + 1
2||x − xk||2
is non-empty and compact. The restriction of R to its domain is a continuous function. R has the Łojasiewicz property.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Property Property
Let {xk}k∈N be the sequence generated by non-convex PPA and ω(xk) the set of its limiting
- points. Then
Sequence {R(xk)}k∈N is decreasing.
- k ||xk − xk+1||2 < +∞.
If R satisfies assumption 2, then ω(xk) ⊂ crit(R). If moreover, {xk}k∈N is bounded ω(xk) is a non-empty compact set, and dist
- xk, ω(xk)
- → 0.
If R satisfies assumption 2, then R is finite and constant on ω(xk). NB: Boundedness can be guaranteed if R is coercive.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence Convergence of PPA
Suppose the sequence {xk}k∈N generated by non-convex PPA is bounded, then
- k||xk − xk+1|| < +∞,
and the whole sequence converges to some critical point ¯ x ∈ crit(R). From definition of xk+1: R(xk+1) +
1 2γ ||xk − xk+1||2 ≤ R(xk) .
Consider g(s) = s1−θ, s > 0: ∇g(s) = (1 − θ)s−θ g
- R(xk)
- − g
- R(xk+1)
- ≥ (1 − θ)
- R(xk+1)
−θ R(xk) − R(xk+1)
- ≥ (1 − θ)
- R(xk)
−θ 1
2γ ||xk − xk+1||2.
WLOG, assume R(¯ x) = 0 for ¯ x ∈ ω(xk). Let vk ∈ ∂R(xk), then for all k large enough 0 < R(xk)θ ≤ C||vk|| = C
γ ||xk − xk−1||.
There exists M > 0
||xk − xk+1||2 ||xk − xk−1|| ≤ M
- R(xk)1−θ − R(xk+1)1−θ
.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence Convergence of PPA
Suppose the sequence {xk}k∈N generated by non-convex PPA is bounded, then
- k||xk − xk+1|| < +∞,
and the whole sequence converges to some critical point ¯ x ∈ crit(R). Take r ∈]0, 1[, if ||xk − xk+1|| ≥ r||xk − xk−1||, then ||xk − xk+1|| ≤ M
r
- R(xk)1−θ − R(xk+1)1−θ
. For all k large enough ||xk − xk+1|| ≤ r||xk − xk−1|| + M
r
- R(xk)1−θ − R(xk+1)1−θ
. There exists some K > 0, such that for k ≥ K
k
i=K||xi − xi+1|| ≤
r 1 − r||xK − xK−1|| + M r(1 − r)
- R(xK)1−θ − R(xK+1)1−θ
. R(x) is bounded from below. Take k → +∞...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Rate of convergence Convergence rate
Suppose the convergence of the non-convex PPA is true. Denote θ the Łojasiewicz exponent of x∞. The following statements hold If θ = 0, then {xk}k∈N converges in finite number of steps. If θ ∈]0, 1/2], then there exists η ∈]0, 1[ such that ||xk − x∞|| = O(ηk). If θ ∈]1/2, 1[, then ||xk − x∞|| = O(k− 1−θ
2θ−1 ). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples 2 Non-convex optimisation 3 Convex relaxation 4 Łojasiewicz inequality 5 Kurdyka-Łojasiewicz inequality
Kurdyka-Łojasiewicz inequality (KL)
Let R : Rn → R ∪ {+∞} be proper l.s.c. For a, b such that −∞ < a < b < +∞, [a < R < b]
def
= {x ∈ Rn : a < R(x) < b}.
Kurdyka-Łojasiewicz inequality
R is said to have the KL property at ¯ x ∈ dom(R) if there exists η ∈]0, +∞], a neighbourhood U
- f ¯
x and a continuous conc
- ncave function ϕ : [0, η[→ R+ such that
ϕ(0) = 0. ϕ is C1 on ]0, η[. for all s ∈]0, η[, ϕ′(s) > 0. for all x ∈ U ∩ [R(¯ x) < R < R(¯ x) + η], the KL inequality holds ϕ′ R(x) − R(¯ x)
- dist
- 0, ∂R(x)
- ≥ 1.
Proper l.s.c. functions are KL at non-critical points. Proper l.s.c. functions which satisfy KL at each point of dom(∂R) are called KL functions. Typical KL functions are the class of semi-algebraic functions.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Kurdyka-Łojasiewicz functions
||∇F(x)|| ≥ 0 ||∇(ϕ ◦ F)(x)|| ≥ 1 When R(¯ x) = 0, then the condition becomes ||∂(ϕ ◦ F)(x)||− ≥ 1. ϕ is called a desingularising function for R, i.e. sharp up to reparameterization via ϕ.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Abstract descent methods
Let Φ be proper and lower semi-continuous. Suppose a sequence {xk}k∈N is generated such that the following conditions are satisfied.
Conditions
Let c, d > 0 be some constants A.1 Sufficien Sufficient decr decrease ease conditions
- nditions For each k ∈ N,
Φ(xk+1) + c||xk+1 − xk||2 ≤ Φ(xk). A.2 Rela elativ tive err error
- r condition
- ndition For each k ∈ N, there exists gk+1 ∈ ∂Φ(xk+1) such that
||gk+1|| ≤ d||xk+1 − xk||. A.3 Con Continuity tinuity con
- ndition
dition There exists a subsequence {xkj}j∈N and ¯ x such that xkj → ¯ x, Φ(xkj) → Φ(¯ x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence Convergence
Let Φ : Rn → R ∪ {+∞} be proper and l.s.c. and KL at some ¯ x ∈ Rn. Let U, η and ϕ be in the KL
- property. Let δ, ρ > 0 be such that B¯
x(δ) ⊂ U with ρ ∈]0, δ[. Consider a sequence {xk}k∈N
which satisfies (A.1)-(A.2). Suppose moreover Φ(¯ x) < Φ(x0) < Φ(¯ x) + η, ||x0 − ¯ x|| + 2
- Φ(x0)−Φ(¯
x) c
+ d
c ϕ
- Φ(x0) − Φ(¯
x)
- < ρ,
and ∀k ∈ N, xk ∈ B¯
x(ρ) ⇒ xk+1 ∈ B¯ x(δ) with Φ(xk+1) ≥ Φ(¯
x). Then the sequence {xk}k∈N satisfies ∀k ∈ N, xk ∈ B¯
x(δ),
- k ||xk − xk+1|| < +∞,
Φ(xk) → Φ(¯ x). and converges to a point x⋆ ∈ B¯
x(δ) such that Φ(x⋆) ≤ Φ(¯
x). If moreover, (A.3) is true, then x⋆ is a critical point and Φ(x⋆) = Φ(¯ x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Condition (A.1) implies that {Φ(xk)}k∈N is non-increasing, and for all k ∈ N ||xk+1 − xk|| ≤
- Φ(xk)−Φ(xk+1)
c
. Condition (A.2) and KL inequality ϕ′ Φ(xk) − Φ(¯ x)
- ≥
1 ||gk|| ≥ 1 d||xk − xk−1||.
Since ϕ is concave, ϕ
- Φ(xk) − Φ(¯
x)
- − ϕ
- Φ(xk) − Φ(¯
x)
- ≥ ϕ′
Φ(xk) − Φ(¯ x)
- Φ(xk) − Φ(xk+1)
- ≥ ϕ′
Φ(xk) − Φ(¯ x)
- c||xk − xk+1||2.
Combining the above two yields
||xk − xk+1||2 ||xk − xk−1|| ≤ d c
- ϕ
- Φ(xk) − Φ(¯
x)
- − ϕ
- Φ(xk) − Φ(¯
x)
- .
Apply the inequality 2√xy ≤ x + y, 2||xk − xk+1|| ≤ ||xk − xk−1|| + d
c
- ϕ(Φ(xk) − Φ(¯
x)) − ϕ(Φ(xk+1) − Φ(¯ x))
- .
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Continue with (A.1), ||x1 − x0|| ≤
- Φ(x0)−Φ(x1)
c
≤
- Φ(x0)−Φ(¯
x) c
. Then ||x1 − ¯ x|| ≤ ||x1 − x0|| + ||x0 − ¯ x|| ≤ ||x1 − x0|| +
- Φ(x0)−Φ(¯
x) c
≤ ρ. By induction, we can show that for all k ∈ N xk ∈ B¯
x(ρ)
and
k
i=1||xi+1 − xi|| + ||xk+1 − xk|| ≤ ||x1 − x0|| + d
c
- ϕ(Φ(x1) − Φ(¯
x)) − ϕ(Φ(xk+1) − Φ(¯ x))
- .
The above directly implies
- k||xk − xk+1|| ≤ ||x1 − x0|| + d
c ϕ(Φ(x1) − Φ(¯
x)) < +∞. Hence, there exists x⋆ ∈ ω(xk) xk → x⋆, gk → 0, Φ(xk) → v ≥ Φ(¯ x). KL inequality ϕ′ v − Φ(¯ x)
- ||gk|| ≥ 1
indicates v = Φ(¯ x). Lower semi-continuous yields Φ(x⋆) ≤ Φ(¯ x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Forward–Backward splitting
Consider minimising min
x∈Rn
- Φ(x)
def
= R(x) + F(x)
- ,
R : Rn → R ∪ {+∞} is proper l.s.c. and bounded from below. F : Rn → R is finite-valued, differentiable and ∇F is L-Lipschitz.
Forward–Backward splitting
Let γ ∈]0, 1/L[: xk+1 ∈ proxγR
- xk − γ∇F(xk)
- .
Sufficien Sufficient decr decrease ease Φ(xk+1) + 1 − γL
2γ
||xk − xk+1||2 ≤ Φ(xj). Rela elativ tive err error
- r gk+1
def
=
1 γ (xk − xk+1) − ∇F(xk) + ∇F(xk+1) ∈ ∂Φ(xk+1)
||gk+1|| ≤ 1
γ + L
- ||xk − xk+1||.
Con Continuity tinuity sequence {xk}k∈N is bounded.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
A coupled minimisation problem A coupled problem
Consider minimising min
x∈Rn,y∈Rm
- E(x, y)
def
= R(x) + F(x, y) + J(y)
- ,
R : Rn → R ∪ {+∞}, J : Rm → R ∪ {+∞} are proper l.s.c. and bounded from below. F : Rn × Rm → R is finite-valued, differentiable and ∇F is L-Lipschitz. Subdifferential ∂E(x, y) =
- ∂R(x) + ∇xF(x, y)
- ×
- ∂J(x) + ∇yF(x, y)
- = ∂xE(x, y) × ∂yE(x, y).
Separate Lipschitz continuity for F: ∇xF is Lx-Lip. and ∇yF is Ly-Lip.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Proximal alternating minimisation (PAM)
PAM is an alternating minimisation algorithm.
PAM
Let γx, γy ∈]0, 1/L[: xk+1 ∈ argminx∈Rn E(x, yk) +
1 2γx ||x − xk||2,
yk+1 ∈ argminy∈Rm E(xk+1, y) +
1 2γy ||y − yk||2.
PAM is an instance of PPA. Convergence, let Φ(x, y) = E(x, y). No closed form solution, xk+1 ∈ argminx∈Rn E(x, yk) +
1 2γx ||x − xk||2
= argminx∈Rn R(x) + F(x, yk) +
1 2γx ||x − xk||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Proximal alternating linearised minimisation (PALM)
PALM is linearised PAM. F(x, yk) ≤ F(xk, yk) + ∇xF(xk, yk), x − xk +
1 2γx ||x − xk||2.
PALM
Let γx, γy ∈]0, 1/L[: xk+1 ∈ proxγxR
- xk − γx∇xF(xk, yk)
- ,
yk+1 ∈ proxγyJ
- yk − γy∇yF(xk+1, yk)
- .
PAM is an instance of Forward–Backward. Convergence, let Φ(x, y) = E(x, y).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Remarks
Converges to global minimiser if starts close enough. Inertial acceleration can be applied to all of them. Step-size v.s. inertial parameter. Step-size and critical points. Stochastic optimisation methods can escape saddle-point or find global minimiser...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Reference
- H. Attouch, and J. Bolte. “On the convergence of the proximal algorithm for nonsmooth
functions involving analytic features”. Mathematical Programming 116.1-2 (2009): 5-16.
- H. Attouch, J. Bolte, and B. Svaiter. “Convergence of descent methods for semi-algebraic and
tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods”. Mathematical Programming 137.1-2 (2013): 91-129.
- H. Attouch, et al. “Proximal alternating minimization and projection methods for nonconvex
problems: An approach based on the Kurdyka-Łojasiewicz inequality”. Mathematics of Operations Research 35.2 (2010): 438-457.
- J. Bolte, S. Sabach, and M. Teboulle. “Proximal alternating linearized minimization or
nonconvex and nonsmooth problems”. Mathematical Programming 146.1-2 (2014): 459-494.
- J. Liang, J. Fadili, and G. Peyré. “A multi-step inertial forward-backward splitting method for