Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient - - PowerPoint PPT Presentation
Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient - - PowerPoint PPT Presentation
Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient methods Jingwei Liang Department of Applied Mathematics and Theoretical Physics Table of contents 1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of
Table of contents
1
Unconstrained smooth optimisation
2
Descent methods
3
Gradient of convex functions
4
Gradient descent
5
Heavy-ball method
6
Nesterov’s optimal schemes
7
Dynamical system
Convexity Convex set
A set S ⊂ Rn is convex if for any θ ∈ [0, 1] and two points x, y ∈ S, θx + (1 − θ)y ∈ S.
Convex function
Function F : Rn → R is convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1], F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y). Proper convex: F(x) < +∞ at least for one x and F(x) > −∞ for all x. 1st-order condition: F is continuous differentiable F(y) ≥ F(x) + ∇F(x), y − x, ∀x, y ∈ dom(F). 2nd-order condition: if F is twice differentiable ∇2F(x) 0, ∀x ∈ dom(F).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Unconstrained smooth optimisation Problem
Unconstrained smooth optimisation min
x∈Rn F(x),
where F : Rn → R is proper convex and smooth differentiable. Optimality condition: let x⋆ be an minimiser of F(x), then 0 = ∇F(x⋆).
x
F (xk) ∇F (x⋆) ∇F (x) Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Example: quadratic minimisation Quadratic programming
General quadratic programming problem min
x∈Rn
1 2xTAx + bTx + c,
where A ∈ Rn×n is symmetric positive definite, b ∈ Rn and c ∈ R. Optimality condition: 0 = Ax⋆ + b. Special Special case ase least square ||Ax − b||2 = xT(ATA)x − 2(ATb)Tx + bTb. Optimality condition ATAx⋆ = ATb.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Example: geometric programming Geometric programming
min
x∈Rn log
m
i=1exp(aT i x + bi)
- .
Optimality condition: 0 =
1 m
i=1exp(aT i x⋆ + bi)
m
i=1 exp(aT i x⋆ + bi)ai. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Problem Unconstrained smooth optimisation
Consider minising min
x∈Rn F(x),
where F : Rn → R is proper convex and smooth differentiable.
x⋆
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Problem Unconstrained smooth optimisation
Consider minising min
x∈Rn F(x),
where F : Rn → R is proper convex and smooth differentiable. The set of minimisers, i.e. Argmin(F) = {x ∈ Rn : F(x) = min
x∈Rn F(x)}
is non-empty. However, given x⋆ ∈ Argmin(F), no closed form expression. Iterative strategy to find one x⋆ ∈ Argmin(F): start from x0 and generate a train of sequsence {xk}k∈N such taht lim
k→∞ xk = x⋆ ∈ Argmin(F). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Problem Unconstrained smooth optimisation
Consider minising min
x∈Rn F(x),
where F : Rn → R is proper convex and smooth differentiable.
x⋆
xk−1 xk xk+1 xk+2
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Descent methods Iterative scheme
For each k = 1, 2, ..., find γk > 0 and dk ∈ Rn and then xk+1 = xk + γkdk, where dk is called search/descent direction. γk is called step-size.
Descent methods
An algorithm is called descent method, if there holds F(xk+1) < F(xk). NB: if xk ∈ Argmin(F), then xk+1 = xk...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Conditions
From convexity of F, we have F(xk+1) ≥ F(xk) + ∇F(xk), xk+1 − xk, which gives ∇F(xk), xk+1 − xk ≥ 0 = ⇒ F(xk+1) ≥ F(xk). Since xk+1 − xk = γkdk, the direction dk should be such that ∇F(xk), dk < 0.
x⋆
∇F(xk)
xk
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
General descent method General descent method
initial initial : x0 ∈ dom(F); repea epeat :
- 1. Find a descent direction dk.
- 2. Choose a step-size γk: line search.
- 3. Update xk+1 = xk + γkdk.
un until til : stopping criterion is satisfied. Stopping criterion: ǫ > 0 is the tolerance, Function value: F(xk+1) − F(xk) ≤ ǫ (can be time consuming). Sequence: ||xk+1 − xk|| ≤ ǫ. Optimality condition: ||∇F(xk)|| ≤ ǫ.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Exact line search Exact line search
Suppose that the direction dk is given. Choose γk such that F(x) is minimised along the ray xk + γdk, γ > 0: γk = argminγ>0F(xk + γdk). Useful when the minimistion problem for γk is simple. γk can be found analytically for special cases.
γ
F (xk + γdk)
γ = 0 γk
F (xk + γkdk) Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Backtracking/inexact line search Backtracking line search
Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1 while F(xk + γdk) > F(xk) + δγ∇F(xk), dk : γ = βγ. Reduce F enough along the direction dk. Since dk is a descent direction ∇F(xk), dk < 0. Stopping criterion for backtracking: F(xk + γdk) ≤ F(xk) + δγ∇F(xk), dk. When γ is small enough F(xk + γdk) ≈ F(xk) + γ∇F(xk), dk < F(xk) + δγ∇F(xk), dk, whcih means backtracking eventually will stop.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Backtracking/inexact line search Backtracking line search
Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1 while F(xk + γdk) > F(xk) + δγ∇F(xk), dk : γ = βγ. γ
F (xk + γdk)
γ = 0
F (xk) + γ∇F (xk)T dk F (xk) + δγ∇F (xk)T dk
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Monotonicity Monotonicity of gradient
Let F : Rn → R be proper convex and smooth differentiable, then ∇F(x) − ∇F(y), x − y ≥ 0, ∀x, y ∈ dom(F). C1: proper convex and smooth differentiable functions on Rn. Pr Proof
- of Owing to convexity, given x, y ∈ dom(F), we have
F(y) ≥ F(x) + ∇F(x), y − x and F(x) ≥ F(y) + ∇F(y), x − y. Summing them up yields ∇F(x) − ∇F(y), x − y ≥ 0. NB: Let F ∈ C1, F is convex if and only if ∇F(x) is monotone.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Lipschitz continuous gradient Lipschitz continuity
The gradient of F is L-Lipschitz continuous if there exists L > 0 such that ||∇F(x) − ∇F(y)|| ≤ L||x − y||, ∀x, y ∈ dom(F). C1
L: proper convex functions with L-Lipschitz continuous gradient on Rn.
If F ∈ C1
L, then
H(x)
def
= L
2||x||2 − F(x)
is convex. Hin Hint : monotonicity of ∇H(x), i.e. ∇H(x) − ∇H(y), x − y = L||x − y||2 − ∇F(x) − ∇F(y), x − y ≥ L||x − y||2 − L||x − y||2 = 0.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Descent lemma Descent lemma, quadratic upper bound
Let F ∈ C1
L, then there holds
F(y) ≤ F(x) + ∇F(x), y − x + L
2||y − x||2, ∀x, y ∈ dom(F).
Pr Proof
- of Define H(t) = F(x + t(y − x)), then
F(y) − F(x) = H(1) − H(0) = 1 ∇H(t)dt = 1 (y − x)T∇F(x + t(y − x))dt ≤ 1 (y − x)T∇F(x)dt + 1
- (y − x)T
∇F(x + t(y − x)) − ∇F(x)
- dt
≤ (y − x)T∇F(x) + 1 ||y − x||||∇F(x + t(y − x)) − ∇F(x)||dt ≤ (y − x)T∇F(x) + ||y − x|| 1 tL||y − x||dt = (y − x)T∇F(x) + L 2 ||y − x||2.
NB: first-order condition of convexity for H(x)
def
= L
2||x||2 − F(x). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Descent lemma: consequences Corollary
Let F ∈ C1
L and x⋆ ∈ Argmin(F), then
1 2L||∇F(x)||2 ≤ F(x) − F(x⋆) ≤ L 2||x − x⋆||2 , ∀x ∈ dom(F).
Pr Proof
- of Right-hand inequality: ∇F(x⋆) = 0,
F(x) ≤ F(x⋆) + ∇F(x⋆), x − x⋆ + L
2||x − x⋆||2, ∀x ∈ dom(F).
Left-hand inequality: F(x⋆) ≤ min
y∈dom(F)
- F(x) + ∇F(x), y − x + L
2||y − x||2
= F(x) − 1
2L||∇F(x)||2.
The corresponding y is y = x − 1
L∇F(x). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Co-coercivity of gradient Co-coercivity
Let F ∈ C1
L, then
x − y, ∇F(x) − ∇F(y) ≥ 1
L||∇F(x) − ∇F(y)||2.
Co-coercivity implies Lipschitz continuity For F ∈ C1
L, H(x)
def
= L
2||x||2 − F(x)
Lipschitz continuity of ∇F = ⇒ Convexity of H(x) = ⇒ Co-coercivity of ∇F(x) = ⇒ Lipschitz continuity of ∇F
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Co-coercivity of gradient Co-coercivity
Let F ∈ C1
L, then
x − y, ∇F(x) − ∇F(y) ≥ 1
L||∇F(x) − ∇F(y)||2.
Pr Proof
- of Define R(z) = F(z) − ∇F(x), z, then ∇R(x) = 0.
Recall the lemma F ∈ C1
L and x⋆ ∈ Argmin(F): 1 2L||∇F(x)||2 ≤ F(x) − F(x⋆) ≤ L 2||x − x⋆||2.
Then we have F(y) − F(x) − ∇F(x), y − x = R(y) − R(x) ≥ 1
2L||∇R(y)||2
= 1
2L||∇F(y) − ∇F(x)||2.
Similarly, define S(z) = F(z) − ∇F(y), z, then F(x) − F(y) − ∇F(y), x − y = S(y) − S(x) ≥ 1
2L||∇F(x) − ∇F(y)||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Strongly convex function Strong convexity
Function F : Rn → R is strongly convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1], there exists α > 0 such that F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y) − α
2 θ(1 − θ)||x − y||2.
F is strongly convex with parameter α > 0 if G(x)
def
= F(x) − α
2 ||x||2
is convex. Monotonicity: ∇F(x) − ∇F(y), x − y ≥ α||x − y||2, ∀x, y ∈ dom(F). Second-order condition for strong convexity: if F ∈ C2, ∇2F(x) αId, ∀x ∈ dom(F).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Quadratic lower bound Quadratic lower bound
Let F ∈ C1 and strongly convex, then F(y) ≥ F(x) + ∇F(x), y − x + α
2 ||y − x||2, ∀x, y ∈ dom(F).
Pr Proof
- of First-order condition of convexity for G(x)
def
= F(x) − α
2 ||x||2.
Corollary
Let F ∈ C1 be α-strongly convex and x⋆ ∈ Argmin(F), then
α 2 ||x − x⋆||2 ≤ F(x) − F(x⋆) ≤ 1 2α||∇F(x)||2 , ∀x ∈ dom(F).
Pr Proof
- of Left-hand inequality: quadratic lower bound.
Right-hand inequality: F(x⋆) ≥ min
y∈dom(F)
- F(x) + ∇F(x), y − x + α
2 ||y − x||2
= F(x) −
1 2α||∇F(x)||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Extension of co-coercivity
If F ∈ C1
L and α-strongly convex, then
G(x)
def
= F(x) − α
2 ||x2||
is convex, and ∇G is L − α-Lipschitz continuous. The co-coercivity of ∇G yields ∇F(x) − ∇F(y), x − y ≥
αL α + L||x − y||2 + 1 α + L||∇F(x) − ∇F(y)||2
for all x, y ∈ dom(F). S1
α,L: functions in C1 L that are α-strongly convex. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Rate of convergence
Sequence xk converges linearly to x⋆ if lim
k→+∞
||xk+1 − x⋆|| ||xk − x⋆||
= ρ holds for ρ ∈]0, 1[, and ρ is called the rate of convergence. If xk converges, let ρk = |
|xk+1−x⋆| | | |xk−x⋆| | ,
– if limk→+∞ ρk = 0: super-linear convergence. – if limk→+∞ ρk = 1: sub-linear convergence. Superlinear convergence: q > 1 lim
k→+∞
||xk+1 − x⋆|| ||xk − x⋆||q < η
for some η ∈]0, 1[. – q = 2: quadratic convergence. – q = 3: cubic convergnce.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Unconstrained smooth optimisation Unconstrained smooth optimisation
Consider minimising min
x∈Rn F(x),
where F : Rn → R is proper convex and smooth differentiable. Assumptions: F ∈ C1 is convex. ∇F(x) is L-Lipschitz continuous for some L > 0. Set of minimisers is non-empty, i.e. Argmin(F) = ∅.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Gradient descent
Descent direction: let d = −∇F(x), then ∇F(x), d = −||∇F(x)||2 ≤ 0.
Gradient descent
initial initial : x0 ∈ dom(F); repea epeat :
- 1. Choose step-size γk > 0
- 2. Update xk+1 = xk − γk∇F(xk)
un until til : stopping criterion is satisfied.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence analysis: constant step-size
Owing to the quadratic upper bound F(xk+1) ≤ F(xk) + ∇F(xk), xk+1 − xk + L
2||xk+1 − xk||2
= F(xk) − γ||∇F(xk)||2 + γ2L
2 ||∇F(xk)||2
= F(xk) − γ(1 − γL
2 )||∇F(xk)||2.
Hence F(xk) − F(xk+1) ≥ γ(1 − γL
2 )||∇F(xk)||2.
Let γ ∈]0, 2/L[, γ(1 − γL
2 ) k
i=0 ||∇F(xi)||2 ≤ F(x0) − F(xk+1) ≤ F(x0) − F(x⋆).
F(x⋆) > −∞, rhs is a positive constant. for lhs, let k → +∞, lim
k→+∞ ||∇F(xk)||2 = 0.
NB: convexity is not required here.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence analysis: constant step-size
Let γ ∈]0, 1/L], then γ(1 − γL
2 ) ≥ γ
2 , and
F(xk+1) ≤ F(xk) − γ
2 ||∇F(xk)||2
(cvx of F at xk) ≤ F(x⋆) + ∇F(xk), xk − x⋆ − γ
2 ||∇F(xk)||2
= F(x⋆) + 1
2γ
- ||xk − x⋆||2 − ||xk − x⋆ − γ∇F(xk)||2
= F(x⋆) + 1
2γ
- ||xk − x⋆||2 − ||xk+1 − x⋆||2
. Summability of F(xk) − F(x⋆),
k
i=1
- F(xk) − F(x⋆)
- ≤
1 2γ
k
i=1
- ||xi−1 − x⋆||2 − ||xi − x⋆||2
=
1 2γ
- ||x0 − x⋆||2 − ||xk+1 − x⋆||2
≤
1 2γ ||x0 − x⋆||2.
Since F(xk) − F(x⋆) is decreasing F(xk) − F(x⋆) ≤ 1
k
k
i=1
- F(xk) − F(x⋆)
- ≤
1 2γk||x0 − x⋆||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence analysis: strongly convex F
Besides the basic assumptions, let’s further assume F ∈ S1
α,L.
Recall that, for all x, y ∈ dom(F) ∇F(x) − ∇F(y), x − y ≥
αL α + L||x − y||2 + 1 α + L||∇F(x) − ∇F(y)||2.
Analysis for constant step-size: let γ ∈]0, 2/(α + L)[ ||xk+1 − x⋆||2 = ||xk − γ∇F(xk) − x⋆||2 = ||xk − x⋆||2 − 2γ∇F(xk), xk − x⋆ + γ2||∇F(xk)||2
(∇F(x⋆) = 0) ≤
- 1 − 2γαL
α + L
- ||xk − x⋆||2 + γ
- γ −
2 α + L
- ||∇F(xk)||2
≤
- 1 − 2γαL
α + L
- ||xk − x⋆||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence analysis: strongly convex F
Distance to minimiser: ρ = 1 − 2γαL
α+L
||xk − x⋆||2 ≤ ρk||x0 − x⋆||2. linear covnergence for γ =
2 α+L,
ρ =
L − α
L + α
2.
Convergence rate of objective function value: F(xk) − F(x⋆) ≤ L
2||xk − x⋆||2 ≤ ρkL 2 ||x0 − x⋆||2.
Numer of iterations k needed for F(xk) − F(x⋆) ≤ ǫ F ∈ C1
L: O(1/ǫ).
F ∈ S1
α,L: O(log(1/ǫ)). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Limits on convergence rate of gradient descent
First-order method: xk is an element from the set x0 + span
- ∇F(x0), ..., ∇F(xi), ..., ∇F(xk−1)
- .
4.1 Problem class: C1
L
Nesterov’s lower bound
For every integer k ≤ (n − 1)/2 and every x0, there exist functions in the problem class such that for any first-order method satisfies (4.1), F(xk) − F(x⋆) ≥ 3
32 L||x0 − x⋆||2 (k + 1)2
, ||xk − x⋆||2 ≥ 1
8||x0 − x⋆||2.
Suggests O(1/k) is not the optimal rate. Accelerated gradient methods can achieve O(1/k2) rate.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Observations
Gradient descent: −γ∇F(xk) = xk+1 − xk. Consider the angle: θk
def
= angle(∇F(xk+1), ∇F(xk)), lim
k→+∞ θk = 0.
Exercise: prove this claim for least square.
Let a > 0 be some constant, −∇F(xk+1) ≈ a(xk+1 − xk).
x⋆
xk−1 xk xk+1 xk+2
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Heavy-ball method Heavy-ball method (Polyak)
Initial Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[; yk = xk + ak(xk − xk−1), ak ∈ [0, 1], xk+1 = yk − γ∇F(xk).
x⋆
xk−1 xk xk+1 xk+2 yk xk+1
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Heavy-ball method Heavy-ball method (Polyak)
Initial Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[; yk = xk + ak(xk − xk−1), ak ∈ [0, 1], xk+1 = yk − γ∇F(xk). xk − xk−1 is called the inertial term or momentum term. ak is called the inertial parameter. Convergence can be proved by studying the Lyapunov function E(xk)
def
= F(xk) + ak
2γ ||xk − xk−1||2.
In general, no convergence rate for F ∈ C1
- L. Local rate for F ∈ S2
α,L. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence rate Theorem
Let x⋆ be a (local) minimiser of F such that αId ∇2F(x⋆) LId and choose a, γ with a ∈ [0, 1[, γ ∈]0, 2(1 + a)/L[. There exists ρ < 1 such that if ρ < ρ < 1 and if x0, x1 are close enough to x⋆, one has ||xk − x⋆|| ≤ Cρk. Moreover, if a = √
L − √α √ L + √α
2 , γ =
4 ( √ L + √α)2
then ρ =
√ L − √α √ L + √α.
Starting points need to close enough to x⋆ Almost the optimal rate can be achieve by gradient method (or first-order method) Gradient descent ρ = L − α
L + α.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence rate: proof
Taylor expansion xk+1 = xk + a(xk − xk−1) − γ∇2F(x⋆)(xk − x⋆) + o(||xk − x⋆||). Let zk = (xk − x⋆, xk−1 − x⋆)T and H = ∇2F, then zk+1 =
- (1 + a)Id − aH
−aId Id
- M
zk + o(||zk||). Spectral radius ρ(M), η = 1 − γα 0 = ρ2 − (a + η)ρ + aη. ρ(M) is a function of a and η (essentially γ).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Convergence rate of gradient descent
Gradient descent with constant step-size: F ∈ C1
L
F(xk) − F(x⋆) ≤ L||x0 − x⋆||2
k + 4
. F ∈ S1
α,L
F(xk) − F(x⋆) ≤ L
2
L − α
L + α
2||x0 − x⋆||2.
x⋆
xk−1 xk xk+1 xk+2 yk xk+1
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Nesterov’s optimal scheme Optimal scheme with constant step-size
initial initial : Choose x0 ∈ Rn, φ0 ∈]0, 1[; Let y0 = x0 and q = α/L. repea epeat :
- 1. Compute φk+1 ∈]0, 1[ from equation
φ2
k+1 = (1 − φk+1)φ2 k + qφk+1.
Let ak = φk(1−φk)
φ2
k+φk+1 and
yk = xk + ak(xk − xk−1).
- 2. Update xk+1 by
xk+1 = yk − 1
L∇F(yk).
un until til : stopping criterion is satisfied.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence rate Convergence rate
Let φ0 ≥
- α/L, then
F(xk) − F(x⋆) ≤ min
- 1 −
α
L
k,
4L (2 √ L + k√ν)2
- ×
- F(x0) − F(x⋆) + ν
2 ||x0 − x⋆||2
, where ν = φ0(φ0L−α)
1−φ0
. Parameter choices: F ∈ C1
L: φ0 = 1,
q = 0, φk ≈
2 k + 1 → 0
and ak ≈ 1 − φk
1 + φk → 1.
F ∈ S1
α,L: φ0 =
- α/L
q = α
L , φk ≡
α
L
and ak ≡
√ L − √α √ L + √α.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system
Dynamical system of gradient descent
From gradient descent xk+1 − xk γ = −∇F(xk). Let γ be small enough ˙ X(t) + ∇F(X(t)) = 0. Discretisation Explicit Euler method ˙ X(t) = X(t + h) − X(t) h . Implicit Euler method ˙ X(t) = X(t) − X(t − h) h .
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Dynamical system of inertial schemes
Given a 2nd order dynamical system ¨ X(t) + λ(t)˙ X(t) + ∇F(X(t)) = 0. Discretisation: 2nd order term ¨ X(t) = X(t + h) − 2X(t) + X(t − h) h2 . Implicit Euler method ˙ X(t) = X(t) − X(t − h) h . Combine together: X(t + h) − X(t) − (1 − hλ(t))(X(t) − X(t − h)) + h2∇F(X(t)) = 0. Choices: Heavy-ball: hλ(t) ∈]0, 1[. Nesterov: λ(t) = d
t , d > 3. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Reference
- S. Boyd and L. Vandenberghe. “Convex optimization”. Cambridge university press, 2004.
- B. Polyak. “Introduction to optimization”. Optimization Software, 1987.
- Y. Nesterov. “Introductory lectures on convex optimization: A basic course”. Vol. 87. Springer
Science & Business Media, 2013.
- W. Su, S. Boyd, and E. Candès. “A differential equation for modeling Nesterov’s accelerated