Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient - - PowerPoint PPT Presentation

introductory course on non smooth optimisation
SMART_READER_LITE
LIVE PREVIEW

Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient - - PowerPoint PPT Presentation

Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient methods Jingwei Liang Department of Applied Mathematics and Theoretical Physics Table of contents 1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of


slide-1
SLIDE 1

Introductory Course on Non-smooth Optimisation

Lecture 01 - Gradient methods Jingwei Liang

Department of Applied Mathematics and Theoretical Physics

slide-2
SLIDE 2

Table of contents

1

Unconstrained smooth optimisation

2

Descent methods

3

Gradient of convex functions

4

Gradient descent

5

Heavy-ball method

6

Nesterov’s optimal schemes

7

Dynamical system

slide-3
SLIDE 3

Convexity Convex set

A set S ⊂ Rn is convex if for any θ ∈ [0, 1] and two points x, y ∈ S, θx + (1 − θ)y ∈ S.

Convex function

Function F : Rn → R is convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1], F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y). Proper convex: F(x) < +∞ at least for one x and F(x) > −∞ for all x. 1st-order condition: F is continuous differentiable F(y) ≥ F(x) + ∇F(x), y − x, ∀x, y ∈ dom(F). 2nd-order condition: if F is twice differentiable ∇2F(x) 0, ∀x ∈ dom(F).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-4
SLIDE 4

Unconstrained smooth optimisation Problem

Unconstrained smooth optimisation min

x∈Rn F(x),

where F : Rn → R is proper convex and smooth differentiable. Optimality condition: let x⋆ be an minimiser of F(x), then 0 = ∇F(x⋆).

x

F (xk) ∇F (x⋆) ∇F (x) Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-5
SLIDE 5

Example: quadratic minimisation Quadratic programming

General quadratic programming problem min

x∈Rn

1 2xTAx + bTx + c,

where A ∈ Rn×n is symmetric positive definite, b ∈ Rn and c ∈ R. Optimality condition: 0 = Ax⋆ + b. Special Special case ase least square ||Ax − b||2 = xT(ATA)x − 2(ATb)Tx + bTb. Optimality condition ATAx⋆ = ATb.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-6
SLIDE 6

Example: geometric programming Geometric programming

min

x∈Rn log

m

i=1exp(aT i x + bi)

  • .

Optimality condition: 0 =

1 m

i=1exp(aT i x⋆ + bi)

m

i=1 exp(aT i x⋆ + bi)ai. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-7
SLIDE 7

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-8
SLIDE 8

Problem Unconstrained smooth optimisation

Consider minising min

x∈Rn F(x),

where F : Rn → R is proper convex and smooth differentiable.

x⋆

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-9
SLIDE 9

Problem Unconstrained smooth optimisation

Consider minising min

x∈Rn F(x),

where F : Rn → R is proper convex and smooth differentiable. The set of minimisers, i.e. Argmin(F) = {x ∈ Rn : F(x) = min

x∈Rn F(x)}

is non-empty. However, given x⋆ ∈ Argmin(F), no closed form expression. Iterative strategy to find one x⋆ ∈ Argmin(F): start from x0 and generate a train of sequsence {xk}k∈N such taht lim

k→∞ xk = x⋆ ∈ Argmin(F). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-10
SLIDE 10

Problem Unconstrained smooth optimisation

Consider minising min

x∈Rn F(x),

where F : Rn → R is proper convex and smooth differentiable.

x⋆

xk−1 xk xk+1 xk+2

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-11
SLIDE 11

Descent methods Iterative scheme

For each k = 1, 2, ..., find γk > 0 and dk ∈ Rn and then xk+1 = xk + γkdk, where dk is called search/descent direction. γk is called step-size.

Descent methods

An algorithm is called descent method, if there holds F(xk+1) < F(xk). NB: if xk ∈ Argmin(F), then xk+1 = xk...

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-12
SLIDE 12

Conditions

From convexity of F, we have F(xk+1) ≥ F(xk) + ∇F(xk), xk+1 − xk, which gives ∇F(xk), xk+1 − xk ≥ 0 = ⇒ F(xk+1) ≥ F(xk). Since xk+1 − xk = γkdk, the direction dk should be such that ∇F(xk), dk < 0.

x⋆

∇F(xk)

xk

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-13
SLIDE 13

General descent method General descent method

initial initial : x0 ∈ dom(F); repea epeat :

  • 1. Find a descent direction dk.
  • 2. Choose a step-size γk: line search.
  • 3. Update xk+1 = xk + γkdk.

un until til : stopping criterion is satisfied. Stopping criterion: ǫ > 0 is the tolerance, Function value: F(xk+1) − F(xk) ≤ ǫ (can be time consuming). Sequence: ||xk+1 − xk|| ≤ ǫ. Optimality condition: ||∇F(xk)|| ≤ ǫ.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-14
SLIDE 14

Exact line search Exact line search

Suppose that the direction dk is given. Choose γk such that F(x) is minimised along the ray xk + γdk, γ > 0: γk = argminγ>0F(xk + γdk). Useful when the minimistion problem for γk is simple. γk can be found analytically for special cases.

γ

F (xk + γdk)

γ = 0 γk

F (xk + γkdk) Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-15
SLIDE 15

Backtracking/inexact line search Backtracking line search

Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1 while F(xk + γdk) > F(xk) + δγ∇F(xk), dk : γ = βγ. Reduce F enough along the direction dk. Since dk is a descent direction ∇F(xk), dk < 0. Stopping criterion for backtracking: F(xk + γdk) ≤ F(xk) + δγ∇F(xk), dk. When γ is small enough F(xk + γdk) ≈ F(xk) + γ∇F(xk), dk < F(xk) + δγ∇F(xk), dk, whcih means backtracking eventually will stop.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-16
SLIDE 16

Backtracking/inexact line search Backtracking line search

Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1 while F(xk + γdk) > F(xk) + δγ∇F(xk), dk : γ = βγ. γ

F (xk + γdk)

γ = 0

F (xk) + γ∇F (xk)T dk F (xk) + δγ∇F (xk)T dk

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-17
SLIDE 17

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-18
SLIDE 18

Monotonicity Monotonicity of gradient

Let F : Rn → R be proper convex and smooth differentiable, then ∇F(x) − ∇F(y), x − y ≥ 0, ∀x, y ∈ dom(F). C1: proper convex and smooth differentiable functions on Rn. Pr Proof

  • of Owing to convexity, given x, y ∈ dom(F), we have

F(y) ≥ F(x) + ∇F(x), y − x and F(x) ≥ F(y) + ∇F(y), x − y. Summing them up yields ∇F(x) − ∇F(y), x − y ≥ 0. NB: Let F ∈ C1, F is convex if and only if ∇F(x) is monotone.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-19
SLIDE 19

Lipschitz continuous gradient Lipschitz continuity

The gradient of F is L-Lipschitz continuous if there exists L > 0 such that ||∇F(x) − ∇F(y)|| ≤ L||x − y||, ∀x, y ∈ dom(F). C1

L: proper convex functions with L-Lipschitz continuous gradient on Rn.

If F ∈ C1

L, then

H(x)

def

= L

2||x||2 − F(x)

is convex. Hin Hint : monotonicity of ∇H(x), i.e. ∇H(x) − ∇H(y), x − y = L||x − y||2 − ∇F(x) − ∇F(y), x − y ≥ L||x − y||2 − L||x − y||2 = 0.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-20
SLIDE 20

Descent lemma Descent lemma, quadratic upper bound

Let F ∈ C1

L, then there holds

F(y) ≤ F(x) + ∇F(x), y − x + L

2||y − x||2, ∀x, y ∈ dom(F).

Pr Proof

  • of Define H(t) = F(x + t(y − x)), then

F(y) − F(x) = H(1) − H(0) = 1 ∇H(t)dt = 1 (y − x)T∇F(x + t(y − x))dt ≤ 1 (y − x)T∇F(x)dt + 1

  • (y − x)T

∇F(x + t(y − x)) − ∇F(x)

  • dt

≤ (y − x)T∇F(x) + 1 ||y − x||||∇F(x + t(y − x)) − ∇F(x)||dt ≤ (y − x)T∇F(x) + ||y − x|| 1 tL||y − x||dt = (y − x)T∇F(x) + L 2 ||y − x||2.

NB: first-order condition of convexity for H(x)

def

= L

2||x||2 − F(x). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-21
SLIDE 21

Descent lemma: consequences Corollary

Let F ∈ C1

L and x⋆ ∈ Argmin(F), then

1 2L||∇F(x)||2 ≤ F(x) − F(x⋆) ≤ L 2||x − x⋆||2 , ∀x ∈ dom(F).

Pr Proof

  • of Right-hand inequality: ∇F(x⋆) = 0,

F(x) ≤ F(x⋆) + ∇F(x⋆), x − x⋆ + L

2||x − x⋆||2, ∀x ∈ dom(F).

Left-hand inequality: F(x⋆) ≤ min

y∈dom(F)

  • F(x) + ∇F(x), y − x + L

2||y − x||2

= F(x) − 1

2L||∇F(x)||2.

The corresponding y is y = x − 1

L∇F(x). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-22
SLIDE 22

Co-coercivity of gradient Co-coercivity

Let F ∈ C1

L, then

x − y, ∇F(x) − ∇F(y) ≥ 1

L||∇F(x) − ∇F(y)||2.

Co-coercivity implies Lipschitz continuity For F ∈ C1

L, H(x)

def

= L

2||x||2 − F(x)

Lipschitz continuity of ∇F = ⇒ Convexity of H(x) = ⇒ Co-coercivity of ∇F(x) = ⇒ Lipschitz continuity of ∇F

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-23
SLIDE 23

Co-coercivity of gradient Co-coercivity

Let F ∈ C1

L, then

x − y, ∇F(x) − ∇F(y) ≥ 1

L||∇F(x) − ∇F(y)||2.

Pr Proof

  • of Define R(z) = F(z) − ∇F(x), z, then ∇R(x) = 0.

Recall the lemma F ∈ C1

L and x⋆ ∈ Argmin(F): 1 2L||∇F(x)||2 ≤ F(x) − F(x⋆) ≤ L 2||x − x⋆||2.

Then we have F(y) − F(x) − ∇F(x), y − x = R(y) − R(x) ≥ 1

2L||∇R(y)||2

= 1

2L||∇F(y) − ∇F(x)||2.

Similarly, define S(z) = F(z) − ∇F(y), z, then F(x) − F(y) − ∇F(y), x − y = S(y) − S(x) ≥ 1

2L||∇F(x) − ∇F(y)||2.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-24
SLIDE 24

Strongly convex function Strong convexity

Function F : Rn → R is strongly convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1], there exists α > 0 such that F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y) − α

2 θ(1 − θ)||x − y||2.

F is strongly convex with parameter α > 0 if G(x)

def

= F(x) − α

2 ||x||2

is convex. Monotonicity: ∇F(x) − ∇F(y), x − y ≥ α||x − y||2, ∀x, y ∈ dom(F). Second-order condition for strong convexity: if F ∈ C2, ∇2F(x) αId, ∀x ∈ dom(F).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-25
SLIDE 25

Quadratic lower bound Quadratic lower bound

Let F ∈ C1 and strongly convex, then F(y) ≥ F(x) + ∇F(x), y − x + α

2 ||y − x||2, ∀x, y ∈ dom(F).

Pr Proof

  • of First-order condition of convexity for G(x)

def

= F(x) − α

2 ||x||2.

Corollary

Let F ∈ C1 be α-strongly convex and x⋆ ∈ Argmin(F), then

α 2 ||x − x⋆||2 ≤ F(x) − F(x⋆) ≤ 1 2α||∇F(x)||2 , ∀x ∈ dom(F).

Pr Proof

  • of Left-hand inequality: quadratic lower bound.

Right-hand inequality: F(x⋆) ≥ min

y∈dom(F)

  • F(x) + ∇F(x), y − x + α

2 ||y − x||2

= F(x) −

1 2α||∇F(x)||2.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-26
SLIDE 26

Extension of co-coercivity

If F ∈ C1

L and α-strongly convex, then

G(x)

def

= F(x) − α

2 ||x2||

is convex, and ∇G is L − α-Lipschitz continuous. The co-coercivity of ∇G yields ∇F(x) − ∇F(y), x − y ≥

αL α + L||x − y||2 + 1 α + L||∇F(x) − ∇F(y)||2

for all x, y ∈ dom(F). S1

α,L: functions in C1 L that are α-strongly convex. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-27
SLIDE 27

Rate of convergence

Sequence xk converges linearly to x⋆ if lim

k→+∞

||xk+1 − x⋆|| ||xk − x⋆||

= ρ holds for ρ ∈]0, 1[, and ρ is called the rate of convergence. If xk converges, let ρk = |

|xk+1−x⋆| | | |xk−x⋆| | ,

– if limk→+∞ ρk = 0: super-linear convergence. – if limk→+∞ ρk = 1: sub-linear convergence. Superlinear convergence: q > 1 lim

k→+∞

||xk+1 − x⋆|| ||xk − x⋆||q < η

for some η ∈]0, 1[. – q = 2: quadratic convergence. – q = 3: cubic convergnce.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-28
SLIDE 28

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-29
SLIDE 29

Unconstrained smooth optimisation Unconstrained smooth optimisation

Consider minimising min

x∈Rn F(x),

where F : Rn → R is proper convex and smooth differentiable. Assumptions: F ∈ C1 is convex. ∇F(x) is L-Lipschitz continuous for some L > 0. Set of minimisers is non-empty, i.e. Argmin(F) = ∅.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-30
SLIDE 30

Gradient descent

Descent direction: let d = −∇F(x), then ∇F(x), d = −||∇F(x)||2 ≤ 0.

Gradient descent

initial initial : x0 ∈ dom(F); repea epeat :

  • 1. Choose step-size γk > 0
  • 2. Update xk+1 = xk − γk∇F(xk)

un until til : stopping criterion is satisfied.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-31
SLIDE 31

Convergence analysis: constant step-size

Owing to the quadratic upper bound F(xk+1) ≤ F(xk) + ∇F(xk), xk+1 − xk + L

2||xk+1 − xk||2

= F(xk) − γ||∇F(xk)||2 + γ2L

2 ||∇F(xk)||2

= F(xk) − γ(1 − γL

2 )||∇F(xk)||2.

Hence F(xk) − F(xk+1) ≥ γ(1 − γL

2 )||∇F(xk)||2.

Let γ ∈]0, 2/L[, γ(1 − γL

2 ) k

i=0 ||∇F(xi)||2 ≤ F(x0) − F(xk+1) ≤ F(x0) − F(x⋆).

F(x⋆) > −∞, rhs is a positive constant. for lhs, let k → +∞, lim

k→+∞ ||∇F(xk)||2 = 0.

NB: convexity is not required here.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-32
SLIDE 32

Convergence analysis: constant step-size

Let γ ∈]0, 1/L], then γ(1 − γL

2 ) ≥ γ

2 , and

F(xk+1) ≤ F(xk) − γ

2 ||∇F(xk)||2

(cvx of F at xk) ≤ F(x⋆) + ∇F(xk), xk − x⋆ − γ

2 ||∇F(xk)||2

= F(x⋆) + 1

  • ||xk − x⋆||2 − ||xk − x⋆ − γ∇F(xk)||2

= F(x⋆) + 1

  • ||xk − x⋆||2 − ||xk+1 − x⋆||2

. Summability of F(xk) − F(x⋆),

k

i=1

  • F(xk) − F(x⋆)

1 2γ

k

i=1

  • ||xi−1 − x⋆||2 − ||xi − x⋆||2

=

1 2γ

  • ||x0 − x⋆||2 − ||xk+1 − x⋆||2

1 2γ ||x0 − x⋆||2.

Since F(xk) − F(x⋆) is decreasing F(xk) − F(x⋆) ≤ 1

k

k

i=1

  • F(xk) − F(x⋆)

1 2γk||x0 − x⋆||2.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-33
SLIDE 33

Convergence analysis: strongly convex F

Besides the basic assumptions, let’s further assume F ∈ S1

α,L.

Recall that, for all x, y ∈ dom(F) ∇F(x) − ∇F(y), x − y ≥

αL α + L||x − y||2 + 1 α + L||∇F(x) − ∇F(y)||2.

Analysis for constant step-size: let γ ∈]0, 2/(α + L)[ ||xk+1 − x⋆||2 = ||xk − γ∇F(xk) − x⋆||2 = ||xk − x⋆||2 − 2γ∇F(xk), xk − x⋆ + γ2||∇F(xk)||2

(∇F(x⋆) = 0) ≤

  • 1 − 2γαL

α + L

  • ||xk − x⋆||2 + γ
  • γ −

2 α + L

  • ||∇F(xk)||2

  • 1 − 2γαL

α + L

  • ||xk − x⋆||2.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-34
SLIDE 34

Convergence analysis: strongly convex F

Distance to minimiser: ρ = 1 − 2γαL

α+L

||xk − x⋆||2 ≤ ρk||x0 − x⋆||2. linear covnergence for γ =

2 α+L,

ρ =

L − α

L + α

2.

Convergence rate of objective function value: F(xk) − F(x⋆) ≤ L

2||xk − x⋆||2 ≤ ρkL 2 ||x0 − x⋆||2.

Numer of iterations k needed for F(xk) − F(x⋆) ≤ ǫ F ∈ C1

L: O(1/ǫ).

F ∈ S1

α,L: O(log(1/ǫ)). Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-35
SLIDE 35

Limits on convergence rate of gradient descent

First-order method: xk is an element from the set x0 + span

  • ∇F(x0), ..., ∇F(xi), ..., ∇F(xk−1)
  • .

4.1 Problem class: C1

L

Nesterov’s lower bound

For every integer k ≤ (n − 1)/2 and every x0, there exist functions in the problem class such that for any first-order method satisfies (4.1), F(xk) − F(x⋆) ≥ 3

32 L||x0 − x⋆||2 (k + 1)2

, ||xk − x⋆||2 ≥ 1

8||x0 − x⋆||2.

Suggests O(1/k) is not the optimal rate. Accelerated gradient methods can achieve O(1/k2) rate.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-36
SLIDE 36

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-37
SLIDE 37

Observations

Gradient descent: −γ∇F(xk) = xk+1 − xk. Consider the angle: θk

def

= angle(∇F(xk+1), ∇F(xk)), lim

k→+∞ θk = 0.

Exercise: prove this claim for least square.

Let a > 0 be some constant, −∇F(xk+1) ≈ a(xk+1 − xk).

x⋆

xk−1 xk xk+1 xk+2

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-38
SLIDE 38

Heavy-ball method Heavy-ball method (Polyak)

Initial Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[; yk = xk + ak(xk − xk−1), ak ∈ [0, 1], xk+1 = yk − γ∇F(xk).

x⋆

xk−1 xk xk+1 xk+2 yk xk+1

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-39
SLIDE 39

Heavy-ball method Heavy-ball method (Polyak)

Initial Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[; yk = xk + ak(xk − xk−1), ak ∈ [0, 1], xk+1 = yk − γ∇F(xk). xk − xk−1 is called the inertial term or momentum term. ak is called the inertial parameter. Convergence can be proved by studying the Lyapunov function E(xk)

def

= F(xk) + ak

2γ ||xk − xk−1||2.

In general, no convergence rate for F ∈ C1

  • L. Local rate for F ∈ S2

α,L. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-40
SLIDE 40

Convergence rate Theorem

Let x⋆ be a (local) minimiser of F such that αId ∇2F(x⋆) LId and choose a, γ with a ∈ [0, 1[, γ ∈]0, 2(1 + a)/L[. There exists ρ < 1 such that if ρ < ρ < 1 and if x0, x1 are close enough to x⋆, one has ||xk − x⋆|| ≤ Cρk. Moreover, if a = √

L − √α √ L + √α

2 , γ =

4 ( √ L + √α)2

then ρ =

√ L − √α √ L + √α.

Starting points need to close enough to x⋆ Almost the optimal rate can be achieve by gradient method (or first-order method) Gradient descent ρ = L − α

L + α.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-41
SLIDE 41

Convergence rate: proof

Taylor expansion xk+1 = xk + a(xk − xk−1) − γ∇2F(x⋆)(xk − x⋆) + o(||xk − x⋆||). Let zk = (xk − x⋆, xk−1 − x⋆)T and H = ∇2F, then zk+1 =

  • (1 + a)Id − aH

−aId Id

  • M

zk + o(||zk||). Spectral radius ρ(M), η = 1 − γα 0 = ρ2 − (a + η)ρ + aη. ρ(M) is a function of a and η (essentially γ).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-42
SLIDE 42

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-43
SLIDE 43

Convergence rate of gradient descent

Gradient descent with constant step-size: F ∈ C1

L

F(xk) − F(x⋆) ≤ L||x0 − x⋆||2

k + 4

. F ∈ S1

α,L

F(xk) − F(x⋆) ≤ L

2

L − α

L + α

2||x0 − x⋆||2.

x⋆

xk−1 xk xk+1 xk+2 yk xk+1

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-44
SLIDE 44

Nesterov’s optimal scheme Optimal scheme with constant step-size

initial initial : Choose x0 ∈ Rn, φ0 ∈]0, 1[; Let y0 = x0 and q = α/L. repea epeat :

  • 1. Compute φk+1 ∈]0, 1[ from equation

φ2

k+1 = (1 − φk+1)φ2 k + qφk+1.

Let ak = φk(1−φk)

φ2

k+φk+1 and

yk = xk + ak(xk − xk−1).

  • 2. Update xk+1 by

xk+1 = yk − 1

L∇F(yk).

un until til : stopping criterion is satisfied.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-45
SLIDE 45

Convergence rate Convergence rate

Let φ0 ≥

  • α/L, then

F(xk) − F(x⋆) ≤ min

  • 1 −

α

L

k,

4L (2 √ L + k√ν)2

  • ×
  • F(x0) − F(x⋆) + ν

2 ||x0 − x⋆||2

, where ν = φ0(φ0L−α)

1−φ0

. Parameter choices: F ∈ C1

L: φ0 = 1,

q = 0, φk ≈

2 k + 1 → 0

and ak ≈ 1 − φk

1 + φk → 1.

F ∈ S1

α,L: φ0 =

  • α/L

q = α

L , φk ≡

α

L

and ak ≡

√ L − √α √ L + √α.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-46
SLIDE 46

Outline

1 Unconstrained smooth optimisation 2 Descent methods 3 Gradient of convex functions 4 Gradient descent 5 Heavy-ball method 6 Nesterov’s optimal schemes 7 Dynamical system

slide-47
SLIDE 47

Dynamical system of gradient descent

From gradient descent xk+1 − xk γ = −∇F(xk). Let γ be small enough ˙ X(t) + ∇F(X(t)) = 0. Discretisation Explicit Euler method ˙ X(t) = X(t + h) − X(t) h . Implicit Euler method ˙ X(t) = X(t) − X(t − h) h .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-48
SLIDE 48

Dynamical system of inertial schemes

Given a 2nd order dynamical system ¨ X(t) + λ(t)˙ X(t) + ∇F(X(t)) = 0. Discretisation: 2nd order term ¨ X(t) = X(t + h) − 2X(t) + X(t − h) h2 . Implicit Euler method ˙ X(t) = X(t) − X(t − h) h . Combine together: X(t + h) − X(t) − (1 − hλ(t))(X(t) − X(t − h)) + h2∇F(X(t)) = 0. Choices: Heavy-ball: hλ(t) ∈]0, 1[. Nesterov: λ(t) = d

t , d > 3. Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

slide-49
SLIDE 49

Reference

  • S. Boyd and L. Vandenberghe. “Convex optimization”. Cambridge university press, 2004.
  • B. Polyak. “Introduction to optimization”. Optimization Software, 1987.
  • Y. Nesterov. “Introductory lectures on convex optimization: A basic course”. Vol. 87. Springer

Science & Business Media, 2013.

  • W. Su, S. Boyd, and E. Candès. “A differential equation for modeling Nesterov’s accelerated

gradient method: Theory and insights”. Advances in Neural Information Processing Systems. 2014.