[PPT] - Geometry of First-order Methods Trajectory and Adaptive Acceleration PowerPoint Presentation

SLIDE 1

Geometry of First-order Methods

Trajectory and Adaptive Acceleration Clarice Poon

University of Bath Joint work with: Jingwei Liang, University of Cambridge

SLIDE 2

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

1

SLIDE 3

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

2

SLIDE 4

Non-smooth optimization

Constrained non-smooth optimisation min

x∈Rn F(x) + r i=1Ri(Kix),

where F: smooth differentiable with L-Lipschitz continuous gradient. Ri: proper and lower semi-continuous. Ki: bounded linear mapping.

3

SLIDE 5

Non-smooth optimization

Constrained non-smooth optimisation min

x∈Rn F(x) + r i=1Ri(Kix),

where F: smooth differentiable with L-Lipschitz continuous gradient. Ri: proper and lower semi-continuous. Ki: bounded linear mapping. Applic Applications: tions: signal/image processing, inverse problems, machine learning, data science, statistics... Challeng Challenges: es: non-smooth, composite, high dimension... Hope: Hope: well structured.

3

SLIDE 6

First-order methods: two basic ingredients

Gradient descent [Cauchy ’1847] min

x F(x)

where F is convex smooth differentiable with ∇F being L-Lipschitz.

4

SLIDE 7

First-order methods: two basic ingredients

Gradient descent [Cauchy ’1847] min

x F(x)

where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[.

4

SLIDE 8

First-order methods: two basic ingredients

Gradient descent [Cauchy ’1847] min

x F(x)

where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min

x R(x)

with R being proper closed convex.

4

SLIDE 9

First-order methods: two basic ingredients

Gradient descent [Cauchy ’1847] min

x F(x)

where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min

x R(x)

with R being proper closed convex. Define “pr proximity ximity oper

perator
r ” by

proxγR(v)

def

= argminx γR(x) + 1

2||x − v||2. 4

SLIDE 10

First-order methods: two basic ingredients

Gradient descent [Cauchy ’1847] min

x F(x)

where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min

x R(x)

with R being proper closed convex. Define “pr proximity ximity oper

perator
r ” by

proxγR(v)

def

= argminx γR(x) + 1

2||x − v||2.

Implicit Implicit Euler scheme: xk+1 = proxγkR(xk), γk > 0.

4

SLIDE 11

First-order methods

Fir First-or

order

der me methods thods F + R Forward–Backward splitting [Lions & Mercier ’79] R1 + R2 Douglas–Rachford splitting [Douglas & Rachford ’56; Lions & Mercier ’79] ADMM [Glowinski & Marrocco ’75; Gabay & Mercier ’76]... F + R(K·) Primal–Dual splitting methods [Arrow, Hurwicz & Uzawa ’58; Esser, Zhang & Chan ’10; Chambolle & Pock ’11] F +

i Ri Generalized Forward–Backward splitting [Raguet, Fadili & Peyré ’13]

– · · ·

Origins from numerical PDE back to 1950s, now ubiquitous in signal/image pro- cessing, inverse problems, data science, statistics, machine learning...

5

SLIDE 12

First-order methods

Fir First-or

order

der me methods thods F + R Forward–Backward splitting [Lions & Mercier ’79] R1 + R2 Douglas–Rachford splitting [Douglas & Rachford ’56; Lions & Mercier ’79] ADMM [Glowinski & Marrocco ’75; Gabay & Mercier ’76]... F + R(K·) Primal–Dual splitting methods [Arrow, Hurwicz & Uzawa ’58; Esser, Zhang & Chan ’10; Chambolle & Pock ’11] F +

i Ri Generalized Forward–Backward splitting [Raguet, Fadili & Peyré ’13]

– · · · Fixed-point formulation Let F : H → H be non-expansive with fix(F)

def

= {z ∈ H | z = F(z)} = ∅, zk+1 = F(zk).

5

SLIDE 13

Acceleration: inertial/momentum acceleration

Inertial Inertial technique echnique [Polyak ’64, Nesterov ’83, Beck & Teboulle ’09]

¯

xk = xk + ak(xk − xk−1), xk+1 = F(¯ xk).

x⋆ xk−1 xk xk+1 xk+2 ¯ xk xk+1

6

SLIDE 14

Acceleration: successive over-relaxation

Successiv Successive over er-r

rela

elaxation tion [Richardson ’1911, Young ’50] xk+1 = (1 − λk)xk + λkF(xk)

ak=λk−1

= = = = = = ⇒

¯

xk = xk + ak(xk − ¯ xk−1), xk+1 = F(¯ xk).

x⋆ xk−1 xk F(xk−1) xk+1 xk+2 xk

7

SLIDE 15

Example: Forward–Backward splitting

Problem min

x

F(x) + R(x).

8

SLIDE 16

Example: Forward–Backward splitting

Problem min

x

F(x) + R(x). Let γ > 0, FFB

def

= proxγR(Id − γ∇F).

8

SLIDE 17

Example: Forward–Backward splitting

Problem min

x

F(x) + R(x). Let γ > 0, FFB

def

= proxγR(Id − γ∇F). For

rwar

ard–Backw d–Backwar ard split splitting ting γ ∈]0, 2/L[ xk+1 = FFB(xk). Sequence o(1/ √ k), objective o(1/k).

8

SLIDE 18

Example: Forward–Backward splitting

Problem min

x

F(x) + R(x). Let γ > 0, FFB

def

= proxγR(Id − γ∇F). Nes Nester erov/FIS v/FISTA γ ∈]0, 1/L], d > 2 ¯ xk = xk + k−1

k+d(xk − xk−1),

xk+1 = FFB(¯ xk). Sequence o(1/k), objective o(1/k2).

8

SLIDE 19

Example: Forward–Backward splitting

Problem min

x

F(x) + R(x). Let γ > 0, FFB

def

= proxγR(Id − γ∇F). Nes Nester erov/FIS v/FISTA γ ∈]0, 1/L], d > 2 ¯ xk = xk + k−1

k+d(xk − xk−1),

xk+1 = FFB(¯ xk). Sequence o(1/k), objective o(1/k2).

8

SLIDE 20

Example: Douglas–Rachford splitting

Problem min

x

J(x) + R(x).

9

SLIDE 21

Example: Douglas–Rachford splitting

Problem min

x

J(x) + R(x). Let γ > 0 FDR

def

= 1

2

Id + (2proxγR − Id)(proxγJ − Id)
.

9

SLIDE 22

Example: Douglas–Rachford splitting

Problem min

x

J(x) + R(x). Let γ > 0 FDR

def

= 1

2

Id + (2proxγR − Id)(proxγJ − Id)
.

Douglas–Rach Douglas–Rachfor

rd split

splitting ting zk+1 = FDR(zk). Sequence o(1/ √ k), objective NA.

9

SLIDE 23

Example: Douglas–Rachford splitting

Problem min

x

J(x) + R(x). Let γ > 0 FDR

def

= 1

2

Id + (2proxγR − Id)(proxγJ − Id)
.

Inertial Inertial Douglas–Rach Douglas–Rachfor

rd

¯ zk = zk + ak(zk − zk−1), zk+1 = FDR(¯ zk). No rates available, may fail to provide acceleration.

9

SLIDE 24

Example: Douglas–Rachford splitting

Problem min

x

J(x) + R(x). Let γ > 0 FDR

def

= 1

2

Id + (2proxγR − Id)(proxγJ − Id)
.

Inertial Inertial Douglas–Rach Douglas–Rachfor

rd

¯ zk = zk + ak(zk − zk−1), zk+1 = FDR(¯ zk). No rates available, may fail to provide acceleration. NB: Performance of inertial is problem and parameter dependent!

9

SLIDE 25

Example: Douglas–Rachford splitting

Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Define FDR

def

= 1

2

Id + (2PT1 − Id)(PT2 − Id)
.

10

SLIDE 26

Example: Douglas–Rachford splitting

Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk, +a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.

200 400 600 800 1000 10-15 10-10 10-5 100

. . normal DR

10

SLIDE 27

Example: Douglas–Rachford splitting

Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1), +b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.

200 400 600 800 1000 10-15 10-10 10-5 100

1-step inertial DR normal DR

10

SLIDE 28

Example: Douglas–Rachford splitting

Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.

200 400 600 800 1000 10-15 10-10 10-5 100

1-step inertial DR normal DR 2-step inertial DR

10

SLIDE 29

Example: Douglas–Rachford splitting

Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.

200 400 600 800 1000 10-15 10-10 10-5 100

1-step inertial DR normal DR 2-step inertial DR

NB: 1-step inertial will always worsen the rate!

10

SLIDE 30

Problems

Nesterov/FISTA achieve worst case optimal convergence rate.

11

SLIDE 31

Problems

Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements.

11

SLIDE 32

Problems

Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:

11

SLIDE 33

Problems

Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:

Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.

strong convexity or Lipschitz smoothness.

11

SLIDE 34

Problems

Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:

Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.

strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent.

11

SLIDE 35

Problems

Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:

Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.

strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent. A general acceleration framework with acceleration guarantees is missing!

11

SLIDE 36

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

12

SLIDE 37

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods...

13

SLIDE 38

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... What happens when a problem can be solved by diff differ eren ent me methods thods ?

13

SLIDE 39

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods...

Gradient descent

Trajectory of {xk}k∈N.

13

SLIDE 40

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods...

normal DR

Trajectory of {zk}k∈N.

13

SLIDE 41

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods...

1-step inertial DR normal DR

Trajectory of {zk}k∈N.

13

SLIDE 42

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods...

1-step inertial DR normal DR 2-step inertial DR

Trajectory of {zk}k∈N.

13

SLIDE 43

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence

13

SLIDE 44

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence

13

SLIDE 45

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence

13

SLIDE 46

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence

13

SLIDE 47

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence

13

SLIDE 48

Why the difference?

Trivial rivial ans answer: er: because they are different problems and different methods... However, FoM are non-linear in general...

13

SLIDE 49

Partial smoothness

Partly smooth function [Lewis ’03] R is partly smooth at x relative to a set Mx containing x if ∂R(x) = ∅ and Smoothness Mx is a C2-manifold, R|Mx is C2 near x. Sharpness Tangent space TMx(x) is Tx

def

= par

∂R(x)

⊥. Continuity ∂R : Rn ⇒ Rn is continuous along Mx near x.

par( C):sub-space parallel to C, where C ⊂ Rn is a non-empty convex set.

−5 5 10 20 30 40 5 10 15 20

M x x2 x1 f (x)

Ex Examples: amples: ℓ1, ℓ1,2, ℓ∞-norm Nuclear norm Total variation

14

SLIDE 50

Trajectory of first-order methods

Framework for analysing the local trajectory of FoM a First-order method (non-linear) ⇓ Convergence & Non-degeneracy: finite identification of M ⇓ Local linearisation along M: matrix M (linear) ⇓ Spectral properties of M ⇓ Local trajectory

aLocal convergence previously studied in [Liang et al ’16].

15

SLIDE 51

Trajectory of first-order methods

Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||).

15

SLIDE 52

Trajectory of first-order methods

Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory.

15

SLIDE 53

Trajectory of first-order methods

Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory.

15

SLIDE 54

Trajectory of first-order methods

Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory. PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory.

15

SLIDE 55

Trajectory of first-order methods

Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory. PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory. NB: For DR/ADMM, if one term is locally C2-smooth, straight-line trajectory can be

btained under proper parameters.

15

SLIDE 56

Trajectory of first-order methods

For

rwar

ard–Backw d–Backwar ard Douglas–Rach Douglas–Rachfor

rd/

d/ADMM ADMM Primal–Dual Primal–Dual

50 100 150 200 250 300 10-15 10-10 10-5 50 100 150 200 250 10-14 10-12 10-10 10-8 10-6 10-4 50 100 150 200 250 300 0.84 0.86 0.88 0.9 0.92 0.94 0.96

1 − cos(θk) cos(ψ) − cos(θk) cos(θk)

15

SLIDE 57

Failure of inertial

Consider the Lasso for a random Gaussian matrix A ∈ Rm×n with m < n: min

x,y∈Rn ||x||1 + 1 2||Ax − f||2 2.

Solving using DR with γ = 0.9

||A||2

200 400 600 800 1000 1200 10-15 10-10 10-5 100 50 100 150 200 250 200 400 600 800 1000 1200 1400 10-10 10-5 100

Eventual trajectory: Straight line when γ < (||A||2)−1

16

SLIDE 58

Failure of inertial

Consider the Lasso for a random Gaussian matrix A ∈ Rm×n with m < n: min

x,y∈Rn ||x||1 + 1 2||Ax − f||2 2.

Solving using DR with γ =

10 ||A||2

50 100 150 200 10-15 10-10 10-5 100 50 100 150 200 250 100 200 300 400 500 600 700 800 900 10-10 10-5 100

Eventual trajectory: Linearisation matrix may have complex leading eigenvalue if γ ≥ (||A||2)−1.

16

SLIDE 59

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

17

SLIDE 60

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0. 18

SLIDE 61

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||. 18

SLIDE 62

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1. 18

SLIDE 63

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1.

Repeat on {zk−j}q

j=0 ∪ {¯

zk,1} and so on.

18

SLIDE 64

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1.

Repeat on {zk−j}q

j=0 ∪ {¯

zk,1} and so on.

18

SLIDE 65

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1.

Repeat on {zk−j}q

j=0 ∪ {¯

zk,1} and so on.

18

SLIDE 66

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1.

Repeat on {zk−j}q

j=0 ∪ {¯

zk,1} and so on.

18

SLIDE 67

Linear prediction: illustration

Idea: Given past points {zk−j}q+1

j=0, how to predict zk+1?

Define {vk−j

def

= zk−j − zk−j−1}q

j=0.

Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck

def

= argminc∈Rq|| q

j=1 cjvk−j − vk||.

Let ¯ zk,1

def

= zk + q

j=1 ck,jvk−j+1.

Repeat on {zk−j}q

j=0 ∪ {¯

zk,1} and so on. The s-step extrapolation is ¯ zk,s = zk + Es,q,k, where Es,q,k = q

j=1 ˆ

cjvk−j+1 and ˆ c

def

= s

j=1H(ck)j (:,1)

with H(ck)

def

=

ck

Idq−1 01,q−1

.

18

SLIDE 68

Adaptive acceleration for FoM (A2FoM)

Given first-order method zk+1 = F(zk). A2FoM via linear prediction Let s ≥ 1, q ≥ 1 be integers. Let z0 ∈ Rn and ¯ z0 = z0, set D = 0 ∈ Rn×(q+1): For k ≥ 1: zk = F(¯ zk−1), vk = zk − zk−1, D = [vk, D(:, 1 : q)]. If mod(k, q+2) = 0: compute c and Hc, If ρ(Hc) < 1: ¯ zk = zk + Vk s

i=1Hi c

(:,1);

else: ¯ zk = zk. If mod(k, q+2) = 0: ¯ zk = zk.

19

SLIDE 69

Adaptive acceleration for FoM (A2FoM)

Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent

+∞

i=0 Hi c = (Id − Hc)−1. 19

SLIDE 70

Adaptive acceleration for FoM (A2FoM)

Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent

+∞

i=0 Hi c = (Id − Hc)−1.

For the summation s

i=1 Hi c,

s

i=1Hi c = (Hc − Hs+1 c

)(Id − Hc)−1.

19

SLIDE 71

Adaptive acceleration for FoM (A2FoM)

Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent

+∞

i=0 Hi c = (Id − Hc)−1.

For the summation s

i=1 Hi c,

s

i=1Hi c = (Hc − Hs+1 c

)(Id − Hc)−1. When s = +∞, we have +∞

i=1 Hi c = Hc(Id − Hc)−1

= (Id − Hc)−1 − Id.

19

SLIDE 72

Adaptive acceleration for FoM (A2FoM)

Remark emark We extrapolate every (q + 2)-iterations.

20

SLIDE 73

Adaptive acceleration for FoM (A2FoM)

Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1.

20

SLIDE 74

Adaptive acceleration for FoM (A2FoM)

Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10

20

SLIDE 75

Adaptive acceleration for FoM (A2FoM)

Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q2n from V+

k−1. 20

SLIDE 76

Adaptive acceleration for FoM (A2FoM)

Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q2n from V+

k−1.

Global convergence can be obtained by treating extrapolation as perturbation error [Alvarez & Attouch ’01], i.e. zk+1 = F(zk + ǫk). Weighted LP ¯ zk = zk + akVk s

i=1Hi c

(:,1),

with ak updated online.

20

SLIDE 77

Example: Douglas–Rachford continue

normal DR 21

SLIDE 78

Example: Douglas–Rachford continue

normal DR LP, s = 4 21

SLIDE 79

Example: Douglas–Rachford continue

normal DR LP, s = 4 LP, s = 25 21

SLIDE 80

Example: Douglas–Rachford continue

normal DR LP, s = 4 LP, s = 25 LP, s = + 21

SLIDE 81

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

22

SLIDE 82

Convergence acceleration

Given a sequence {zk}k∈N which converges to z⋆. Can we generate another sequence {¯ zk}k∈N such that ||¯ zk − z⋆|| = o(||zk − z⋆||)?

23

SLIDE 83

Convergence acceleration

Given a sequence {zk}k∈N which converges to z⋆. Can we generate another sequence {¯ zk}k∈N such that ||¯ zk − z⋆|| = o(||zk − z⋆||)? This is called convergence acceleration and is well-established in numerical analysis: 1927 Aitkin’s ∆-process. 1965 Andersen’s acceleration. 1970’s Vector extrapolation techniques such as minimal polynomial extrapolation (MPE) and reduced rank extrapolation (RRE) [Sidi ’17]. Now Regularized non-linear acceleration (RNA) is a regularised version of RRE introduced by [Scieur, D’Aspremont, Bach ’16].

23

SLIDE 84

Vector extrapolation techniques

Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆:

24

SLIDE 85

Vector extrapolation techniques

Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆: zk − z⋆ = M(zk−1 − z⋆) = Mk(z0 − z⋆),

24

SLIDE 86

Vector extrapolation techniques

Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆: zk − z⋆ = M(zk−1 − z⋆) = Mk(z0 − z⋆), If P(λ) = q

j=0 cjλj is the minimal polynomial of M w.r.t. z0 − z⋆, that is

P(M)(z0 − z⋆) = q

j=0cjMj(z0 − z⋆) = 0.

then z⋆ =

q

j=0 cjzj

j cj .

The coefficients c can be computed without knowledge of z⋆: Vqc(0:q−1) = −vq+1 and cq = 1 where Vq =

v1|v2| · · · |vq
and vj = zj − zj−1.

24

SLIDE 87

Vector extrapolation techniques

Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.

Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1

j=0 and let vj = zj − zj−1.

S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =

v1| · · · |vq
. For j ∈ [0, q − 1], ˜

cj

def

= cj/(q

j=0 cj).

S.3 ¯ z

def

= q

j=0 ˜

cjzj.

25

SLIDE 88

Vector extrapolation techniques

Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.

Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1

j=0 and let vj = zj − zj−1.

S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =

v1| · · · |vq
. For j ∈ [0, q − 1], ˜

cj

def

= cj/(q

j=0 cj).

S.3 ¯ z

def

= q

j=0 ˜

cjzj. Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argminc||Vq+1c|| subject to 1Tc = 1.

25

SLIDE 89

Vector extrapolation techniques

Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.

Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1

j=0 and let vj = zj − zj−1.

S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =

v1| · · · |vq
. For j ∈ [0, q − 1], ˜

cj

def

= cj/(q

j=0 cj).

S.3 ¯ z

def

= q

j=0 ˜

cjzj. Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argminc||Vq+1c|| subject to 1Tc = 1.

LP LP is is equiv equivalen alent to MPE MPE with with S. S.3 replaced eplaced by ¯ z

de def

= q

j=0 ˜

cjzj+1.

25

SLIDE 90

Regularised non-linear acceleration (RNA)

[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1.

26

SLIDE 91

Regularised non-linear acceleration (RNA)

[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT

qVq+λId)c|| subject to 1Tc = 1. 26

SLIDE 92

Regularised non-linear acceleration (RNA)

[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT

qVq+λId)c|| subject to 1Tc = 1.

In practice, grid search with the objective to find optimal λ ∈ [λmin, λmax].

26

SLIDE 93

Regularised non-linear acceleration (RNA)

[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT

qVq+λId)c|| subject to 1Tc = 1.

In practice, grid search with the objective to find optimal λ ∈ [λmin, λmax]. The angle between zk − zk−1 and zk+1 − zk converges to zero, intuitively, this is the regime where standard inertial works well...

26

SLIDE 94

Acceleration guarantees

We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]:

27

SLIDE 95

Acceleration guarantees

We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B

def

= s

ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|. 27

SLIDE 96

Acceleration guarantees

We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B

def

= s

ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.

Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k).

27

SLIDE 97

Acceleration guarantees

We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B

def

= s

ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.

Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k). Non-asymptotic bound: If Σ(M) ⊂ [α, β] with −1 < α < β < 1, then Bǫk ≤ Kβk−q √η − 1

√η + 1

q , where η = 1 − α

1 − β 27

SLIDE 98

Acceleration guarantees

We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B

def

= s

ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.

Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k). Non-asymptotic bound: If Σ(M) ⊂ [α, β] with −1 < α < β < 1, then Bǫk ≤ Kβk−q √η − 1

√η + 1

q , where η = 1 − α

1 − β

For PD, DR with polyhedral functions, guaranteed acceleration with q = 2.

27

SLIDE 99

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work.

28

SLIDE 100

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form.

28

SLIDE 101

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.

1 2 3 4 104 10-10 10-6 10-2

Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 10-10 10-6 10-2

Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA

Least-squares, enforcing straight line trajectory

28

SLIDE 102

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.

500 1000 1500 2000 10-10 10-6 10-2 102

Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA

0.005 0.01 0.015 0.02 0.025 0.03 10-10 10-6 10-2 102

Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA

Lasso, enforcing straight line trajectory

28

SLIDE 103

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.

500 1000 1500 2000 2500 10-10 10-6 10-2 102

DR 1-iDR 2-iDR DR+LP MPE RRE

0.02 0.04 0.06 0.08 0.1 0.12 10-10 10-6 10-2 102

DR 1-iDR 2-iDR DR+LP MPE RRE

Basis pursuit with DR (spiral trajectory). Extrapolating with 3 points.

28

SLIDE 104

Our contributions

We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.

500 1000 1500 2000 2500 10-10 10-6 10-2 102

DR 1-iDR 2-iDR DR+LP MPE RRE

0.02 0.04 0.06 0.08 0.1 0.12 10-10 10-6 10-2 102

DR 1-iDR 2-iDR DR+LP MPE RRE

Basis pursuit with DR (spiral trajectory). Extrapolating with 4 points.

28

SLIDE 105

Outline

Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions

29

SLIDE 106

Experiment: 2 non-smooth terms

Basis pursuit type problem with Ω

def

= {x ∈ Rn : Kx = f}: min

x,y∈Rn R(x) + ιΩ(y)

such that x − y = 0.

50 100 150 200 250 300 350 0.8 0.85 0.9 0.95 1 50 100 150 200 0.75 0.8 0.85 0.9 0.95 1 100 200 300 400 500 0.8 0.85 0.9 0.95 1

50 100 150 200 250 300 350 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 50 100 150 200 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 100 200 300 400 500 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2

ℓ1-norm ℓ1,2-norm Nuclear norm

30

SLIDE 107

Experiment: 2 non-smooth terms

50 100 150 200 250 300 350 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2

ℓ1-norm Inertial ADMM is slower than ADMM as eventual trajectory is a spiral.

30

SLIDE 108

Experiment: LASSO

The LASSO problem min

x,y∈Rn R(x) + 1 2||Ky − f||2

such that x − y = 0.

2000 4000 6000 8000 10000 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 200 400 600 800 1000 1200 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 500 1000 1500 2000 2500 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2

covtype ijcnn1 phishing

31

SLIDE 109

Experiment: LASSO

2000 4000 6000 8000 10000 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2

covtype Inertial ADMM does accelerate, but A3DMM is significantly faster.

31

SLIDE 110

Experiment: Total variation based image inpainting

Let Ω

def

= {x ∈ Rn×n : PD(x) = f}, PD randomly sets 50% pixels to zero and consider min

x∈Rn×n ||y||1 + ιΩ(x)

such that ∇x − y = 0.

Angle {θk} ||xk − x⋆|| PSNR

500 1000 1500 0.75 0.8 0.85 0.9 0.95 1

200 400 600 800 1000 1200 1400 1600 1800 10 -8 10 -6 10 -4 10 -2 10 0 10 2 10 4 20 40 60 80 100 24 24.5 25 25.5 26 26.5 27 27.5

Both functions are polyhedral, trajectory is a spiral. Inertial ADMM is slower than ADMM.

32

SLIDE 111

Experiment: Total variation based image inpainting

Original image ADMM, PSNR = 26.5448 Inertial ADMM, PSNR = 26.1096 Corrupted image A

3DMM s = 100, PSNR = 27.0402

A

3DMM s = +∞, PSNR = 27.0402

32

SLIDE 112

Experiment: Partial Fourier reconstruction

We consider reconstruction from partial Fourier measurements (along radial lines), reg- ularised with translation invariant wavelets. min

x∈Rn×n ||Wx||1

subject to PΩFx = y

20 40 60 80 100 60 65 70 75 80 85

PD iPD PD-LP, q=2 PD-LP, q=1

Primal-Dual (72.9s) PD-LP (72.1s)

33

SLIDE 113

Take-away messages

Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals.

34

SLIDE 114

Take-away messages

Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals. An adaptive acceleration for FoM Though motivated by local trajectory, linear prediction works globally. For polyhedral functions, guaranteed local acceleration for DR/PD using 4 past points. For FB, the trajectory is eventually a straight line and one can guarantee local acceleration by extrapolating from 3 past points under our framework, but one could just apply restarted FISTA...

34

SLIDE 115

Take-away messages

Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals. An adaptive acceleration for FoM Though motivated by local trajectory, linear prediction works globally. For polyhedral functions, guaranteed local acceleration for DR/PD using 4 past points. For FB, the trajectory is eventually a straight line and one can guarantee local acceleration by extrapolating from 3 past points under our framework, but one could just apply restarted FISTA...

Many thanks for your attention!

P. & Liang. Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration, Neurips’19

Liang & P. The Geometry of First-Order Methods and Adaptive Acceleration. arXiv 2003.03910

34