Geometry of First-order Methods Trajectory and Adaptive Acceleration - - PowerPoint PPT Presentation
Geometry of First-order Methods Trajectory and Adaptive Acceleration - - PowerPoint PPT Presentation
Geometry of First-order Methods Trajectory and Adaptive Acceleration Clarice Poon University of Bath Joint work with: Jingwei Liang, University of Cambridge Outline Introduction Trajectory of first-order methods Adaptive acceleration via
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
1
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
2
Non-smooth optimization
Constrained non-smooth optimisation min
x∈Rn F(x) + r i=1Ri(Kix),
where F: smooth differentiable with L-Lipschitz continuous gradient. Ri: proper and lower semi-continuous. Ki: bounded linear mapping.
3
Non-smooth optimization
Constrained non-smooth optimisation min
x∈Rn F(x) + r i=1Ri(Kix),
where F: smooth differentiable with L-Lipschitz continuous gradient. Ri: proper and lower semi-continuous. Ki: bounded linear mapping. Applic Applications: tions: signal/image processing, inverse problems, machine learning, data science, statistics... Challeng Challenges: es: non-smooth, composite, high dimension... Hope: Hope: well structured.
3
First-order methods: two basic ingredients
Gradient descent [Cauchy ’1847] min
x F(x)
where F is convex smooth differentiable with ∇F being L-Lipschitz.
4
First-order methods: two basic ingredients
Gradient descent [Cauchy ’1847] min
x F(x)
where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[.
4
First-order methods: two basic ingredients
Gradient descent [Cauchy ’1847] min
x F(x)
where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min
x R(x)
with R being proper closed convex.
4
First-order methods: two basic ingredients
Gradient descent [Cauchy ’1847] min
x F(x)
where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min
x R(x)
with R being proper closed convex. Define “pr proximity ximity oper
- perator
- r ” by
proxγR(v)
def
= argminx γR(x) + 1
2||x − v||2. 4
First-order methods: two basic ingredients
Gradient descent [Cauchy ’1847] min
x F(x)
where F is convex smooth differentiable with ∇F being L-Lipschitz. Explicit Explicit Euler scheme: xk+1 = xk − γk∇F(xk), γk ∈]0, 2/L[. Proximal point algorithm [Rockafellar ’76] min
x R(x)
with R being proper closed convex. Define “pr proximity ximity oper
- perator
- r ” by
proxγR(v)
def
= argminx γR(x) + 1
2||x − v||2.
Implicit Implicit Euler scheme: xk+1 = proxγkR(xk), γk > 0.
4
First-order methods
Fir First-or
- order
der me methods thods F + R Forward–Backward splitting [Lions & Mercier ’79] R1 + R2 Douglas–Rachford splitting [Douglas & Rachford ’56; Lions & Mercier ’79] ADMM [Glowinski & Marrocco ’75; Gabay & Mercier ’76]... F + R(K·) Primal–Dual splitting methods [Arrow, Hurwicz & Uzawa ’58; Esser, Zhang & Chan ’10; Chambolle & Pock ’11] F +
i Ri Generalized Forward–Backward splitting [Raguet, Fadili & Peyré ’13]
– · · ·
Origins from numerical PDE back to 1950s, now ubiquitous in signal/image pro- cessing, inverse problems, data science, statistics, machine learning...
5
First-order methods
Fir First-or
- order
der me methods thods F + R Forward–Backward splitting [Lions & Mercier ’79] R1 + R2 Douglas–Rachford splitting [Douglas & Rachford ’56; Lions & Mercier ’79] ADMM [Glowinski & Marrocco ’75; Gabay & Mercier ’76]... F + R(K·) Primal–Dual splitting methods [Arrow, Hurwicz & Uzawa ’58; Esser, Zhang & Chan ’10; Chambolle & Pock ’11] F +
i Ri Generalized Forward–Backward splitting [Raguet, Fadili & Peyré ’13]
– · · · Fixed-point formulation Let F : H → H be non-expansive with fix(F)
def
= {z ∈ H | z = F(z)} = ∅, zk+1 = F(zk).
5
Acceleration: inertial/momentum acceleration
Inertial Inertial technique echnique [Polyak ’64, Nesterov ’83, Beck & Teboulle ’09]
- ¯
xk = xk + ak(xk − xk−1), xk+1 = F(¯ xk).
x⋆ xk−1 xk xk+1 xk+2 ¯ xk xk+1
6
Acceleration: successive over-relaxation
Successiv Successive over er-r
- rela
elaxation tion [Richardson ’1911, Young ’50] xk+1 = (1 − λk)xk + λkF(xk)
ak=λk−1
= = = = = = ⇒
- ¯
xk = xk + ak(xk − ¯ xk−1), xk+1 = F(¯ xk).
x⋆ xk−1 xk F(xk−1) xk+1 xk+2 xk
7
Example: Forward–Backward splitting
Problem min
x
F(x) + R(x).
8
Example: Forward–Backward splitting
Problem min
x
F(x) + R(x). Let γ > 0, FFB
def
= proxγR(Id − γ∇F).
8
Example: Forward–Backward splitting
Problem min
x
F(x) + R(x). Let γ > 0, FFB
def
= proxγR(Id − γ∇F). For
- rwar
ard–Backw d–Backwar ard split splitting ting γ ∈]0, 2/L[ xk+1 = FFB(xk). Sequence o(1/ √ k), objective o(1/k).
8
Example: Forward–Backward splitting
Problem min
x
F(x) + R(x). Let γ > 0, FFB
def
= proxγR(Id − γ∇F). Nes Nester erov/FIS v/FISTA γ ∈]0, 1/L], d > 2 ¯ xk = xk + k−1
k+d(xk − xk−1),
xk+1 = FFB(¯ xk). Sequence o(1/k), objective o(1/k2).
8
Example: Forward–Backward splitting
Problem min
x
F(x) + R(x). Let γ > 0, FFB
def
= proxγR(Id − γ∇F). Nes Nester erov/FIS v/FISTA γ ∈]0, 1/L], d > 2 ¯ xk = xk + k−1
k+d(xk − xk−1),
xk+1 = FFB(¯ xk). Sequence o(1/k), objective o(1/k2).
8
Example: Douglas–Rachford splitting
Problem min
x
J(x) + R(x).
9
Example: Douglas–Rachford splitting
Problem min
x
J(x) + R(x). Let γ > 0 FDR
def
= 1
2
- Id + (2proxγR − Id)(proxγJ − Id)
- .
9
Example: Douglas–Rachford splitting
Problem min
x
J(x) + R(x). Let γ > 0 FDR
def
= 1
2
- Id + (2proxγR − Id)(proxγJ − Id)
- .
Douglas–Rach Douglas–Rachfor
- rd split
splitting ting zk+1 = FDR(zk). Sequence o(1/ √ k), objective NA.
9
Example: Douglas–Rachford splitting
Problem min
x
J(x) + R(x). Let γ > 0 FDR
def
= 1
2
- Id + (2proxγR − Id)(proxγJ − Id)
- .
Inertial Inertial Douglas–Rach Douglas–Rachfor
- rd
¯ zk = zk + ak(zk − zk−1), zk+1 = FDR(¯ zk). No rates available, may fail to provide acceleration.
9
Example: Douglas–Rachford splitting
Problem min
x
J(x) + R(x). Let γ > 0 FDR
def
= 1
2
- Id + (2proxγR − Id)(proxγJ − Id)
- .
Inertial Inertial Douglas–Rach Douglas–Rachfor
- rd
¯ zk = zk + ak(zk − zk−1), zk+1 = FDR(¯ zk). No rates available, may fail to provide acceleration. NB: Performance of inertial is problem and parameter dependent!
9
Example: Douglas–Rachford splitting
Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Define FDR
def
= 1
2
- Id + (2PT1 − Id)(PT2 − Id)
- .
10
Example: Douglas–Rachford splitting
Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk, +a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.
200 400 600 800 1000 10-15 10-10 10-5 100
. . normal DR
10
Example: Douglas–Rachford splitting
Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1), +b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.
200 400 600 800 1000 10-15 10-10 10-5 100
1-step inertial DR normal DR
10
Example: Douglas–Rachford splitting
Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.
200 400 600 800 1000 10-15 10-10 10-5 100
1-step inertial DR normal DR 2-step inertial DR
10
Example: Douglas–Rachford splitting
Feasibility problem in R2 Let T1, T2 ⊂ R2 be two subspaces such that T1 ∩ T2 = ∅, Find x ∈ R2 such that x ∈ T1 ∩ T2. Inertial Douglas–Rachford: ¯ zk = zk + a(zk − zk−1) + b(zk−1 − zk−2), zk+1 = FDR(¯ zk). 1-step inertial: a = 0.3. 2-step inertial: a = 0.6, b = −0.3.
200 400 600 800 1000 10-15 10-10 10-5 100
1-step inertial DR normal DR 2-step inertial DR
NB: 1-step inertial will always worsen the rate!
10
Problems
Nesterov/FISTA achieve worst case optimal convergence rate.
11
Problems
Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements.
11
Problems
Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:
11
Problems
Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:
Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.
strong convexity or Lipschitz smoothness.
11
Problems
Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:
Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.
strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent.
11
Problems
Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable:
Guaranteed sequence convergence [Alvarez & Attouch ’01]. NO acceleration guarantees. Unless stronger assumptions are imposed, e.g.
strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent. A general acceleration framework with acceleration guarantees is missing!
11
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
12
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods...
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... What happens when a problem can be solved by diff differ eren ent me methods thods ?
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods...
Gradient descent
Trajectory of {xk}k∈N.
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods...
normal DR
Trajectory of {zk}k∈N.
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods...
1-step inertial DR normal DR
Trajectory of {zk}k∈N.
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods...
1-step inertial DR normal DR 2-step inertial DR
Trajectory of {zk}k∈N.
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... Non-smooth opt. Structur Structure = = = = = = = = ⇒ FoM Geome Geometr try = = = = = = = = ⇒ Trajectory of sequence
13
Why the difference?
Trivial rivial ans answer: er: because they are different problems and different methods... However, FoM are non-linear in general...
13
Partial smoothness
Partly smooth function [Lewis ’03] R is partly smooth at x relative to a set Mx containing x if ∂R(x) = ∅ and Smoothness Mx is a C2-manifold, R|Mx is C2 near x. Sharpness Tangent space TMx(x) is Tx
def
= par
- ∂R(x)
⊥. Continuity ∂R : Rn ⇒ Rn is continuous along Mx near x.
par( C):sub-space parallel to C, where C ⊂ Rn is a non-empty convex set.
−5 5 10 20 30 40 5 10 15 20
M x x2 x1 f (x)
Ex Examples: amples: ℓ1, ℓ1,2, ℓ∞-norm Nuclear norm Total variation
14
Trajectory of first-order methods
Framework for analysing the local trajectory of FoM a First-order method (non-linear) ⇓ Convergence & Non-degeneracy: finite identification of M ⇓ Local linearisation along M: matrix M (linear) ⇓ Spectral properties of M ⇓ Local trajectory
aLocal convergence previously studied in [Liang et al ’16].
15
Trajectory of first-order methods
Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||).
15
Trajectory of first-order methods
Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory.
15
Trajectory of first-order methods
Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory.
15
Trajectory of first-order methods
Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory. PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory.
15
Trajectory of first-order methods
Local linearisation of FoM: zk+1 − zk = M(zk − zk−1) + o(||zk − zk−1||). FB FB M is similar to a symmetric matrix with real eigenvalues in ] − 1, 1]: straight-line trajectory. DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos(θ)e±iθ: logarithmic spiral trajectory. PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory. NB: For DR/ADMM, if one term is locally C2-smooth, straight-line trajectory can be
- btained under proper parameters.
15
Trajectory of first-order methods
For
- rwar
ard–Backw d–Backwar ard Douglas–Rach Douglas–Rachfor
- rd/
d/ADMM ADMM Primal–Dual Primal–Dual
50 100 150 200 250 300 10-15 10-10 10-5 50 100 150 200 250 10-14 10-12 10-10 10-8 10-6 10-4 50 100 150 200 250 300 0.84 0.86 0.88 0.9 0.92 0.94 0.96
1 − cos(θk) cos(ψ) − cos(θk) cos(θk)
15
Failure of inertial
Consider the Lasso for a random Gaussian matrix A ∈ Rm×n with m < n: min
x,y∈Rn ||x||1 + 1 2||Ax − f||2 2.
Solving using DR with γ = 0.9
||A||2
200 400 600 800 1000 1200 10-15 10-10 10-5 100 50 100 150 200 250 200 400 600 800 1000 1200 1400 10-10 10-5 100
Eventual trajectory: Straight line when γ < (||A||2)−1
16
Failure of inertial
Consider the Lasso for a random Gaussian matrix A ∈ Rm×n with m < n: min
x,y∈Rn ||x||1 + 1 2||Ax − f||2 2.
Solving using DR with γ =
10 ||A||2
50 100 150 200 10-15 10-10 10-5 100 50 100 150 200 250 100 200 300 400 500 600 700 800 900 10-10 10-5 100
Eventual trajectory: Linearisation matrix may have complex leading eigenvalue if γ ≥ (||A||2)−1.
16
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
17
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0. 18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||. 18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1. 18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1.
Repeat on {zk−j}q
j=0 ∪ {¯
zk,1} and so on.
18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1.
Repeat on {zk−j}q
j=0 ∪ {¯
zk,1} and so on.
18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1.
Repeat on {zk−j}q
j=0 ∪ {¯
zk,1} and so on.
18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1.
Repeat on {zk−j}q
j=0 ∪ {¯
zk,1} and so on.
18
Linear prediction: illustration
Idea: Given past points {zk−j}q+1
j=0, how to predict zk+1?
Define {vk−j
def
= zk−j − zk−j−1}q
j=0.
Fit the past directions vk−1, . . . , vk−q to the lat- est direction vk: ck
def
= argminc∈Rq|| q
j=1 cjvk−j − vk||.
Let ¯ zk,1
def
= zk + q
j=1 ck,jvk−j+1.
Repeat on {zk−j}q
j=0 ∪ {¯
zk,1} and so on. The s-step extrapolation is ¯ zk,s = zk + Es,q,k, where Es,q,k = q
j=1 ˆ
cjvk−j+1 and ˆ c
def
= s
j=1H(ck)j (:,1)
with H(ck)
def
=
- ck
Idq−1 01,q−1
- .
18
Adaptive acceleration for FoM (A2FoM)
Given first-order method zk+1 = F(zk). A2FoM via linear prediction Let s ≥ 1, q ≥ 1 be integers. Let z0 ∈ Rn and ¯ z0 = z0, set D = 0 ∈ Rn×(q+1): For k ≥ 1: zk = F(¯ zk−1), vk = zk − zk−1, D = [vk, D(:, 1 : q)]. If mod(k, q+2) = 0: compute c and Hc, If ρ(Hc) < 1: ¯ zk = zk + Vk s
i=1Hi c
- (:,1);
else: ¯ zk = zk. If mod(k, q+2) = 0: ¯ zk = zk.
19
Adaptive acceleration for FoM (A2FoM)
Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent
+∞
i=0 Hi c = (Id − Hc)−1. 19
Adaptive acceleration for FoM (A2FoM)
Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent
+∞
i=0 Hi c = (Id − Hc)−1.
For the summation s
i=1 Hi c,
s
i=1Hi c = (Hc − Hs+1 c
)(Id − Hc)−1.
19
Adaptive acceleration for FoM (A2FoM)
Simplific Simplification tion If ρ(Hc) < 1, the Neumann series is convergent
+∞
i=0 Hi c = (Id − Hc)−1.
For the summation s
i=1 Hi c,
s
i=1Hi c = (Hc − Hs+1 c
)(Id − Hc)−1. When s = +∞, we have +∞
i=1 Hi c = Hc(Id − Hc)−1
= (Id − Hc)−1 − Id.
19
Adaptive acceleration for FoM (A2FoM)
Remark emark We extrapolate every (q + 2)-iterations.
20
Adaptive acceleration for FoM (A2FoM)
Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1.
20
Adaptive acceleration for FoM (A2FoM)
Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10
20
Adaptive acceleration for FoM (A2FoM)
Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q2n from V+
k−1. 20
Adaptive acceleration for FoM (A2FoM)
Remark emark We extrapolate every (q + 2)-iterations. Only apply the linear prediction when ρ(Hc) < 1. Extra memory cost n × (q + 1) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q2n from V+
k−1.
Global convergence can be obtained by treating extrapolation as perturbation error [Alvarez & Attouch ’01], i.e. zk+1 = F(zk + ǫk). Weighted LP ¯ zk = zk + akVk s
i=1Hi c
- (:,1),
with ak updated online.
20
Example: Douglas–Rachford continue
normal DR 21
Example: Douglas–Rachford continue
normal DR LP, s = 4 21
Example: Douglas–Rachford continue
normal DR LP, s = 4 LP, s = 25 21
Example: Douglas–Rachford continue
normal DR LP, s = 4 LP, s = 25 LP, s = + 21
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
22
Convergence acceleration
Given a sequence {zk}k∈N which converges to z⋆. Can we generate another sequence {¯ zk}k∈N such that ||¯ zk − z⋆|| = o(||zk − z⋆||)?
23
Convergence acceleration
Given a sequence {zk}k∈N which converges to z⋆. Can we generate another sequence {¯ zk}k∈N such that ||¯ zk − z⋆|| = o(||zk − z⋆||)? This is called convergence acceleration and is well-established in numerical analysis: 1927 Aitkin’s ∆-process. 1965 Andersen’s acceleration. 1970’s Vector extrapolation techniques such as minimal polynomial extrapolation (MPE) and reduced rank extrapolation (RRE) [Sidi ’17]. Now Regularized non-linear acceleration (RNA) is a regularised version of RRE introduced by [Scieur, D’Aspremont, Bach ’16].
23
Vector extrapolation techniques
Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆:
24
Vector extrapolation techniques
Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆: zk − z⋆ = M(zk−1 − z⋆) = Mk(z0 − z⋆),
24
Vector extrapolation techniques
Polynomial extrapolation [Cabay & Jackson ’76] Consider zk+1 = Mzk + d with ρ(M) < 1 such that zk → z⋆: zk − z⋆ = M(zk−1 − z⋆) = Mk(z0 − z⋆), If P(λ) = q
j=0 cjλj is the minimal polynomial of M w.r.t. z0 − z⋆, that is
P(M)(z0 − z⋆) = q
j=0cjMj(z0 − z⋆) = 0.
then z⋆ =
q
j=0 cjzj
- j cj .
The coefficients c can be computed without knowledge of z⋆: Vqc(0:q−1) = −vq+1 and cq = 1 where Vq =
- v1|v2| · · · |vq
- and vj = zj − zj−1.
24
Vector extrapolation techniques
Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.
Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1
j=0 and let vj = zj − zj−1.
S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =
- v1| · · · |vq
- . For j ∈ [0, q − 1], ˜
cj
def
= cj/(q
j=0 cj).
S.3 ¯ z
def
= q
j=0 ˜
cjzj.
25
Vector extrapolation techniques
Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.
Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1
j=0 and let vj = zj − zj−1.
S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =
- v1| · · · |vq
- . For j ∈ [0, q − 1], ˜
cj
def
= cj/(q
j=0 cj).
S.3 ¯ z
def
= q
j=0 ˜
cjzj. Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argminc||Vq+1c|| subject to 1Tc = 1.
25
Vector extrapolation techniques
Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi.
Given a sequence generated by zk = F(zk−1). Minimal polynomial extrapolation (MPE) Let z0 = ¯ z: S.1 Generate points {zj}q+1
j=0 and let vj = zj − zj−1.
S.2 Let c ∈ Rq+1 be such that cq = 1 and Vqc(0:q−1) = −vq+1 where Vq =
- v1| · · · |vq
- . For j ∈ [0, q − 1], ˜
cj
def
= cj/(q
j=0 cj).
S.3 ¯ z
def
= q
j=0 ˜
cjzj. Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argminc||Vq+1c|| subject to 1Tc = 1.
LP LP is is equiv equivalen alent to MPE MPE with with S. S.3 replaced eplaced by ¯ z
de def
= q
j=0 ˜
cjzj+1.
25
Regularised non-linear acceleration (RNA)
[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1.
26
Regularised non-linear acceleration (RNA)
[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT
qVq+λId)c|| subject to 1Tc = 1. 26
Regularised non-linear acceleration (RNA)
[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT
qVq+λId)c|| subject to 1Tc = 1.
In practice, grid search with the objective to find optimal λ ∈ [λmin, λmax].
26
Regularised non-linear acceleration (RNA)
[Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of zk+1 − z⋆ = A(zk − z⋆) + O(||zk − z⋆||2) where A is symmetric with 0 A σId, σ < 1. To deal with the possible ill-conditioning of Vq, regularise with λ > 0: ˜ c ∈ Argminc ||cT(VT
qVq+λId)c|| subject to 1Tc = 1.
In practice, grid search with the objective to find optimal λ ∈ [λmin, λmax]. The angle between zk − zk−1 and zk+1 − zk converges to zero, intuitively, this is the regime where standard inertial works well...
26
Acceleration guarantees
We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]:
27
Acceleration guarantees
We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B
def
= s
ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|. 27
Acceleration guarantees
We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B
def
= s
ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.
Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k).
27
Acceleration guarantees
We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B
def
= s
ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.
Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k). Non-asymptotic bound: If Σ(M) ⊂ [α, β] with −1 < α < β < 1, then Bǫk ≤ Kβk−q √η − 1
√η + 1
q , where η = 1 − α
1 − β 27
Acceleration guarantees
We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When zk+1 − zk = M(zk − zk−1), ||¯ zk,s − z∗|| ≤ ||zk+s − z∗|| + Bǫk where ǫk = ||Vk−1c − vk|| and B
def
= s
ℓ=1 ||Mℓ||| s−ℓ i=0(Hi c)(1,1)|.
Asymptotic bound (k → ∞): ǫk = O(|λq+1|k) where λq+1 is the (q + 1)th largest eigenvalue. Without extrapolation, we just have O(|λ1|k). Non-asymptotic bound: If Σ(M) ⊂ [α, β] with −1 < α < β < 1, then Bǫk ≤ Kβk−q √η − 1
√η + 1
q , where η = 1 − α
1 − β
For PD, DR with polyhedral functions, guaranteed acceleration with q = 2.
27
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work.
28
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form.
28
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.
1 2 3 4 104 10-10 10-6 10-2
Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 10-10 10-6 10-2
Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA
Least-squares, enforcing straight line trajectory
28
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.
500 1000 1500 2000 10-10 10-6 10-2 102
Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA
0.005 0.01 0.015 0.02 0.025 0.03 10-10 10-6 10-2 102
Forward--Backward FISTA Restarting FISTA LP MPE RRE RNA
Lasso, enforcing straight line trajectory
28
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.
500 1000 1500 2000 2500 10-10 10-6 10-2 102
DR 1-iDR 2-iDR DR+LP MPE RRE
0.02 0.04 0.06 0.08 0.1 0.12 10-10 10-6 10-2 102
DR 1-iDR 2-iDR DR+LP MPE RRE
Basis pursuit with DR (spiral trajectory). Extrapolating with 3 points.
28
Our contributions
We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory. Minor difference in final form. Understanding the eventual sequence trajectory can aid in acceleration design.
500 1000 1500 2000 2500 10-10 10-6 10-2 102
DR 1-iDR 2-iDR DR+LP MPE RRE
0.02 0.04 0.06 0.08 0.1 0.12 10-10 10-6 10-2 102
DR 1-iDR 2-iDR DR+LP MPE RRE
Basis pursuit with DR (spiral trajectory). Extrapolating with 4 points.
28
Outline
Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions
29
Experiment: 2 non-smooth terms
Basis pursuit type problem with Ω
def
= {x ∈ Rn : Kx = f}: min
x,y∈Rn R(x) + ιΩ(y)
such that x − y = 0.
50 100 150 200 250 300 350 0.8 0.85 0.9 0.95 1 50 100 150 200 0.75 0.8 0.85 0.9 0.95 1 100 200 300 400 500 0.8 0.85 0.9 0.95 1
50 100 150 200 250 300 350 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 50 100 150 200 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 100 200 300 400 500 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2
ℓ1-norm ℓ1,2-norm Nuclear norm
30
Experiment: 2 non-smooth terms
50 100 150 200 250 300 350 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2
ℓ1-norm Inertial ADMM is slower than ADMM as eventual trajectory is a spiral.
30
Experiment: LASSO
The LASSO problem min
x,y∈Rn R(x) + 1 2||Ky − f||2
such that x − y = 0.
2000 4000 6000 8000 10000 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 200 400 600 800 1000 1200 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 500 1000 1500 2000 2500 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2
covtype ijcnn1 phishing
31
Experiment: LASSO
2000 4000 6000 8000 10000 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2
covtype Inertial ADMM does accelerate, but A3DMM is significantly faster.
31
Experiment: Total variation based image inpainting
Let Ω
def
= {x ∈ Rn×n : PD(x) = f}, PD randomly sets 50% pixels to zero and consider min
x∈Rn×n ||y||1 + ιΩ(x)
such that ∇x − y = 0.
Angle {θk} ||xk − x⋆|| PSNR
500 1000 1500 0.75 0.8 0.85 0.9 0.95 1
200 400 600 800 1000 1200 1400 1600 1800 10 -8 10 -6 10 -4 10 -2 10 0 10 2 10 4 20 40 60 80 100 24 24.5 25 25.5 26 26.5 27 27.5
Both functions are polyhedral, trajectory is a spiral. Inertial ADMM is slower than ADMM.
32
Experiment: Total variation based image inpainting
Original image ADMM, PSNR = 26.5448 Inertial ADMM, PSNR = 26.1096 Corrupted image A
3DMM s = 100, PSNR = 27.0402
A
3DMM s = +∞, PSNR = 27.0402
32
Experiment: Partial Fourier reconstruction
We consider reconstruction from partial Fourier measurements (along radial lines), reg- ularised with translation invariant wavelets. min
x∈Rn×n ||Wx||1
subject to PΩFx = y
20 40 60 80 100 60 65 70 75 80 85
PD iPD PD-LP, q=2 PD-LP, q=1
Primal-Dual (72.9s) PD-LP (72.1s)
33
Take-away messages
Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals.
34
Take-away messages
Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals. An adaptive acceleration for FoM Though motivated by local trajectory, linear prediction works globally. For polyhedral functions, guaranteed local acceleration for DR/PD using 4 past points. For FB, the trajectory is eventually a straight line and one can guarantee local acceleration by extrapolating from 3 past points under our framework, but one could just apply restarted FISTA...
34
Take-away messages
Trajectory of FoM For fixed-point sequence {zk}k∈N Linearization of FoM. Locally different FoM demonstrate distinct trajectories: straight line, (logarithmic and elliptic) spirals. An adaptive acceleration for FoM Though motivated by local trajectory, linear prediction works globally. For polyhedral functions, guaranteed local acceleration for DR/PD using 4 past points. For FB, the trajectory is eventually a straight line and one can guarantee local acceleration by extrapolating from 3 past points under our framework, but one could just apply restarted FISTA...
Many thanks for your attention!
- P. & Liang. Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration, Neurips’19
Liang & P. The Geometry of First-Order Methods and Adaptive Acceleration. arXiv 2003.03910
34