Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . - - PowerPoint PPT Presentation
Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . - - PowerPoint PPT Presentation
Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . with Sebastian Pokutta School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA 37 th International Conference on Machine Learning July
Outline
1 Introduction 2 The Frank-Wolfe algorithm 3 Boosting Frank-Wolfe 4 Computational experiments
2/19
Introduction
Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where
- f : H → R is a smooth convex function
- C ⊂ H is a compact convex set, C = conv(V)
3/19
Introduction
Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where
- f : H → R is a smooth convex function
- C ⊂ H is a compact convex set, C = conv(V)
Example
- Sparse logistic regression
min
x∈Rn
1 m
m
- i=1
ln(1 + exp(−yia⊤
i x))
s.t. x1 τ
- Low-rank matrix completion
min
X∈Rm×n
1 2|I|
- (i,j)∈I
(Yi,j − Xi,j)2 s.t. Xnuc τ
3/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
xt xt − γt∇f (xt) xt+1
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))
N/A: no closed-form exists and solution must be computed via nontrivial optimization
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))
N/A: no closed-form exists and solution must be computed via nontrivial optimization
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))
N/A: no closed-form exists and solution must be computed via nontrivial optimization
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))
N/A: no closed-form exists and solution must be computed via nontrivial optimization
4/19
Introduction
- A natural approach is to use any efficient method and add projections back
- nto C to ensure feasibility
- However, in many situations projections onto C are very expensive
- This is an issue with the method of projections, not necessarily with the
geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))
N/A: no closed-form exists and solution must be computed via nontrivial optimization
- Can we avoid projections?
4/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) xt
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) xt
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) xt
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1
- xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1
- xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
- FW uses linear minimizations (the “FW oracle”) instead of projections
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1
- xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
- FW uses linear minimizations (the “FW oracle”) instead of projections
- FW = pick a vertex (using gradient information) and move in that
direction
5/19
The Frank-Wolfe algorithm
The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1
- xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
- FW uses linear minimizations (the “FW oracle”) instead of projections
- FW = pick a vertex (using gradient information) and move in that
direction
- Successfully applied to: traffic assignment, computer vision, optimal
transport, adversarial learning, etc.
5/19
The Frank-Wolfe algorithm
Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =
2 t+2 (default) or γt = min
- ∇f (xt),xt−vt
Lxt−vt2
, 1
- (“short step”), then
f (xt) − min
C f 4LD2
t + 2
6/19
The Frank-Wolfe algorithm
Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =
2 t+2 (default) or γt = min
- ∇f (xt),xt−vt
Lxt−vt2
, 1
- (“short step”), then
f (xt) − min
C f 4LD2
t + 2
- The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,
2013; Lan, 2013)
6/19
The Frank-Wolfe algorithm
Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =
2 t+2 (default) or γt = min
- ∇f (xt),xt−vt
Lxt−vt2
, 1
- (“short step”), then
f (xt) − min
C f 4LD2
t + 2
- The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,
2013; Lan, 2013)
- Why?
6/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x∗
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗
- Let x0 =
- 1
- 7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2 x3
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2 x3
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2 x3 x4
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2 x3 x4
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
7/19
The Frank-Wolfe algorithm
Consider the simple problem min 1 2x2
2
s.t. x ∈ conv 1
- ,
−1
- ,
1
- and x∗ =
- x0
x∗ x1 x2 x3 x4
- Let x0 =
- 1
- FW tries to reach x∗ by moving towards vertices
- This yields an inefficient zig-zagging trajectory
7/19
Improved Frank-Wolfe variants
- Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,
2015): enhances FW by allowing to move away from vertices
8/19
Improved Frank-Wolfe variants
- Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,
2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5
8/19
Improved Frank-Wolfe variants
- Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,
2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5
- Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &
Meshi, 2016): memory-free variant of AFW
8/19
Improved Frank-Wolfe variants
- Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,
2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5
- Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &
Meshi, 2016): memory-free variant of AFW
- Blended Conditional Gradients (BCG) (Braun et al., 2019): blends FCFW
and FW
8/19
Boosting Frank-Wolfe
- Can we speed up FW in a simple way?
9/19
Boosting Frank-Wolfe
- Can we speed up FW in a simple way?
- Rule of thumb in optimization: follow the steepest direction
9/19
Boosting Frank-Wolfe
- Can we speed up FW in a simple way?
- Rule of thumb in optimization: follow the steepest direction
Idea:
- Speed up FW by moving in a direction better aligned with −∇f (xt)
9/19
Boosting Frank-Wolfe
- Can we speed up FW in a simple way?
- Rule of thumb in optimization: follow the steepest direction
Idea:
- Speed up FW by moving in a direction better aligned with −∇f (xt)
- Build this direction by using V to maintain the projection-free property
9/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection? xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0
r1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0
r1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1 λ0u0
r1 r2
λ1u1
v1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1
- We could continue:
v2 ∈ arg maxv∈Vr2, v λ0u0 λ1u1 r2
v1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1
- We could continue:
v2 ∈ arg maxv∈Vr2, v
- d = λ0u0 + λ1u1
λ0u0 λ1u1
d v1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1
- We could continue:
v2 ∈ arg maxv∈Vr2, v
- d = λ0u0 + λ1u1
- gt = d/(λ0 + λ1)
λ0u0 λ1u1
d gt v1 v0 xt −∇f (xt)
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1
- We could continue:
v2 ∈ arg maxv∈Vr2, v
- d = λ0u0 + λ1u1
- gt = d/(λ0 + λ1)
gt v0 xt −∇f (xt)
- The boosted direction gt is better aligned with −∇f (xt) than is the FW
direction v0 − xt
10/19
Boosting Frank-Wolfe
- How can we build a direction better aligned with −∇f (xt) and that allows
to update xt+1 without projection?
- v0 ∈ arg maxv∈V−∇f (xt), v
λ0u0 = −∇f (xt),v0−xt
v0−xt2
(v0 − xt) r1 = −∇f (xt) − λ0u0
- v1 ∈ arg maxv∈Vr1, v
λ1u1 = r1,v1−xt
v1−xt2 (v1 − xt)
r2 = r1 − λ1u1
- We could continue:
v2 ∈ arg maxv∈Vr2, v
- d = λ0u0 + λ1u1
- gt = d/(λ0 + λ1)
gt v0 xt −∇f (xt)
- The boosted direction gt is better aligned with −∇f (xt) than is the FW
direction v0 − xt and satisfies [xt, xt + gt] ⊆ C so we can update xt+1 = xt + γtgt for any γt ∈ [0, 1]
10/19
Boosting Frank-Wolfe
Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =
Kt−1
- k=0
λk(vk − xt) where λk > 0 and vk ∈ V
11/19
Boosting Frank-Wolfe
Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =
Kt−1
- k=0
λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1
k=0 λk, then
gt = 1 Λt
Kt−1
- k=0
λk(vk − xt) =
- 1
Λt
Kt−1
- k=0
λkvk
- ∈C
−xt
11/19
Boosting Frank-Wolfe
Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =
Kt−1
- k=0
λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1
k=0 λk, then
gt = 1 Λt
Kt−1
- k=0
λk(vk − xt) =
- 1
Λt
Kt−1
- k=0
λkvk
- ∈C
−xt Thus, xt + gt ∈ C so [xt, xt + gt] ⊆ C by convexity
11/19
Boosting Frank-Wolfe
Algorithm
Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′
k ← dk + λkuk
8: if align(∇, d′
k) − align(∇, dk) δ then
9: dk+1 ← d′
k
10: Λt ←
- Λ + λk
if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization
12/19
Boosting Frank-Wolfe
Algorithm
Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′
k ← dk + λkuk
8: if align(∇, d′
k) − align(∇, dk) δ then
9: dk+1 ← d′
k
10: Λt ←
- Λ + λk
if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization
- Technicality to ensure convergence of the procedure (Locatello et al., 2017)
12/19
Boosting Frank-Wolfe
Algorithm
Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′
k ← dk + λkuk
8: if align(∇, d′
k) − align(∇, dk) δ then
9: dk+1 ← d′
k
10: Λt ←
- Λ + λk
if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization
- Technicality to ensure convergence of the procedure (Locatello et al., 2017)
- The stopping criterion is an alignment improvement condition (typically
δ = 10−3 and K = +∞)
12/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt)
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt)
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt)
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt x0 x∗ = x1
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt x0 x∗ = x1
- What is the convergence rate of BoostFW?
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt x0 x∗ = x1
- What is the convergence rate of BoostFW?
- Is BoostFW expensive in practice?
13/19
Boosting Frank-Wolfe
Algorithm Frank-Wolfe (FW)
Input: x0 ∈ C, γt ∈ [0, 1].
1: for t = 0 to T − 1 do 2:
vt ← arg min
v∈V
∇f (xt), v
3:
xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4
Algorithm Boosted Frank-Wolfe (BoostFW)
Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.
1: for t = 0 to T − 1 do 2:
gt ← procedure(xt, −∇f (xt), K, δ)
3:
xt+1 ← xt + γtgt x0 x∗ = x1
- What is the convergence rate of BoostFW?
- Is BoostFW expensive in practice?
- How does it compare to the state-of-the-art?
13/19
Boosting Frank-Wolfe
- Let Nt be the number of iterations up to t where at least 2 rounds of
alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min
- −∇f (xt),gt
Lgt2
, 1
- (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then
f (xt) − min
C f LD2
2 exp
- −δ2 µ
L ωtp
14/19
Boosting Frank-Wolfe
- Let Nt be the number of iterations up to t where at least 2 rounds of
alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min
- −∇f (xt),gt
Lgt2
, 1
- (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then
f (xt) − min
C f LD2
2 exp
- −δ2 µ
L ωtp
- The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,
that the boosting procedure is active
14/19
Boosting Frank-Wolfe
- Let Nt be the number of iterations up to t where at least 2 rounds of
alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min
- −∇f (xt),gt
Lgt2
, 1
- (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then
f (xt) − min
C f LD2
2 exp
- −δ2 µ
L ωtp
- The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,
that the boosting procedure is active
- Else BoostFW reduces to FW and the convergence rate is 4LD2
t+2
14/19
Boosting Frank-Wolfe
- Let Nt be the number of iterations up to t where at least 2 rounds of
alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min
- −∇f (xt),gt
Lgt2
, 1
- (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then
f (xt) − min
C f LD2
2 exp
- −δ2 µ
L ωtp
- The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,
that the boosting procedure is active
- Else BoostFW reduces to FW and the convergence rate is 4LD2
t+2
- In practice, Nt ≈ t (so ω 1 and p = 1)
14/19
Computational experiments
- We compare BoostFW to AFW, BCG, and DICG on a series of
experiments involving various objective functions and feasible regions
✶
15/19
Computational experiments
- We compare BoostFW to AFW, BCG, and DICG on a series of
experiments involving various objective functions and feasible regions
min
x∈Rn y − Ax2 2
s.t. x1 τ min
x∈R|A|
- a∈A
τaxa
- 1 + 0.03
xa
ca
4
s.t. xa =
- r∈R
✶{a∈r}yr a ∈ A
- r∈Ri,j
yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min
x∈Rn
1 m
m
- i=1
ln(1 + exp(−yia⊤
i x))
s.t. x1 τ min
X∈Rm×n
1 |I|
- (i,j)∈I
hρ(Yi,j − Xi,j) s.t. Xnuc τ
15/19
Computational experiments
- We compare BoostFW to AFW, BCG, and DICG on a series of
experiments involving various objective functions and feasible regions
min
x∈Rn y − Ax2 2
s.t. x1 τ min
x∈R|A|
- a∈A
τaxa
- 1 + 0.03
xa
ca
4
s.t. xa =
- r∈R
✶{a∈r}yr a ∈ A
- r∈Ri,j
yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min
x∈Rn
1 m
m
- i=1
ln(1 + exp(−yia⊤
i x))
s.t. x1 τ min
X∈Rm×n
1 |I|
- (i,j)∈I
hρ(Yi,j − Xi,j) s.t. Xnuc τ
- For BoostFW and AFW we also run the line search-free variants (the
“short step” strategy) and label them with an “L”
15/19
Computational experiments
- Sparse signal recovery
- Traffic assignment
- Sparse logistic regression on the
Gisette dataset
- Collaborative filtering on the
MovieLens 100k dataset
16/19
Boosting DICG
- DICG is known to perform particularly well on the video co-localization
experiment (YouTube-Objects dataset)
- BoostDICG: application of our method to DICG
17/19
Boosting DICG
- DICG is known to perform particularly well on the video co-localization
experiment (YouTube-Objects dataset)
- BoostDICG: application of our method to DICG
- (details)
DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min
v∈V
∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt
17/19
Boosting DICG
- DICG is known to perform particularly well on the video co-localization
experiment (YouTube-Objects dataset)
- BoostDICG: application of our method to DICG
- (details)
DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min
v∈V
∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt
17/19
Boosting DICG
- DICG is known to perform particularly well on the video co-localization
experiment (YouTube-Objects dataset)
- BoostDICG: application of our method to DICG
- (details)
DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min
v∈V
∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt
17/19
Takeaways and final remarks
- Projection-free algorithms are of considerable interest in optimization
18/19
Takeaways and final remarks
- Projection-free algorithms are of considerable interest in optimization
- We have proposed an intuitive and generic boosting procedure to speed up
Frank-Wolfe algorithms
18/19
Takeaways and final remarks
- Projection-free algorithms are of considerable interest in optimization
- We have proposed an intuitive and generic boosting procedure to speed up
Frank-Wolfe algorithms
- Although our method may perform more linear minimizations per iteration,
the progress obtained greatly overcomes their cost
18/19
Takeaways and final remarks
- Projection-free algorithms are of considerable interest in optimization
- We have proposed an intuitive and generic boosting procedure to speed up
Frank-Wolfe algorithms
- Although our method may perform more linear minimizations per iteration,
the progress obtained greatly overcomes their cost
- We focused on smooth convex objective functions, but we expect our
method to provide significant gains in performance in other areas of
- ptimization as well
18/19
Takeaways and final remarks
- Projection-free algorithms are of considerable interest in optimization
- We have proposed an intuitive and generic boosting procedure to speed up
Frank-Wolfe algorithms
- Although our method may perform more linear minimizations per iteration,
the progress obtained greatly overcomes their cost
- We focused on smooth convex objective functions, but we expect our
method to provide significant gains in performance in other areas of
- ptimization as well
E.g., large-scale finite-sum/stochastic constrained optimization: gt ← procedure(xt, − ˜ ∇f (xt), K, δ) xt+1 ← xt + γtgt
18/19
References
- G. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning
- f conditional gradients. ICML, 2019.
- M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfe
- algorithm. SIAM J. Control, 1968.
- C. W. Combettes and S. Pokutta. Boosting Frank-Wolfe by chasing gradients. ICML, 2020.
- M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Q., 1956.
- D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent con-
ditional gradient algorithm for structured polytopes. NIPS, 2016.
- M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML, 2013.
- S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization
- variants. NIPS, 2015.
- G. Lan. The complexity of large-scale convex programming under a linear optimization oracle.
Technical report, University of Florida, 2013.
- E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Comp. Math. Math.
Phys., 1966.
- F. Locatello, M. Tschannen, G. R¨
atsch, and M. Jaggi. Greedy algorithms for cone constrained
- ptimization with convergence guarantees. NIPS, 2017.
- P. Wolfe. Convergence theory in nonlinear programming. Integer and Nonlinear Programming.
North-Holland, 1970.
19/19