[PPT] - Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . PowerPoint Presentation

SLIDE 1

Boosting Frank-Wolfe by Chasing Gradients

Cyrille W. Combettes . with Sebastian Pokutta

School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA

37th International Conference on Machine Learning July 12–18, 2020

SLIDE 2

Outline

1 Introduction 2 The Frank-Wolfe algorithm 3 Boosting Frank-Wolfe 4 Computational experiments

2/19

SLIDE 3

Introduction

Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where

f : H → R is a smooth convex function
C ⊂ H is a compact convex set, C = conv(V)

3/19

SLIDE 4

Introduction

Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where

f : H → R is a smooth convex function
C ⊂ H is a compact convex set, C = conv(V)

Example

Sparse logistic regression

min

x∈Rn

1 m

m

i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ

Low-rank matrix completion

min

X∈Rm×n

1 2|I|

(i,j)∈I

(Yi,j − Xi,j)2 s.t. Xnuc τ

3/19

SLIDE 5

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility

4/19

SLIDE 6

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility

xt xt − γt∇f (xt) xt+1

4/19

SLIDE 7

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive

4/19

SLIDE 8

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap

4/19

SLIDE 9

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

SLIDE 10

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

SLIDE 11

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

SLIDE 12

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

SLIDE 13

Introduction

A natural approach is to use any efficient method and add projections back
nto C to ensure feasibility
However, in many situations projections onto C are very expensive
This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

Can we avoid projections?

4/19

SLIDE 14

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

SLIDE 15

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

SLIDE 16

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

SLIDE 17

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt

5/19

SLIDE 18

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt

5/19

SLIDE 19

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

5/19

SLIDE 20

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C

5/19

SLIDE 21

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
FW uses linear minimizations (the “FW oracle”) instead of projections

5/19

SLIDE 22

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
FW uses linear minimizations (the “FW oracle”) instead of projections
FW = pick a vertex (using gradient information) and move in that

direction

5/19

SLIDE 23

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
FW uses linear minimizations (the “FW oracle”) instead of projections
FW = pick a vertex (using gradient information) and move in that

direction

Successfully applied to: traffic assignment, computer vision, optimal

transport, adversarial learning, etc.

5/19

SLIDE 24

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

∇f (xt),xt−vt

Lxt−vt2

, 1

(“short step”), then

f (xt) − min

C f 4LD2

t + 2

6/19

SLIDE 25

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

∇f (xt),xt−vt

Lxt−vt2

, 1

(“short step”), then

f (xt) − min

C f 4LD2

t + 2

The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,

2013; Lan, 2013)

6/19

SLIDE 26

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

∇f (xt),xt−vt

Lxt−vt2

, 1

(“short step”), then

f (xt) − min

C f 4LD2

t + 2

The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,

2013; Lan, 2013)

Why?

6/19

SLIDE 27

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x∗

7/19

SLIDE 28

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗

Let x0 =
1
7/19

SLIDE 29

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 30

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 31

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 32

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 33

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 34

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2 x3

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 35

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2 x3

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 36

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2 x3 x4

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 37

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2 x3 x4

Let x0 =
1
FW tries to reach x∗ by moving towards vertices

7/19

SLIDE 38

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

,

−1

,

1

and x∗ =
x0

x∗ x1 x2 x3 x4

Let x0 =
1
FW tries to reach x∗ by moving towards vertices
This yields an inefficient zig-zagging trajectory

7/19

SLIDE 39

Improved Frank-Wolfe variants

Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices

8/19

SLIDE 40

Improved Frank-Wolfe variants

Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

8/19

SLIDE 41

Improved Frank-Wolfe variants

Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &

Meshi, 2016): memory-free variant of AFW

8/19

SLIDE 42

Improved Frank-Wolfe variants

Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &

Meshi, 2016): memory-free variant of AFW

Blended Conditional Gradients (BCG) (Braun et al., 2019): blends FCFW

and FW

8/19

SLIDE 43

Boosting Frank-Wolfe

Can we speed up FW in a simple way?

9/19

SLIDE 44

Boosting Frank-Wolfe

Can we speed up FW in a simple way?
Rule of thumb in optimization: follow the steepest direction

9/19

SLIDE 45

Boosting Frank-Wolfe

Can we speed up FW in a simple way?
Rule of thumb in optimization: follow the steepest direction

Idea:

Speed up FW by moving in a direction better aligned with −∇f (xt)

9/19

SLIDE 46

Boosting Frank-Wolfe

Can we speed up FW in a simple way?
Rule of thumb in optimization: follow the steepest direction

Idea:

Speed up FW by moving in a direction better aligned with −∇f (xt)
Build this direction by using V to maintain the projection-free property

9/19

SLIDE 47

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

10/19

SLIDE 48

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection? xt −∇f (xt)

10/19

SLIDE 49

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0

r1 v0 xt −∇f (xt)

10/19

SLIDE 50

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0

r1 v0 xt −∇f (xt)

10/19

SLIDE 51

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1 λ0u0

r1 r2

λ1u1

v1 v0 xt −∇f (xt)

10/19

SLIDE 52

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

We could continue:

v2 ∈ arg maxv∈Vr2, v λ0u0 λ1u1 r2

v1 v0 xt −∇f (xt)

10/19

SLIDE 53

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

We could continue:

v2 ∈ arg maxv∈Vr2, v

d = λ0u0 + λ1u1

λ0u0 λ1u1

d v1 v0 xt −∇f (xt)

10/19

SLIDE 54

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

We could continue:

v2 ∈ arg maxv∈Vr2, v

d = λ0u0 + λ1u1
gt = d/(λ0 + λ1)

λ0u0 λ1u1

d gt v1 v0 xt −∇f (xt)

10/19

SLIDE 55

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

We could continue:

v2 ∈ arg maxv∈Vr2, v

d = λ0u0 + λ1u1
gt = d/(λ0 + λ1)

gt v0 xt −∇f (xt)

The boosted direction gt is better aligned with −∇f (xt) than is the FW

direction v0 − xt

10/19

SLIDE 56

Boosting Frank-Wolfe

How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

We could continue:

v2 ∈ arg maxv∈Vr2, v

d = λ0u0 + λ1u1
gt = d/(λ0 + λ1)

gt v0 xt −∇f (xt)

The boosted direction gt is better aligned with −∇f (xt) than is the FW

direction v0 − xt and satisfies [xt, xt + gt] ⊆ C so we can update xt+1 = xt + γtgt for any γt ∈ [0, 1]

10/19

SLIDE 57

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

k=0

λk(vk − xt) where λk > 0 and vk ∈ V

11/19

SLIDE 58

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

k=0

λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1

k=0 λk, then

gt = 1 Λt

Kt−1

k=0

λk(vk − xt) =

1

Λt

Kt−1

k=0

λkvk

∈C

−xt

11/19

SLIDE 59

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

k=0

λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1

k=0 λk, then

gt = 1 Λt

Kt−1

k=0

λk(vk − xt) =

1

Λt

Kt−1

k=0

λkvk

∈C

−xt Thus, xt + gt ∈ C so [xt, xt + gt] ⊆ C by convexity

11/19

SLIDE 60

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

12/19

SLIDE 61

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

Technicality to ensure convergence of the procedure (Locatello et al., 2017)

12/19

SLIDE 62

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

Technicality to ensure convergence of the procedure (Locatello et al., 2017)
The stopping criterion is an alignment improvement condition (typically

δ = 10−3 and K = +∞)

12/19

SLIDE 63

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

SLIDE 64

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

SLIDE 65

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

SLIDE 66

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

13/19

SLIDE 67

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

What is the convergence rate of BoostFW?

13/19

SLIDE 68

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

What is the convergence rate of BoostFW?
Is BoostFW expensive in practice?

13/19

SLIDE 69

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

What is the convergence rate of BoostFW?
Is BoostFW expensive in practice?
How does it compare to the state-of-the-art?

13/19

SLIDE 70

Boosting Frank-Wolfe

Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

−∇f (xt),gt

Lgt2

, 1

(“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

−δ2 µ

L ωtp

14/19

SLIDE 71

Boosting Frank-Wolfe

Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

−∇f (xt),gt

Lgt2

, 1

(“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

−δ2 µ

L ωtp

The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

14/19

SLIDE 72

Boosting Frank-Wolfe

Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

−∇f (xt),gt

Lgt2

, 1

(“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

−δ2 µ

L ωtp

The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

Else BoostFW reduces to FW and the convergence rate is 4LD2

t+2

14/19

SLIDE 73

Boosting Frank-Wolfe

Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

−∇f (xt),gt

Lgt2

, 1

(“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

−δ2 µ

L ωtp

The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

Else BoostFW reduces to FW and the convergence rate is 4LD2

t+2

In practice, Nt ≈ t (so ω 1 and p = 1)

14/19

SLIDE 74

Computational experiments

We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

✶

15/19

SLIDE 75

Computational experiments

We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

min

x∈Rn y − Ax2 2

s.t. x1 τ min

x∈R|A|

a∈A

τaxa

1 + 0.03

xa

ca

4 s.t. xa =

r∈R

✶{a∈r}yr a ∈ A

r∈Ri,j

yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min

x∈Rn

1 m

m

i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ min

X∈Rm×n

1 |I|

(i,j)∈I

hρ(Yi,j − Xi,j) s.t. Xnuc τ

15/19

SLIDE 76

Computational experiments

We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

min

x∈Rn y − Ax2 2

s.t. x1 τ min

x∈R|A|

a∈A

τaxa

1 + 0.03

xa

ca

4 s.t. xa =

r∈R

✶{a∈r}yr a ∈ A

r∈Ri,j

yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min

x∈Rn

1 m

m

i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ min

X∈Rm×n

1 |I|

(i,j)∈I

hρ(Yi,j − Xi,j) s.t. Xnuc τ

For BoostFW and AFW we also run the line search-free variants (the

“short step” strategy) and label them with an “L”

15/19

SLIDE 77

Computational experiments

Sparse signal recovery
Traffic assignment
Sparse logistic regression on the

Gisette dataset

Collaborative filtering on the

MovieLens 100k dataset

16/19

SLIDE 78

Boosting DICG

DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

BoostDICG: application of our method to DICG

17/19

SLIDE 79

Boosting DICG

DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

BoostDICG: application of our method to DICG
(details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

SLIDE 80

Boosting DICG

DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

BoostDICG: application of our method to DICG
(details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

SLIDE 81

Boosting DICG

DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

BoostDICG: application of our method to DICG
(details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

SLIDE 82

Takeaways and final remarks

Projection-free algorithms are of considerable interest in optimization

18/19

SLIDE 83

Takeaways and final remarks

Projection-free algorithms are of considerable interest in optimization
We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

18/19

SLIDE 84

Takeaways and final remarks

Projection-free algorithms are of considerable interest in optimization
We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

18/19

SLIDE 85

Takeaways and final remarks

Projection-free algorithms are of considerable interest in optimization
We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

We focused on smooth convex objective functions, but we expect our

method to provide significant gains in performance in other areas of

ptimization as well

18/19

SLIDE 86

Takeaways and final remarks

Projection-free algorithms are of considerable interest in optimization
We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

We focused on smooth convex objective functions, but we expect our

method to provide significant gains in performance in other areas of

ptimization as well

E.g., large-scale finite-sum/stochastic constrained optimization: gt ← procedure(xt, − ˜ ∇f (xt), K, δ) xt+1 ← xt + γtgt

18/19

SLIDE 87

References

G. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning
f conditional gradients. ICML, 2019.
M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfe
algorithm. SIAM J. Control, 1968.
C. W. Combettes and S. Pokutta. Boosting Frank-Wolfe by chasing gradients. ICML, 2020.
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Q., 1956.
D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent con-

ditional gradient algorithm for structured polytopes. NIPS, 2016.

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML, 2013.
S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization
variants. NIPS, 2015.
G. Lan. The complexity of large-scale convex programming under a linear optimization oracle.

Technical report, University of Florida, 2013.

E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Comp. Math. Math.

Phys., 1966.

F. Locatello, M. Tschannen, G. R¨

atsch, and M. Jaggi. Greedy algorithms for cone constrained

ptimization with convergence guarantees. NIPS, 2017.
P. Wolfe. Convergence theory in nonlinear programming. Integer and Nonlinear Programming.

North-Holland, 1970.

19/19