Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . - - PowerPoint PPT Presentation

boosting frank wolfe by chasing gradients
SMART_READER_LITE
LIVE PREVIEW

Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . - - PowerPoint PPT Presentation

Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . with Sebastian Pokutta School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA 37 th International Conference on Machine Learning July


slide-1
SLIDE 1

Boosting Frank-Wolfe by Chasing Gradients

Cyrille W. Combettes . with Sebastian Pokutta

School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA

37th International Conference on Machine Learning July 12–18, 2020

slide-2
SLIDE 2

Outline

1 Introduction 2 The Frank-Wolfe algorithm 3 Boosting Frank-Wolfe 4 Computational experiments

2/19

slide-3
SLIDE 3

Introduction

Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where

  • f : H → R is a smooth convex function
  • C ⊂ H is a compact convex set, C = conv(V)

3/19

slide-4
SLIDE 4

Introduction

Let H be a Euclidean space (e.g., Rn or Rm×n) and consider min f (x) s.t. x ∈ C where

  • f : H → R is a smooth convex function
  • C ⊂ H is a compact convex set, C = conv(V)

Example

  • Sparse logistic regression

min

x∈Rn

1 m

m

  • i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ

  • Low-rank matrix completion

min

X∈Rm×n

1 2|I|

  • (i,j)∈I

(Yi,j − Xi,j)2 s.t. Xnuc τ

3/19

slide-5
SLIDE 5

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility

4/19

slide-6
SLIDE 6

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility

xt xt − γt∇f (xt) xt+1

4/19

slide-7
SLIDE 7

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive

4/19

slide-8
SLIDE 8

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap

4/19

slide-9
SLIDE 9

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

slide-10
SLIDE 10

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

slide-11
SLIDE 11

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

slide-12
SLIDE 12

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

4/19

slide-13
SLIDE 13

Introduction

  • A natural approach is to use any efficient method and add projections back
  • nto C to ensure feasibility
  • However, in many situations projections onto C are very expensive
  • This is an issue with the method of projections, not necessarily with the

geometry of C: linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ1/ℓ2/ℓ∞-ball O(n) O(n) ℓp-ball, p ∈ ]1, ∞[ \{2} O(n) N/A Nuclear norm-ball O(nnz) O(mn min{m, n}) Flow polytope O(n) O(n3.5) Birkhoff polytope O(n3) N/A Matroid polytope O(n ln(n)) O(poly(n))

N/A: no closed-form exists and solution must be computed via nontrivial optimization

  • Can we avoid projections?

4/19

slide-14
SLIDE 14

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

slide-15
SLIDE 15

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

slide-16
SLIDE 16

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) xt

5/19

slide-17
SLIDE 17

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt

5/19

slide-18
SLIDE 18

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt

5/19

slide-19
SLIDE 19

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

5/19

slide-20
SLIDE 20

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

  • xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C

5/19

slide-21
SLIDE 21

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

  • xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
  • FW uses linear minimizations (the “FW oracle”) instead of projections

5/19

slide-22
SLIDE 22

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

  • xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
  • FW uses linear minimizations (the “FW oracle”) instead of projections
  • FW = pick a vertex (using gradient information) and move in that

direction

5/19

slide-23
SLIDE 23

The Frank-Wolfe algorithm

The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966):

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) −∇f (xt) vt xt xt+1

  • xt+1 is obtained by convex combination of xt ∈ C and vt ∈ C, thus xt+1 ∈ C
  • FW uses linear minimizations (the “FW oracle”) instead of projections
  • FW = pick a vertex (using gradient information) and move in that

direction

  • Successfully applied to: traffic assignment, computer vision, optimal

transport, adversarial learning, etc.

5/19

slide-24
SLIDE 24

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

  • ∇f (xt),xt−vt

Lxt−vt2

, 1

  • (“short step”), then

f (xt) − min

C f 4LD2

t + 2

6/19

slide-25
SLIDE 25

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

  • ∇f (xt),xt−vt

Lxt−vt2

, 1

  • (“short step”), then

f (xt) − min

C f 4LD2

t + 2

  • The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,

2013; Lan, 2013)

6/19

slide-26
SLIDE 26

The Frank-Wolfe algorithm

Theorem (Levitin & Polyak, 1966; Jaggi, 2013) Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth convex function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. If γt =

2 t+2 (default) or γt = min

  • ∇f (xt),xt−vt

Lxt−vt2

, 1

  • (“short step”), then

f (xt) − min

C f 4LD2

t + 2

  • The convergence rate cannot be improved (Canon & Cullum, 1968; Jaggi,

2013; Lan, 2013)

  • Why?

6/19

slide-27
SLIDE 27

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x∗

7/19

slide-28
SLIDE 28

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗

  • Let x0 =
  • 1
  • 7/19
slide-29
SLIDE 29

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-30
SLIDE 30

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-31
SLIDE 31

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-32
SLIDE 32

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-33
SLIDE 33

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-34
SLIDE 34

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2 x3

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-35
SLIDE 35

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2 x3

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-36
SLIDE 36

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2 x3 x4

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-37
SLIDE 37

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2 x3 x4

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices

7/19

slide-38
SLIDE 38

The Frank-Wolfe algorithm

Consider the simple problem min 1 2x2

2

s.t. x ∈ conv 1

  • ,

−1

  • ,

1

  • and x∗ =
  • x0

x∗ x1 x2 x3 x4

  • Let x0 =
  • 1
  • FW tries to reach x∗ by moving towards vertices
  • This yields an inefficient zig-zagging trajectory

7/19

slide-39
SLIDE 39

Improved Frank-Wolfe variants

  • Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices

8/19

slide-40
SLIDE 40

Improved Frank-Wolfe variants

  • Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

8/19

slide-41
SLIDE 41

Improved Frank-Wolfe variants

  • Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

  • Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &

Meshi, 2016): memory-free variant of AFW

8/19

slide-42
SLIDE 42

Improved Frank-Wolfe variants

  • Away-Step Frank-Wolfe (AFW) (Wolfe, 1970; Lacoste-Julien & Jaggi,

2015): enhances FW by allowing to move away from vertices x0 x1 x2 x3 x4 x∗ = x5

  • Decomposition-Invariant Pairwise Conditional Gradient (DICG) (Garber &

Meshi, 2016): memory-free variant of AFW

  • Blended Conditional Gradients (BCG) (Braun et al., 2019): blends FCFW

and FW

8/19

slide-43
SLIDE 43

Boosting Frank-Wolfe

  • Can we speed up FW in a simple way?

9/19

slide-44
SLIDE 44

Boosting Frank-Wolfe

  • Can we speed up FW in a simple way?
  • Rule of thumb in optimization: follow the steepest direction

9/19

slide-45
SLIDE 45

Boosting Frank-Wolfe

  • Can we speed up FW in a simple way?
  • Rule of thumb in optimization: follow the steepest direction

Idea:

  • Speed up FW by moving in a direction better aligned with −∇f (xt)

9/19

slide-46
SLIDE 46

Boosting Frank-Wolfe

  • Can we speed up FW in a simple way?
  • Rule of thumb in optimization: follow the steepest direction

Idea:

  • Speed up FW by moving in a direction better aligned with −∇f (xt)
  • Build this direction by using V to maintain the projection-free property

9/19

slide-47
SLIDE 47

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

10/19

slide-48
SLIDE 48

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection? xt −∇f (xt)

10/19

slide-49
SLIDE 49

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0

r1 v0 xt −∇f (xt)

10/19

slide-50
SLIDE 50

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0 λ0u0

r1 v0 xt −∇f (xt)

10/19

slide-51
SLIDE 51

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1 λ0u0

r1 r2

λ1u1

v1 v0 xt −∇f (xt)

10/19

slide-52
SLIDE 52

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

  • We could continue:

v2 ∈ arg maxv∈Vr2, v λ0u0 λ1u1 r2

v1 v0 xt −∇f (xt)

10/19

slide-53
SLIDE 53

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

  • We could continue:

v2 ∈ arg maxv∈Vr2, v

  • d = λ0u0 + λ1u1

λ0u0 λ1u1

d v1 v0 xt −∇f (xt)

10/19

slide-54
SLIDE 54

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

  • We could continue:

v2 ∈ arg maxv∈Vr2, v

  • d = λ0u0 + λ1u1
  • gt = d/(λ0 + λ1)

λ0u0 λ1u1

d gt v1 v0 xt −∇f (xt)

10/19

slide-55
SLIDE 55

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

  • We could continue:

v2 ∈ arg maxv∈Vr2, v

  • d = λ0u0 + λ1u1
  • gt = d/(λ0 + λ1)

gt v0 xt −∇f (xt)

  • The boosted direction gt is better aligned with −∇f (xt) than is the FW

direction v0 − xt

10/19

slide-56
SLIDE 56

Boosting Frank-Wolfe

  • How can we build a direction better aligned with −∇f (xt) and that allows

to update xt+1 without projection?

  • v0 ∈ arg maxv∈V−∇f (xt), v

λ0u0 = −∇f (xt),v0−xt

v0−xt2

(v0 − xt) r1 = −∇f (xt) − λ0u0

  • v1 ∈ arg maxv∈Vr1, v

λ1u1 = r1,v1−xt

v1−xt2 (v1 − xt)

r2 = r1 − λ1u1

  • We could continue:

v2 ∈ arg maxv∈Vr2, v

  • d = λ0u0 + λ1u1
  • gt = d/(λ0 + λ1)

gt v0 xt −∇f (xt)

  • The boosted direction gt is better aligned with −∇f (xt) than is the FW

direction v0 − xt and satisfies [xt, xt + gt] ⊆ C so we can update xt+1 = xt + γtgt for any γt ∈ [0, 1]

10/19

slide-57
SLIDE 57

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

  • k=0

λk(vk − xt) where λk > 0 and vk ∈ V

11/19

slide-58
SLIDE 58

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

  • k=0

λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1

k=0 λk, then

gt = 1 Λt

Kt−1

  • k=0

λk(vk − xt) =

  • 1

Λt

Kt−1

  • k=0

λkvk

  • ∈C

−xt

11/19

slide-59
SLIDE 59

Boosting Frank-Wolfe

Why [xt, xt + gt] ⊆ C? Let Kt be the number of alignment rounds. We have d =

Kt−1

  • k=0

λk(vk − xt) where λk > 0 and vk ∈ V so if Λt = K−1

k=0 λk, then

gt = 1 Λt

Kt−1

  • k=0

λk(vk − xt) =

  • 1

Λt

Kt−1

  • k=0

λkvk

  • ∈C

−xt Thus, xt + gt ∈ C so [xt, xt + gt] ⊆ C by convexity

11/19

slide-60
SLIDE 60

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

  • Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

12/19

slide-61
SLIDE 61

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

  • Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

  • Technicality to ensure convergence of the procedure (Locatello et al., 2017)

12/19

slide-62
SLIDE 62

Boosting Frank-Wolfe

Algorithm

Finding a direction g well aligned with ∇ from a reference point z Input: z ∈ C, ∇ ∈ H, K ∈ N\{0}, δ ∈ ]0, 1[. 1: d0 ← 0, Λ ← 0 2: for k = 0 to K − 1 do 3: rk ← ∇ − dk ⊲ k-th residual 4: vk ← arg maxv∈Vrk, v ⊲ FW oracle 5: uk ← arg maxu∈{vk−z,−dk/dk}rk, u 6: λk ← rk, uk/uk2 7: d′

k ← dk + λkuk

8: if align(∇, d′

k) − align(∇, dk) δ then

9: dk+1 ← d′

k

10: Λt ←

  • Λ + λk

if uk = vk − z Λ(1 − λk/dk) if uk = −dk/dk 11: else 12: break ⊲ exit k-loop 13: g ← dk/Λ ⊲ normalization

  • Technicality to ensure convergence of the procedure (Locatello et al., 2017)
  • The stopping criterion is an alignment improvement condition (typically

δ = 10−3 and K = +∞)

12/19

slide-63
SLIDE 63

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

slide-64
SLIDE 64

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

slide-65
SLIDE 65

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt)

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt

13/19

slide-66
SLIDE 66

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

13/19

slide-67
SLIDE 67

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

  • What is the convergence rate of BoostFW?

13/19

slide-68
SLIDE 68

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

  • What is the convergence rate of BoostFW?
  • Is BoostFW expensive in practice?

13/19

slide-69
SLIDE 69

Boosting Frank-Wolfe

Algorithm Frank-Wolfe (FW)

Input: x0 ∈ C, γt ∈ [0, 1].

1: for t = 0 to T − 1 do 2:

vt ← arg min

v∈V

∇f (xt), v

3:

xt+1 ← xt + γt(vt − xt) x0 x∗ x1 x2 x3 x4

Algorithm Boosted Frank-Wolfe (BoostFW)

Input: x0 ∈ C, γt ∈ [0, 1], K ∈ N\{0}, δ ∈ ]0, 1[.

1: for t = 0 to T − 1 do 2:

gt ← procedure(xt, −∇f (xt), K, δ)

3:

xt+1 ← xt + γtgt x0 x∗ = x1

  • What is the convergence rate of BoostFW?
  • Is BoostFW expensive in practice?
  • How does it compare to the state-of-the-art?

13/19

slide-70
SLIDE 70

Boosting Frank-Wolfe

  • Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

  • −∇f (xt),gt

Lgt2

, 1

  • (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

  • −δ2 µ

L ωtp

14/19

slide-71
SLIDE 71

Boosting Frank-Wolfe

  • Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

  • −∇f (xt),gt

Lgt2

, 1

  • (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

  • −δ2 µ

L ωtp

  • The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

14/19

slide-72
SLIDE 72

Boosting Frank-Wolfe

  • Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

  • −∇f (xt),gt

Lgt2

, 1

  • (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

  • −δ2 µ

L ωtp

  • The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

  • Else BoostFW reduces to FW and the convergence rate is 4LD2

t+2

14/19

slide-73
SLIDE 73

Boosting Frank-Wolfe

  • Let Nt be the number of iterations up to t where at least 2 rounds of

alignment were performed (FW = always 1 round) Theorem Let C ⊂ H be a compact convex set with diameter D and f : H → R be a L-smooth, convex, and µ-gradient dominated function, and let x0 ∈ arg minv∈V∇f (y), v for some y ∈ C. Set γt = min

  • −∇f (xt),gt

Lgt2

, 1

  • (“short step”) and suppose that Nt ωtp where p ∈ ]0, 1]. Then

f (xt) − min

C f LD2

2 exp

  • −δ2 µ

L ωtp

  • The assumption Nt ωtp simply states that Nt is nonnegligeable, i.e.,

that the boosting procedure is active

  • Else BoostFW reduces to FW and the convergence rate is 4LD2

t+2

  • In practice, Nt ≈ t (so ω 1 and p = 1)

14/19

slide-74
SLIDE 74

Computational experiments

  • We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

15/19

slide-75
SLIDE 75

Computational experiments

  • We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

min

x∈Rn y − Ax2 2

s.t. x1 τ min

x∈R|A|

  • a∈A

τaxa

  • 1 + 0.03

xa

ca

4

s.t. xa =

  • r∈R

✶{a∈r}yr a ∈ A

  • r∈Ri,j

yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min

x∈Rn

1 m

m

  • i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ min

X∈Rm×n

1 |I|

  • (i,j)∈I

hρ(Yi,j − Xi,j) s.t. Xnuc τ

15/19

slide-76
SLIDE 76

Computational experiments

  • We compare BoostFW to AFW, BCG, and DICG on a series of

experiments involving various objective functions and feasible regions

min

x∈Rn y − Ax2 2

s.t. x1 τ min

x∈R|A|

  • a∈A

τaxa

  • 1 + 0.03

xa

ca

4

s.t. xa =

  • r∈R

✶{a∈r}yr a ∈ A

  • r∈Ri,j

yr = di,j (i, j) ∈ S yr 0 r ∈ Ri,j, (i, j) ∈ S min

x∈Rn

1 m

m

  • i=1

ln(1 + exp(−yia⊤

i x))

s.t. x1 τ min

X∈Rm×n

1 |I|

  • (i,j)∈I

hρ(Yi,j − Xi,j) s.t. Xnuc τ

  • For BoostFW and AFW we also run the line search-free variants (the

“short step” strategy) and label them with an “L”

15/19

slide-77
SLIDE 77

Computational experiments

  • Sparse signal recovery
  • Traffic assignment
  • Sparse logistic regression on the

Gisette dataset

  • Collaborative filtering on the

MovieLens 100k dataset

16/19

slide-78
SLIDE 78

Boosting DICG

  • DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

  • BoostDICG: application of our method to DICG

17/19

slide-79
SLIDE 79

Boosting DICG

  • DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

  • BoostDICG: application of our method to DICG
  • (details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

slide-80
SLIDE 80

Boosting DICG

  • DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

  • BoostDICG: application of our method to DICG
  • (details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

slide-81
SLIDE 81

Boosting DICG

  • DICG is known to perform particularly well on the video co-localization

experiment (YouTube-Objects dataset)

  • BoostDICG: application of our method to DICG
  • (details)

DICG BoostDICG at ← away vertex at ← away vertex vt ← arg min

v∈V

∇f (xt), v gt ← procedure(at, −∇f (xt), K, δ) xt+1 ← xt + γt(vt − at) xt+1 ← xt + γtgt

17/19

slide-82
SLIDE 82

Takeaways and final remarks

  • Projection-free algorithms are of considerable interest in optimization

18/19

slide-83
SLIDE 83

Takeaways and final remarks

  • Projection-free algorithms are of considerable interest in optimization
  • We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

18/19

slide-84
SLIDE 84

Takeaways and final remarks

  • Projection-free algorithms are of considerable interest in optimization
  • We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

  • Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

18/19

slide-85
SLIDE 85

Takeaways and final remarks

  • Projection-free algorithms are of considerable interest in optimization
  • We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

  • Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

  • We focused on smooth convex objective functions, but we expect our

method to provide significant gains in performance in other areas of

  • ptimization as well

18/19

slide-86
SLIDE 86

Takeaways and final remarks

  • Projection-free algorithms are of considerable interest in optimization
  • We have proposed an intuitive and generic boosting procedure to speed up

Frank-Wolfe algorithms

  • Although our method may perform more linear minimizations per iteration,

the progress obtained greatly overcomes their cost

  • We focused on smooth convex objective functions, but we expect our

method to provide significant gains in performance in other areas of

  • ptimization as well

E.g., large-scale finite-sum/stochastic constrained optimization: gt ← procedure(xt, − ˜ ∇f (xt), K, δ) xt+1 ← xt + γtgt

18/19

slide-87
SLIDE 87

References

  • G. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning
  • f conditional gradients. ICML, 2019.
  • M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfe
  • algorithm. SIAM J. Control, 1968.
  • C. W. Combettes and S. Pokutta. Boosting Frank-Wolfe by chasing gradients. ICML, 2020.
  • M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Q., 1956.
  • D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent con-

ditional gradient algorithm for structured polytopes. NIPS, 2016.

  • M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML, 2013.
  • S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization
  • variants. NIPS, 2015.
  • G. Lan. The complexity of large-scale convex programming under a linear optimization oracle.

Technical report, University of Florida, 2013.

  • E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Comp. Math. Math.

Phys., 1966.

  • F. Locatello, M. Tschannen, G. R¨

atsch, and M. Jaggi. Greedy algorithms for cone constrained

  • ptimization with convergence guarantees. NIPS, 2017.
  • P. Wolfe. Convergence theory in nonlinear programming. Integer and Nonlinear Programming.

North-Holland, 1970.

19/19