[PPT] - Frank-Wolfe Algorithms for Saddle Point problems Gauthier Gidel 1 , PowerPoint Presentation

SLIDE 1

Frank-Wolfe Algorithms for Saddle Point problems

Gauthier Gidel1,3 Tony Jebara2 Simon Lacoste-Julien3

1INRIA Paris, Sierra Team 2Department of CS, Columbia University 3Department of CS & OR (DIRO) Université de Montréal

25th May 2017

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 2

Overview

◮ Frank-Wolfe algorithm (FW) gained in popularity in the

last couple of years.

◮ Main advantage: FW only needs LMO. ◮ Extend FW properties to solve saddle point problem1. ◮ Straightforward extension but Non trivial analysis.

1Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 3

Overview

◮ Frank-Wolfe algorithm (FW) gained in popularity in the

last couple of years.

◮ Main advantage: FW only needs LMO. ◮ Extend FW properties to solve saddle point problem1. ◮ Straightforward extension but Non trivial analysis.

Question for the audience: Call for application

1Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 4

Saddle point and link with variational inequalities

Let L : X × Y → R, where X and Y are convex and compact. Saddle point problem: solve min

x∈X max y∈Y L(x, y)

A solution (x∗, y∗) is called a Saddle Point.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 5

Saddle point and link with variational inequalities

Let L : X × Y → R, where X and Y are convex and compact. Saddle point problem: solve min

x∈X max y∈Y L(x, y)

A solution (x∗, y∗) is called a Saddle Point.

◮ Necessary stationary conditions:

x − x∗, ∇xL(x∗, y∗) ≥ 0

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 6

Saddle point and link with variational inequalities

Let L : X × Y → R, where X and Y are convex and compact. Saddle point problem: solve min

x∈X max y∈Y L(x, y)

A solution (x∗, y∗) is called a Saddle Point.

◮ Necessary stationary conditions:

x − x∗, ∇xL(x∗, y∗) ≥ 0 y − y∗, −∇yL(x∗, y∗) ≥ 0

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 7

Saddle point and link with variational inequalities

Let L : X × Y → R, where X and Y are convex and compact. Saddle point problem: solve min

x∈X max y∈Y L(x, y)

A solution (x∗, y∗) is called a Saddle Point.

◮ Necessary stationary conditions:

x − x∗, ∇xL(x∗, y∗) ≥ 0 y − y∗, −∇yL(x∗, y∗) ≥ 0

◮ Variational inequality:

∀z ∈ X × Y z − z∗, g(z∗) ≥ 0 where (x∗, y∗) = z∗ and g(z) = (∇xL(z), −∇yL(z))

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 8

Saddle point and link with variational inequalities

Let L : X × Y → R, where X and Y are convex and compact. Saddle point problem: solve min

x∈X max y∈Y L(x, y)

A solution (x∗, y∗) is called a Saddle Point.

◮ Necessary stationary conditions:

x − x∗, ∇xL(x∗, y∗) ≥ 0 y − y∗, −∇yL(x∗, y∗) ≥ 0

◮ Variational inequality:

∀z ∈ X × Y z − z∗, g(z∗) ≥ 0 where (x∗, y∗) = z∗ and g(z) = (∇xL(z), −∇yL(z))

◮ Sufficient condition: Global solution if L

convex-concave. ∀(x, y) ∈ X × Y x′ → L(x′, y) is convex and y′ → L(x, y′) is concave.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 9

Motivations: games and robust learning

◮ Zero-sum games with two players:

min

x∈∆(I) max y∈∆(J) x⊤My

2J. Wen, C. Yu, and R. Greiner. “Robust Learning under Uncertain

Test Distributions: Relating Covariate Shift to Model Misspecification.” In: ICML. 2014.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 10

Motivations: games and robust learning

◮ Zero-sum games with two players:

min

x∈∆(I) max y∈∆(J) x⊤My ◮ Robust learning:2 We want to learn

min

θ∈Θ

1 n

n

i=1

ℓ (fθ(xi), yi) + λΩ(θ) with an uncertainty regarding the data: min

θ∈Θ max w∈∆n n

i=1

ωiℓ (fθ(xi), yi) + λΩ(θ) Minimize the worst case → gives robustness

2J. Wen, C. Yu, and R. Greiner. “Robust Learning under Uncertain

Test Distributions: Relating Covariate Shift to Model Misspecification.” In: ICML. 2014.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 11

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 12

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 13

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 14

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 15

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Difficult to project when:

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 16

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Difficult to project when:

◮ Structured sparsity norm (group lasso norm).

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 17

Problem with Hard projection

The structured SVM: min

ω∈Rd λΩ(ω) + 1

n

i=1

max

y∈Yi (Li(y) − ω, φi(y))

structured empirical loss

Regularization: penalized → constrained. min

Ω(ω)≤β

max

α∈∆(|Y|) bT α − ωT Mα

Difficult to project when:

◮ Structured sparsity norm (group lasso norm). ◮ The output Y is structured: exponential size.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 18

Standard approaches in literature

◮ Projected gradient algorithm.

x(t+1) = PX (x(t) − η∇xL(x(t), y(t))) y(t+1) = PY(y(t) + η∇yL(x(t), y(t)))

3GM Korpelevich. “The extragradient method for finding saddle points

and other problems”. In: Matecon (1976).

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 19

Standard approaches in literature

◮ Projected gradient algorithm.

x(t+1) = PX (x(t) − η∇xL(x(t), y(t))) y(t+1) = PY(y(t) + η∇yL(x(t), y(t)))

◮ Projected extra-gradient3.

¯ x(t+1) = PX (x(t) − η∇xL(x(t), y(t))) ¯ y(t+1) = PY(y(t) + η∇yL(x(t), y(t))) Intuition: lookahead move: look at what your opponent would do before deciding your move. x(t+1) = PX (x(t) − η∇xL(¯ x(t+1), ¯ y(t+1))) y(t+1) = PY(y(t) + η∇yL(¯ x(t+1), ¯ y(t+1))) Prevents oscillations for non strongly convex objective.

3GM Korpelevich. “The extragradient method for finding saddle points

and other problems”. In: Matecon (1976).

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 20

Standard approaches in literature

◮ Gradient method works for non-smooth optimization, but

1 T

T

t=1
x(t), y(t)

− →

T→∞ (x∗, y∗)

4N. He and Z. Harchaoui. “Semi-proximal Mirror-Prox for Nonsmooth

Composite Minimization”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 21

Standard approaches in literature

◮ Gradient method works for non-smooth optimization, but

1 T

T

t=1
x(t), y(t)

− →

T→∞ (x∗, y∗) ◮ Extragradient method works for smooth optimization,

(x(t), y(t)) → (x∗, y∗)

4N. He and Z. Harchaoui. “Semi-proximal Mirror-Prox for Nonsmooth

Composite Minimization”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 22

Standard approaches in literature

◮ Gradient method works for non-smooth optimization, but

1 T

T

t=1
x(t), y(t)

− →

T→∞ (x∗, y∗) ◮ Extragradient method works for smooth optimization,

(x(t), y(t)) → (x∗, y∗) Even when projections are expensive: Can use LMO to compute approximate projections4.

4N. He and Z. Harchaoui. “Semi-proximal Mirror-Prox for Nonsmooth

Composite Minimization”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 23

The FW algorithm

Algorithm Frank-Wolfe algorithm

1: Let x(0) ∈ X 2: for t = 0 . . . T do 3:

Compute r(t) = ∇f(x(t))

4:

Compute s(t) ∈ argmin

s∈X

s, r(t)

5:

Compute gt :=

x(t) − s(t), r(t)

6:

if gt ≤ ǫ then return x(t)

7:

Let γ =

2 2+t (or do line-search)

8:

Update x(t+1) := (1−γ)x(t)+γs(t)

9: end for

α f(α) M

f

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 24

The FW algorithm

Algorithm Frank-Wolfe algorithm

1: Let x(0) ∈ X 2: for t = 0 . . . T do 3:

Compute r(t) = ∇f(x(t))

4:

Compute s(t) ∈ argmin

s∈X

s, r(t)

5:

Compute gt :=

x(t) − s(t), r(t)

6:

if gt ≤ ǫ then return x(t)

7:

Let γ =

2 2+t (or do line-search)

8:

Update x(t+1) := (1−γ)x(t)+γs(t)

9: end for

α f(α) M

f

f(α) +

s0 − α, rf(α)
Gauthier Gidel

Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 25

The FW algorithm

Algorithm Frank-Wolfe algorithm

1: Let x(0) ∈ X 2: for t = 0 . . . T do 3:

Compute r(t) = ∇f(x(t))

4:

Compute s(t) ∈ argmin

s∈X

s, r(t)

5:

Compute gt :=

x(t) − s(t), r(t)

6:

if gt ≤ ǫ then return x(t)

7:

Let γ =

2 2+t (or do line-search)

8:

Update x(t+1) := (1−γ)x(t)+γs(t)

9: end for

α f(α) M

f

f(α) +

s0 − α, rf(α)
Gauthier Gidel

Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 26

The FW algorithm

Algorithm Frank-Wolfe algorithm

1: Let x(0) ∈ X 2: for t = 0 . . . T do 3:

Compute r(t) = ∇f(x(t))

4:

Compute s(t) ∈ argmin

s∈X

s, r(t)

5:

Compute gt :=

x(t) − s(t), r(t)

6:

if gt ≤ ǫ then return x(t)

7:

Let γ =

2 2+t (or do line-search)

8:

Update x(t+1) := (1−γ)x(t)+γs(t)

9: end for

α f(α) M

f

s

f(α) +

s0 − α, rf(α)
Gauthier Gidel

Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 27

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 28

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 29

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 30

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 31

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 32

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

◮ One can

define FW extension with away step.

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 33

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

◮ One can

define FW extension with away step.

◮ Crucial for

ur linear

convergence results.

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 34

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

◮ One can

define FW extension with away step.

◮ Crucial for

ur linear

convergence results.

◮ γt = 1 1+t ⇒ z(t) = 1 t

t

i=0 s(i).

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 35

SP-FW

Algorithm Saddle point FW algorithm

1: for t = 0 . . . T do 2:

Compute r(t) :=

∇xL(x(t), y(t))

−∇yL(x(t), y(t))

3:

Compute s(t) ∈ argmin

z∈X×Y

z, r(t)

4:

Compute gt :=

z(t) − s(t), r(t)

5:

if gt ≤ ǫ then return z(t)

6:

Let γ = min

1, ν

C gt

r γ =

2 2+t

7:

Update z(t+1) := (1 − γ)z(t) + γs(t)

8: end for

◮ Originally

proposed by Hammond4 with γt = 1/(t + 1).

◮ One can

define FW extension with away step.

◮ Crucial for

ur linear

convergence results.

◮ γt = 1 1+t ⇒ z(t) = 1 t

t

i=0 s(i). ◮ (γt = 1 1+t) + Bilinear objective ↔ fictitious play algorithm.5

5J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 36

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle).

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 37

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 38

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free. ◮ Affine invariance of the algorithm.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 39

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free. ◮ Affine invariance of the algorithm. ◮ Sparsity of the iterates.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 40

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free. ◮ Affine invariance of the algorithm. ◮ Sparsity of the iterates. ◮ Universal step size γt := 2 2+t, adaptive step size γt := ν C gt.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 41

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free. ◮ Affine invariance of the algorithm. ◮ Sparsity of the iterates. ◮ Universal step size γt := 2 2+t, adaptive step size γt := ν C gt.

Main difference with FW:

◮ No line-search.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 42

Advantages of SP-FW

Same main property as FW: Only LMO (linear minimization oracle). Same other advantages as FW:

◮ Convergence certificate gt for free. ◮ Affine invariance of the algorithm. ◮ Sparsity of the iterates. ◮ Universal step size γt := 2 2+t, adaptive step size γt := ν C gt.

Main difference with FW:

◮ No line-search.

When constraint set is a “complicated” structured polytope: projection is difficult whereas LMO is tractable.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 43

Hypothesis

Similar hypothesis as AFW:

◮ L is L-smooth and µ-strongly convex-concave. ◮ X and Y polytopes.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 44

Hypothesis

Similar hypothesis as AFW:

◮ L is L-smooth and µ-strongly convex-concave. ◮ X and Y polytopes. ◮ Additional assumption on bilinearity:

L(x, y) = f(x) + x⊤My − h(y) Roughly, M smaller than the strong convexity constant. ν := 1

2 − √ 2M µ D δ > 0

D := max{diam(X), diam(Y)}, δ := min{PWidth(X), PWidth(Y)}

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 45

Theoretical contribution

SP extension of FW with away step6: Linear rate with adaptive step size γt :=

ν LD2 gt.

min

s≤t gs ≤ O(1)

1 − ν2 δ2

D2 µ 2L

k(t)

Sublinear rate with uni- versal step size γt :=

2 2+k(t).

min

s≤t gs ≤ O

1

t

◮ k(t) : number of non drop steps, k(t) ≥ t/3 .

6Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 46

Theoretical contribution

SP extension of FW with away step6: Linear rate with adaptive step size γt :=

ν LD2 gt.

min

s≤t gs ≤ O(1)

1 − ν2 δ2

D2 µ 2L

k(t)

Sublinear rate with uni- versal step size γt :=

2 2+k(t).

min

s≤t gs ≤ O

1

t

◮ k(t) : number of non drop steps, k(t) ≥ t/3 .

◮ Proof use recent advances on AFW → growth condition.

6Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 47

Theoretical contribution

SP extension of FW with away step7: Linear rate with adaptive step size γt :=

ν LD2 gt.

min

s≤t gs ≤ O(1)

1 − ν2 δ2

D2 µ 2L

k(t)

Sublinear rate with uni- versal step size γt :=

2 2+k(t).

min

s≤t gs ≤ O

1

t

◮ k(t) : number of non drop steps, k(t) ≥ t/3 .

◮ Proof use recent advances on AFW → growth condition. ◮ Partially answering a 30 years old conjecture8.

strongly monotone obj with step size

1 t+1 over polytope.

7Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

8J. Hammond. “Solving asymmetric variational inequality problems and

systems of equations with generalized nonlinear programming algorithms”. PhD thesis. MIT, 1984.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 48

Growth Condition : Pairwise Frank Wolfe Gap

◮ st := arg min s∈X

∇f(x(t)), s.

◮ vt := arg max v∈S(t) ∇f(x(t)), s

gPW

t

:=

∇f(x(t)), vt − st
Gauthier Gidel

Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 49

Growth Condition

Key quantity, independent of any algorithm9:

◮ If X is a polytope and f strongly convex,

f(x(t)) − f∗ ≤ (gPW

t

)2 2µFW .

9Simon Lacoste-Julien and Martin Jaggi. “On the global linear

convergence of Frank-Wolfe optimization variants”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 50

Growth Condition

Key quantity, independent of any algorithm9:

◮ If X is a polytope and f strongly convex,

f(x(t)) − f∗ ≤ (gPW

t

)2 2µFW .

◮ In the unconstrained case, analog of:

f(x(t)) − f∗ ≤ ∇f(x(t))2 2µ

9Simon Lacoste-Julien and Martin Jaggi. “On the global linear

convergence of Frank-Wolfe optimization variants”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 51

Growth Condition

Key quantity, independent of any algorithm9:

◮ If X is a polytope and f strongly convex,

f(x(t)) − f∗ ≤ (gPW

t

)2 2µFW .

◮ In the unconstrained case, analog of:

f(x(t)) − f∗ ≤ ∇f(x(t))2 2µ

◮ Can extend this growth condition to SP.

9Simon Lacoste-Julien and Martin Jaggi. “On the global linear

convergence of Frank-Wolfe optimization variants”. In: NIPS. 2015.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 52

Difficulties for saddle point

Usual descent Lemma: ht+1 ≤ ht − γtgt

≥0

+γ2

t

Ld(t)2 2 With γt small enough the sequence decreases.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 53

Difficulties for saddle point

Usual descent Lemma: ht+1 ≤ ht − γtgt

≥0

+γ2

t

Ld(t)2 2 With γt small enough the sequence decreases. For saddle point problem the Lipschitz gradient property gives Lt+1 − L∗ ≤ Lt − L∗ − γt

g(x)

t

− g(y)

t

arbitrary sign

+γ2

t

Ld(t)2 2 .

◮ Cannot control the oscillation of the sequence. ◮ Must introduce other quantities to establish convergence.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 54

Difficulties for saddle point

Standard merit functions: primal + dual gaps ht := max

y∈Y L(x(t), y) − min x∈X L(x, y(t)) ≥ 0.

10Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 55

Difficulties for saddle point

Standard merit functions: primal + dual gaps ht := max

y∈Y L(x(t), y) − min x∈X L(x, y(t)) ≥ 0.

Problem: ˆ y(t) := arg maxy∈Y L(x(t), y) depends on t.

10Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 56

Difficulties for saddle point

Standard merit functions: primal + dual gaps ht := max

y∈Y L(x(t), y) − min x∈X L(x, y(t)) ≥ 0.

Problem: ˆ y(t) := arg maxy∈Y L(x(t), y) depends on t. wt := L(x(t), y∗) − L∗

:=w(x)

t

+ L∗ − L(x∗, y(t))

:=w(y)

t

. We have, 0 ≤ wt ≤ ht ≤ gt In general, wt can be zero even if we have not reached a

solution. But for strongly convex-concave function10

ht ≤ Cte√wt

10Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. “Frank-Wolfe

Algorithms for Saddle Point Problems”. In: AISTATS. 2017.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 57

Toy Experiments

◮ SP-AFW with theoretical step-size.

Theoretical step-size γt = ν

C gt.

L(x, y) := µ 2 x − x∗2

2 + (x − x∗)⊤M(y − y∗) − µ

2 y − y∗2

2

X = Y := [0, 1]d
d = 30
C := 2LD2
L = µ

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 58

Toy Experiments

◮ SP-AFW vs. Extragradient with approximate projection.

Theoretical step-size γt = ν

C gt.

EG : [He & Harchaoui NIPS 2015] L(x, y) := µ 2 x − x∗2

2 + (x − x∗)⊤M(y − y∗) − µ

2 y − y∗2

2

X = Y := [0, 1]d
d = 30
C := 2LD2
L = µ

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 59

Toy Experiments

◮ SP-AFW with heuristic step-size. (When ν < 0)

Heuristic step-size. γt =

gt C+2 M2D2

µ

Recall: theoretical step-size γt = ν

C gt.

L(x, y) := µ 2 x − x∗2

2 + (x − x∗)⊤M(y − y∗) − µ

2 y − y∗2

2

X = Y := [0, 1]d
d = 30
C := 2LD2
L = µ

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 60

Toy Experiments

◮ SP-AFW with heuristic step-size. (When ν < 0)

Heuristic step-size. γt =

gt C+2 M2D2

µ

Recall: theoretical step-size γt = ν

C gt.

L(x, y) := µ 2 x − x∗2

2 + (x − x∗)⊤M(y − y∗) − µ

2 y − y∗2

2

X = Y := [0, 1]d
d = 30
C := 2LD2
L = µ

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 61

Toy Experiments

◮ SP-AFW with heuristic step-size. (When ν < 0)

Heuristic step-size. γt =

gt C+2 M2D2

µ

Recall: theoretical step-size γt = ν

C gt.

L(x, y) := µ 2 x − x∗2

2 + (x − x∗)⊤M(y − y∗) − µ

2 y − y∗2

2

X = Y := [0, 1]d
d = 30
C := 2LD2
L = µ

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 62

Conclusion

◮ SP-FW one of the first SP solver only working with LMO.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 63

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 64

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems. ◮ Same hope as FW for SP-FW

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 65

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems. ◮ Same hope as FW for SP-FW

Call for applications !

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 66

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems. ◮ Same hope as FW for SP-FW

Call for applications !

◮ With a bilinear objective this algorithm is highly related

to the fictitious play algorithm.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 67

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems. ◮ Same hope as FW for SP-FW

Call for applications !

◮ With a bilinear objective this algorithm is highly related

to the fictitious play algorithm.

◮ Rich interplay tapping into this game theory literature.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 68

Conclusion

◮ SP-FW one of the first SP solver only working with LMO. ◮ FW resurgence lead to new structured problems. ◮ Same hope as FW for SP-FW

Call for applications !

◮ With a bilinear objective this algorithm is highly related

to the fictitious play algorithm.

◮ Rich interplay tapping into this game theory literature. ◮ Still many theoretical opened questions.

Karlin’s conjecture.11
Convergence without assumption on the bilinearity.

11Samuel Karlin. Mathematical methods and theory in games,

programming and economics. 1960.

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 69

Thank You !

Slides available on www.di.ens.fr/~gidel.

SLIDE 70

Problems with difficult projection

University game:

1. Game between two universities (A and B).
2. Admitting d students and have to assign pairs of students

into dorms.

3. The game has a payoff matrix M belonging to R(d(d−1)/2)2.
4. Mij,kl is the expected tuition that B gets (or A gives up) if

A pairs student i with j and B pairs student k with l.

5. Here the actions are both in the marginal polytope of all

perfect unipartite matchings. Hard to project on this polytope whereas the LMO can be solved efficiently with the blossom algorithm12.

12J. Edmonds. “Paths, trees and flowers”.

In: Canadian Journal of Mathematics (1965).

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 71

Experiments

Figure: SP-FW on the University game.

◮ Sublinear convergence rate

(faster than expected O(t−2))

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017

SLIDE 72

Experiments

Figure: SP-FW on the University game.

◮ Sublinear convergence rate

(faster than expected O(t−2))

◮ Best theoretical rate

proved: O(t−1/d)

Gauthier Gidel Frank-Wolfe Algorithms for SP 25th May 2017