Block-Coordinate Frank-Wolfe Optimization with applications to - - PowerPoint PPT Presentation

block coordinate frank wolfe optimization
SMART_READER_LITE
LIVE PREVIEW

Block-Coordinate Frank-Wolfe Optimization with applications to - - PowerPoint PPT Presentation

Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and


slide-1
SLIDE 1

Block-Coordinate Frank-Wolfe Optimization

Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2

with applications to structured prediction

Co-Authors: Simon Lacoste-Julien, Mark Schmidt and Patrick Pletscher

slide-2
SLIDE 2
  • Two Old First-Order Optimization Algorithms
  • Coordinate Descent
  • The Frank-Wolfe Algorithm
  • Duality for Constrained Convex Optimization
  • Combining Frank-Wolfe and Coordinate Descent
  • Applications: Large Margin Prediction
  • binary SVMs
  • structural SVMs

Outline

slide-3
SLIDE 3

Coordinate Descent

slide-4
SLIDE 4

Rd

Coordinate Descent

Selection of next coordinate:

  • the one of steepest desc.
  • cycle (hard to analyze!)
  • random sampling

f(x)

x

slide-5
SLIDE 5

The Frank- Wolfe Algorithm

Frank and Wolfe (1956)

D ⊂ Rd

slide-6
SLIDE 6

min

x∈D f(x)

D ⊂ Rd

f(x)

x

slide-7
SLIDE 7

min

x∈D f(x)

D ⊂ Rd

f(x)

x

slide-8
SLIDE 8

min

x∈D f(x)

D ⊂ Rd

f(x)

x

slide-9
SLIDE 9

D ⊂ Rd

f(x)

x

min

x∈D f(x)

slide-10
SLIDE 10

The Linearized Problem

min

s02D f(x) +

⌦ s0 x, rf(x) ↵

s

D ⊂ Rd

f(x) x

Algorithm 1 Frank-Wolfe for k = 0 . . . K do Compute s := arg min

s02D

⌦ s0, rf(x(k)) ↵ Let γ :=

2 k+2

Update x(k+1) := (1 γ)x(k) + γs end for

slide-11
SLIDE 11

The Linearized Problem

D ⊂ Rd

min

s02D f(x) +

⌦ s0 x, rf(x) ↵

f(x) x

rf(x)

Frank-Wolfe Cost per step Sparse Solutions (approx.) solve linearized problem on D

(in terms of used vertices)

Gradient Descent Projection back to D

slide-12
SLIDE 12

X Optimization Domain Complexity of one Frank-Wolfe Iteration Atoms A D = conv(A) sups2Dhs, yi Complexity Rn Sparse Vectors k.k1-ball kyk1 O(n) Rn Sign-Vectors k.k1-ball kyk1 O(n) Rn `p-Sphere k.kp-ball kykq O(n) Rn Sparse Non-neg. Vectors Simplex ∆n maxi{yi} O(n) Rn Latent Group Sparse Vec. k.kG-ball maxg2G

  • y(g)

g

P

g2G |g|

Rm⇥n Matrix Trace Norm k.ktr-ball kykop = 1(y) ˜ O

  • N

f/

p "0 (Lanczos) Rm⇥n Matrix Operator Norm k.kop-ball kyktr = k(i(y))k1 SVD Rm⇥n Schatten Matrix Norms k(i(.))kp-ball k(i(y))kq SVD Rm⇥n Matrix Max-Norm k.kmax-ball ˜ O

  • N

f(n + m)1.5/"02.5

Rn⇥n Permutation Matrices

Birkhoff polytope

O(n3) Rn⇥n Rotation Matrices SVD (Procrustes prob.) Sn⇥n

Rank-1 PSD matrices

  • f unit trace

{x⌫0, Tr(x)=1}

max(y) ˜ O

  • N

f/

p "0 (Lanczos) Sn⇥n

PSD matrices

  • f bounded diagonal

{x⌫0, xii1}

˜ O

  • N

f n1.5/"02.5

Table 1: Some examples of atomic domains suitable for optimization using the Frank-Wolfe algorithm. Here SVD refers to the complexity of computing a singular value decomposition, which is O(min{mn2, m2n}). N

f is the number of non-zero entries in the gradient of the objective func-

tion f, and "0 = 2δCf

k+2 is the required accuracy for the linear subproblems. For any p 2 [1, 1],

the conjugate value q is meant to satisfy 1

p + 1 q = 1, allowing q = 1 for p = 1 and vice versa.

  • J. 2013

Some Examples of Atomic Domains Suitable for Frank-Wolfe

Dudık et al. 2011, Tewari et al. 2011, J. 2011

slide-13
SLIDE 13

The Linearized Problem

D ⊂ Rd

min

s02D f(x) +

⌦ s0 x, rf(x) ↵

f(x) x

Primal Convergence: Algorithms obtain after steps.

k

f(x(k)) − f(x∗) ≤ O 1

k

  • Primal-Dual Convergence:

Algorithms obtain after steps.

k

gap(x(k)) ≤ O 1

k

  • [ Clarkson 2008, J. 2013 ]

[ Frank & Wolfe 1956 ]

slide-14
SLIDE 14

gap(x) Original Problem

A Simple Optimization Duality

D ⊂ Rd

min

x∈D f(x)

f(x)

x

The Dual Value

min

s02D f(x) +

⌦ s0 x, rf(x) ↵

ω(x) :=

ω(x)

Weak Duality

ω(x) ≤ f(x⇤) ≤ f(x0)

slide-15
SLIDE 15

Block-Separable Optimization Problems

min

x∈D(1)×···×D(n) f(x)

x = (x(1), . . . , x(n))

f(x)

D(i) ∈ Rdi

x(i)

× × · · · ×

f(x)

D(1) ∈ Rd1 x(1)

f(x)

D(n) ∈ Rdn x(n)

slide-16
SLIDE 16

Algorithm 2: Uniform Coordinate Descent Let x(0) 2 D for k = 0 . . . K do Pick i 2u.a.r. [n] Compute s(i) := arg min

s(i)∈D(i)

D s(i), r(i)f(x(k)) E + Li

2

  • s(i) x(i)
  • 2

Update x(k+1)

(i)

:= x(k)

(i) +

  • s(i) x(k)

(i)

  • end

Theorem: Algorithm obtains accuracy after steps.

k

O

  • 2n

k+2n

  • Nesterov (2012)

Richtárik, Takáč (2012) ``Huge-Scale’’ Coordinate Descent

× × · · · ×

f(x) x(i)

f(x) x(1) f(x)

x(n)

f(x)

x(i)

f(x) x(1) f(x)

x(n)

× · · · × ×

Algorithm 3: Block-Coordinate “Frank-Wolfe” Let x(0) 2 D for k = 0 . . . K do Pick i 2u.a.r. [n] Compute s(i) := arg min

s(i)∈D(i)

D s(i), r(i)f(x(k)) E Let γ :=

2n k+2n, or optimize γ by line-search

Update x(k+1)

(i)

:= x(k)

(i) + γ

  • s(i) x(k)

(i)

  • end

Hidden constant: Curvature ≤ P

i Lf diam2(D(i))

(also in duality gap, and with inexact subproblems)

slide-17
SLIDE 17

Applications: Large Margin Prediction

  • Binary Support

Vector Machine

(no bias)

  • also: Ranking SVM

⌦ w, φ(xi) yi ↵ ≥ 1 − ξi

primal problem:

min

w

λ 2 kwk2 + 1 n

n

X

i=1

max n 0, 1 ⌦ w, φ(xi) yi ↵o

slide-18
SLIDE 18

Binary SVM

primal

  • d-dim
  • unconstrained
  • non-smooth, strongly convex

⌦ w, φ(xi) yi ↵ ≥ 1 − ξi

dual

  • n-dim
  • box-constrained
  • smooth, not strongly convex

min

α2Rn

f(α) :=

λ 2 kAαk2 bT α

s.t. 0  αi  1 8i 2 [n]

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max n 0, 1 ⌦ w, φ(xi) yi | {z } ↵o

i-th column of A

slide-19
SLIDE 19

primal problem:

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max

y2Y

n L(yi, y) ⌦ w, φ(xi, yi) φ(xi, y) | {z } ↵o

(i, y)-th column of A

φ( ,2)

Structural SVM

φ : X × Y → Rd

``joint’’ feature map large margin ``separation’’

⌦ w, φ(xi, yi) − φ(xi, y) ↵ ≥ L(y, yi) − ξi ∀y

φ( ,1) φ( ,0) φ( ,3) φ( ,4) φ( ,7)

slide-20
SLIDE 20

primal problem:

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max

y2Y

n L(yi, y) ⌦ w, φ(xi, yi) φ(xi, y) | {z } ↵o

(i, y)-th column of A

Structural SVM

φ : X × Y → Rd

``joint’’ feature map large margin ``separation’’

⌦ w, φ(xi, yi) − φ(xi, y) ↵ ≥ L(y, yi) − ξi ∀y

decoding oracle

φ( , )

u n e x p e c t e d

( , )

φ

nuexpcted

( , )

φ

aaaaaaa

( , )

φ

uxtecpsss

donaudampfschifffahrtsgesellschaftskapitän |Y| = 2642

slide-21
SLIDE 21

Binary SVM Structural SVM

primal primal dual dual

min

α2Rn·|Y|

f(α) :=

λ 2 kAαk2 bT α

s.t. P

y2Y αi(y) = 1

8i 2 [n] and αi(y) 0 8i 2 [n], 8y 2 Y

2 Y min

α2Rn

f(α) :=

λ 2 kAαk2 bT α

s.t. 0  αi  1 8i 2 [n]

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max

y2Y

n L(yi, y) ⌦ w, φ(xi, yi) φ(xi, y) | {z } ↵o

primal-dual correspondence w = Aα

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max n 0, 1 ⌦ w, φ(xi) yi | {z } ↵o

i-th column of A (i, y)-th column of A

slide-22
SLIDE 22

primal dual batch (n cost per iteration) • subgradient descent

  • Frank-Wolfe

=cutting plane (SVM-light)

  • nline

(1 cost per iteration) • stochastic subgradient

(SGD, Pegasos)

  • coordinate descent (Hsieh 2008)

=block-coordinate descent =block-coordinate Frank-Wolfe

Optimization Algorithms Binary SVM

primal

  • n-dim
  • box-constrained
  • smooth, not strongly convex
  • d-dim
  • unconstrained
  • non-smooth, strongly convex

dual

2 Y min

α2Rn

f(α) :=

λ 2 kAαk2 bT α

s.t. 0  αi  1 8i 2 [n]

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max n 0, 1 ⌦ w, φ(xi) yi | {z } ↵o

i-th column of A

O( R2

λε )

slide-23
SLIDE 23

Structural SVM

primal

Optimization Algorithms

φ( ,2) φ( ,1) φ ( ,0) φ ( ,4) φ( ,7)

  • n |Y| - dim
  • block-constrained
  • smooth, not strongly convex
  • d-dim
  • unconstrained
  • non-smooth, strongly convex

dual

min

α2Rn·|Y|

f(α) :=

λ 2 kAαk2 bT α

s.t. P

y2Y αi(y) = 1

8i 2 [n] and αi(y) 0 8i 2 [n], 8y 2 Y

min

w2Rd λ 2 kwk2

+ 1

n n

X

i=1

max

y2Y

n L(yi, y) ⌦ w, φ(xi, yi) φ(xi, y) | {z } ↵o

(i, y)-th column of A

Optimization Algorithms

primal dual batch (n cost per iteration) • subgradient descent

  • Frank-Wolfe

=cutting plane (SVM-struct)

  • nline

(1 cost per iteration) • stochastic subgradient

(SGD, Pegasos)

  • block coordinate descent (Nesterov)
  • block-coordinate Frank-Wolfe

O( R2

λε )

slide-24
SLIDE 24

20 40 60 80 100 120 140 10−2 10−1 100 effective passes primal suboptimality for problem (1)

(a) OCR dataset, λ = 0.01.

20 40 60 80 100 120 140 10−1 100 101 effective passes primal suboptimality for problem (1)

(b) OCR dataset, λ = 0.001.

20 40 60 80 100 120 140 10−1 100 101 effective passes primal suboptimality for problem (1) BCFW SSG

  • nline-EG

FW cutting plane

(c) OCR dataset, λ = 1/n.

10−1 100 101 10−2 10−1 100 effective passes primal suboptimality for problem (1) BCFW SSG FW cutting plane

(d) CoNLL dataset, λ = 1/n.

10−1 100 101 0.040 0.060 0.080 0.100 effective passes test error

(e) Test error for λ = 1/n on CoNLL.

10−2 10−1 100 101 10−4 10−3 10−2 10−1 100 101 102 effective passes primal suboptimality for problem (1) BCFW SSG FW cutting plane

(f) Matching dataset, λ = 0.001.

dataset n d

OCR sequence labeling 6251 4028 CoNLL POS sequence labeling 8936 1643026 Matching word alignment 5000 82

Experimental Results

slide-25
SLIDE 25

Thanks!

Co-Authors: Simon Lacoste-Julien, Mark Schmidt and Patrick Pletscher

Block-Coordinate Frank-Wolfe Optimization for Structural SVMs Lacoste-Julien, S*., Jaggi, M*., Schmidt, M., & Pletscher, P. ICML 2013 Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization Jaggi, M. ICML 2013

slide-26
SLIDE 26

Related Work

Table 1. Convergence rates given in the number of calls to the oracles for different optimization algorithms for the struc- tural SVM objective (1) in the case of a Markov random field structure, to reach a specific accuracy ε measured for different types of gaps, in term of the number of training examples n, regularization parameter λ, size of the label space |Y|, max- imum feature norm R := maxi,y kψi(y)k2 (some minor terms were ignored for succinctness). Table inspired from (Zhang et al., 2011). Notice that only stochastic subgradient and our proposed algorithm have rates independent of n. Optimization algorithm Online Primal/Dual Type of guarantee Oracle type # Oracle calls dual extragradient (Taskar et al., 2006) no primal-“dual” saddle point gap Bregman projection O ⇣

nR log |Y| λε

  • nline exponentiated gradient

(Collins et al., 2008) yes dual expected dual error expectation O ⇣

(n+log |Y|)R2 λε

⌘ excessive gap reduction (Zhang et al., 2011) no primal-dual duality gap expectation O ✓ nR q

log |Y| λε

◆ BMRM (Teo et al., 2010) no primal ≥primal error maximization O ⇣

nR2 λε

⌘ 1-slack SVM-Struct (Joachims et al., 2009) no primal-dual duality gap maximization O ⇣

nR2 λε

⌘ stochastic subgradient (Shalev-Shwartz et al., 2010) yes primal primal error w.h.p. maximization ˜ O ⇣

R2 λε

⌘ this paper: stochastic block- coordinate Frank-Wolfe yes primal-dual expected duality gap maximization O ⇣

R2 λε

  • Thm. 3
slide-27
SLIDE 27

dataset n d

OCR sequence labeling 6251 4028 CoNLL POS sequence labeling 8936 1643026 Matching word alignment 5000 82

20 40 60 80 100 120 140 10−2 10−1 100 effective passes primal suboptimality for problem (1)

(a) OCR dataset, λ = 0.01.

20 40 60 80 100 120 140 10−1 100 101 effective passes primal suboptimality for problem (1)

(b) OCR dataset, λ = 0.001.

20 40 60 80 100 120 140 10−1 100 101 effective passes primal suboptimality for problem (1) BCFW BCFW-wavg SSG SSG-wavg

  • nline-EG

FW cutting plane

(c) OCR dataset, λ = 1/n.

10−1 100 101 10−2 10−1 100 effective passes primal suboptimality for problem (1) BCFW BCFW-wavg SSG SSG-wavg FW cutting plane

(d) CoNLL dataset, λ = 1/n.

10−1 100 101 0.040 0.060 0.080 0.100 effective passes test error

(e) Test error for λ = 1/n on CoNLL.

10−2 10−1 100 101 10−4 10−3 10−2 10−1 100 101 102 effective passes primal suboptimality for problem (1) BCFW BCFW-wavg SSG SSG-wavg FW cutting plane

(f) Matching dataset, λ = 0.001.

Experimental Results (with averaging)