Block Conditional Gradient Algorithms E. Pauwels joint work with A. - - PowerPoint PPT Presentation

block conditional gradient algorithms
SMART_READER_LITE
LIVE PREVIEW

Block Conditional Gradient Algorithms E. Pauwels joint work with A. - - PowerPoint PPT Presentation

Block Conditional Gradient Algorithms E. Pauwels joint work with A. Beck and S. Sabach. GdT Math ematiques de lapprentissage September 24 2015 1 / 21 Context: large scale convex optimization Two old ideas have received renewed attention


slide-1
SLIDE 1

Block Conditional Gradient Algorithms

  • E. Pauwels

joint work with A. Beck and S. Sabach. GdT Math´ ematiques de l’apprentissage September 24 2015

1 / 21

slide-2
SLIDE 2

Context: large scale convex optimization

Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles:

x =    x1 . . . xN    minx∈X x, c

  • 2 / 21
slide-3
SLIDE 3

Context: large scale convex optimization

Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles:

x =    x1 . . . xN    minx∈X x, c

  • Coordinate descent:

Large dimension Distributed data Conditional gradient: “Complex constraints” Primal-dual interpretation

2 / 21

slide-4
SLIDE 4

Context: large scale convex optimization

Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles:

x =    x1 . . . xN    minx∈X x, c

  • Coordinate descent:

Large dimension Distributed data Conditional gradient: “Complex constraints” Primal-dual interpretation Theoretical properties and empirical performances?

2 / 21

slide-5
SLIDE 5

Scope of the presentation

Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG).

◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs

(ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG).

3 / 21

slide-6
SLIDE 6

Scope of the presentation

Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG).

◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs

(ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG). This presentation: focus on machine learning related aspects General introduction to linear oracle based optimization methods. Specification to (regularized) empirical risk minimization (ERM). Details about the application to structured SVM. (Taskar et. al., 2003 – Tsochantaridis et. al., 2005)

3 / 21

slide-7
SLIDE 7

Outline

  • 1. Context
  • 2. Conditional Gradient algorithm
  • 3. CG and convex duality
  • 4. Block CG and L2 regularized ERM
  • 5. Results

4 / 21

slide-8
SLIDE 8

Main idea

Optimization setting: f : Rn → R is convex, C1 with L-Lipschitz gradient over X ⊂ Rn which is convex and compact. ¯ f := min

x∈X f (x)

5 / 21

slide-9
SLIDE 9

Main idea

Optimization setting: f : Rn → R is convex, C1 with L-Lipschitz gradient over X ⊂ Rn which is convex and compact. ¯ f := min

x∈X f (x)

Start with x0 ∈ X pk ∈ argmaxy∈X

  • ∇f (xk), xk − y
  • xk+1 = (1 − αk)xk + αkpk

0 ≤ αk ≤ 1

5 / 21

slide-10
SLIDE 10

Main idea

Optimization setting: f : Rn → R is convex, C1 with L-Lipschitz gradient over X ⊂ Rn which is convex and compact. ¯ f := min

x∈X f (x)

Start with x0 ∈ X pk ∈ argmaxy∈X

  • ∇f (xk), xk − y
  • xk+1 = (1 − αk)xk + αkpk

0 ≤ αk ≤ 1 Step size: αk =

2 k+2

Open loop xk+1 = argminy∈[xk,pk]f (y) Exact line search xk+1 = argminy∈[xk,pk]Q(xk, y) Approximate line search f (y) ≤ Q(x, y) := f (x) + ∇f (x), y − x + L 2y − x2

2

(tangent quadratic upper bound, descent Lemma).

5 / 21

slide-11
SLIDE 11

Historical remarks

Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f (xk) − ¯ f = O(1/k) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O(1/k1+ǫ) (Canon, Cullum, Polyak, 60’s)

6 / 21

slide-12
SLIDE 12

Historical remarks

Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f (xk) − ¯ f = O(1/k) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O(1/k1+ǫ) (Canon, Cullum, Polyak, 60’s) Recent developments (illustrations follow): Revival for large scale problems. Primal dual interpretation (Bach 2015) and convergence analysis (Jaggi 2013) Block decomposition variants (Lacoste-Julien et al. 2013)

6 / 21

slide-13
SLIDE 13

Why is it interesting?

O(1/k2) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice.

7 / 21

slide-14
SLIDE 14

Why is it interesting?

O(1/k2) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points.

7 / 21

slide-15
SLIDE 15

Why is it interesting?

O(1/k2) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points.

Trace norm: For M ∈ Rm×n, M∗ =

i σi, where {σi} is the set

  • f singular values of M.

Projection on the trace norm ball is a thresholding of singular values → full SVD. Linear programming on the trace norm ball is finding the largest singular value → leading singular vector.

7 / 21

slide-16
SLIDE 16

Outline

  • 1. Context
  • 2. Conditional Gradient algorithm
  • 3. CG and convex duality
  • 4. Block CG and L2 regularized ERM
  • 5. Results

8 / 21

slide-17
SLIDE 17

Convex duality

Recall that X is convex and compact. Define its support function g : Rn → Rn g : w → max

x∈X x, w

9 / 21

slide-18
SLIDE 18

Convex duality

Recall that X is convex and compact. Define its support function g : Rn → Rn g : w → max

x∈X x, w

Given A ∈ Rn×m and b ∈ Rn, consider the problems ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x))

9 / 21

slide-19
SLIDE 19

Convex duality

Recall that X is convex and compact. Define its support function g : Rn → Rn g : w → max

x∈X x, w

Given A ∈ Rn×m and b ∈ Rn, consider the problems ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x)) Weak duality: for any w ∈ Rm and x ∈ X, P(w) + D(x) ≥ 0 Strong duality holds ¯ p + ¯ d = 0

9 / 21

slide-20
SLIDE 20

Primal subgradient and dual conditional gradient

g : w → max

x∈X x, w

(x ∈ argmax ⇔ x ∈ ∂g(w)) ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x))

10 / 21

slide-21
SLIDE 21

Primal subgradient and dual conditional gradient

g : w → max

x∈X x, w

(x ∈ argmax ⇔ x ∈ ∂g(w)) ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x)) A conditional gradient step in the dual: pk : max

y∈X

  • AATxk − b, xk − y
  • = ATxk2

2 −

  • b, xk

+ g(−AATxk + b) = P(ATxk) + D(xk)

10 / 21

slide-22
SLIDE 22

Primal subgradient and dual conditional gradient

g : w → max

x∈X x, w

(x ∈ argmax ⇔ x ∈ ∂g(w)) ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x)) A conditional gradient step in the dual: pk : max

y∈X

  • AATxk − b, xk − y
  • = ATxk2

2 −

  • b, xk

+ g(−AATxk + b) = P(ATxk) + D(xk) Consider the primal variable wk = ATxk: we have pk ∈ ∂g(−Awk + b). wk+1 − wk = αkAT(−xk + pk) = −αk∂P(wk)

10 / 21

slide-23
SLIDE 23

Primal subgradient and dual conditional gradient

g : w → max

x∈X x, w

(x ∈ argmax ⇔ x ∈ ∂g(w)) ¯ p = min

w∈Rm

1 2w2

2 + g(−Aw + b)

(= P(w)) ¯ d = min

x∈X

1 2ATx2

2 − x, b

(= D(x)) A conditional gradient step in the dual: pk : max

y∈X

  • AATxk − b, xk − y
  • = ATxk2

2 −

  • b, xk

+ g(−AATxk + b) = P(ATxk) + D(xk) Consider the primal variable wk = ATxk: we have pk ∈ ∂g(−Awk + b). wk+1 − wk = αkAT(−xk + pk) = −αk∂P(wk) Implicit subgradient steps in the primal!

10 / 21

slide-24
SLIDE 24

Primal subgradient and dual conditional gradient

The primal-dual interpretation holds in much more general settings (Bach 2015). Primal-dual convergence analysis, mini=1,...,k P(wi) + D(xi) = O(1/k) (Jaggi 2013). Automatic step size tuning for subgradient descent in the primal.

11 / 21

slide-25
SLIDE 25

Outline

  • 1. Context
  • 2. Conditional Gradient algorithm
  • 3. CG and convex duality
  • 4. Block CG and L2 regularized ERM
  • 5. Results

12 / 21

slide-26
SLIDE 26

L2 regularized ERM

Consider a problem of the form: ¯ p = min

w∈Rm

λ 2 w2

2 + 1

N

N

  • i=1

g(−Aiw + bi) (= P(w)) ¯ d = min

xi∈X,i=1,...,N

λ 2

  • 1

N

  • i=1

Ai

Txi

  • 2

2

− 1 N

N

  • i=1

xi, bi (= D(x)) ✶

13 / 21

slide-27
SLIDE 27

L2 regularized ERM

Consider a problem of the form: ¯ p = min

w∈Rm

λ 2 w2

2 + 1

N

N

  • i=1

g(−Aiw + bi) (= P(w)) ¯ d = min

xi∈X,i=1,...,N

λ 2

  • 1

N

  • i=1

Ai

Txi

  • 2

2

− 1 N

N

  • i=1

xi, bi (= D(x)) Binary SVM: dataset (ai, li) ∈ Rm × {−1, 1} , i = 1, . . . , N P(w) = λ 2 w2

2 + 1

N

N

  • i=1

max(0, 1 − liaT

i w)

Prediction: l(a, w) = argmaxl∈{−1,1}laTw = sign(aTw). Convex surrogate of the empirical risk:

1 N

N

i=1 ✶(l(ai, w) = li)

13 / 21

slide-28
SLIDE 28

L2 regularized ERM: dual block conditional gradient

The dual has a separable block structure: xi ∈ X, i = 1, . . . , N. Start with x0

i ∈ X, i = 1, . . . , N, and iterate for k ∈ N and i = 1, . . . , N

pk

i ∈ argmaxy∈X

  • ∇xiD(xk), xk

i − y

  • xk+1

i

= (1 − αk

i )xk i + αk i pk i

0 ≤ αk

i ≤ 1

14 / 21

slide-29
SLIDE 29

L2 regularized ERM: dual block conditional gradient

The dual has a separable block structure: xi ∈ X, i = 1, . . . , N. Start with x0

i ∈ X, i = 1, . . . , N, and iterate for k ∈ N and i = 1, . . . , N

pk

i ∈ argmaxy∈X

  • ∇xiD(xk), xk

i − y

  • xk+1

i

= (1 − αk

i )xk i + αk i pk i

0 ≤ αk

i ≤ 1

Mainly three way to choose blocks: Uniformly at random (Lacoste-Julien et al. 2013). Cyclic (Beck et al. 2015). Essentially cyclic, “random permutation” (Beck et al. 2015). Primal interpretation: a subgradient method (stochastic, cyclic, etc . . . ). pk

i ∈ ∂g

  • −Aiwk + bi
  • 14 / 21
slide-30
SLIDE 30

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+.

15 / 21

slide-31
SLIDE 31

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Binary SVM: L = {−1, 1}. φ(a, l) = la. ∆ is the 0 − 1 loss Prediction is a sign (optimize over a set of size 2) The dual constraint set is a box (product of segments).

15 / 21

slide-32
SLIDE 32

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Label sequence learning: L is the set of possible words over an alphabet. φ is inspired by HMM (unary and binary terms over a chain) ∆ is the Hamming distance. Prediction (or decoding) is done by dynamic programming (Viterbi algorithm).

15 / 21

slide-33
SLIDE 33

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Empirical risk: w →

N

  • i=1

∆(li, l(ai, w)). Label sequence learning: L is the set of possible words over an alphabet. φ is inspired by HMM (unary and binary terms over a chain) ∆ is the Hamming distance. Prediction (or decoding) is done by dynamic programming (Viterbi algorithm).

15 / 21

slide-34
SLIDE 34

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Convex relaxation: w →

N

  • i=1

max

l∈L {∆(li, l) − w, φ(ai, l) − φ(ai, li)} .

Label sequence learning: L is the set of possible words over an alphabet. φ is inspired by HMM (unary and binary terms over a chain) ∆ is the Hamming distance. Prediction (or decoding) is done by dynamic programming (Viterbi algorithm).

15 / 21

slide-35
SLIDE 35

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Structured SVM: min

w∈Rm

λ 2 w2 +

N

  • i=1

max

l∈L {∆(li, l) − w, φ(ai, l) − φ(ai, li)}

Label sequence learning: L is the set of possible words over an alphabet. φ is inspired by HMM (unary and binary terms over a chain) ∆ is the Hamming distance. Prediction (or decoding) is done by dynamic programming (Viterbi algorithm).

15 / 21

slide-36
SLIDE 36

Structured output learning and structured SVM

Dataset: (ai, li) ∈ A × L, i = 1, . . . , N. L is discrete and structured: Feature function: φ: A × L → Rm Prediction l(a, w) = argmaxl∈L w, φ(a, l) Risk function ∆: L2 → R+. Structured SVM: min

w∈Rm

λ 2 w2 +

N

  • i=1

max

l∈L {∆(li, l) − w, φ(ai, l) − φ(ai, li)}

Label sequence learning: L is the set of possible words over an alphabet. φ is inspired by HMM (unary and binary terms over a chain) ∆ is the Hamming distance. Prediction (or decoding) is done by dynamic programming (Viterbi algorithm). The dual constraint set is a product of simplices (of size |L|).

15 / 21

slide-37
SLIDE 37

Outline

  • 1. Context
  • 2. Conditional Gradient algorithm
  • 3. CG and convex duality
  • 4. Block CG and L2 regularized ERM
  • 5. Results

16 / 21

slide-38
SLIDE 38

Convergence rates

˜ k: number of effective passes through the N blocks. The rates are given for the duality gap. B: diameter of the dual constraint set X × X × . . . × X. L: Lipschitz modulus of ∇D.

17 / 21

slide-39
SLIDE 39

Convergence rates

˜ k: number of effective passes through the N blocks. The rates are given for the duality gap. B: diameter of the dual constraint set X × X × . . . × X. L: Lipschitz modulus of ∇D. Random block: the rate relates to an expectation (Lacoste-Julien et al. 2013). O 1 ˜ k (LB2 + D(x0))

  • 17 / 21
slide-40
SLIDE 40

Convergence rates

˜ k: number of effective passes through the N blocks. The rates are given for the duality gap. B: diameter of the dual constraint set X × X × . . . × X. L: Lipschitz modulus of ∇D. Random block: the rate relates to an expectation (Lacoste-Julien et al. 2013). O 1 ˜ k (LB2 + D(x0))

  • Cyclic block: deterministic rate (Beck et al. 2015).

Approximate line search : O 1 ˜ k LB2N L β

  • Open loop
  • α

˜ k i =

2 ˜ k + 2

  • : O

1 ˜ k LB2√ N

  • where β is the smallest block Lipschitz modulus of ∇D (variations constrained to

a single blocks).

17 / 21

slide-41
SLIDE 41

Results on synthetic problems

1000 random QP over the unit cube in R100 (normalized).

Predefined step Backtracking Exact line−search 10−6 10−4 10−2 100 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

k f

type CBCG−P CBCG−C RBCG CG

18 / 21

slide-42
SLIDE 42

Results on structural SVM

Handwritten words recognition.

λ : 0.001 λ : 0.01 λ : 0.1 10−1.5 10−1 10−0.5 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

k gap

type CBCG−C CBCG−P RBCG

19 / 21

slide-43
SLIDE 43

Conclusion regarding cyclic block selection rule

One of the few attempts to analyse essentially cyclic methods. Huge gap compared to random selection. Efficient in practice. Future directions: Gap between theory and practice Linear convergence Exact line search, inexact oracles

20 / 21

slide-44
SLIDE 44

General conclusion

Nice duality between constraint block decomposition and sequential methods for sums. Conditional gradient is “bad”, but it is good in settings for which nothing else is affordable.

21 / 21