Large-scale learning for image classification Zaid Harchaoui - - PowerPoint PPT Presentation

large scale learning for image classification
SMART_READER_LITE
LIVE PREVIEW

Large-scale learning for image classification Zaid Harchaoui - - PowerPoint PPT Presentation

Large-scale learning for image classification Zaid Harchaoui CVML13, July 2013 Zaid Harchaoui (INRIA) LL July 25th 2013 1 / 75 Large-scale image datasets From The Promise and Perils of Benchmark Datasets and Challenges, D. Forsyth,


slide-1
SLIDE 1

Large-scale learning for image classification

Zaid Harchaoui

CVML’13, July 2013

Zaid Harchaoui (INRIA) LL July 25th 2013 1 / 75

slide-2
SLIDE 2

Large-scale image datasets

From “The Promise and Perils of Benchmark Datasets and Challenges”, D. Forsyth, A. Efros, F.-F. Li, A. Torralba and A. Zisserman, Talk at “Frontiers of Computer Vision”

Zaid Harchaoui (INRIA) LL July 25th 2013 2 / 75

slide-3
SLIDE 3

Large-scale supervised learning

Large-scale image classification Let (x1, y1), . . . , (xn, yn) ∈ Rd × {1, . . . , k} be labelled training images Minimize

W∈Rd×k

λ Ω(W) + 1 n

n

  • i=1

L

  • yi, WT xi
  • Problem : minimizing such objectives in the large-scale setting

n ≫ 1, d ≫ 1, k ≫ 1

Zaid Harchaoui (INRIA) LL July 25th 2013 3 / 75

slide-4
SLIDE 4

Large-scale supervised learning

Large-scale image classification Let (x1, y1), . . . , (xn, yn) ∈ Rd × {1, . . . , k} be labelled training images Minimize

W∈Rd×k

λ Ω(W) + 1 n

n

  • i=1

L

  • yi, WT xi
  • Problem : minimizing such objectives in the large-scale setting

n ≫ 1, d ≫ 1, k ≫ 1

Zaid Harchaoui (INRIA) LL July 25th 2013 4 / 75

slide-5
SLIDE 5

Machine learning cuboid

n d k

Zaid Harchaoui (INRIA) LL July 25th 2013 5 / 75

slide-6
SLIDE 6

Working example : ImageNet dataset

ImageNet dataset Large number of examples : n = 17 millions Large feature size : d = 4.103, . . . , 2.105 Large number of categories : k = 10, 000

Zaid Harchaoui (INRIA) LL July 25th 2013 6 / 75

slide-7
SLIDE 7

General strategy for large-scale problems

Strategy Most approaches boil down to a general "divide-and-conquer" strategy Break the large learning problem into small and easy pieces

Zaid Harchaoui (INRIA) LL July 25th 2013 7 / 75

slide-8
SLIDE 8

Machine learning cuboid

n d k

Zaid Harchaoui (INRIA) LL July 25th 2013 8 / 75

slide-9
SLIDE 9

Decomposition principle

Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over features : (primal) regular coordinate descent Decomposition over categories : one-versus-rest strategy Decomposition over latent structure : atomic decomposition

Zaid Harchaoui (INRIA) LL July 25th 2013 9 / 75

slide-10
SLIDE 10

Decomposition principle

Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over features : (primal) coordinate descent Decomposition over categories : one-versus-rest strategy Decomposition over latent structure : atomic decomposition

Zaid Harchaoui (INRIA) LL July 25th 2013 10 / 75

slide-11
SLIDE 11

Decomposition over examples

Decomposition over examples

Stochastic/incremental gradient descent Bru, 1890 : algorithm to adjust a slant θ of cannon in order to obtain a specified range r by trial and error, firing one shell after another θt = θt−1 − γ0 t (r − rt) Perceptron, Rosenblatt, 1957 wt = wt−1 − γt(ytφ(xt)) if ytφ(xt) ≤ 0 = wt−1

  • therwise

Zaid Harchaoui (INRIA) LL July 25th 2013 11 / 75

slide-12
SLIDE 12

Decomposition over examples

Decomposition over examples

Stochastic/incremental gradient descent Bru, 1890 : algorithm to adjust a slant θ of cannon in order to obtain a specified range r by trial and error Perceptron, Rosenblatt, 1957 60s-70s : extensions in learning, optimal control, and adaptive signal processing 80s-90s : extensions to non-convex learning problems see "Efficient backprop" in Neural networks : Tricks of the trade, LeCun et al., 1998, for wise advice and overview on sgd algorithms

Zaid Harchaoui (INRIA) LL July 25th 2013 12 / 75

slide-13
SLIDE 13

Decomposition over examples

Decomposition over examples

Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example (xt, yt) Wt+1 = Wt − γt ∇WQ(W; xt, yt)

  • ne example at a time

Why ? Where does these update rules come from ?

Zaid Harchaoui (INRIA) LL July 25th 2013 13 / 75

slide-14
SLIDE 14

Decomposition over examples

Plain gradient descent

Plain gradient descent versus stochastic/incremental gradient descent Grouping the regularization penalty and the empirical risk ∇WJ(W) = 1 n

n

  • i=1
  • nλ Ω(W) + L
  • yi, WT xi
  • Zaid Harchaoui (INRIA)

LL July 25th 2013 14 / 75

slide-15
SLIDE 15

Decomposition over examples

Plain gradient descent

Plain gradient descent versus stochastic/incremental gradient descent Grouping the regularization penalty and the empirical risk, and expanding the sum onto the examples ∇WJ(W) = 1 n

n

  • i=1
  • nλ Ω(W) + L
  • yi, WT xi
  • = ∇W
  • 1

n

n

  • i=1

Q(W; xi, yi)

  • Zaid Harchaoui (INRIA)

LL July 25th 2013 15 / 75

slide-16
SLIDE 16

Decomposition over examples

Plain gradient descent

Plain gradient descent Initialize : W = 0 Iterate : Wt+1 = Wt − γt∇J(W) = Wt − γt∇W

  • 1

n

n

  • i=1

Q(W; xi, yi)

  • Zaid Harchaoui (INRIA)

LL July 25th 2013 16 / 75

slide-17
SLIDE 17

Decomposition over examples

Plain gradient descent

Plain gradient descent Initialize : W = 0 Iterate : Wt+1 = Wt − γt∇WJ(W) = Wt − γt ∇W

  • 1

n

n

  • i=1

Q(W; xi, yi)

  • sum over all examples !

Strengths and weaknesses Strength : robust to setting of step-size sequence (line-search) Weakness : demanding disk/memory requirements

Zaid Harchaoui (INRIA) LL July 25th 2013 17 / 75

slide-18
SLIDE 18

Decomposition over examples

Stochastic/incremental gradient descent

Stochastic/incremental gradient descent Leveraging the decomposable structure over examples ∇WJ(W) = 1 n

n

  • i=1

∇WQ(W; xi, yi) = 1 n

  • ∇WQ(W; x1, y1) + · · · + 1

n(∇WQ(W; xn, yn)

  • Zaid Harchaoui (INRIA)

LL July 25th 2013 18 / 75

slide-19
SLIDE 19

Decomposition over examples

Decomposition over examples

Stochastic/incremental gradient descent Leveraging the decomposable structure over examples ∇WJ(W) = 1 n      ∇WQ(W; x1, y1)

  • cheap to compute

+ · · · + ∇WQ(W; xn, yn)

  • cheap to compute

     Make incremental gradient steps along Q(W; xt, yt) at each iteration t, instead of full gradient steps along ∇J(W) at each iteration

Zaid Harchaoui (INRIA) LL July 25th 2013 19 / 75

slide-20
SLIDE 20

Decomposition over examples

Stochastic/incremental gradient descent

Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example (xt, yt) Wt+1 = Wt − γt∇WQ(W; xt, yt)

Zaid Harchaoui (INRIA) LL July 25th 2013 20 / 75

slide-21
SLIDE 21

Decomposition over examples

Stochastic/incremental gradient descent

Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example (xt, yt) Wt+1 = Wt − γt ∇WQ(W; xt, yt)

  • ne example at a time

Strengths and weaknesses Strength : little disk requirements Weakness : may be sensitive to setting of step-size sequence

Zaid Harchaoui (INRIA) LL July 25th 2013 21 / 75

slide-22
SLIDE 22

Decomposition over examples

Stochastic/incremental gradient descent

What’s "stochastic" in this algorithm ? Looking at the objective as a stochastic approximation of the expected training error ∇WJ(W) = 1 n

n

  • i=1

∇WQ(W; xi, yi) = 1 n

  • ∇WQ(W; x1, y1) + · · · + 1

n(∇WQ(W; xn, yn)

  • Zaid Harchaoui (INRIA)

LL July 25th 2013 22 / 75

slide-23
SLIDE 23

Decomposition over examples

Stochastic/incremental gradient descent

What’s "stochastic" in this algorithm ? ∇WJ(W) = 1 n

n

  • i=1

∇WQ(W; xi, yi) ≈ Ex,y[∇WQ(W; x, y)] Practical consequences Shuffle the examples before launching the algorithm, in case they form a correlated sequence Perform several passes/epochs over the training data, shuffling the examples before each pass/epoch

Zaid Harchaoui (INRIA) LL July 25th 2013 23 / 75

slide-24
SLIDE 24

Decomposition over examples

Mini-batch extensions

Mini-batch extensions Regular stochastic gradient descent : extreme decomposition strategy picking one example at a time Mini-batch extensions : decomposition onto mini-batches of size Bt at iteration t When to choose one or the other ? Regular stochastic gradient descent converges for simple objectives with ”moderate non-smoothness” For more sophisticated objectives, SGD does not converge, and mini-batch SGD is a must

Zaid Harchaoui (INRIA) LL July 25th 2013 24 / 75

slide-25
SLIDE 25

Decomposition over examples

Theory digest

Theory digest Fixed stepsize γt ≡ γ − → stable convergence Decreasing stepsize γt =

γ0 t+t0 −

→ faster local convergence, with γ0 and t0 properly set Note : stochastic gradient descent is an extreme decomposition strategy picking one example at a time In practice Pick a random batch of reasonable size, and find best pair (γ0, t0) through cross-validation Run stochastic gradient descent with sequence of decreasing stepsize γt =

γ0 t+t0

Zaid Harchaoui (INRIA) LL July 25th 2013 25 / 75

slide-26
SLIDE 26

Decomposition over examples

Tricks of the trade : life is simpler in large-scale settings

Life is simpler in large-scale settings Shuffle the examples before launching the algorithm, and process the examples in a balanced manner w.r.t the categories Regularization through early stopping : perform only a few several passes/epochs over the training data, and stop when the accuracy on a held-out validation set does not increase anymore Fixed step-size works fine : find best γ through cross-validation on a small batch

Zaid Harchaoui (INRIA) LL July 25th 2013 26 / 75

slide-27
SLIDE 27

Decomposition over examples

Stochastic/incremental gradient descent

Put the shoulder to the wheel Let’s try it out !

Zaid Harchaoui (INRIA) LL July 25th 2013 27 / 75

slide-28
SLIDE 28

Decomposition over examples

Ridge regression

Ridge regression Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × R Minimize

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

L

  • yi, wT xi
  • Key calculations

Q(w; xi, yi) = nλ 2 w2

2 + (yi − wT xi)2

∇Q(w; xi, yi) = nλw + (yi − wT xi)x

Zaid Harchaoui (INRIA) LL July 25th 2013 28 / 75

slide-29
SLIDE 29

Decomposition over examples

Logistic regression

Logistic regression Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × R Minimize

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

L

  • yi, wT xi
  • Key calculations

Q(w; xi, yi) = nλ 2 w2

2 + log

  • 1 + exp(−yiwT xi))
  • ∇Q(w; xi, yi) = nλw + −

1 1 + exp(yiwT xi)yixi

Zaid Harchaoui (INRIA) LL July 25th 2013 29 / 75

slide-30
SLIDE 30

Decomposition over examples

Linear SVM with linear hinge loss

Linear SVM with linear hinge loss Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × R Minimize

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

L

  • yi, wT xi
  • Key calculations

Q(w; xi, yi) = nλ 2 w2

2 + max

  • 0, 1 − yiwT xi
  • ∇Q(w; xi, yi) =
  • nλw − yixi

if 1 − yixi > 0

  • therwise

Zaid Harchaoui (INRIA) LL July 25th 2013 30 / 75

slide-31
SLIDE 31

Decomposition over examples

Linear SVM with linear hinge loss

Linear SVM with linear hinge loss Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × R Minimize

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

L

  • yi, wT xi
  • Non-differentiable loss

Rule : if Q(w; x, y) has a finite number of a non-differentiable points, then just make no update, and pick another example. Theoretical justification : the set of a non-differentiable points will have measure zero, and convergence guarantee is still valid ∇Ex,y[Q(W; x, y)] = Ex,y[∇Q(W; x, y)] .

Zaid Harchaoui (INRIA) LL July 25th 2013 31 / 75

slide-32
SLIDE 32

Decomposition over examples

A quick overview

Convergence guarantees Least-square loss : smooth → fast and stable convergence Logistic loss : smooth → fast and stable convergence Linear hinge loss : non-smooth → slower convergence Convergence guarantees Take-home message : smooth loss is nicer

Zaid Harchaoui (INRIA) LL July 25th 2013 32 / 75

slide-33
SLIDE 33

Decomposition over examples

Machine learning cuboid

n d k

Zaid Harchaoui (INRIA) LL July 25th 2013 33 / 75

slide-34
SLIDE 34

Decomposition over examples

Decomposition principle

Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over categories : one-versus-rest strategy Decomposition over latent structure : atomic decomposition

Zaid Harchaoui (INRIA) LL July 25th 2013 34 / 75

slide-35
SLIDE 35

Decomposition over categories : one-versus-rest strategy

Multi-class linear SVM with regular linear hinge loss

Multi-class linear SVM with regular linear hinge loss Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × {0, 1}k min

W∈Rd×k

λ w2

2 + 1

n

n

  • i=1

BinaryHingeLossi One-versus-rest reduction Turn original label yi ∈ {0, 1}k into binary label ˜ yi ∈ {−1, +1} BinaryHingeLossi = max(0, 1 − ˜ yiwT xi) Note : any loss could do, i.e. also the logistic loss

Zaid Harchaoui (INRIA) LL July 25th 2013 35 / 75

slide-36
SLIDE 36

Decomposition over categories : one-versus-rest strategy

Multi-class linear SVM with regular linear hinge loss

Multi-class linear SVM with regular linear hinge loss Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × {0, 1}k min

w∈Rd×k k

  • ℓ=1

λℓ wℓ2

2 + 1

n

k

  • ℓ=1

n

  • i=1

yi≡class ℓ

BinaryHingeLossi Decomposition over categories Leverage decomposable structure over categories

Zaid Harchaoui (INRIA) LL July 25th 2013 36 / 75

slide-37
SLIDE 37

Decomposition over categories : one-versus-rest strategy

Multi-class linear SVM with regular linear hinge loss

Multi-class linear SVM with regular linear hinge loss Training data : (x1, y1), . . . , (xn, yn) ∈ Rd × {0, 1}k                                      min

w1∈Rd

λ1 w12

2 + 1

n

n

  • i=1

yi≡class 1

BinaryHingeLossi . . . min

wℓ∈Rd

λℓ wℓ2

2 + 1

n

n

  • i=1

yi≡class ℓ

BinaryHingeLossi . . . min

wk∈Rd

λk wk2

2 + 1

n

n

  • i=1

yi≡class k

BinaryHingeLossi

Zaid Harchaoui (INRIA) LL July 25th 2013 37 / 75

slide-38
SLIDE 38

Decomposition over categories : one-versus-rest strategy

Multi-class through one-vs-rest

Multi-class through one-vs-rest Overall : simplest multi-class classification algorithm Computational strength : easy to optimize by decomposition over classes Statistical weakness : no universally consistent loss can be decomposable over classes (do we really care ? we’ll see)

Zaid Harchaoui (INRIA) LL July 25th 2013 38 / 75

slide-39
SLIDE 39

Decomposition over categories : one-versus-rest strategy

Multi-class through one-vs-rest

In practice State-of-the-art performance using a balanced version of the binary loss, and learning the optimal imbalance β through cross-validation Empirical risk = β n+

  • i∈positive examples

BinaryHingeLossi + 1 − β n−

  • i∈negative examples

BinaryHingeLossi

Zaid Harchaoui (INRIA) LL July 25th 2013 39 / 75

slide-40
SLIDE 40

Decomposition over categories : one-versus-rest strategy

Multi-class with non-decomposable loss functions

Other multi-class loss functions Multinomial logistic loss Crammer & Singer multi-class loss RMUL = 1 n

n

  • i=1
  • max

y (∆(yi, y) + wT y xi) − wT yixi

  • The multi-class binary hinge loss is the only decomposable loss

Zaid Harchaoui (INRIA) LL July 25th 2013 40 / 75

slide-41
SLIDE 41

Decomposition over categories : one-versus-rest strategy

Multi-class with non-decomposable loss functions

More sophisticated losses Loss functions tailored to optimize a convex surrogate of top-k accuracy Accuracytop−k = # images whose correct label lies in top-k scores Total number of images Ranking losses RRNK = 1 n

n

  • i=1

k

  • y=1

max

y

  • 0, ∆(yi, y) − (wT

yi − wy)T xi

  • Weighted ranking losses, and other variations

Yet to prove themselves compared to one-vs-rest with binary loss on real-world datasets

Zaid Harchaoui (INRIA) LL July 25th 2013 41 / 75

slide-42
SLIDE 42

Decomposition over categories : one-versus-rest strategy

Multi-class with non-decomposable loss functions

Sampling Update ROVR Draw (xi, yi) from S δi = 1 if LOVR(xi, yi; w) > 0, 0 otherwise. w(t) = (1 − ηt)w(t−1) + ηtδixiyi RMUL Draw (xi, yi) from S ¯ y = arg maxy D(yi, y) + w′

yxi and δi =

1 if ¯ y = yi

  • therwise.

w(t)

y

=      w(t−1)

y

(1 − ηt) + δiηtxi if y = yi w(t−1)

y

(1 − ηt) − δiηtxi if y = ¯ y w(t−1)

y

(1 − ηt)

  • therwise.

RRNK Draw (xi, yi) from S δi = 1 if Ltri(xi, yi, ¯ y; w) > 0, 0 otherwise. Draw ¯ y = yi from Y w(t)

y

=      w(t−1)

y

(1 − ηt) + δiηtxi if y = yi w(t−1)

y

(1 − ηt) − δiηtxi if y = ¯ y w(t−1)

y

(1 − ηt)

  • therwise.

Zaid Harchaoui (INRIA) LL July 25th 2013 42 / 75

slide-43
SLIDE 43

Decomposition over categories : one-versus-rest strategy

Experimental results

Datasets

Total # of Partition images classes train val test Fungus 88K 134 44K 5K 39K Ungulate 183K 183 91.5K 5K 86.5K Vehicle 226K 262 113K 5K 108K ILSVRC10 1.4M 1,000 1.2M 50K 150K ImageNet10K 9M 10,184 4.5M 50K 4.45M Table : Datasets considered

Zaid Harchaoui (INRIA) LL July 25th 2013 43 / 75

slide-44
SLIDE 44

Decomposition over categories : one-versus-rest strategy

Stochastic gradient descent is competitive with batch solvers

Stochastic gradient descent is competitive with batch solvers

Average training time (in CPU seconds) on 3 fine-grained datasets (a) w-OVR SVM : LibSVM vs SGD

LibSVM (batch) / SGD (online) Fungus Ungulate Vehicle 10 12 / 7 31 / 18 107 / 39 25 95 / 16 175 / 36 835 / 119 50 441 / 38 909 / 67 3,223 / 271 100 1,346 / 71 3,677 / 133 11,679 / 314

(b) MUL SVM : SVM-light vs SGD

SVM-light (batch) / SGD (online) Fungus Ungulate Vehicle 10 45 / 36 324 / 81 557 / 209 25 99 / 72 441 / 198 723 / 369 50 198 / 261 855 / 420 1,265 / 747 100 972 / 522 1,674 / 765 3,752 / 1,503

Zaid Harchaoui (INRIA) LL July 25th 2013 44 / 75

slide-45
SLIDE 45

Decomposition over categories : one-versus-rest strategy

Superiority of one-vs-rest with weighted binary loss

Superiority of one-vs-rest with weighted binary loss over unweighted one

1 2 4 8 16 32 64 6 8 10 12 14 16 18 20 22

Influence of weights on w−OVR (SGD) FUNGUS

Weight β Top−1 Accuracy (in %) BOV: N=1,024 + SP (D=4,096) FV: N=8 + SP (D=4,096) FV: N=16 + SP (D=8,192) FV: N=64 + SP (D=32,768) FV: N=256 + SP (D=131,072)

Figure : Influence of data rebalancing in weighted one-vs-rest (w-OVR) vs unweighted

  • ne-vs-rest (u-OVR) on Fungus (134 classes).

Zaid Harchaoui (INRIA) LL July 25th 2013 45 / 75

slide-46
SLIDE 46

Decomposition over categories : one-versus-rest strategy

One-vs-rest with binary hinge-loss works fine for expressive features

One-vs-rest with binary hinge-loss works fine for expressive features

8 16 32 64 128 256 10 11 12 13 14 15 16 17 18 19 20 21 Number of Gaussians N Top−1 Accuracy (in %) w−OVR MUL RNK WAR

(a) Fungus

8 16 32 64 128 256 10 12 14 16 18 20 22 24 26 28 30 32 Number of Gaussians N Top−1 Accuracy (in %) w−OVR MUL RNK WAR

(b) Ungulate

8 16 32 64 128 256 26 28 30 32 34 36 38 40 42 Number of Gaussians N Top−1 Accuracy (in %) w−OVR MUL RNK WAR

(c) Vehicle

Figure : Comparison of Top-1 Accuracy between the w-OVR, MUL, RNK and WAR SVMs as a function of the number of Gaussians used to compute the FV (i.e. as a function of the FV dimensionality). No spatial pyramids were used to speed-up these experiments.

Zaid Harchaoui (INRIA) LL July 25th 2013 46 / 75

slide-47
SLIDE 47

Decomposition over categories : one-versus-rest strategy

Beyond one-vs-rest strategies

Large-scale learning Training data : (x1, y1), . . . , (xn, yn)Rd × Y = {0, 1}k Minimize

W∈Rd×k

λΩ(W) + 1 n

n

  • i=1

Lossi Discover latent structure of the classes And keep scalability and efficiency of one-versus-rest strategies

Zaid Harchaoui (INRIA) LL July 25th 2013 47 / 75

slide-48
SLIDE 48

Decomposition over categories : one-versus-rest strategy

Machine learning cuboid

n d k

Zaid Harchaoui (INRIA) LL July 25th 2013 48 / 75

slide-49
SLIDE 49

Decomposition over categories : one-versus-rest strategy

Decomposition principle

Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over latent structure : atomic decomposition

Zaid Harchaoui (INRIA) LL July 25th 2013 49 / 75

slide-50
SLIDE 50

Decomposition over latent structure

Learning with atom. penalty

Learning with low-rank regularization penalty Training data : (x1, y1), . . . , (xn, yn)Rd × Y = {0, 1}k Minimize

W∈Rd×k

λRank(W) + 1 n

n

  • i=1

Lossi ? Embedding motivation : classes may embedded in a low-dimensional subspace of the feature space Computational motivation : algorithm scales with the number of latent classes r, assuming that r ≪ k Extension of Reduced-Rank Regression (see e.g. Velu, Reinsel, 1998) → non-smooth, non-convex optimization problem

Zaid Harchaoui (INRIA) LL July 25th 2013 50 / 75

slide-51
SLIDE 51

Decomposition over latent structure

Learning with atom. penalty

Learning with low-rank regularization penalty Training data : (x1, y1), . . . , (xn, yn)Rd × Y = {0, 1}k Minimize

W∈Rd×k

λσ(W)1 + 1 n

n

  • i=1

Lossi

  • convex

Embedding motivation : classes may embedded in a low-dimensional subspace of the feature space Computational motivation : algorithm scales with the number of latent classes r, assuming that r ≪ k Extension of Reduced-Rank Regression (see e.g. Velu, Reinsel, 1998) → non-smooth, non-convex optimization problem

Zaid Harchaoui (INRIA) LL July 25th 2013 51 / 75

slide-52
SLIDE 52

Decomposition over latent structure

Learning with atom. penalty

Learning with low-rank regularization penalty Let (x1, y1), . . . , (xn, yn) ∈ Rd × {1, . . . , k} be labelled training images Minimize

W∈Rd×k

λσ(W)1 + 1 n

n

  • i=1

Lossi

  • convex

Tight convex relaxation (Amit et al., 2007 ; Argyriou et al., 2007) Enforces a low-rank structure of W (sparsity of spectrum σ(W)) Convex, but non-differentiable

Zaid Harchaoui (INRIA) LL July 25th 2013 52 / 75

slide-53
SLIDE 53

Decomposition over latent structure

Learning with atom. penalty

Learning with low-rank regularization penalty Let (x1, y1), . . . , (xn, yn) ∈ Rd × {1, . . . , k} be labelled training images Minimize

W∈Rd×k

λ σ(W)1

  • non-smooth

+ Rn(W)

  • smooth

where Rn(W) is the empirical risk with the multinomial logistic loss Rn(W) = 1 n

n

  • i=1

log  1 +

  • ℓ∈Y\{yi}

exp

  • wT

ℓ xi − wT yixi

Zaid Harchaoui (INRIA) LL July 25th 2013 53 / 75

slide-54
SLIDE 54

Decomposition over latent structure

Learning with atom. penalty

Learning with low-rank regularization penalty Let (x1, y1), . . . , (xn, yn) ∈ Rd × {1, . . . , k} be labelled training images Minimize

W∈Rd×k

λ σ(W)1

  • decomposable?

+ Rn(W)

  • smooth

where Rn(W) is the empirical risk with the multinomial logistic loss Rn(W) = 1 n

n

  • i=1

log  1 +

  • ℓ∈Y\{yi}

exp

  • wT

ℓ xi − wT yixi

Zaid Harchaoui (INRIA) LL July 25th 2013 54 / 75

slide-55
SLIDE 55

Decomposition over latent structure

Stochastic atom descent

We want an efficient and scalable algorithm Let’s get inspiration from ℓ1 case... Atom-descent algorithms Leverage a decomposable structure of regularization : atomic decomposition Perform a stochastic version of coordinate descent on this representation Efficient and scalable algorithms Atom-descent for trace-norm regularization Leverage a non-flat decomposable structure, in contrast to one-vs-rest Learn a latent embedding of the classes

Zaid Harchaoui (INRIA) LL July 25th 2013 55 / 75

slide-56
SLIDE 56

Decomposition over latent structure

Lifting to an infinite-dimensional space

The trace-norm is the smallest ℓ1-norm of the weight vector associated with an atomic decomposition onto rank-one subspaces

θ1

= + · · · + θi + · · · u1 ui vi v1 W

σ(W)1 = min

θ

  • θ1 , ∃N, Mi ∈ M, with W =

N

  • i=1

θiMi

  • M = {uvT | u ∈ Rd, v ∈ Rk, u2 = v2 = 1}

Zaid Harchaoui (INRIA) LL July 25th 2013 56 / 75

slide-57
SLIDE 57

Decomposition over latent structure

Lifted objective

Lifting Original objective : J(W) := λ σ(W)1 + Rn(W) Lifted objective : I(θ) := λ

  • j∈supp(θ)

θj + Rn(Wθ)

Zaid Harchaoui (INRIA) LL July 25th 2013 57 / 75

slide-58
SLIDE 58

Decomposition over latent structure

Equivalence

Equivalence Assume that the loss function L(y, ·) is convex and smooth. Then the two problems are equivalent

Zaid Harchaoui (INRIA) LL July 25th 2013 58 / 75

slide-59
SLIDE 59

Decomposition over latent structure

Stochastic atom-descent descent : high-level idea

Sketch At each iteration, pick a random mini-batch Bt, then pick the rank-1 subspace yielding the steepest descent, and perform descent along that direction Periodically perform second-order minimization on current subspace

θ1

= + · · · + θi + · · · u1 ui vi v1 W

Zaid Harchaoui (INRIA) LL July 25th 2013 59 / 75

slide-60
SLIDE 60

Decomposition over latent structure

Stochastic atom descent

Algorithm Initialize : θ = 0 Iterate : pick a random mini-batch Bt, find coordinate θi of steepest descent i(t) = Arg max

i

∂IBt(θ) ∂θi = Arg max

i

uivT

i , −∇RBt(Wθ)

= Arg max

u2=v2=1

uT (−∇RBt(Wθ)) v . then perform a 1D line-search along θi(t) (θt+1,1, · · · , θt+1,i(t), · · · ) ← (θt,1, · · · , θt,i(t), · · · ) + δ(0, · · · , 1, · · · )

Zaid Harchaoui (INRIA) LL July 25th 2013 60 / 75

slide-61
SLIDE 61

Decomposition over latent structure

Stochastic atom descent

Algorithm Initialize : θ = 0 Iterate : pick a random mini-batch Bt, find coordinate θi of steepest descent i(t) = Arg max

i

∂IBt(θ) ∂θi = Arg max

i

uivT

i , −∇RBt(Wθ)

= Arg max

u2=v2=1

uT (−∇RBt(Wθ)) v . then perform a 1D line-search along θi(t) θt+1 ← θt + δet(i)

Zaid Harchaoui (INRIA) LL July 25th 2013 61 / 75

slide-62
SLIDE 62

Decomposition over latent structure

Stochastic atom descent

Algorithm Initialize : θ = 0 Iterate : pick a random mini-batch Bt, find coordinate θi of steepest descent Finding i(t) corresponds to finding top singular vectors u1 and v1 of −∇RBt(Wθ) Runtime : O(dk) (few Power/Lanczos iter.) Descend along θi(t) θt+1 ← θt + δei(t) Wt+1 ← Wt + δutvT

t

Periodically minimize I(θ) over supp(θ) up to optimality.

Zaid Harchaoui (INRIA) LL July 25th 2013 62 / 75

slide-63
SLIDE 63

Decomposition over latent structure

Stochastic atom descent

Algorithm Initialize : θ = 0 Iterate : pick a random mini-batch Bt, find coordinate θi of steepest descent Find i⋆ corresponds to find top singular vectors u1 and v1 of RBt(Wθ) then descend along θi(t) Periodically minimize I(θ) over supp(θ) up to optimality. (quasi-)Newton method with box constraints

Zaid Harchaoui (INRIA) LL July 25th 2013 63 / 75

slide-64
SLIDE 64

Decomposition over latent structure

Stochastic atom descent

Subspace acceleration Let s = supp(θ) be the size of the support of θ at iteration t. Coordinates

  • f θ are re-ordered using indexes j = 1, . . . , s.

   min

θ1,...,θs

λ s

j=1 θj + Remp

s

j=1 θjujv⊤ j

  • subject to

θj ≥ 0 , j = 1, . . . , s convex and smooth objective with simple box constraint coordinate-descent works fine Quasi-Newton algorithm with box constraints (L-BFGS-B) works fine

Zaid Harchaoui (INRIA) LL July 25th 2013 64 / 75

slide-65
SLIDE 65

Decomposition over latent structure

Generalization to gauge regularization penalty

Properties Ω(tW) = tΩ(W) for all W and t ≥ 0 Ω(W + W′) ≤ Ω(W) + Ω(W′) for all W and W′. Additional properties Assuming 0 ∈ int B, we also have Ω(W) ≥ 0, with equality if and only if W = 0 {W : Ω(W) ≤ t} = tB for t ≥ 0, i.e., level sets are compact. Polar duality Support function : Ω◦(G) := supM∈BM, G = supM∈MM, G.

Zaid Harchaoui (INRIA) LL July 25th 2013 65 / 75

slide-66
SLIDE 66

Decomposition over latent structure

Beyond trace-norm

Different types of atomic decomposition Mℓ1-norm =

  • sejeT

ℓ | s ∈ {−1, 1}

j ∈ {1, . . . , d}, ℓ ∈ {1, . . . , k}

  • Mℓ1/ℓ2-norm = {ejvT | j ∈ {1, . . . , d}, v ∈ Rk, v2 = 1}

Mtrace-norm = {uvT | u ∈ Rd, v ∈ Rk, u2 = v2 = 1} where (e1, · · · , ed) form the canonical basis of Rd.

Zaid Harchaoui (INRIA) LL July 25th 2013 66 / 75

slide-67
SLIDE 67

Decomposition over latent structure

Experimental results

Benchmark ImageNet dataset Subset of classes “Vehicles262”, “Fungus134’, and “Ungulate183” Fisher vector image representation (Perronnin & Dance, 2007)

1 Extracted SIFT and local color descriptors reduced to 128 D 2 Train a Gaussian mixture model of 16 centroids → Fisher vectors of

  • dim. 4096

3 Explicit embedding (Perronnin et al., 2010 ; Vedaldi & Zisserman,

2010)

Zaid Harchaoui (INRIA) LL July 25th 2013 67 / 75

slide-68
SLIDE 68

Decomposition over latent structure

Experimental results

Computations on each mini-batch

1 parallelized objective evaluation and gradient evaluation 2 efficient matrix computations for high-dimensional features

Efficient strategy : training with compression and testing without compression

1 product quantization of visual descriptors (Jegou et al., 2011) 2 during training, all matrix computations on features are performed in

the compressed domain

3 for testing, all matrix computations are performed on uncompressed

features

4 Note : compared to train/test without compression, average loss of

performance only of 0.9% on a subset of Vehicles with 10 categories.

Zaid Harchaoui (INRIA) LL July 25th 2013 68 / 75

slide-69
SLIDE 69

Decomposition over latent structure

Experimental results

Classification accuracy comparison Classification accuracy : top-k accuracy, i.e. Accuracytop−k = # images whose correct label lies in top-k scores Total number of images Competitors : our approach (TR-Multiclass) and k independently trained one-vs-rest classifiers (OVR)

Zaid Harchaoui (INRIA) LL July 25th 2013 69 / 75

slide-70
SLIDE 70

Decomposition over latent structure

Experimental results

Cheatsheet OVR Minimize

W∈Rd×k k

  • ℓ=1

λℓ wℓ2

2 + 1

n

n

  • i=1

BinaryHingeLossi L2-Multiclass Minimize

W∈Rd×k

λW2

2 + 1

n

n

  • i=1

MultinomialLogisticLossi TR-Multiclass Minimize

W∈Rd×k

λσ(W)1 + 1 n

n

  • i=1

MultinomialLogisticLossi

Zaid Harchaoui (INRIA) LL July 25th 2013 70 / 75

slide-71
SLIDE 71

Decomposition over latent structure

Experimental results

Zaid Harchaoui (INRIA) LL July 25th 2013 71 / 75

slide-72
SLIDE 72

Decomposition over latent structure

A posteriori low-dimensional embedding

Zaid Harchaoui (INRIA) LL July 25th 2013 72 / 75

slide-73
SLIDE 73

Decomposition over latent structure

Conclusion

Large-scale learning through decomposition Stochastic gradient descent is a decomposition over examples One-vs-rest is a decomposition over categories Stochastic atom descent is a decomposition over latent structure

Zaid Harchaoui (INRIA) LL July 25th 2013 73 / 75

slide-74
SLIDE 74

Decomposition over latent structure

Conclusion

Be your own cook Mix these decompositions and come up with your own algorithms for new problems Implement your own codes and master the algorithm before using an

  • ff-the-shelf implementation

Public code Download it and start running your own large-scale experiments Joust SGD : lear.inrialpes.fr/software

Zaid Harchaoui (INRIA) LL July 25th 2013 74 / 75

slide-75
SLIDE 75

Decomposition over latent structure

References

Recent references

Good practices in large-scale learning for image classification, Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, TPAMI 2013 Large-scale classification with trace-norm regularization, Z. Harchaoui, M. Douze,

  • M. Paulin, J. Malick, CVPR 2012

Learning with matrix gauge regularizers, M. Dudik, Z. Harchaoui, J. Malick, NIPS

  • Opt. 2011

Image Classification with the Fisher Vector : Theory and Practice, J. Sanchez, F. Perronnin, T. Mensink, J. Verbeek, IJCV 2013 Product quantization for nearest-neighbor search, H. Jegou, M. Douze, C. Schmid, IEEE Trans. PAMI, 2011 Efficient additive kernels via explicit feature maps, A. Vedaldi, A. Zisserman, CVPR 2010

Zaid Harchaoui (INRIA) LL July 25th 2013 75 / 75