Multi-class to Binary reduction of Large-scale classification - - PowerPoint PPT Presentation

multi class to binary reduction of large scale
SMART_READER_LITE
LIVE PREVIEW

Multi-class to Binary reduction of Large-scale classification - - PowerPoint PPT Presentation

1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th ,


slide-1
SLIDE 1

1/21

Multi-class to Binary reduction of Large-scale classification Problems

Bikash Joshi

Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier

BigTargets ECML 2015 workshop September the 11th, 2015

slide-2
SLIDE 2

2/21

Outline

❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

slide-3
SLIDE 3

2/21

Outline

❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

slide-4
SLIDE 4

3/21

Multiclass classification : emerging problems

❑ The number of classes, K, in new emerging multiclass problems, for example in text and image classification, may reach 105 to 106 categories. ❑ For example

slide-5
SLIDE 5

4/21

Large-scale classification : power law distribution of classes

Collection K d DMOZ 7500 594158

500 1000 1500 2000 2500 3000 3500 4000 2-5 6-10 11-30 31-100 101-200 >200 # Classes # Documents DMOZ-7500

slide-6
SLIDE 6

5/21

Multiclass classification approaches

❑ Uncombined approaches, i.e. MSVM or MLP. The number of parameters, M, is at least O(K × d). ❑ Combined approaches based on binary classification :

❑ One-Vs-one - M ≥ O(K 2 × d) ❑ One-Vs-Rest - M ≥ O(K × d)

❑ For K >> 1 and d >> 1 traditional approaches do not pass the scale.

slide-7
SLIDE 7

6/21

Outline

❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

slide-8
SLIDE 8

7/21

Learning objective

❑ Large-scale multiclass classification,

❑ Hypothesis : Observations xy = (x, y) ∈ X × Y are i.i.d with respect to a distribution D, ❑ For a class of H = {h : X × Y → R}, a ranking instanstaneous loss h ∈ H over an example xy by : e(h, xy) = 1 K − 1

  • y ′∈Y\{y}

✶h(xy )≤h(xy′), ❑ The aim is to find a function h ∈ H that minimizes the generalization error L(h) : L(h) = Exy∼D [e(h, xy)] . ❑ Empirical error of a function h ∈ H over a training set S =

  • xyi

i

m

i=1 is

ˆ Lm(h, S) = 1 m

m

  • i=1

e(h, xyi

i )

slide-9
SLIDE 9

8/21

Reduction strategy

❑ Consider the empirical loss ˆ Lm(h, S) = 1 m(K − 1)

m

  • i=1
  • y ′∈Y\{yi}

✶h(x

yi i )≤h(xy′ i )

= 1 n

n

  • i=1

✶˜

yig(Zi)≤0

  • LT

n (g,T(S))

where n = m(K − 1), Zi is a pair of couples costituted by a couple of example and its class and the couple corresponding to the example and another class, ˜ yi = 1 if the first couple in Zi is the true couple and −1 otherwise, and g(xy, xy ′) = h(xy) − h(xy ′).

slide-10
SLIDE 10

9/21

Reduction strategy for the class of linear functions

slide-11
SLIDE 11

9/21

Reduction strategy for the class of linear functions

Problems : ❑ How to define Φ(xy), ❑ Consistency of the ERM principle with interdependant data.

slide-12
SLIDE 12

10/21

Consistency of the ERM principle with interdependant data

❑ Different statistical tools for extending concentration inequalities to the case of interdependent data, ❑ tools based on colorable graphs proposed by (Janson, 2004) 1.

S T(S)

(C1, α1 = 1) (C2, α2 = 1) x1

1

x2

2

x3

3

(x1

1, x2 1) (x1 1, x3 1)

(x2

2, x3 2)

(x2

2, x1 2)

(x3

3, x1 3) (x3 3, x2 3)

(x1

1, x2 1) (x2 2, x1 2) (x3 3, x1 3)

(x1

1, x3 1) (x2 2, x3 2) (x3 3, x2 3)

  • 1. S. Janson. Large deviations for sums of partly dependent random
  • variables. Random Structures and Algorithms, 24(3) :234–248, 2004.
slide-13
SLIDE 13

11/21

Theorem

Let S = (xyi

i )m i=1 ∈ (X × Y)m be a training set constituted of m examples

generated i.i.d. with respect to a probability distribution D overX × Y and T(S) = ((Zi, ˜ yi))n

i=1 ∈ (Z × {−1, 1})n the transformed set obtained with

application T. Let κ : Z → R by a PSD kernel, and Φ : X × Y → H the associated mapping function. For all 1 > δ > 0, and all gw ∈ GB = {x → w, Φ(x) | ||w|| ≤ B} with probability at least (1 − δ)

  • ver T(S) we have then :

LT (gw ) ≤ ǫT

n (gw, T(S)) + 2BG(T(S))

m√ K − 1 + 3

  • ln( 2

δ )

2m (1) where ǫT

n (gw, T(S)) = 1 n n

  • i=1

L(˜

yigw (Zi)) with a surrogate Hinge loss

L : t → min(1, max(1 − t, 0)), LT (gw ) = ET(S)[LT

n (gw, T(S))] et

G(T(S)) = n

i=1 dκ(Zi) with

dκ(xy, xy′) = κ(xy, xy ) + κ(xy′, xy′) − 2κ(xy, xy′)

slide-14
SLIDE 14

12/21

Key Features of Algorithm

❑ Data dependent bound : If the feature representation of (x,y) pairs is independent of original dimension, then :

G(T(S)) ≤ √

n × Constant ≈

  • m × (K − 1) × Constant

❑ Non-trivial joint feature representation (example-class pair) ❑ Same for any number of class ❑ Same parameter vector for all classes

slide-15
SLIDE 15

13/21

Outline

❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

slide-16
SLIDE 16

14/21

Feature representation Φ(xy)

Features 1.

  • t∈y∩x

ln(1 + yt) 2.

  • t∈y∩x

ln(1 + lS

St

) 3.

  • t∈y∩x

It 4.

  • t∈y∩x

ln(1 + yt

|y|

) 5.

  • t∈y∩x

ln(1 + yt

|y|.It)

6.

  • t∈y∩x

ln(1 + yt

|y|. lS St

) 7.

  • t∈y∩x

1 8.

  • t∈y∩x

yt

|y|.It

9. d1(xy)

  • 10. d2(xy)

❑ xt : number of occurrences of terme t in document x, ❑ V : Number of distinct terms in S, ❑ yt =

x∈y xt, |y| = t∈V yt, St = x∈S xt,

lS =

t∈V St.

❑ It : idf of the terme t,

slide-17
SLIDE 17

15/21

Experimental results on text classification

Collection K d m Test size DMOZ 7500 594158 394756 104263 WIKIPEDIA 7500 346299 456886 81262 K × d = O(109) ❑ Random samples of 100, 500, 1000, 3000, 5000 and 7500

slide-18
SLIDE 18

16/21

Experimental Setup

Implementation and comparison : ❑ SVM with linear kernel as binary classification algorithm ❑ Value of C chosen by cross-validation ❑ Comparison with OVA, OVO, M-SVM, LogT Performance Evaluation : ❑ Accuracy : Correctly classified examples in test dataset ❑ Macro F-Measure : Harmonic mean of precision and recall

slide-19
SLIDE 19

17/21

Experimental Results

Result for 7500 class : ❑ OVO and M-SVM did not pass the scale for 7500 classes ❑ Nc : Proportion of classes for which at leaset one TP document found ❑ mRb covers 6-9.5% classes than OVA ( 500 - 700 classes)

slide-20
SLIDE 20

18/21

# of Classes Vs. Macro F-Measure

slide-21
SLIDE 21

19/21

# of Classes Vs. Macro F-Measure

slide-22
SLIDE 22

20/21

Conclusion

❑ A new method of large-scale multiclass classification based on reduction of multiclass classification to binary classification. ❑ Efficiency of deduced algorithm comparable or better than the state of the art multiclass classification approaches.