Similarity Learning for Provably Accurate Sparse Linear - - PowerPoint PPT Presentation

similarity learning for provably accurate sparse linear
SMART_READER_LITE
LIVE PREVIEW

Similarity Learning for Provably Accurate Sparse Linear - - PowerPoint PPT Presentation

Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit e de Saint-Etienne Alicante - September 2012 Bellet,


slide-1
SLIDE 1

Similarity Learning for Provably Accurate Sparse Linear Classification

(ICML 2012)

Aur´ elien Bellet Amaury Habrard Marc Sebban

Laboratoire Hubert Curien, UMR CNRS 5516, Universit´ e de Saint-Etienne

Alicante - September 2012

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 1 / 34

slide-2
SLIDE 2

Introduction: Supervised Classification, Similarity Learning

Introduction

Supervised Classification, Similarity Learning

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 2 / 34

slide-3
SLIDE 3

Introduction: Supervised Classification, Similarity Learning Similarity Learning

Similarity functions in classification

Common approach in supervised classification: learn to classify

  • bjects using a pairwise similarity (or distance) function.

Successful examples: k-Nearest Neighbor (k-NN), Support Vector Machines (SVM).

?

Best way to get a “good” similarity function for a specific task: learn it from data!

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 3 / 34

slide-4
SLIDE 4

Introduction: Supervised Classification, Similarity Learning Similarity Learning

Similarity learning

Similarity learning overview

Learning a similarity function K(x, x′) implying a new instance space where the performance of a given algorithm is improved.

Learn K

Very popular approach

Learn a positive semi-definite matrix (PSD) M ∈ Rd×d that parameterizes a (squared) Mahalanobis distance d2

M(x, y) = (x − x′)T M(x − x′)

according to local constraints.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 4 / 34

slide-5
SLIDE 5

Introduction: Supervised Classification, Similarity Learning Similarity Learning

Mahalanobis distance learning

Existing methods typically use 2 types of constraints (from labels):

equivalence constraints (x and x′ are similar/dissimilar),

  • r relative constraints (x is more similar to x′ than to x′′).

Goal: find M that best satisfies the constraints. dM is then plugged in a k-NN classifier (or in a clustering algorithm) and is expected to improve results (w.r.t. Euclidean distance).

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 5 / 34

slide-6
SLIDE 6

Introduction: Supervised Classification, Similarity Learning Similarity Learning

Motivation of our work

Limitations of Mahalanobis distance learning

Must enforce M 0 (costly). No theoretical link between the learned metric and the error of the classifier. dM is learned using local constraints. Works well in practice with k-NN (based on a local neighborhood). Not really appropriate for global classifiers?

Goal of our work

Learn a non PSD similarity function, designed to improve global linear classifiers, with theoretical guarantees on the classifier error. Theory of (ǫ, γ, τ)-good similarity functions

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 6 / 34

slide-7
SLIDE 7

(ǫ, γ, τ)-Good Similarity Functions

(ǫ, γ, τ)-Good Similarity Functions

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 7 / 34

slide-8
SLIDE 8

(ǫ, γ, τ)-Good Similarity Functions Definition

Definition

The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear

  • classification. They proposed the following definition.

Definition (Balcan et al., 2008)

A similarity function K ∈ [−1, 1] is an (ǫ, γ, τ)-good similarity function for a learning problem P if there exists an indicator function R(x) defining a set of “reasonable points” such that the following conditions hold:

1

A 1 − ǫ probability mass of examples (x, ℓ) satisfy: E(x′,ℓ′)∼P

  • ℓℓ′K(x, x′)|R(x′)
  • ≥ γ

2

Prx′[R(x′)] ≥ τ. ǫ, γ, τ ∈ [0, 1]

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34

slide-9
SLIDE 9

(ǫ, γ, τ)-Good Similarity Functions Definition

Definition

The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear

  • classification. They proposed the following definition.

Definition (Balcan et al., 2008)

A similarity function K ∈ [−1, 1] is an (ǫ, γ, τ)-good similarity function for a learning problem P if there exists an indicator function R(x) defining a set of “reasonable points” such that the following conditions hold:

1

A 1 − ǫ probability mass of examples (x, ℓ) satisfy: E(x′,ℓ′)∼P

  • ℓℓ′K(x, x′)|R(x′)
  • ≥ γ

2

Prx′[R(x′)] ≥ τ. ǫ, γ, τ ∈ [0, 1]

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34

slide-10
SLIDE 10

(ǫ, γ, τ)-Good Similarity Functions Intuition behind the definition

Intuition behind the definition

A B C D E F G H

Positive class Reasonable point Negative class

K(x, x′) = −x − x′2 is good with ǫ = 0, γ = 0.03, τ = 3/8 ⇒ ∀(x, lx) :

lx 3 (K(x, A) + K(x, C) − K(x, G)) ≥ 0.03

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 9 / 34

slide-11
SLIDE 11

(ǫ, γ, τ)-Good Similarity Functions Intuition behind the definition

Intuition behind the definition

A B C D E F G H

Positive class Reasonable point Negative class

K(x, x′) = −x − x′2 is good with ǫ = 1/8, γ = 0.12, τ = 3/8 With example (E, −1) :

−1 3 (K(E, A) + K(E, C) − K(E, G)) < 0.12

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 10 / 34

slide-12
SLIDE 12

(ǫ, γ, τ)-Good Similarity Functions Implications for learning

Implications for learning

Strategy

Each example is mapped to the space of “the similarity scores with the reasonable points”.

K(x, A) K(x, C) K(x, G) K(x, G)

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 11 / 34

slide-13
SLIDE 13

(ǫ, γ, τ)-Good Similarity Functions Implications for learning

Implications for learning

Theorem (Balcan et al., 2008)

Given K is (ǫ, γ, τ)-good, there exists a linear separator α in the above-defined projection space that has error close to ǫ at margin γ.

K(x, A) K(x, C) K(x, G) K(x, G)

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 12 / 34

slide-14
SLIDE 14

(ǫ, γ, τ)-Good Similarity Functions Hinge loss definition

Hinge loss definition

Hinge loss version of the definition.

Definition (Balcan et al., 2008)

A similarity function K is an (ǫ, γ, τ)-good similarity function in hinge loss for a learning problem P if there exists a (random) indicator function R(x) defining a (probabilistic) set of “reasonable points” such that the following conditions hold:

1

E(x,ℓ)∼P [[1 − ℓg(x)/γ]+] ≤ ǫ, where g(x) = E(x′,ℓ′)∼P[ℓ′K(x, x′)|R(x′)] and [1 − c]+ = max(1 − c, 0) is the hinge loss,

2

Prx′[R(x′)] ≥ τ. · Expectation on the amount of margin violations ⇒ Easier to optimize

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 13 / 34

slide-15
SLIDE 15

(ǫ, γ, τ)-Good Similarity Functions Balcan et al.’s learning rule

Learning rule

Learning the separator α with a linear program

min

α dl

  • i=1

 1 −

du

  • j=1

αjℓiK(xi, x′

j)

 

+

+ λα1

Advantage: sparsity

Thanks to the L1-regularization, α will have some zero-coordinates (depending on λ). Makes prediction much faster than (for instance) k-NN.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 14 / 34

slide-16
SLIDE 16

(ǫ, γ, τ)-Good Similarity Functions L1-norm and Sparsity

L1-norm and Sparsity

Why does L1-norm constraint/regularization induce sparsity? Geometric interpretation: L1 constraint L2 constraint

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 15 / 34

slide-17
SLIDE 17

Learning Good Similarity Functions for Linear Classification

Learning Good Similarity Functions for Linear Classification

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 16 / 34

slide-18
SLIDE 18

Learning Good Similarity Functions for Linear Classification Form of similarity function

Form of similarity function

We propose to optimize a bilinear similarity KA: KA(x, x′) = xT Ax′ parameterized by the matrix A ∈ Rd×d (not constrained to be PSD nor symmetric). KA is efficiently computable for sparse inputs. To ensure KA ∈ [−1, 1], we assume the inputs are normalized such that ||x||2 ≤ 1, and we require ||A||F ≤ 1.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 17 / 34

slide-19
SLIDE 19

Learning Good Similarity Functions for Linear Classification Formulation

Empirical goodness

Goal

Optimize the (ǫ, γ, τ)-goodness of KA on a finite-size sample.

Notations

Given a training sample T = {zi = (xi, li)}NT

i=1, a subsample R ⊆ T of NR

reasonable points and a margin γ,

V (A, zi, R) = [1 − li 1 γNR

nR

  • k=1

lkKA(xi, xk)]+

is the empirical goodness of KA w.r.t. a single training point zi ∈ T and ǫT = 1 NT

NT

  • i=1

V (A, zi, R) is the empirical goodness over T.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 18 / 34

slide-20
SLIDE 20

Learning Good Similarity Functions for Linear Classification Formulation

Formulation

SLLC (Similarity Learning for Linear Classification)

min

A∈Rd×d ǫT + β A2 F

where β is a regularization parameter. SLLC can be cast as a convex QP and efficienctly solved. Only one constraint per training example. Very different from classic metric learning approaches: similarity constraints must be satisfied only on average, learn global similarity (same R for all training examples).

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 19 / 34

slide-21
SLIDE 21

Learning Good Similarity Functions for Linear Classification Formulation

Kernelization

Our approach is very simple: learn a global linear similarity, use it to learn a global linear classifier. Would be interesting to be able to learn more powerful similarities and classifiers. We kernelize SLLC to be able to learn in a nonlinear feature space induced by a kernel. This is done with the KPCA trick (Chatpatanasiri et al., 2010): projection of data in kernel space + dimensionality reduction. Then we apply SLLC in this new feature space. Not prone to overfitting (coming up: confirmation by theory and experiments).

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 20 / 34

slide-22
SLIDE 22

Learning Good Similarity Functions for Linear Classification Theoretical analysis

Theoretical analysis

We want to bound the goodness in generalization ǫ of our learned similarity: ǫ = Ez=(x,l)∼PV (A, z, R) by its empirical goodness ˆ ǫT: ˆ ǫT = 1 NT

NT

  • i=1

V (A, zi, R) Non i.i.d. setting because R is drawn from T.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 21 / 34

slide-23
SLIDE 23

Learning Good Similarity Functions for Linear Classification Theoretical analysis

Theoretical analysis ctd

Uniform stability (Bousquet & Elisseeff, 2002)

Idea: study the impact of a small change in the training sample. ∀T, ∀i, sup

z |V (A, z, R) − V (Ai, z, Ri)| ≤ κ

NT T i set obtained by replacing zi ∈ T by an example z′

i independent from T,

Ri the set of reasonable points associated with T i and Ai the matrix learned from T i and Ri.

Theorem (Bousquet & Elisseeff, 2002)

If an algorithm has a uniform stability, then it has generalization guarantees.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 22 / 34

slide-24
SLIDE 24

Learning Good Similarity Functions for Linear Classification Theoretical analysis

Theoretical analysis ctd(2)

Theorem: SLLC has a uniform stability in κ/NT

With probability 1 − δ: κ = 1 γ ( 1 βγ + 2 ˆ τ ) = ˆ τ + 2βγ ˆ τβγ2 , where β is the regularization parameter, γ the margin and ˆ τ the proportion of reasonable points in the training sample.

Theorem: Generalization bound - Convergence in O(

  • 1/NT)

ǫ ≤ ˆ ǫT + κ NT + (2κ + 1)

  • ln 1/δ

2NT . ֒ → Guarantee on the error of the classifier and convergence rate independent from dimensionality.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 23 / 34

slide-25
SLIDE 25

Learning Good Similarity Functions for Linear Classification Experiments

Experiments

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 24 / 34

slide-26
SLIDE 26

Learning Good Similarity Functions for Linear Classification Experiments

Experiments

We conducted experiments on 7 datasets of various domain, size and difficulty.

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA train size 488 245 700 537 1,000 3,089 59,535 test size 211 106 300 231 2,175 4,000 271,617 # dimensions 9 34 2 8 60 4 8 # dim. KPCA 27 102 8 24 180 16 24 # runs 100 100 100 100 1 1 1

We compare SLLC to KI (cosine baseline) and two widely-used Mahalanobis distance learning methods: LMNN and ITML.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 25 / 34

slide-27
SLIDE 27

Learning Good Similarity Functions for Linear Classification Experiments

Experiments ctd

Linear classification results:

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA KI 96.57 89.81 100.00 75.62 83.86 96.95 95.91 20.39 52.93 18.20 25.93 362 64 557 SLLC 96.90 93.25 100.00 75.94 87.36 96.55 94.08 1.00 1.00 1.00 1.00 1 8 1 LMNN 96.81 90.21 100.00 75.15 85.61 95.80 88.40 9.98 13.30 18.04 69.71 315 157 61 LMNN KPCA 96.01 86.12 100.00 74.92 86.85 96.53 95.15 8.46 9.96 8.73 22.20 156 82 591 ITML 96.80 92.09 100.00 75.25 81.47 96.70 95.06 9.79 9.51 17.85 56.22 377 49 164 ITML KPCA 96.23 93.05 100.00 75.25 85.29 96.55 95.14 17.17 18.01 15.21 16.40 287 89 206

SLLC outperforms KI, LMNN and ITML on 5 out of 7 datasets. Always leads to extremely sparse models.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 26 / 34

slide-28
SLIDE 28

Learning Good Similarity Functions for Linear Classification Experiments

Experiments ctd(2)

−0.4 −0.2 0.2 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 KI (0.50196) −0.01 −0.005 0.005 0.01 0.015 −6 −4 −2 2 4 6 8 x 10

−5

SLLC (1) −30 −20 −10 10 20 −30 −20 −10 10 20 30 LMNN (0.85804) −0.4 −0.2 0.2 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 ITML (0.50002)

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 27 / 34

slide-29
SLIDE 29

Learning Good Similarity Functions for Linear Classification Experiments

Experiments ctd(2) - Another projection space

−20 −10 10 20 −15 −10 −5 5 10 KI (0.69754) −1 −0.5 0.5 1 x 10

−4

−3 −2 −1 1 2 3 4 x 10

−5

SLLC (0.93035) −1000 −500 500 1000 −500 500 LMNN (0.80975) −20 −10 10 20 −0.4 −0.2 0.2 0.4 0.6 0.8 1 ITML (0.97105)

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 28 / 34

slide-30
SLIDE 30

Learning Good Similarity Functions for Linear Classification Experiments

Experiments ctd(3)

k-NN results:

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA KI 96.71 83.57 100.00 72.78 77.52 93.93 90.07 SLLC 96.90 93.25 100.00 75.94 87.36 93.82 94.08 LMNN 96.46 88.68 100.00 72.84 83.49 96.23 94.98 LMNN KPCA 96.23 87.13 100.00 73.50 87.59 95.85 94.43 ITML 92.67 88.29 100.00 72.07 77.43 95.97 95.42 ITML KPCA 96.38 87.56 100.00 72.80 84.41 96.80 95.32

Surprisingly, SLLC also outperforms KI, LMNN and ITML on the small datasets.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 29 / 34

slide-31
SLIDE 31

Conclusion

Conclusion

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 30 / 34

slide-32
SLIDE 32

Conclusion

Conclusion

Making use of Balcan et al.’s theory and the KPCA trick, we propose a novel similarity learning method that: is tailored to linear classifiers, has guarantees in terms of the error of the classifier, is effective and efficient in practice as compared to the state-of-the-art, produces extremely sparse models, and is robust to overfitting. Future work could include: playing with other similarities/regularizers (nuclear norm), developing a specific solver, deriving an online algorithm.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 31 / 34

slide-33
SLIDE 33

Conclusion

Thanks

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 32 / 34

slide-34
SLIDE 34

Conclusion

Experiments - overfitting

Surprising? Not so much!

74 76 78 80 82 84 86 88 90 92 94 50 100 150 200 250 Classification accuracy Dimension

SLLC LMNN ITML

LMNN and ITML overfit the data as dimensionality grows.

Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 33 / 34

slide-35
SLIDE 35

Conclusion

Experiments ctd(6)

LMNN and ITML have their own specific and sophisticated solver while we just use a standard convex optimization tool. However, SLLC is much faster than LMNN (because it has less constraints), but remains slower than ITML on the same number of constraints.

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA SLLC 4.76 5.36 0.05 4.01 158.38 185.53 2471.25 LMNN 25.99 16.27 37.95 32.14 309.36 331.28 10418.73 LMNN KPCA 41.06 34.57 84.86 48.28 1122.60 369.31 24296.41 ITML 2.09 3.09 0.19 2.96 3.41 0.83 5.98 ITML KPCA 1.68 5.77 0.20 2.74 56.14 5.30 25.25 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 34 / 34