similarity learning for provably accurate sparse linear
play

Similarity Learning for Provably Accurate Sparse Linear - PowerPoint PPT Presentation

Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit e de Saint-Etienne Alicante - September 2012 Bellet,


  1. Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur´ elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit´ e de Saint-Etienne Alicante - September 2012 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 1 / 34

  2. Introduction: Supervised Classification, Similarity Learning Introduction Supervised Classification, Similarity Learning Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 2 / 34

  3. Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity functions in classification Common approach in supervised classification: learn to classify objects using a pairwise similarity (or distance) function . Successful examples: k -Nearest Neighbor ( k -NN), Support Vector Machines (SVM). ? Best way to get a “good” similarity function for a specific task: learn it from data! Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 3 / 34

  4. Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity learning Similarity learning overview Learning a similarity function K ( x , x ′ ) implying a new instance space where the performance of a given algorithm is improved. Learn K Very popular approach Learn a positive semi-definite matrix (PSD) M ∈ R d × d that parameterizes M ( x , y ) = ( x − x ′ ) T M ( x − x ′ ) a (squared) Mahalanobis distance d 2 according to local constraints. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 4 / 34

  5. Introduction: Supervised Classification, Similarity Learning Similarity Learning Mahalanobis distance learning Existing methods typically use 2 types of constraints (from labels): equivalence constraints ( x and x ′ are similar/dissimilar), or relative constraints ( x is more similar to x ′ than to x ′′ ). Goal: find M that best satisfies the constraints. d M is then plugged in a k -NN classifier (or in a clustering algorithm) and is expected to improve results (w.r.t. Euclidean distance). Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 5 / 34

  6. Introduction: Supervised Classification, Similarity Learning Similarity Learning Motivation of our work Limitations of Mahalanobis distance learning Must enforce M � 0 (costly). No theoretical link between the learned metric and the error of the classifier. d M is learned using local constraints. Works well in practice with k -NN (based on a local neighborhood). Not really appropriate for global classifiers? Goal of our work Learn a non PSD similarity function, designed to improve global linear classifiers , with theoretical guarantees on the classifier error. Theory of ( ǫ, γ, τ ) -good similarity functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 6 / 34

  7. ( ǫ, γ, τ )-Good Similarity Functions ( ǫ, γ, τ ) -Good Similarity Functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 7 / 34

  8. ( ǫ, γ, τ )-Good Similarity Functions Definition Definition The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear classification . They proposed the following definition. Definition (Balcan et al., 2008) A similarity function K ∈ [ − 1 , 1] is an ( ǫ, γ, τ ) -good similarity function for a learning problem P if there exists an indicator function R ( x ) defining a set of “reasonable points” such that the following conditions hold: A 1 − ǫ probability mass of examples ( x , ℓ ) satisfy: 1 ℓℓ ′ K ( x , x ′ ) | R ( x ′ ) � � E ( x ′ ,ℓ ′ ) ∼ P ≥ γ Pr x ′ [ R ( x ′ )] ≥ τ. ǫ, γ, τ ∈ [0 , 1] 2 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34

  9. ( ǫ, γ, τ )-Good Similarity Functions Definition Definition The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear classification . They proposed the following definition. Definition (Balcan et al., 2008) A similarity function K ∈ [ − 1 , 1] is an ( ǫ, γ, τ ) -good similarity function for a learning problem P if there exists an indicator function R ( x ) defining a set of “reasonable points” such that the following conditions hold: A 1 − ǫ probability mass of examples ( x , ℓ ) satisfy: 1 ℓℓ ′ K ( x , x ′ ) | R ( x ′ ) � � E ( x ′ ,ℓ ′ ) ∼ P ≥ γ Pr x ′ [ R ( x ′ )] ≥ τ. ǫ, γ, τ ∈ [0 , 1] 2 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34

  10. ( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 0 , γ = 0 . 03 , τ = 3 / 8 l x ⇒ ∀ ( x , l x ) : 3 ( K ( x , A ) + K ( x , C ) − K ( x , G )) ≥ 0 . 03 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 9 / 34

  11. ( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 1 / 8 , γ = 0 . 12 , τ = 3 / 8 − 1 With example ( E , − 1) : 3 ( K ( E , A ) + K ( E , C ) − K ( E , G )) < 0 . 12 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 10 / 34

  12. ( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Strategy Each example is mapped to the space of “the similarity scores with the reasonable points”. K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 11 / 34

  13. ( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Theorem (Balcan et al., 2008) Given K is ( ǫ, γ, τ ) -good, there exists a linear separator α in the above-defined projection space that has error close to ǫ at margin γ . K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 12 / 34

  14. ( ǫ, γ, τ )-Good Similarity Functions Hinge loss definition Hinge loss definition Hinge loss version of the definition. Definition (Balcan et al., 2008) A similarity function K is an ( ǫ, γ, τ ) -good similarity function in hinge loss for a learning problem P if there exists a (random) indicator function R ( x ) defining a (probabilistic) set of “reasonable points” such that the following conditions hold: E ( x ,ℓ ) ∼ P [[1 − ℓ g ( x ) /γ ] + ] ≤ ǫ , 1 where g ( x ) = E ( x ′ ,ℓ ′ ) ∼ P [ ℓ ′ K ( x , x ′ ) | R ( x ′ )] and [1 − c ] + = max (1 − c , 0) is the hinge loss, Pr x ′ [ R ( x ′ )] ≥ τ . 2 · Expectation on the amount of margin violations ⇒ Easier to optimize Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 13 / 34

  15. ( ǫ, γ, τ )-Good Similarity Functions Balcan et al.’s learning rule Learning rule Learning the separator α with a linear program   d l d u � � α j ℓ i K ( x i , x ′ min  1 − j ) + λ � α � 1  α i =1 j =1 + Advantage: sparsity Thanks to the L 1 -regularization , α will have some zero-coordinates (depending on λ ). Makes prediction much faster than (for instance) k -NN. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 14 / 34

  16. ( ǫ, γ, τ )-Good Similarity Functions L 1 -norm and Sparsity L 1 -norm and Sparsity Why does L 1 -norm constraint/regularization induce sparsity? Geometric interpretation: L 2 constraint L 1 constraint Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 15 / 34

  17. Learning Good Similarity Functions for Linear Classification Learning Good Similarity Functions for Linear Classification Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 16 / 34

  18. Learning Good Similarity Functions for Linear Classification Form of similarity function Form of similarity function We propose to optimize a bilinear similarity K A : K A ( x , x ′ ) = x T Ax ′ parameterized by the matrix A ∈ R d × d (not constrained to be PSD nor symmetric). K A is efficiently computable for sparse inputs. To ensure K A ∈ [ − 1 , 1], we assume the inputs are normalized such that || x || 2 ≤ 1, and we require || A || F ≤ 1. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 17 / 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend