Similarity Learning for Provably Accurate Sparse Linear - PowerPoint PPT Presentation

Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur´ elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit´ e de Saint-Etienne Alicante - September 2012 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 1 / 34

Introduction: Supervised Classification, Similarity Learning Introduction Supervised Classification, Similarity Learning Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 2 / 34

Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity functions in classification Common approach in supervised classification: learn to classify objects using a pairwise similarity (or distance) function . Successful examples: k -Nearest Neighbor ( k -NN), Support Vector Machines (SVM). ? Best way to get a “good” similarity function for a specific task: learn it from data! Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 3 / 34

Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity learning Similarity learning overview Learning a similarity function K ( x , x ′ ) implying a new instance space where the performance of a given algorithm is improved. Learn K Very popular approach Learn a positive semi-definite matrix (PSD) M ∈ R d × d that parameterizes M ( x , y ) = ( x − x ′ ) T M ( x − x ′ ) a (squared) Mahalanobis distance d 2 according to local constraints. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 4 / 34

Introduction: Supervised Classification, Similarity Learning Similarity Learning Mahalanobis distance learning Existing methods typically use 2 types of constraints (from labels): equivalence constraints ( x and x ′ are similar/dissimilar), or relative constraints ( x is more similar to x ′ than to x ′′ ). Goal: find M that best satisfies the constraints. d M is then plugged in a k -NN classifier (or in a clustering algorithm) and is expected to improve results (w.r.t. Euclidean distance). Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 5 / 34

Introduction: Supervised Classification, Similarity Learning Similarity Learning Motivation of our work Limitations of Mahalanobis distance learning Must enforce M � 0 (costly). No theoretical link between the learned metric and the error of the classifier. d M is learned using local constraints. Works well in practice with k -NN (based on a local neighborhood). Not really appropriate for global classifiers? Goal of our work Learn a non PSD similarity function, designed to improve global linear classifiers , with theoretical guarantees on the classifier error. Theory of ( ǫ, γ, τ ) -good similarity functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 6 / 34

( ǫ, γ, τ )-Good Similarity Functions ( ǫ, γ, τ ) -Good Similarity Functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 7 / 34

( ǫ, γ, τ )-Good Similarity Functions Definition Definition The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear classification . They proposed the following definition. Definition (Balcan et al., 2008) A similarity function K ∈ [ − 1 , 1] is an ( ǫ, γ, τ ) -good similarity function for a learning problem P if there exists an indicator function R ( x ) defining a set of “reasonable points” such that the following conditions hold: A 1 − ǫ probability mass of examples ( x , ℓ ) satisfy: 1 ℓℓ ′ K ( x , x ′ ) | R ( x ′ ) � � E ( x ′ ,ℓ ′ ) ∼ P ≥ γ Pr x ′ [ R ( x ′ )] ≥ τ. ǫ, γ, τ ∈ [0 , 1] 2 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34

( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 0 , γ = 0 . 03 , τ = 3 / 8 l x ⇒ ∀ ( x , l x ) : 3 ( K ( x , A ) + K ( x , C ) − K ( x , G )) ≥ 0 . 03 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 9 / 34

( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 1 / 8 , γ = 0 . 12 , τ = 3 / 8 − 1 With example ( E , − 1) : 3 ( K ( E , A ) + K ( E , C ) − K ( E , G )) < 0 . 12 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 10 / 34

( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Strategy Each example is mapped to the space of “the similarity scores with the reasonable points”. K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 11 / 34

( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Theorem (Balcan et al., 2008) Given K is ( ǫ, γ, τ ) -good, there exists a linear separator α in the above-defined projection space that has error close to ǫ at margin γ . K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 12 / 34

( ǫ, γ, τ )-Good Similarity Functions Hinge loss definition Hinge loss definition Hinge loss version of the definition. Definition (Balcan et al., 2008) A similarity function K is an ( ǫ, γ, τ ) -good similarity function in hinge loss for a learning problem P if there exists a (random) indicator function R ( x ) defining a (probabilistic) set of “reasonable points” such that the following conditions hold: E ( x ,ℓ ) ∼ P [[1 − ℓ g ( x ) /γ ] + ] ≤ ǫ , 1 where g ( x ) = E ( x ′ ,ℓ ′ ) ∼ P [ ℓ ′ K ( x , x ′ ) | R ( x ′ )] and [1 − c ] + = max (1 − c , 0) is the hinge loss, Pr x ′ [ R ( x ′ )] ≥ τ . 2 · Expectation on the amount of margin violations ⇒ Easier to optimize Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 13 / 34

( ǫ, γ, τ )-Good Similarity Functions Balcan et al.’s learning rule Learning rule Learning the separator α with a linear program   d l d u � � α j ℓ i K ( x i , x ′ min  1 − j ) + λ � α � 1  α i =1 j =1 + Advantage: sparsity Thanks to the L 1 -regularization , α will have some zero-coordinates (depending on λ ). Makes prediction much faster than (for instance) k -NN. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 14 / 34

( ǫ, γ, τ )-Good Similarity Functions L 1 -norm and Sparsity L 1 -norm and Sparsity Why does L 1 -norm constraint/regularization induce sparsity? Geometric interpretation: L 2 constraint L 1 constraint Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 15 / 34

Learning Good Similarity Functions for Linear Classification Learning Good Similarity Functions for Linear Classification Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 16 / 34

Learning Good Similarity Functions for Linear Classification Form of similarity function Form of similarity function We propose to optimize a bilinear similarity K A : K A ( x , x ′ ) = x T Ax ′ parameterized by the matrix A ∈ R d × d (not constrained to be PSD nor symmetric). K A is efficiently computable for sparse inputs. To ensure K A ∈ [ − 1 , 1], we assume the inputs are normalized such that || x || 2 ≤ 1, and we require || A || F ≤ 1. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 17 / 34

Similarity Learning for Provably Accurate Sparse Linear - PowerPoint PPT Presentation

Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit e de Saint-Etienne Alicante - September 2012 Bellet,

Provably secure hash functions - do we care? Krystian Matusiewicz Technical University of Denmark

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Provably secure compilation of side-channel countermeasures: the case of cryptographic

Provably Secure Higher-Order Masking of AES Matthieu Rivain Emmanuel Prouff CryptoExperts

Provably weak instances of Ring-LWE revisited Wouter Castryck 1 , 2 , Ilia Iliashenko 1 , Frederik

Provably Live Exception Handling Bart Jacobs DistriNet, KU Leuven FTfJP 2015 Bart Jacobs

What is the B-method? Welcome to Provably Correct Software http://www.it.uu.se/

Second Generation Model-based Testing Provably Strong Testing Methods for the Certification of

A Set that is Streamless and Not Provably Noetherian Marc Bezem Department of Informatics

Survey Sampling Theory & Small and Hard-to-reach Groups Geert Molenberghs Interuniversity

Building Ubiquitous and Robust Speech and Natural Language I nterfaces I Gary Geunbae Lee,

Automatic Classification of Fricatives Using t-SNE Yizhar Lavner 1 and Alex Frid 1,2 1 Department

Vers un apprentissage subquadratique pour les m elanges darbres F. Schnitzler 1 P. Leray 2

3D Shape Completion from Sparse Point Clouds Christian Diller 26 th July 2019 Motivation 2

Extracting new metrics from Version Control System for the comparison of software developers

in the Perception of Metric Accentuation in Song Analysis on the Korean Translation of Happy

GreenSONAR A multi-domain energy profiling system based on perfSONAR Lutz Engels Todor Yakimov

Similarity Learning for Provably Accurate Sparse Linear - PowerPoint PPT Presentation

Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit e de Saint-Etienne Alicante - September 2012 Bellet,

Provably secure hash functions - do we care? Krystian Matusiewicz Technical University of Denmark

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Provably secure compilation of side-channel countermeasures: the case of cryptographic

Provably Secure Higher-Order Masking of AES Matthieu Rivain Emmanuel Prouff CryptoExperts

Provably weak instances of Ring-LWE revisited Wouter Castryck 1 , 2 , Ilia Iliashenko 1 , Frederik

Provably Live Exception Handling Bart Jacobs DistriNet, KU Leuven FTfJP 2015 Bart Jacobs

What is the B-method? Welcome to Provably Correct Software http://www.it.uu.se/

Second Generation Model-based Testing Provably Strong Testing Methods for the Certification of

A Set that is Streamless and Not Provably Noetherian Marc Bezem Department of Informatics

Survey Sampling Theory &amp; Small and Hard-to-reach Groups Geert Molenberghs Interuniversity

Building Ubiquitous and Robust Speech and Natural Language I nterfaces I Gary Geunbae Lee,

Automatic Classification of Fricatives Using t-SNE Yizhar Lavner 1 and Alex Frid 1,2 1 Department

Vers un apprentissage subquadratique pour les m elanges darbres F. Schnitzler 1 P. Leray 2

3D Shape Completion from Sparse Point Clouds Christian Diller 26 th July 2019 Motivation 2

Extracting new metrics from Version Control System for the comparison of software developers

in the Perception of Metric Accentuation in Song Analysis on the Korean Translation of Happy

GreenSONAR A multi-domain energy profiling system based on perfSONAR Lutz Engels Todor Yakimov

Survey Sampling Theory & Small and Hard-to-reach Groups Geert Molenberghs Interuniversity