Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, - PowerPoint PPT Presentation

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1

Outline Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation Mahalanobis metric for clustering Large Margin Nearest Neighbor Distance Metric Learning using CNNs Siamese Network Contrastive loss function Applications Triplet Network Triplet Loss Applications Mining Triplets Adaptive Density Distribution Magnet loss 2

Distances and Similarities Distance Functions The concept of distance function d ( ., . ) is inherent to any pattern recognition problem. E.g. clustering (kmeans), classification (kNN, SVM) etc. Typical Choices 1 p . ◮ Minkowski Distance: L p ( P , Q ) = ( � i | P i − Q i | p ) ◮ Cosine: L ( P , Q ) = P T Q | P || Q | ◮ Earth Mover: Uses an optimization algorithm ◮ Edit distance: Uses dynamic programming between sequences. i P i log P i ◮ KL Divergence: KL ( P � Q ) = � Q i . (Not Symmetric!) ◮ many more ... (depending on type of problem) 3

Distances and Similarities Choosing the right distance function? Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 4

Metric Learning Distance Metric Learning Learn a function that maps input patterns into a target space such that the simple distance in the target space (Euclidean) approximates the “semantic” distance in the input space. Figure 1: Hadsell et. al. CVPR’06 5

Metric Learning Many applications Figure 2: A subset of applications using metric learning. ◮ Scale to large number of #categories. [Schroff et al., 2015] ◮ Fine grained classification . [Rippel et. al., 2015] ◮ Visualization of high-dimensional data. [van der Maaten and Hinton, 2008] ◮ Ranking and retrieval. [Wang et. al. CVPR’14] 6

Properties of a Metric What defines a metric? 1. Non-negativity: D ( P , Q ) ≥ 0 2. Identity of indiscernibles: D ( P , Q ) = 0 iff P = Q 3. Symmetry: D ( P , Q ) = D ( Q , P ) 4. Triangle Inequality: D ( P , Q ) ≤ D ( P , K ) + D ( K , Q ) Pseudo/Semi Metric If the second property is not followed strictly i.e. “iff → if” 7

Metric learning as learning transformations ◮ Feature Weighting ◮ Learn weightings over the features, then use standard distance (e.g.,Euclidean) after re-weighting ◮ Full linear transformation ◮ In addition to scaling of features, also rotates the data ◮ For transformations to r < d dimensions, this is linear dimensionality reduction ◮ Non Linear Transformation ◮ Neural nets ◮ Kernelization of linear transformations Slide Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 8

Supervised Metric Learning Main focus of this talk. ◮ Constraints or labels given to the algorithm. E.g. set of similarity and dissimilarity constraints ◮ Recent popular methods uses CNN architecture for non-linear transformation. Before getting into deep architectures, let us explore some basic and classical works. 9

Mahalanobis Distances ◮ Assume the data is represented as N vectors of length d: X = [ x 1 , x 2 , · · · , x N ] ◮ Squared Euclidean distance d ( x 1 , x 2 ) = || x 1 − x 2 || 2 2 (1) = ( x 1 − x 2 ) T ( x 1 − x 2) ◮ Let Σ = � i , j ( x i − µ )( x j − µ ) T ◮ The original Mahalanobis distance is given as:- d M ( x 1 , x 2 ) = ( x 1 − x 2 ) T Σ − 1 ( x 1 − x 2) (2) 10

Mahalanobis Distances Equivalent to applying a whitening transform Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 11

Mahalanobis Distances Mahalanobis distances for metric learning In general distance can be parameterized by d × d positive semi-definite matrix A : d A ( x 1 , x 2 ) = ( x 1 − x 2 ) T A ( x 1 − x 2 ) (3) Metric learning as linear transformation Derives a family of metrics over X by computing Euclidean distances after performing a linear transformation x ′ = Lx A = LL T [Cholesky decomposition] (4) d A ( x 1 , x 2 ) = || L T ( x 1 − x 2 ) || 2 2 12

Mahalanobis Distances Why is A positive semi-definite (PSD)? 13

Mahalanobis Distances Why is A positive semi-definite (PSD)? ◮ If A is not PSD, then d A could be negative. ◮ Suppose v = x 1 − x 2 is an eigen vector corresponding to a negative eigenvalue λ of A d A ( x 1 , x 2 ) = ( x 1 − x 2 ) T A ( x 1 − x 2 ) = v T Av (5) = λ v T v < 0 13

Metric Learning Formulation Two main components:- ◮ A set of constraints on the distance ◮ A regularizer on the distance / objective function Constrained Case min A r ( A ) s . t . c i ( A ) ≤ 0 0 ≤ i ≤ C (6) A ≥ 0 Here r is the regularizer. Popular one is || A || 2 F . A ≥ 0 for positive semi-definiteness. Unconstrained Case C � min A ≥ 0 r ( A ) + λ c i ( A ) (7) i =1 14

Metric Learning Formulation: Defining Constraints Similarity / Dissimilarity constraints Given a set of pairs ( x i , x j ) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. d A ( x i , x j ) ≤ l ∀ ( i , j ) ∈ S (8) d A ( x i , x j ) ≥ u ∀ ( i , j ) ∈ D Popular in verification problems. 15

Metric Learning Formulation: Defining Constraints Similarity / Dissimilarity constraints Given a set of pairs ( x i , x j ) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. d A ( x i , x j ) ≤ l ∀ ( i , j ) ∈ S (8) d A ( x i , x j ) ≥ u ∀ ( i , j ) ∈ D Popular in verification problems. Relative distance constraints Given a triplet ( x i , x j , x k ) such that the distance between x i and x j should be smaller than the distance between x i and x k :- d A ( x i , x j ) ≤ d A ( x i , x k ) − m (9) Here m is the margin. It is popular for ranking problems. 15

Mahalanobis metric for clustering Key Components ◮ A convex objective function for distance metric learning. ◮ Similar to linear discriminant analysis. � � d A ( x i , x j ) max A ( x i , x j ) ∈ D � (10) s . t . c ( A ) = d A ( x i , x j ) ≤ 1 ( x i , x j ) ∈ S A ≥ 0 ◮ Here, D is a set of pairs of dissimilar pairs, S is a set of similar pairs ◮ Objective tries to maximize sum of dissimilar distances ◮ Constraint keeps sum of similar distances small Xing et. al.’s NIPS’02 16

Large Margin Nearest Neighbor Key Components Learns a Mahalanobis distance metric using:- ◮ convex loss function ◮ margin maximization ◮ constraints imposed for accurate kNN classification. ◮ Promotes local distance notion instead of global similarity. Intuition ◮ Each training input x i should share the same label y i as its k nearest neighbors and, ◮ Training inputs with different labels should be widely separated. Weinberger et. al. JMLR’09 17

Large Margin Nearest Neighbor Target Neighbors Use prior knowledge or compute k nearest neighbors using Euclidean distance. Neighbors does not change while training. Imposters Differently labeled inputs that invade the perimeter plus unit margin. || L T ( x i − x l ) || 2 ≤ || L T ( x i − x j ) || 2 + 1 (11) Here x i and x j have label y i and x l is an imposter with label y l � = y i Weinberger et. al. JMLR’09 18

Large Margin Nearest Neighbor Loss Function � || L T ( x i − x j ) || 2 ε pull ( L ) = j � i (1 − y il )[1 + || L T ( x i − x j ) || 2 − || L T ( x i − x l ) || 2 ] + � � ε push ( L ) = i , j � i l (12) Here [ z ] + = max (0 , z ), denotes the standard hinge loss. ε ( L ) = (1 − µ ) ε pull ( L ) + µε push ( L ) . (13) Here ( x i , x j , x l ) forms a triplet sample. The above loss function is non-convex and the original paper discuss a convex formulation using semi-definite programming. Weinberger et. al. JMLR’09 19

Distance Metric Learning using CNNs 20

Distance Metric Learning using CNNs Siamese Network Siamese is an informal term for conjoined or fused. ◮ Contains two or more identical sub-networks with shared set of parameters and weights ◮ Popularly used for similarity learning tasks such as verification and ranking. Figure 3: Signature verification. Bromley et. al. NIPS’93 20

Siamese Architecture Given a family of functions G W ( X ) parameterized by W , find W such that the similarity metric D W ( X 1 , X 2 ) is small for similar pairs and large for disimilar pairs:- D W ( X 1 , X 2 ) = || G W ( X 1 ) − G W ( X 2 ) || (14) Chopra et. al. CVPR’05 and Hadsell et. al. CVPR’06 21

Contrastive Loss Function Let X 1 , X 2 ∈ I , pair of input vectors and Y be the binary label where Y = 0 means the pair is similar and Y = 1 means dissimilar. We define a parameterized distance function D W as:- D W ( X 1 , X 2 ) = || G W ( X 1 ) − G W ( X 2 ) || 2 (15) The contrastive loss function is given as:- L ( W , Y , X 1 , X 2 ) = (1 − Y )1 2( D W ) 2 + ( Y )1 2 { max (0 , m − D W ) } 2 (16) Here m > 0 is the margin which enforces the robustness. 22

Contrastive loss function Spring model analogy: F = − KX Attraction ∂ L S ∂ D W ∂ W = D W ∂ W Repulsion ∂ L D ∂ W = − ( m − D W ) ∂ D W ∂ W The force is absent when D W ≥ m . 23

Dimensionality Reduction 24

Face Verification Discriminative Deep Metric Learning ◮ Face verification in the wild. ◮ Defines a threshold for both positive and negative face pairs. Hu et. al. CVPR’14 25

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, - PowerPoint PPT Presentation

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1 Outline Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation Mahalanobis metric for clustering

Welcome back... Metric spaces. Approximate metric using a tree. Tree metric: 16 16 A metric

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Metric Spaces Definition If d is a metric on X , then the metric topology on X induced by d is

CALCULUS ON METRIC SPACES: BEYOND THE POINCAR INEQUALITY New Examples of Differentiability

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Information- -Velocity Metric Velocity Metric Information-Velocity Metric Information for the

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Learning distance functions Xin Sui CS395T Visual Recognition and Search The University of Texas

Learning Distance for Sequences by Learning a Ground Metric Bing Su Ying Wu

Information Theoretic Metric Learning Instructor: Sham Kakade 1 Metric Learning In k -nearest

Smooth Proxy-Anchor Loss for Noisy Metric Learning Carlos Roig, David Varas, Issey Masuda, Juan

Smooth Proxy-Anchor Loss for Noisy Metric Learning Carlos Roig David Varas Issey Masuda Juan

SYDE 372 - Winter 2011 Introduction to Pattern Recognition Distance Measures for Pattern

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Perceptron Algorithm An aside: a hyperplane is a perceptron. (single layer neural network, do you

Lecture 7: Arora Rao Vazirani Lecture Outline Part I: Semidefinite Programming Relaxation for

Big Data Analytics Symposium - Summer 2019 Analytics Project: Whats the trend of hot areas in

Vision San Jacinto College will be the leader in educational excellence and in the achievement of

Triangle Rasterization Sung-Eui Yoon ( ) ( ) C Course URL: URL

Counteracting Adversarial Attacks in Autonomous Driving Qi Sun 1 , Arjun Ashok Rao 1 , Xufeng Yao

Cloth Simulation CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2018 Cloth

Finding Shortest Paths Shortest Path Problem Shortest Path Problem We are given a graph G = ( V ,