Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, - - PowerPoint PPT Presentation

distance metric learning beyond 0 1 loss
SMART_READER_LITE
LIVE PREVIEW

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, - - PowerPoint PPT Presentation

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1 Outline Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation Mahalanobis metric for clustering


slide-1
SLIDE 1

Distance Metric Learning: Beyond 0/1 Loss

Praveen Krishnan

CVIT, IIIT Hyderabad

June 14, 2017

1

slide-2
SLIDE 2

Outline

Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation

Mahalanobis metric for clustering Large Margin Nearest Neighbor

Distance Metric Learning using CNNs Siamese Network

Contrastive loss function Applications

Triplet Network

Triplet Loss Applications

Mining Triplets Adaptive Density Distribution

Magnet loss

2

slide-3
SLIDE 3

Distances and Similarities

Distance Functions

The concept of distance function d(., .) is inherent to any pattern recognition problem. E.g. clustering (kmeans), classification (kNN, SVM) etc.

Typical Choices

◮ Minkowski Distance: Lp(P, Q) = ( i |Pi − Qi|p)

1 p .

◮ Cosine: L(P, Q) = PT Q |P||Q| ◮ Earth Mover: Uses an optimization algorithm ◮ Edit distance: Uses dynamic programming between sequences. ◮ KL Divergence: KL(P Q) = i Pi log Pi Qi . (Not Symmetric!) ◮ many more ... (depending on type of problem)

3

slide-4
SLIDE 4

Distances and Similarities

Choosing the right distance function?

Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 4

slide-5
SLIDE 5

Metric Learning

Distance Metric Learning

Learn a function that maps input patterns into a target space such that the simple distance in the target space (Euclidean) approximates the “semantic” distance in the input space.

Figure 1: Hadsell et. al. CVPR’06

5

slide-6
SLIDE 6

Metric Learning

Many applications

Figure 2: A subset of applications using metric learning.

◮ Scale to large number of #categories. [Schroff et al., 2015] ◮ Fine grained classification . [Rippel et. al., 2015] ◮ Visualization of high-dimensional data. [van der Maaten and

Hinton, 2008]

◮ Ranking and retrieval. [Wang et. al. CVPR’14]

6

slide-7
SLIDE 7

Properties of a Metric

What defines a metric?

  • 1. Non-negativity: D(P, Q) ≥ 0
  • 2. Identity of indiscernibles: D(P, Q) = 0 iff P = Q
  • 3. Symmetry: D(P, Q) = D(Q, P)
  • 4. Triangle Inequality: D(P, Q) ≤ D(P, K) + D(K, Q)

Pseudo/Semi Metric

If the second property is not followed strictly i.e. “iff → if”

7

slide-8
SLIDE 8

Metric learning as learning transformations

◮ Feature Weighting

◮ Learn weightings over the features, then use standard distance

(e.g.,Euclidean) after re-weighting

◮ Full linear transformation

◮ In addition to scaling of features, also rotates the data ◮ For transformations to r < d dimensions, this is linear

dimensionality reduction

◮ Non Linear Transformation

◮ Neural nets ◮ Kernelization of linear transformations Slide Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 8

slide-9
SLIDE 9

Supervised Metric Learning

Main focus of this talk.

◮ Constraints or labels given to the algorithm. E.g. set of

similarity and dissimilarity constraints

◮ Recent popular methods uses CNN architecture for non-linear

transformation. Before getting into deep architectures, let us explore some basic and classical works.

9

slide-10
SLIDE 10

Mahalanobis Distances

◮ Assume the data is represented as N vectors of length d:

X = [x1, x2, · · · , xN]

◮ Squared Euclidean distance

d(x1, x2) = ||x1 − x2||2

2

= (x1 − x2)T(x1 − x2) (1)

◮ Let Σ = i,j(xi − µ)(xj − µ)T ◮ The original Mahalanobis distance is given as:-

dM(x1, x2) = (x1 − x2)TΣ−1(x1 − x2) (2)

10

slide-11
SLIDE 11

Mahalanobis Distances

Equivalent to applying a whitening transform

Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 11

slide-12
SLIDE 12

Mahalanobis Distances

Mahalanobis distances for metric learning

In general distance can be parameterized by d × d positive semi-definite matrix A: dA(x1, x2) = (x1 − x2)TA(x1 − x2) (3)

Metric learning as linear transformation

Derives a family of metrics over X by computing Euclidean distances after performing a linear transformation x′ = Lx A = LLT [Cholesky decomposition] dA(x1, x2) = ||LT(x1 − x2)||2

2

(4)

12

slide-13
SLIDE 13

Mahalanobis Distances

Why is A positive semi-definite (PSD)?

13

slide-14
SLIDE 14

Mahalanobis Distances

Why is A positive semi-definite (PSD)?

◮ If A is not PSD, then dA could be negative. ◮ Suppose v = x1 − x2 is an eigen vector corresponding to a

negative eigenvalue λ of A dA(x1, x2) = (x1 − x2)TA(x1 − x2) = vTAv = λvTv < 0 (5)

13

slide-15
SLIDE 15

Metric Learning Formulation

Two main components:-

◮ A set of constraints on the distance ◮ A regularizer on the distance / objective function

Constrained Case

minA r(A) s.t. ci(A) ≤ 0 0 ≤ i ≤ C A ≥ 0 (6) Here r is the regularizer. Popular one is ||A||2

F.

A ≥ 0 for positive semi-definiteness.

Unconstrained Case

min

A≥0 r(A) + λ C

  • i=1

ci(A) (7)

14

slide-16
SLIDE 16

Metric Learning Formulation: Defining Constraints

Similarity / Dissimilarity constraints

Given a set of pairs (xi, xj) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. dA(xi, xj) ≤ l ∀(i, j) ∈ S dA(xi, xj) ≥ u ∀(i, j) ∈ D (8) Popular in verification problems.

15

slide-17
SLIDE 17

Metric Learning Formulation: Defining Constraints

Similarity / Dissimilarity constraints

Given a set of pairs (xi, xj) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. dA(xi, xj) ≤ l ∀(i, j) ∈ S dA(xi, xj) ≥ u ∀(i, j) ∈ D (8) Popular in verification problems.

Relative distance constraints

Given a triplet (xi, xj, xk) such that the distance between xi and xj should be smaller than the distance between xi and xk:- dA(xi, xj) ≤ dA(xi, xk) − m (9) Here m is the margin. It is popular for ranking problems.

15

slide-18
SLIDE 18

Mahalanobis metric for clustering

Key Components

◮ A convex objective function for distance metric learning. ◮ Similar to linear discriminant analysis.

maxA

  • (xi,xj)∈D
  • dA(xi, xj)

s.t. c(A) =

  • (xi,xj)∈S

dA(xi, xj) ≤ 1 A ≥ 0 (10)

◮ Here, D is a set of pairs of dissimilar pairs, S is a set of

similar pairs

◮ Objective tries to maximize sum of dissimilar distances ◮ Constraint keeps sum of similar distances small

Xing et. al.’s NIPS’02

16

slide-19
SLIDE 19

Large Margin Nearest Neighbor

Key Components

Learns a Mahalanobis distance metric using:-

◮ convex loss function ◮ margin maximization ◮ constraints imposed for accurate kNN classification.

◮ Promotes local distance notion instead of global similarity.

Intuition

◮ Each training input xi should share the same label yi as its k

nearest neighbors and,

◮ Training inputs with different labels should be widely

separated.

Weinberger et. al. JMLR’09

17

slide-20
SLIDE 20

Large Margin Nearest Neighbor

Target Neighbors

Use prior knowledge or compute k nearest neighbors using Euclidean

  • distance. Neighbors does not change while training.

Imposters

Differently labeled inputs that invade the perimeter plus unit margin. ||LT(xi − xl)||2 ≤ ||LT(xi − xj)||2 + 1 (11) Here xi and xj have label yi and xl is an imposter with label yl = yi

Weinberger et. al. JMLR’09

18

slide-21
SLIDE 21

Large Margin Nearest Neighbor

Loss Function

εpull(L) =

  • ji

||LT(xi − xj)||2 εpush(L) =

  • i,ji
  • l

(1 − yil)[1 + ||LT(xi − xj)||2 − ||LT(xi − xl)||2]+ (12) Here [z]+ = max(0, z), denotes the standard hinge loss. ε(L) = (1 − µ)εpull(L) + µεpush(L). (13) Here (xi, xj, xl) forms a triplet sample. The above loss function is non-convex and the original paper discuss a convex formulation using semi-definite programming.

Weinberger et. al. JMLR’09

19

slide-22
SLIDE 22

Distance Metric Learning using CNNs

20

slide-23
SLIDE 23

Distance Metric Learning using CNNs

Siamese Network

Siamese is an informal term for conjoined or fused.

◮ Contains two or more identical sub-networks with shared set

  • f parameters and weights

◮ Popularly used for similarity learning tasks such as verification

and ranking.

Figure 3: Signature verification. Bromley et. al. NIPS’93

20

slide-24
SLIDE 24

Siamese Architecture

Given a family of functions GW (X) parameterized by W , find W such that the similarity metric DW (X1, X2) is small for similar pairs and large for disimilar pairs:- DW (X1, X2) = ||GW (X1) − GW (X2)|| (14)

Chopra et. al. CVPR’05 and Hadsell et. al. CVPR’06

21

slide-25
SLIDE 25

Contrastive Loss Function

Let X1, X2 ∈ I, pair of input vectors and Y be the binary label where Y = 0 means the pair is similar and Y = 1 means dissimilar. We define a parameterized distance function DW as:- DW (X1, X2) = ||GW (X1) − GW (X2)||2 (15) The contrastive loss function is given as:- L(W , Y , X1, X2) = (1 − Y )1 2(DW )2 + (Y )1 2{max(0, m − DW )}2 (16) Here m > 0 is the margin which enforces the robustness.

22

slide-26
SLIDE 26

Contrastive loss function

Spring model analogy: F = −KX Attraction

∂LS ∂W = DW ∂DW ∂W

Repulsion

∂LD ∂W = −(m − DW )∂DW ∂W The force is absent when DW ≥ m.

23

slide-27
SLIDE 27

Dimensionality Reduction

24

slide-28
SLIDE 28

Face Verification

Discriminative Deep Metric Learning

◮ Face verification in

the wild.

◮ Defines a threshold

for both positive and negative face pairs.

Hu et. al. CVPR’14

25

slide-29
SLIDE 29

Face Verification

Discriminative Deep Metric Learning

arg min

f

1 2

  • i,j

g(1 − lij(τ − d2

f (xi, xj))) + λ

2

M

  • m=1

(||W m||2

F + ||bm||2 2)

Here g(z) = 1

β log(1 + exp(βz)) is generalized logistic regression

function.

Hu et. al. CVPR’14

26

slide-30
SLIDE 30

Triplet network

From the idea of triplet pairs as formulated in LMNN, the siamese architecture is modified to a triplet network:-

27

slide-31
SLIDE 31

Triplet Loss

◮ Learn an embedding function f (.) that assigns smaller

distances to similar image pairs.

◮ Given a triplet: ti = (pi, p+ i , p− i ), triplet loss is defined as:-

l(pi, p+

i , p− i ) = max{0, m+D(f (pi), f (p+i+))−D(f (pi), f (p− i ))}

Figure 5: Network architecture of deep ranking model

Wang et. al. CVPR’14

28

slide-32
SLIDE 32

Fine-grained Image Similarity with Deep Ranking

Wang et. al. CVPR’14

29

slide-33
SLIDE 33

FaceNet

Face Embedding

◮ State of the art results on LFW dataset. ◮ Additional constrain of embedding to live on the

d-dimensional hypersphere, ||f (x)||2 = 1.

◮ Use an online triplet selection method.

Figure 6: Results of Face Clustering using learned embedded features.

Schroff et. al. arxiv’15

30

slide-34
SLIDE 34

How to mine triplets?

Selection of triplets important for faster convergence and better training.

Challenges

◮ Given N examples, picking all triplets is O(N3). ◮ Need for fresh selection of triplets after each epoch.

Typical Strategies

◮ Select hard positives and hard negatives. ◮ Generate triplets offline every n steps ‘or’ online from each

mini batch.

31

slide-35
SLIDE 35

How to mine triplets?

Schroff et. al. arxiv’15

◮ Generate triplets online with large mini batch sizes by ensuring

minimum no. of exemplars for each class.

◮ Picks semi-hard examples where:-

||G(X a

i ) − G(X p i )||2 2 < ||G(X a i ) − G(X n i )||2 2

These negative samples are further away from anchor but lie inside the margin m

Wang et. al. CVPR’14

◮ Uses pairwise relevance scores (prior knowledge). ◮ Uses an online triplet sampling algorithm based on reservoir

sampling.

32

slide-36
SLIDE 36

Issues in the Existing Approaches

◮ Predefined target neighborhood structure

◮ Defined a-priori using supervised knowledge. ◮ Doesn’t enable utilizing shared structure among different

classes.

◮ Local similarity given by LMNN [Weinberger et. al. 2009],

partly attempts this issue however,

◮ Target neighbors are determined a-priori and never updated

again.

◮ Objective formulation

◮ Penalizing individual pairs of triplets does not employ sufficient

contextual insights of neighborhood structure.

◮ Non optimal mining of triplets leads to slower convergence

and non-optimal solutions.

33

slide-37
SLIDE 37

Issues in the Existing Approaches

Figure 7: Both triplet and softmax result in unimodal separation, due to enforcement of semantic similarity.

Image Credit: Rippel et. al. arxiv’16

34

slide-38
SLIDE 38

Magnet Loss

Key Points

◮ Adaptive assessment of similarity as function of current

representation.

◮ Local discrimination by penalizing class distribution overlap. ◮ Enabling clustering based approach for efficient hard negative

mining.

Rippel et. al. arxiv’16

35

slide-39
SLIDE 39

Magnet Loss

Model Formulation

Jointly manipulate clusters in pursuit of local discrimination. For each class c, we have K cluster assignments given as:- Ic

1 . . . Ic K = arg min Ic

1...Ic K

K

  • k=1
  • r∈I c

k

||r − µc

k||2 2,

µc

k =

1 |I c

k |

  • r∈I c

k

r Here rn = f (xn, Θ), denote the representation given by a CNN network.

Rippel et. al. arxiv’16

36

slide-40
SLIDE 40

Magnet Loss

Model Formulation

The magnet loss is defined as:- L(Θ) = 1 N

N

  • n=1

{− log exp

1 2σ2 ||rn−µ(rn)||2 2−α

  • c=C(rn)

K

k=1 exp

1 2σ2 ||rn−µ(rn)||2 2 }

Here C(r) is the class of representation r and µ(r) is the assigned cluster center. α ∈ R and σ2 =

1 N−1

  • r∈D ||r − µ(r)||2

2 is the

variance.

Key points

◮ Loss for examples farther from cluster centers minimum. ◮ Variance standardization for in-variance against characteristic

length scale.

◮ α acts as the cluster separation gap.

Rippel et. al. arxiv’16

37

slide-41
SLIDE 41

Magnet Loss

Training Procedure

◮ Neighbourhood sampling

◮ Sample a seed cluster I1 ∼ pI(.) ◮ Retrieve M − 1 nearest impostor clusters of I2, . . . IM of I1 ◮ For each cluster IM, m = 1 . . . M, sample D examples

xm

1 , . . . , xm D ∼ pIM(.)

Here pI ∝ LI and pIM is a uniform distribution.

◮ Cluster Index: Periodic computation of Kmeans clustering

using the current representation taken from the CNN network.

Figure 8: Triplet vs.magnet in terms of training curves.

Rippel et. al. arxiv’16

38

slide-42
SLIDE 42

References Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ ackinger, Roopak Shah, ”Signature Verification Using a Siamese Time Delay Neural Network”. NIPS 1993 Chopra, Sumit and Hadsell, Raia and LeCun, Yann, ”Learning a similarity metric discriminatively, with application to face verification”, CVPR2005 Hadsell, Raia and Chopra, Sumit and LeCun, Yann, ”Dimensionality reduction by learning an invariant mapping”, CVPR06 Brian Kulis, Distance Functions and Metric Learning: Part 2 ECCV 2010 Tutorial Weinberger, Kilian Q., and Lawrence K. Saul. ”Distance metric learning for large margin nearest neighbor classification.” JMLR 2009.

39

slide-43
SLIDE 43

Hu, Junlin, Jiwen Lu, and Yap-Peng Tan. ”Discriminative deep metric learning for face verification in the wild.” CVPR 2014. Wang, Jiang, et al. ”Learning fine-grained image similarity with deep ranking.” CVPR 2014. Schroff, Florian, Dmitry Kalenichenko, and James Philbin. ”Facenet: A unified embedding for face recognition and clustering.” arXiv preprint (2015). Rippel, Oren, et al. ”Metric learning with adaptive density discrimination.” arXiv 2015.

40

slide-44
SLIDE 44

Thank You

41