[PPT] - One the Role and Impact of the Metaparameters in t-distributed PowerPoint Presentation

SLIDE 1

One the Role and Impact of the Metaparameters in t-distributed Stochastic Neighbor Embedding

John A. Lee and Michel Verleysen

Machine Learning Group Université catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be

SLIDE 2

Motivation for nonlinear dimensionality reduction

High-dimensional data are

– difficult to represent – difficult to understand – difficult to analyze

Motivation # 1:

– To visualize data living in a d-dimensional space (d > 3)

Motivation # 2:

– Models (regression, classification, clustering) based on high-dimensional data suffer from the curse of dimensionality – Need to reduce the dimension of data while keeping information content!

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 2

Motivation

SLIDE 3

Visualization

These are data
It is difficult to see something…

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 3

Motivation

annual increase (% ), infant mortality (‰ ), illiteracy ratio (% ), school attendance (% ), GIP, annual GIP increase (% )

SLIDE 4

Visualization

These are the same data
under different visualization paradigms
possible to see groups, relations, outliers, …

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 4

Motivation

SLIDE 5

Not all NLDR methods perform equally !

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 5

Motivation

Geodesic NLM CDA Isomap

SLIDE 6

Stochastic Neighbor Embedding

SNE and t-SNE are nowadays considered as ‘good’ methods for NDLR
Examples

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 6

Motivation

From: L. Van der Maaten & G. Hinton, Visualizing Data using t- SNE, Journal of Machine Learning Research 9 (2008) 2579-2605

t-SNE MDS

SLIDE 7

Stochastic Neighbor Embedding

SNE and t-SNE are nowadays considered as ‘good’ methods for NDLR
Examples

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 7

Motivation

From: L. Van der Maaten & G. Hinton, Visualizing Data using t- SNE, Journal of Machine Learning Research 9 (2008) 2579-2605

t-SNE MDS

SLIDE 8

Outline

NDLR: a historical perspective

– stress function – intrusion and extrusions – geodesic distances

SNE and t-SNE

– algorithm – gradient – transformed distances

Experiments

– with Euclidean distances – with geodesic distances

Conclusions

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 8

NDLR: a historical perspective

SLIDE 9

From MDS to more general cost functions

MDS follows the idea of
Extension:

to give more importance to

– small distances – close data – …

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 9

NDLR: a historical perspective → Stress function

( )

∑

<

−

j i ij ij X

d

2 2 2

min δ

j i ij j i ij

x x d y y − = − = δ where

( )

∑

<

−

j i ij ij ij X

d w

2 2 2

min δ Traditional « stress » function:

( )

∑

<

−

j i ij ij ij X

d w

2

min δ

Breakthrough # 1

SLIDE 10

Limitations of linear projections

Even simple manifolds can be poorly projected

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 10

NDLR: a historical perspective → Intrusions and extrusions

SLIDE 11

Limitations of linear projections

Even simple manifolds can be poorly projected
Points originally far from eachother are projected close:

this is an intrusion

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 11

NDLR: a historical perspective → Intrusions and extrusions

SLIDE 12

Nonlinear projections

Goal: to unfold, rather than to project (linearly)

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 12

NDLR: a historical perspective → Intrusions and extrusions

SLIDE 13

Nonlinear projections

Goal: to unfold, rather than to project (linearly)
Intrusions can be hopefully decreased, but extrusions could appear

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 13

NDLR: a historical perspective → Intrusions and extrusions

SLIDE 14

The user’s point of view

Favouring intrusions or extrusions is related to the application

(user’s point of view)

General way of handling the compromise:
Nowadays, few methods acknowledge this need for a trade-off !

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 14

NDLR: a historical perspective → Intrusions and extrusions

( )

        − +         = σ δ λ σ λ

ij ij ij

f d f w 1

Breakthrough # 2 allows intrusions allows extrusions

SLIDE 15

Geodesic distances

Goal: to measure distances along the manifold
Such distances are more easily preserved

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 15

NDLR: a historical perspective → Geodesic distances

Breakthrough # 3

SLIDE 16

Geodesic and graph distances

Geodesic distances: finding the shortest way between data along the

manifold

Problem: the manifold is unknown → approximate it by a graph
It exists efficient algorithms for finding shortest paths
The graph can be built by connecting data in a k-neighborhood, or in

a ε-ball

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 16

2-d data Approximation of Geodesic distance

NDLR: a historical perspective → Geodesic distances

SLIDE 17

Distance preservation methods

Euclidean distances in HD space Geodesic distances in HD space Metric MDS Isomap Favors intrusions Sammon NLM Geodesic NLM Favors extrusions CCA CDA

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 17

NDLR: a historical perspective

( ) ( )

( )

2 1 ,

, ,

∑

=

− =

N j i x y

j i d j i d E

( ) ( )

( )

∑

< =

− =

N j i i y x y NLM

j i d j i d j i d E

1 2

, , , ( ) ( )

( )

( ) ( )

∑

< =

− =

N j i i x x y CCA

j i d F j i d j i d E

1 2

, , ,

λ

SLIDE 18

Distance preservation methods

Euclidean distances in HD space Geodesic distances in HD space Metric MDS Isomap Favors intrusions Sammon NLM Geodesic NLM Favors extrusions CCA CDA

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 18

NDLR: a historical perspective

( ) ( )

( )

2 1 ,

, ,

∑

=

− =

N j i x y

j i d j i d E

( ) ( )

( )

∑

< =

− =

N j i i y x y NLM

j i d j i d j i d E

1 2

, , , ( ) ( )

( )

( ) ( )

∑

< =

− =

N j i i x x y CCA

j i d F j i d j i d E

1 2

, , ,

λ

Computational load ↓ Performances ↓ Computational load ↑ Performances ↑

SLIDE 19

Outline

NDLR: a historical perspective

– stress function – intrusion and extrusions – geodesic distances

SNE and t-SNE

– algorithm – gradient – transformed distances

Experiments

– with Euclidean distances – with geodesic distances

Conclusions

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 19

SNE and t-SNE

SLIDE 20

SNE and t-SNE

In the original space, the similarity between yi and yj is defined as
Similarities are not symmetric (individual widths) !
pj|i is the empirical probability of yj to be a neighbor of yi

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 20

SNE and t-SNE → Algorithm

( )

     = =

∑

≠

therwise

if

i k i ik i ij i i j

g g j i p λ δ λ δ λ

( )

                − = 2 exp

2

u u g

SLIDE 21

SNE and t-SNE

In the original space, the similarity between yi and yj is defined as
Similarities are not symmetric (individual widths) !
pj|i is the empirical probability of yj to be a neighbor of yi
Individuals widths λi: set (individually) through a global « perplexity »

parameter

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 21

SNE and t-SNE → Algorithm

( )

                − = 2 exp

2

u u g

( )

PPXT

i j

p H

= 2

( )

     = =

∑

≠

therwise

if

i k i ik i ij i i j

g g j i p λ δ λ δ λ

SLIDE 22

SNE and t-SNE

In the embedding space, the similarity between xi and xj is defined as
Similarities are symmetric
t(u,n) is proportional to a Student t with n degrees of freedom

(n controls the thickness of the tail)

SNE: n → ∞

t-SNE: n = 1

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 22

SNE and t-SNE → Algorithm

( )

     = =

∑

≠

therwise

, , if

l k kl ij ij

n d t n d t j i n q

( )

                    + =

+ − 2 1 2

1 ,

n

n u n u t

SLIDE 23

SNE and t-SNE

Now that similarties are defined in both spaces, how to compare

them?

– This seems to be a major difference with respect to other methods, based

n square erros!
E is minimized by gradient descent, to find locations xi.

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 23

SNE and t-SNE → gradient

( )

q p D E

KL

=

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

SLIDE 24

SNE and t-SNE

Now that similarties are defined in both spaces, how to compare

them?

– This seems to be a major difference with respect to other methods, based

n square erros!
E is minimized by gradient descent, to find locations xi.

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 24

SNE and t-SNE → gradient

( )

q p D E

KL

=

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

xi moves towards xj

SLIDE 25

SNE and t-SNE

Now that similarties are defined in both spaces, how to compare

them?

– This seems to be a major difference with respect to other methods, based

n square erros!
E is minimized by gradient descent, to find locations xi.

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 25

SNE and t-SNE → gradient

( )

q p D E

KL

=

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

xi moves towards xj Similarity error – adjusts amplitude

SLIDE 26

SNE and t-SNE: gradient

Now that similarities are defined in both spaces, how to compare

them?

– This seems to be a major difference with respect to other methods, based

n square erros!
E is minimized by gradient descent, to find locations xi.

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 26

SNE and t-SNE → gradient

( )

q p D E

KL

=

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

xi moves towards xj Similarity error – adjusts amplitude Damping factor

SLIDE 27

SNE and t-SNE: gradient

Damping factor is similar to in CCA and CDA:

– Large distances are less important – Distances in the embedding space are used, to allow tears (favoring extrusions)

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 27

SNE and t-SNE → gradient

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

xi moves towards xj Similarity error – adjusts amplitude Damping factor

( )

ij

d Fλ

SLIDE 28

SNE and t-SNE: distributions

Why different distributions for pij and qij ?
Remember that distances have often to be enlarged: heavier tails (in

the embedding space) help!

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 28

SNE and t-SNE → distributions

( ) ( ) (

)

∑

=

− + − + = ∂ ∂

N j j i ij ij ij i

x x n d n q p n n x E

1 2

1 2 2 λ

xi moves towards xj Similarity error – adjusts amplitude Damping factor

SLIDE 29

SNE and t-SNE: distributions

Non-trivial solution of min E
After some (rough) approximations:
Properties

– f is monotonically increasing – with SNE (n → ∞): – if δij < < λi, then

t-SNE tries to preserved streched distances
SNE distances are scaled by λi
n and λi act more or less in the same way

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 29

SNE and t-SNE → distributions

( )

n n n f d

i ij ij ij

−         + = ≈

2 2

1 exp λ δ δ

( )

i ij ij

f λ δ δ =

( ) ( )

1 + = n f

i ij ij

λ δ δ

SLIDE 30

Outline

NDLR: a historical perspective

– stress function – intrusion and extrusions – geodesic distances

SNE and t-SNE

– algorithm – gradient – transformed distances

Experiments

– with Euclidean distances – with geodesic distances

Conclusions

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 30

SLIDE 31

Experiments

Data: swiss roll
Quality measures: in a K-neighborhood, we count the number of

intrusions and extrusions. Then

– QNX(K) measures the overall number of intrusions and extrusions (higher QNX(K) means better quality) – BNX(K) measures the difference between the number of intrusions and extrusions (positiveBNX(K) means intrusive)

Use of both Euclidean and geodesic distances

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 31

Experiments

SLIDE 32

Results with Euclidean distances

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 32

Experiments → with Euclidean distances

increasing perplexity

SLIDE 33

Results with Euclidean distances

Difficult problem! (low

values of QNX(K))

t-SNE largely depends
n perplexity

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 33

Experiments → with Euclidean distances

increasing perplexity

SLIDE 34

Results with Euclidean distances

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 34

Experiments → with Euclidean distances

SLIDE 35

Results with geodesic distances

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 35

Experiments → with geodesic distances

increasing perplexity

SLIDE 36

Results with geodesic distances

Geodesic distances

facilitate the task

CCA performs well!
t-SNE still depends
n perplexity, but

large values help

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 36

Experiments → with geodesic distances

increasing perplexity

SLIDE 37

Outline

NDLR: a historical perspective

– stress function – intrusion and extrusions – geodesic distances

SNE and t-SNE

– algorithm – gradient – transformed distances

Experiments

– with Euclidean distances – with geodesic distances

Conclusions

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 37

SLIDE 38

Conclusions

t-SNE is a distance preservation method
Stretching distances : good idea!
But transformation in t-SNE not always optimal (not data driven)
Careful tuning of parameters!
Damping factor for large distances: good idea
But this does not solve the issue of non-Euclidean manifolds (ex:

hollow sphere)

Situation is better with clustered data (stretching large distances

improves the separation between clusters)

Compstat 2010 On the role and impact of the metaparameters in t-distributed SNE 38

Conclusions

SLIDE 39