Intrinsic t-Stochastic Neighbor Embedding for Visualization and - - PowerPoint PPT Presentation

intrinsic t stochastic neighbor embedding for
SMART_READER_LITE
LIVE PREVIEW

Intrinsic t-Stochastic Neighbor Embedding for Visualization and - - PowerPoint PPT Presentation

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University t-Stochastic Neighbor


slide-1
SLIDE 1

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection

A Remedy Against the Curse of Dimensionality?

Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany

Heidelberg University

slide-2
SLIDE 2

t-Stochastic Neighbor Embedding

t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50

  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50 60 70

10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances.

1

slide-3
SLIDE 3

t-Stochastic Neighbor Embedding

t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50

  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50 60 70

10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances.

1

slide-4
SLIDE 4

t-Stochastic Neighbor Embedding

SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD)

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

  • 11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1

1 2 3 4 5 6 7 8 9 10 11

  • 13
  • 10
  • 7
  • 4
  • 1

2 5 8 11 14

SNE SNE/t-SNE do not preserve density / distances. Can get stuck in a local optimum!

2

slide-5
SLIDE 5

t-Stochastic Neighbor Embedding

SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD)

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50

  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50

t-SNE SNE/t-SNE do not preserve density / distances. Can get stuck in a local optimum!

2

slide-6
SLIDE 6

t-Stochastic Neighbor Embedding

SNE and t-SNE use a Gaussian kernel in the input domain: pj|i =

exp(−xi−xj2/2σ2

i )

  • k=i exp(−xi−xk2/2σ2

i )

where each σ2

i is optimized to have the desired perplexity

(Perplexity ≈ number of neighbors to preserve) Asymmetric, so they simply use: pij := (pi|j + pj|i)/2 (We suggest to prefer pij = pi|j · pj|i for outlier detection) In the output domain, as qij, SNE uses a Gaussian (with constant σ), t-SNE uses a Student-t-Distribution.

Kullback-Leibler divergence can be minimized using stochastic gradient descent to

make input and output affinities similar.

3

slide-7
SLIDE 7

SNE vs. t-SNE

Gaussian weights in the output domain as used by SNE vs. t-SNE:

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.5 1 1.5 2 Gaussian Weight, σ²=1 Student-t Weight t=1

t-SNE has more emphasis on separating points.

even neighbors will be “fanned out” a bit “beter” separation of far points (SNE has 0 weight on far points)

4

slide-8
SLIDE 8

The Curse of Dimensionality

Loss of “discrimination” of distances [Bey+99]: lim

dim→∞ E

maxy=x d(x,y)−miny=x d(x,y)

miny=x d(x,y)

  • → 0.

Distances to near points and to far points become similar.

5

slide-9
SLIDE 9

The Curse of Dimensionality

Loss of “discrimination” of distances [Bey+99]: lim

dim→∞ E

maxy=x d(x,y)−miny=x d(x,y)

miny=x d(x,y)

  • → 0.

Distances to near points and to far points become similar.

The Gaussian kernel uses relative distances: exp(−xi − xj2/2σ2

i )

Distance Expected Distance With high-dimensional data, all pij become similar!

We cannot find a “good” σi anymore.

5

slide-10
SLIDE 10

Distribution of Distances

On the short tail distance distributions ofen look like this:

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Neighbor Density, 3D Neighbor Density, 10D Neighbor Density, 50D

In high-dimensional data, almost all nearest neighbors concentrate on the right hand side of this plot.

6

slide-11
SLIDE 11

Distribution of Distances

Gaussian weights as used by SNE / t-SNE:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.4 0.6 0.8 1 Gaussian Weight, σ²=0.3 Gaussian Weight, σ²=1 Gaussian Weight, σ²=2

For low-dimensional data, Gaussian weights work good. For high-dimensional data: almost the same weight for all points.

7

slide-12
SLIDE 12

Distribution of Distances

Gaussian kernels adjusted for intrinsic dimensionality:

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 0.4 0.6 0.8 1 Gaussian Weight, id=3, σ²=1 Gaussian Weight, id=10, σ²=1 Gaussian Weight, id=50, σ²=1

In theory, they behave like Gaussian kernels in low dimensionality.

8

slide-13
SLIDE 13

Distance Power Transform

Let X be a random variable (“of distances”) as in [Hou15], For constants c and m, use the transformation Y = g(X) with g(x):=c · xm Let FX, FY be the cumulative distribution of X, Y . Then IDFX(x) = m · IDFY (c · xm) [Hou15, Table 1]. By choosing m = IDFX(x)/t for any t > 0, one therefore obtains: IDFY (c · xm) = IDFX(x)/m = t where one can choose c > 0 as desired, e.g., for numerical reasons.

We can transform distances to any desired ID = t!

9

slide-14
SLIDE 14

Distance Power Transform

For each point p:

  • 1. Find k′ nearest neighbors of p (should be k′ > 100, k′ > k)
  • 2. Estimate ID at p
  • 3. Choose m = IDFX(x)/t, t = 2, c = k-distance
  • 4. Transform distances:

d′(p, q) := c · d(p, q)m

  • 5. Use Gaussian kernel, perplexity, t-SNE, ...

Can we defeat the curse this easily?

10

slide-15
SLIDE 15

Distance Power Transform

For each point p:

  • 1. Find k′ nearest neighbors of p (should be k′ > 100, k′ > k)
  • 2. Estimate ID at p
  • 3. Choose m = IDFX(x)/t, t = 2, c = k-distance
  • 4. Transform distances:

d′(p, q) := c · d(p, q)m

  • 5. Use Gaussian kernel, perplexity, t-SNE, ...

Can we defeat the curse this easily? Probably not: this is a hack to cure one symptom. Qestion: is our definition of ID too permissive?

10

slide-16
SLIDE 16

Experimental Results: it-SNE

Projections of the ALOI outlier data set (as available at [Cam+16]): PCA t-SNE it-SNE Data set: Color histograms of 50.000 photos of 1000 objects Each class: same object, different angles & different light Labeled outliers: classes reduced to 1-3 objects — May contain other “true” outliers!

11

slide-17
SLIDE 17

Experimental Results: it-SNE

Projection of the ALOI outlier data set with t-SNE:

11

slide-18
SLIDE 18

Experimental Results: it-SNE

Projection of the ALOI outlier data set with it-SNE:

Labeled & Unlabeled Outliers!

11

slide-19
SLIDE 19

Experimental Results: it-SNE

On the well-known MNIST data set t-SNE:

12

slide-20
SLIDE 20

Experimental Results: it-SNE

On the well-known MNIST data set it-SNE: Outliers!

12

slide-21
SLIDE 21

Outlier Detection: ODIN

ODIN (Outlier Detection using Indegree Number) [HKF04]:

  • 1. Find the k nearest neighbors of each object.
  • 2. Count how ofen each object was returned.

= in-degree of the k nearest neighbor graph

  • 3. Objects with no (or fewest) occurrences are outliers.

Works, but many objects will have the exact same score. Which k to use? Can change abruptly with k. Can we make a continuous (“smooth”) version of this idea?

13

slide-22
SLIDE 22

Outlier Detection: SOS

SOS (Stochastic Outlier Selection) [JPH13] Idea: assume every object can link to one neighbor randomly. Inliers: likely to be linked to, outliers: likely to be not linked to.

  • 1. Compute pj|i of SNE / t-SNE for all i, j:

pj|i =

exp(−xi−xj2/2σ2

i )

  • k=i exp(−xi−xk2/2σ2

i )

use Gaussian weights to prefer near neighbors.

  • 2. The SOS outlier score is then:

SOS(xj) :=

  • i=j 1 − pj|i

= probability that no neighbor links to object j.

14

slide-23
SLIDE 23

KNNSOS and ISOS Outlier Detection

We propose two variants of this idea:

  • 1. Since most pj|i will be zero, use only the k nearest neighbors.

Reduces runtime from O(n2) to possibly O(n log n), O(n4/3). KNNSOS(xj) :=

  • i∈kNN(xj) 1 − pj|i
  • 2. Estimate ID(xi), and use transformed distances for pj|i.

ISOS: Intrinsic-dimensionality Stochastic Outlier Selection Note: The t-SNE author, van der Maaten, already proposed an approximate and index-based variant of t-SNE: Barnes-Hut t-SNE, which also uses the kNN only [Maa14].

15

slide-24
SLIDE 24

Experimental Results: Outlier Detection

1 2 5 10 20 50 100 200 −0.05 0.00 0.05 0.10 0.15 0.20

  • ● ● ● ●●●●●●●
  • ALOI_withoutdupl_norm

Neighborhood size Adjusted AP

  • ISOS

kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS

16

slide-25
SLIDE 25

Experimental Results: Outlier Detection

1 2 5 10 20 50 100 200 −0.04 −0.02 0.00 0.02 0.04

  • ● ● ● ● ● ●●●●●●●
  • Annthyroid_withoutdupl_norm_02_v01

Neighborhood size Adjusted AP

  • ISOS

kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS

16

slide-26
SLIDE 26

Experimental Results: Outlier Detection

1 2 5 10 20 50 100 200 0.00 0.01 0.02 0.03 0.04 0.05

  • ● ● ● ● ● ●●●●●●●
  • Pima_withoutdupl_norm_02_v01

Neighborhood size Adjusted AP

  • ISOS

kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS

16

slide-27
SLIDE 27

Experimental Results: Outliers in MNIST

0 (33.4%) 5 (33.1%) 8 (32.7%) 7 (32.3%) 1 (32.3%) 8 (32.2%) 8 (32.1%) 6 (31.9%) 8 (31.7%) 9 (31.6%) 7 (31.5%) 5 (31.4%) 4 (31.3%) 1 (31.1%) 9 (31.1%) 8 (31.0%) 9 (30.7%) 3 (30.7%) 5 (30.5%) 4 (30.4%) 1 (30.3%) 7 (30.3%) 2 (30.2%) 7 (30.1%)

8 (30.1%) 0 (30.1%) 9 (30.0%) 0 (30.0%) 0 (29.9%) 0 (29.8%) 2 (29.8%) 1 (29.7%) 7 (29.6%) 9 (29.6%) 5 (29.6%) 2 (29.5%) 2 (29.5%) 3 (29.2%) 8 (29.2%) 4 (29.2%) 7 (29.0%) 1 (29.0%) 4 (28.9%) 7 (28.9%) 2 (28.8%) 6 (28.8%) 8 (28.8%) 8 (28.7%) 0 (28.7%) 4 (28.6%)

17

slide-28
SLIDE 28

Conclusions

◮ We can “reduce” intrinsic dimensionality to ID = t using:

m = IDFX(x)/t But is this more than a cure for a symptom (for our estimate)?

◮ t-SNE benefits from this adjustment:

We get more difference in neighbor weights. (We can also use SNE, but we did not experiment with this.)

◮ t-SNE tends to hide outliers, unless we use

pij = pi|j · pj|i instead of pij = 1

2(pi|j + pj|i) ◮ We can make SOS outlier faster using the KNN only ◮ ISOS improves SOS by adjusting for ID. 18

slide-29
SLIDE 29

Thank You! Qestions?

19

slide-30
SLIDE 30

Thank You! Qestions?

How do we fix ID?

19

slide-31
SLIDE 31

References i

[Bey+99]

  • K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaf. “When Is ”Nearest Neighbor” Meaningful?” In: Int.
  • Conf. Database Theory ICDT. 1999.

[Cam+16]

  • G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, and
  • M. E. Houle. “On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical

study”. In: Data Min. Knowl. Discov. 30.4 (2016). [HKF04]

  • V. Hautamäki, I. Kärkkäinen, and P. Fränti. “Outlier Detection Using k-Nearest Neighbour Graph”. In: Int.
  • Conf. Patern Recognition, ICPR. 2004.

[Hou15]

  • M. E. Houle. Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation.
  • Tech. rep. NII-2015-002E. National Institute of Informatics, Tokyo, Japan, 2015.

[HR02]

  • G. E. Hinton and S. T. Roweis. “Stochastic Neighbor Embedding”. In: Adv. in Neural Information Processing

Systems 15, NIPS. 2002. [JPH13]

  • J. H. M. Janssens, E. O. Postma, and H. J. van den Herik. “Density-Based Anomaly Detection in the Maritime

Domain”. In: Situation Awareness with Systems of Systems. 2013. 20

slide-32
SLIDE 32

References ii

[Maa14]

  • L. van der Maaten. “Accelerating t-SNE using tree-based algorithms”. In: J. Machine Learning Research 15.1

(2014). [MH08]

  • L. van der Maaten and G. Hinton. “Visualizing Data using t-SNE”. In: J. Machine Learning Research 9.11

(2008). 21