intrinsic t stochastic neighbor embedding for
play

Intrinsic t-Stochastic Neighbor Embedding for Visualization and - PowerPoint PPT Presentation

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University t-Stochastic Neighbor


  1. Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University

  2. t-Stochastic Neighbor Embedding t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD) 70 60 1 50 0.9 40 0.8 30 0.7 20 0.6 10 0.5 0 0.4 -10 0.3 -20 1 0.8 -30 0.2 0.6 0.4 -40 0.1 0.2 -50 0 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances. 1

  3. t-Stochastic Neighbor Embedding t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD) 70 60 1 50 0.9 40 0.8 30 0.7 20 0.6 10 0.5 0 0.4 -10 0.3 -20 1 0.8 -30 0.2 0.6 0.4 -40 0.1 0.2 -50 0 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances. 1

  4. t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 14 100 90 11 80 8 70 5 60 2 50 -1 40 -4 30 -7 20 -10 10 SNE 0 -13 0 10 20 30 40 50 60 70 80 90 100 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!

  5. t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 50 100 40 90 30 80 20 70 10 60 0 50 -10 40 -20 30 -30 20 -40 10 t-SNE -50 0 -60 0 10 20 30 40 50 60 70 80 90 100 -50 -40 -30 -20 -10 0 10 20 30 40 50 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!

  6. t-Stochastic Neighbor Embedding SNE and t-SNE use a Gaussian kernel in the input domain: exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = k � = i exp( −� x i − x k � 2 / 2 σ 2 � i ) where each σ 2 i is optimized to have the desired perplexity (Perplexity ≈ number of neighbors to preserve) Asymmetric, so they simply use: p ij := ( p i | j + p j | i ) / 2 (We suggest to prefer p ij = � p i | j · p j | i for outlier detection) In the output domain, as q ij , SNE uses a Gaussian (with constant σ ), t-SNE uses a Student-t-Distribution. � Kullback-Leibler divergence can be minimized using stochastic gradient descent to make input and output affinities similar. 3

  7. SNE vs. t-SNE Gaussian weights in the output domain as used by SNE vs. t-SNE: 0.4 Gaussian Weight, σ ²=1 Student-t Weight t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 2 t-SNE has more emphasis on separating points. � even neighbors will be “fanned out” a bit � “beter” separation of far points (SNE has 0 weight on far points) 4

  8. The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. 5

  9. The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. The Gaussian kernel uses relative distances: exp( −� x i − x j � 2 / 2 σ 2 i ) Distance Expected Distance With high-dimensional data, all p ij become similar! � We cannot find a “good” σ i anymore. 5

  10. Distribution of Distances On the short tail distance distributions ofen look like this: 1 Neighbor Density, 3D Neighbor Density, 10D Neighbor Density, 50D 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 In high-dimensional data, almost all nearest neighbors concentrate on the right hand side of this plot. 6

  11. Distribution of Distances Gaussian weights as used by SNE / t-SNE: 0.8 Gaussian Weight, σ ²=0.3 Gaussian Weight, σ ²=1 0.7 Gaussian Weight, σ ²=2 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 For low-dimensional data, Gaussian weights work good. For high-dimensional data: almost the same weight for all points. 7

  12. Distribution of Distances Gaussian kernels adjusted for intrinsic dimensionality: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Gaussian Weight, id=3, σ ²=1 0.05 Gaussian Weight, id=10, σ ²=1 Gaussian Weight, id=50, σ ²=1 0 0 0.2 0.4 0.6 0.8 1 In theory, they behave like Gaussian kernels in low dimensionality. 8

  13. Distance Power Transform Let X be a random variable (“of distances”) as in [Hou15], For constants c and m , use the transformation with g ( x ):= c · x m Y = g ( X ) Let F X , F Y be the cumulative distribution of X , Y . ID F X ( x ) = m · ID F Y ( c · x m ) Then [Hou15, Table 1]. By choosing m = ID F X ( x ) /t for any t > 0 , one therefore obtains: ID F Y ( c · x m ) = ID F X ( x ) /m = t where one can choose c > 0 as desired, e.g., for numerical reasons. � We can transform distances to any desired ID = t ! 9

  14. Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? 10

  15. Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? Probably not: this is a hack to cure one symptom. Qestion: is our definition of ID too permissive? 10

  16. Experimental Results: it-SNE Projections of the ALOI outlier data set (as available at [Cam+16]): PCA t-SNE it-SNE Data set: Color histograms of 50.000 photos of 1000 objects Each class: same object, different angles & different light Labeled outliers: classes reduced to 1-3 objects — May contain other “true” outliers! 11

  17. Experimental Results: it-SNE Projection of the ALOI outlier data set with t-SNE: 11

  18. Experimental Results: it-SNE Projection of the ALOI outlier data set with it-SNE: Labeled & Unlabeled Outliers! 11

  19. Experimental Results: it-SNE On the well-known MNIST data set t-SNE: 12

  20. Experimental Results: it-SNE On the well-known MNIST data set it-SNE: Outliers! 12

  21. Outlier Detection: ODIN ODIN (Outlier Detection using Indegree Number) [HKF04]: 1. Find the k nearest neighbors of each object. 2. Count how ofen each object was returned. = in-degree of the k nearest neighbor graph 3. Objects with no (or fewest) occurrences are outliers. Works, but many objects will have the exact same score. Which k to use? Can change abruptly with k . Can we make a continuous (“smooth”) version of this idea? 13

  22. Outlier Detection: SOS SOS (Stochastic Outlier Selection) [JPH13] Idea: assume every object can link to one neighbor randomly. Inliers: likely to be linked to, outliers: likely to be not linked to. 1. Compute p j | i of SNE / t-SNE for all i, j : exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = � k � = i exp( −� x i − x k � 2 / 2 σ 2 i ) use Gaussian weights to prefer near neighbors. 2. The SOS outlier score is then: � SOS( x j ) := i � = j 1 − p j | i = probability that no neighbor links to object j . 14

  23. KNNSOS and ISOS Outlier Detection We propose two variants of this idea: 1. Since most p j | i will be zero, use only the k nearest neighbors. Reduces runtime from O ( n 2 ) to possibly O ( n log n ) , O ( n 4 / 3 ) . � KNNSOS( x j ) := i ∈ k NN ( x j ) 1 − p j | i 2. Estimate ID( x i ) , and use transformed distances for p j | i . ISOS : Intrinsic-dimensionality Stochastic Outlier Selection Note: The t-SNE author, van der Maaten, already proposed an approximate and index-based variant of t-SNE: Barnes-Hut t-SNE, which also uses the k NN only [Maa14]. 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend