Intrinsic t-Stochastic Neighbor Embedding for Visualization and - PowerPoint PPT Presentation

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University

t-Stochastic Neighbor Embedding t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD) 70 60 1 50 0.9 40 0.8 30 0.7 20 0.6 10 0.5 0 0.4 -10 0.3 -20 1 0.8 -30 0.2 0.6 0.4 -40 0.1 0.2 -50 0 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances. 1

t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 14 100 90 11 80 8 70 5 60 2 50 -1 40 -4 30 -7 20 -10 10 SNE 0 -13 0 10 20 30 40 50 60 70 80 90 100 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!

t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 50 100 40 90 30 80 20 70 10 60 0 50 -10 40 -20 30 -30 20 -40 10 t-SNE -50 0 -60 0 10 20 30 40 50 60 70 80 90 100 -50 -40 -30 -20 -10 0 10 20 30 40 50 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!

t-Stochastic Neighbor Embedding SNE and t-SNE use a Gaussian kernel in the input domain: exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = k � = i exp( −� x i − x k � 2 / 2 σ 2 � i ) where each σ 2 i is optimized to have the desired perplexity (Perplexity ≈ number of neighbors to preserve) Asymmetric, so they simply use: p ij := ( p i | j + p j | i ) / 2 (We suggest to prefer p ij = � p i | j · p j | i for outlier detection) In the output domain, as q ij , SNE uses a Gaussian (with constant σ ), t-SNE uses a Student-t-Distribution. � Kullback-Leibler divergence can be minimized using stochastic gradient descent to make input and output affinities similar. 3

SNE vs. t-SNE Gaussian weights in the output domain as used by SNE vs. t-SNE: 0.4 Gaussian Weight, σ ²=1 Student-t Weight t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 2 t-SNE has more emphasis on separating points. � even neighbors will be “fanned out” a bit � “beter” separation of far points (SNE has 0 weight on far points) 4

The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. 5

The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. The Gaussian kernel uses relative distances: exp( −� x i − x j � 2 / 2 σ 2 i ) Distance Expected Distance With high-dimensional data, all p ij become similar! � We cannot find a “good” σ i anymore. 5

Distribution of Distances On the short tail distance distributions ofen look like this: 1 Neighbor Density, 3D Neighbor Density, 10D Neighbor Density, 50D 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 In high-dimensional data, almost all nearest neighbors concentrate on the right hand side of this plot. 6

Distribution of Distances Gaussian weights as used by SNE / t-SNE: 0.8 Gaussian Weight, σ ²=0.3 Gaussian Weight, σ ²=1 0.7 Gaussian Weight, σ ²=2 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 For low-dimensional data, Gaussian weights work good. For high-dimensional data: almost the same weight for all points. 7

Distribution of Distances Gaussian kernels adjusted for intrinsic dimensionality: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Gaussian Weight, id=3, σ ²=1 0.05 Gaussian Weight, id=10, σ ²=1 Gaussian Weight, id=50, σ ²=1 0 0 0.2 0.4 0.6 0.8 1 In theory, they behave like Gaussian kernels in low dimensionality. 8

Distance Power Transform Let X be a random variable (“of distances”) as in [Hou15], For constants c and m , use the transformation with g ( x ):= c · x m Y = g ( X ) Let F X , F Y be the cumulative distribution of X , Y . ID F X ( x ) = m · ID F Y ( c · x m ) Then [Hou15, Table 1]. By choosing m = ID F X ( x ) /t for any t > 0 , one therefore obtains: ID F Y ( c · x m ) = ID F X ( x ) /m = t where one can choose c > 0 as desired, e.g., for numerical reasons. � We can transform distances to any desired ID = t ! 9

Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? 10

Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? Probably not: this is a hack to cure one symptom. Qestion: is our definition of ID too permissive? 10

Experimental Results: it-SNE Projections of the ALOI outlier data set (as available at [Cam+16]): PCA t-SNE it-SNE Data set: Color histograms of 50.000 photos of 1000 objects Each class: same object, different angles & different light Labeled outliers: classes reduced to 1-3 objects — May contain other “true” outliers! 11

Experimental Results: it-SNE Projection of the ALOI outlier data set with t-SNE: 11

Experimental Results: it-SNE Projection of the ALOI outlier data set with it-SNE: Labeled & Unlabeled Outliers! 11

Experimental Results: it-SNE On the well-known MNIST data set t-SNE: 12

Experimental Results: it-SNE On the well-known MNIST data set it-SNE: Outliers! 12

Outlier Detection: ODIN ODIN (Outlier Detection using Indegree Number) [HKF04]: 1. Find the k nearest neighbors of each object. 2. Count how ofen each object was returned. = in-degree of the k nearest neighbor graph 3. Objects with no (or fewest) occurrences are outliers. Works, but many objects will have the exact same score. Which k to use? Can change abruptly with k . Can we make a continuous (“smooth”) version of this idea? 13

Outlier Detection: SOS SOS (Stochastic Outlier Selection) [JPH13] Idea: assume every object can link to one neighbor randomly. Inliers: likely to be linked to, outliers: likely to be not linked to. 1. Compute p j | i of SNE / t-SNE for all i, j : exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = � k � = i exp( −� x i − x k � 2 / 2 σ 2 i ) use Gaussian weights to prefer near neighbors. 2. The SOS outlier score is then: � SOS( x j ) := i � = j 1 − p j | i = probability that no neighbor links to object j . 14

KNNSOS and ISOS Outlier Detection We propose two variants of this idea: 1. Since most p j | i will be zero, use only the k nearest neighbors. Reduces runtime from O ( n 2 ) to possibly O ( n log n ) , O ( n 4 / 3 ) . � KNNSOS( x j ) := i ∈ k NN ( x j ) 1 − p j | i 2. Estimate ID( x i ) , and use transformed distances for p j | i . ISOS : Intrinsic-dimensionality Stochastic Outlier Selection Note: The t-SNE author, van der Maaten, already proposed an approximate and index-based variant of t-SNE: Barnes-Hut t-SNE, which also uses the k NN only [Maa14]. 15

Intrinsic t-Stochastic Neighbor Embedding for Visualization and - PowerPoint PPT Presentation

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University t-Stochastic Neighbor

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

6 Feedback, Reinforcement, and Intrinsic Motivation Session Outline (continued) Intrinsic

One the Role and Impact of the Metaparameters in t-distributed Stochastic Neighbor Embedding John

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

On Optimal Neighbor Discovery Philipp H. Kindt philipp.kindt@tum.de SIGCOMM19, Beijing CH

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

The Intrinsic Fragility of DNA The Intrinsic Fragility of DNA Tomas Lindahl omas Lindahl Nobel

Understanding Intrinsic Motivation Understanding Intrinsic Motivation A Caution

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

Flow A Special Case of A Special Case of Intrinsic Motivation Intrinsic Motivation Flow: A

Intrinsic Metrics on Graphs & Graph Geometry D. J. Klein Texas A&M University @ Galveston,

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical

Intrinsic Energy Partition in Fission Outline 1. TIME DEPENDENT PAIRING EQUATIONS AND

The Governing Equation(s) for a Spring-Mass-System Bernd Schr oder logo1 Bernd Schr oder

Exploring the Impact of Worked Examples in a Novice Programming Environment Rui Zhi Thomas W.

Advances and Applications of Intrinsic Low Dimensional Manifold Theory by Joseph M. Powers,

The Womens Empowerment in Agriculture Index WEAI is made up of two sub indices Public

1 Related Work Related Work Related Work Related Work Gromov-Hausdorff Gromov-Hausdorff

Intrinsic t-Stochastic Neighbor Embedding for Visualization and - PowerPoint PPT Presentation

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University t-Stochastic Neighbor

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

6 Feedback, Reinforcement, and Intrinsic Motivation Session Outline (continued) Intrinsic

One the Role and Impact of the Metaparameters in t-distributed Stochastic Neighbor Embedding John

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

On Optimal Neighbor Discovery Philipp H. Kindt philipp.kindt@tum.de SIGCOMM19, Beijing CH

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

The Intrinsic Fragility of DNA The Intrinsic Fragility of DNA Tomas Lindahl omas Lindahl Nobel

Understanding Intrinsic Motivation Understanding Intrinsic Motivation A Caution

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

Flow A Special Case of A Special Case of Intrinsic Motivation Intrinsic Motivation Flow: A

Intrinsic Metrics on Graphs &amp; Graph Geometry D. J. Klein Texas A&amp;M University @ Galveston,

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical

Intrinsic Energy Partition in Fission Outline 1. TIME DEPENDENT PAIRING EQUATIONS AND

The Governing Equation(s) for a Spring-Mass-System Bernd Schr oder logo1 Bernd Schr oder

Exploring the Impact of Worked Examples in a Novice Programming Environment Rui Zhi Thomas W.

Advances and Applications of Intrinsic Low Dimensional Manifold Theory by Joseph M. Powers,

The Womens Empowerment in Agriculture Index WEAI is made up of two sub indices Public

1 Related Work Related Work Related Work Related Work Gromov-Hausdorff Gromov-Hausdorff

Intrinsic Metrics on Graphs & Graph Geometry D. J. Klein Texas A&M University @ Galveston,