Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection
A Remedy Against the Curse of Dimensionality?
Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany
Heidelberg University
Intrinsic t-Stochastic Neighbor Embedding for Visualization and - - PowerPoint PPT Presentation
Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University t-Stochastic Neighbor
A Remedy Against the Curse of Dimensionality?
Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany
Heidelberg University
t-Stochastic Neighbor Embedding
t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1
10 20 30 40 50
10 20 30 40 50 60 70
10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances.
1
t-Stochastic Neighbor Embedding
t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1
10 20 30 40 50
10 20 30 40 50 60 70
10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances.
1
t-Stochastic Neighbor Embedding
SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD)
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8 9 10 11
2 5 8 11 14
SNE SNE/t-SNE do not preserve density / distances. Can get stuck in a local optimum!
2
t-Stochastic Neighbor Embedding
SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD)
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
10 20 30 40 50
10 20 30 40 50
t-SNE SNE/t-SNE do not preserve density / distances. Can get stuck in a local optimum!
2
t-Stochastic Neighbor Embedding
SNE and t-SNE use a Gaussian kernel in the input domain: pj|i =
exp(−xi−xj2/2σ2
i )
i )
where each σ2
i is optimized to have the desired perplexity
(Perplexity ≈ number of neighbors to preserve) Asymmetric, so they simply use: pij := (pi|j + pj|i)/2 (We suggest to prefer pij = pi|j · pj|i for outlier detection) In the output domain, as qij, SNE uses a Gaussian (with constant σ), t-SNE uses a Student-t-Distribution.
Kullback-Leibler divergence can be minimized using stochastic gradient descent to
make input and output affinities similar.
3
SNE vs. t-SNE
Gaussian weights in the output domain as used by SNE vs. t-SNE:
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.5 1 1.5 2 Gaussian Weight, σ²=1 Student-t Weight t=1
t-SNE has more emphasis on separating points.
even neighbors will be “fanned out” a bit “beter” separation of far points (SNE has 0 weight on far points)
4
The Curse of Dimensionality
Loss of “discrimination” of distances [Bey+99]: lim
dim→∞ E
maxy=x d(x,y)−miny=x d(x,y)
miny=x d(x,y)
Distances to near points and to far points become similar.
5
The Curse of Dimensionality
Loss of “discrimination” of distances [Bey+99]: lim
dim→∞ E
maxy=x d(x,y)−miny=x d(x,y)
miny=x d(x,y)
Distances to near points and to far points become similar.
The Gaussian kernel uses relative distances: exp(−xi − xj2/2σ2
i )
Distance Expected Distance With high-dimensional data, all pij become similar!
We cannot find a “good” σi anymore.
5
Distribution of Distances
On the short tail distance distributions ofen look like this:
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Neighbor Density, 3D Neighbor Density, 10D Neighbor Density, 50D
In high-dimensional data, almost all nearest neighbors concentrate on the right hand side of this plot.
6
Distribution of Distances
Gaussian weights as used by SNE / t-SNE:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.4 0.6 0.8 1 Gaussian Weight, σ²=0.3 Gaussian Weight, σ²=1 Gaussian Weight, σ²=2
For low-dimensional data, Gaussian weights work good. For high-dimensional data: almost the same weight for all points.
7
Distribution of Distances
Gaussian kernels adjusted for intrinsic dimensionality:
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 0.4 0.6 0.8 1 Gaussian Weight, id=3, σ²=1 Gaussian Weight, id=10, σ²=1 Gaussian Weight, id=50, σ²=1
In theory, they behave like Gaussian kernels in low dimensionality.
8
Distance Power Transform
Let X be a random variable (“of distances”) as in [Hou15], For constants c and m, use the transformation Y = g(X) with g(x):=c · xm Let FX, FY be the cumulative distribution of X, Y . Then IDFX(x) = m · IDFY (c · xm) [Hou15, Table 1]. By choosing m = IDFX(x)/t for any t > 0, one therefore obtains: IDFY (c · xm) = IDFX(x)/m = t where one can choose c > 0 as desired, e.g., for numerical reasons.
We can transform distances to any desired ID = t!
9
Distance Power Transform
For each point p:
d′(p, q) := c · d(p, q)m
Can we defeat the curse this easily?
10
Distance Power Transform
For each point p:
d′(p, q) := c · d(p, q)m
Can we defeat the curse this easily? Probably not: this is a hack to cure one symptom. Qestion: is our definition of ID too permissive?
10
Experimental Results: it-SNE
Projections of the ALOI outlier data set (as available at [Cam+16]): PCA t-SNE it-SNE Data set: Color histograms of 50.000 photos of 1000 objects Each class: same object, different angles & different light Labeled outliers: classes reduced to 1-3 objects — May contain other “true” outliers!
11
Experimental Results: it-SNE
Projection of the ALOI outlier data set with t-SNE:
11
Experimental Results: it-SNE
Projection of the ALOI outlier data set with it-SNE:
Labeled & Unlabeled Outliers!
11
Experimental Results: it-SNE
On the well-known MNIST data set t-SNE:
12
Experimental Results: it-SNE
On the well-known MNIST data set it-SNE: Outliers!
12
Outlier Detection: ODIN
ODIN (Outlier Detection using Indegree Number) [HKF04]:
= in-degree of the k nearest neighbor graph
Works, but many objects will have the exact same score. Which k to use? Can change abruptly with k. Can we make a continuous (“smooth”) version of this idea?
13
Outlier Detection: SOS
SOS (Stochastic Outlier Selection) [JPH13] Idea: assume every object can link to one neighbor randomly. Inliers: likely to be linked to, outliers: likely to be not linked to.
pj|i =
exp(−xi−xj2/2σ2
i )
i )
use Gaussian weights to prefer near neighbors.
SOS(xj) :=
= probability that no neighbor links to object j.
14
KNNSOS and ISOS Outlier Detection
We propose two variants of this idea:
Reduces runtime from O(n2) to possibly O(n log n), O(n4/3). KNNSOS(xj) :=
ISOS: Intrinsic-dimensionality Stochastic Outlier Selection Note: The t-SNE author, van der Maaten, already proposed an approximate and index-based variant of t-SNE: Barnes-Hut t-SNE, which also uses the kNN only [Maa14].
15
Experimental Results: Outlier Detection
1 2 5 10 20 50 100 200 −0.05 0.00 0.05 0.10 0.15 0.20
Neighborhood size Adjusted AP
kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS
16
Experimental Results: Outlier Detection
1 2 5 10 20 50 100 200 −0.04 −0.02 0.00 0.02 0.04
Neighborhood size Adjusted AP
kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS
16
Experimental Results: Outlier Detection
1 2 5 10 20 50 100 200 0.00 0.01 0.02 0.03 0.04 0.05
Neighborhood size Adjusted AP
kNNSOS ODIN kNN kNNW LOF SimplifiedLOF LoOP INFLO KDEOS
16
Experimental Results: Outliers in MNIST
0 (33.4%) 5 (33.1%) 8 (32.7%) 7 (32.3%) 1 (32.3%) 8 (32.2%) 8 (32.1%) 6 (31.9%) 8 (31.7%) 9 (31.6%) 7 (31.5%) 5 (31.4%) 4 (31.3%) 1 (31.1%) 9 (31.1%) 8 (31.0%) 9 (30.7%) 3 (30.7%) 5 (30.5%) 4 (30.4%) 1 (30.3%) 7 (30.3%) 2 (30.2%) 7 (30.1%)
8 (30.1%) 0 (30.1%) 9 (30.0%) 0 (30.0%) 0 (29.9%) 0 (29.8%) 2 (29.8%) 1 (29.7%) 7 (29.6%) 9 (29.6%) 5 (29.6%) 2 (29.5%) 2 (29.5%) 3 (29.2%) 8 (29.2%) 4 (29.2%) 7 (29.0%) 1 (29.0%) 4 (28.9%) 7 (28.9%) 2 (28.8%) 6 (28.8%) 8 (28.8%) 8 (28.7%) 0 (28.7%) 4 (28.6%)
17
Conclusions
◮ We can “reduce” intrinsic dimensionality to ID = t using:
m = IDFX(x)/t But is this more than a cure for a symptom (for our estimate)?
◮ t-SNE benefits from this adjustment:
We get more difference in neighbor weights. (We can also use SNE, but we did not experiment with this.)
◮ t-SNE tends to hide outliers, unless we use
pij = pi|j · pj|i instead of pij = 1
2(pi|j + pj|i) ◮ We can make SOS outlier faster using the KNN only ◮ ISOS improves SOS by adjusting for ID. 18
19
19
References i
[Bey+99]
[Cam+16]
study”. In: Data Min. Knowl. Discov. 30.4 (2016). [HKF04]
[Hou15]
[HR02]
Systems 15, NIPS. 2002. [JPH13]
Domain”. In: Situation Awareness with Systems of Systems. 2013. 20
References ii
[Maa14]
(2014). [MH08]
(2008). 21