DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION
MATTHIEU BLOCH April 21, 2020
1 / 26
DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH - - PowerPoint PPT Presentation
DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26 MULTIDIMENSIONAL SCALING MULTIDIMENSIONAL SCALING There are situations for which Euclidean distance is not appropriate Suppose we have access to a
MATTHIEU BLOCH April 21, 2020
1 / 26
There are situations for which Euclidean distance is not appropriate Suppose we have access to a dissimilarity matrix and some distance function A dissimilarity matrix satisfies Triangle inequality not required Multidimensional scaling (MDS) Find dimension and such that In general, perfect embedding into the desired dimension will not exist Many variants of MDS based on choice of , whether is completely known, and Two types of MDS Metric MDS: try to ensure that Non Metric MDS: try to ensure that
2 / 26
Assume is completely known (no missing entry) and Algorithm to create embedding Form where Compute eigen decomposition Return , where consists of first columns of , consists of first columns
Where is this coming from? Theorem (Eckart-Young Theorem) The above algorithm returns the best rank approximation in the sense that it minimizes and
3 / 26
4 / 26
5 / 26
6 / 26
Suppose we have and such that Set PCA computes an eigendecomposition of Equivalent to computing the SVD of New representation computed as , where MDS computes an eigendecomposition of Return
7 / 26
Subtle difference between PCS and MDS PCA gives us access to and : we can extract features and reconstruct approximations Need to recover How can we extract features in MDS and compute ? Important to add new point
and want to add a new point to our embedding. Define Then , where consists of first columns of from the SVD
8 / 26
Classical MDS minimizes the loss function Many other choices exist Common choice is stress function are fixed weights handles missing data, penalizes error on nearby points Nonlinear embeddings High-dimensional data sets can have nonlinear structure that not captured via linear methods Kernelize PCA and MDS with non linear Use PCA on
9 / 26
Dataset , kernel , dimension Kernel PCA
where is kernel matrix and is centering matrix
to first rows of Projection of transformed data computed with computed as with No computation in large dimensional Hilbert space!
10 / 26
Can be viewed as an extension of MDS Assumes that the data lies in low-dimensional manifold (looks Euclidean in small neighborhoods) Given dataset , try to compute estimate of the geodesic distance along manifold
Swiss roll manifold
How do we estimate the geodesic distance?
11 / 26
Compute shortest path using a proximity graph Form a matrix as follows
define local neighborhood (e.g., nearest neighbors, all s.t. )
, set is a weighted adjacency matrix of the graph Compute by setting to length of shortest path from node to node in graph described by Can compute embedding similarly to MDS Challenge: isomap can become inaccurate for points far apart
12 / 26
13 / 26
Idea: a data manifold that is globally nonlinear still appears linear in local pieces Don’t try to explicitly model global geodesic distances Try to preserve structure in data by patching together local pieces of the manifold LLE algorithm for dataset
, define local neighborhood
and solve
14 / 26
15 / 26
Eigenvalue problem in compact form: Same problem encountered in PCA! Use eigendecomposition of Can compute embedding of new points as with computed from same constrained least squares problem Demo notebook
16 / 26
Density estimation problem: given samples from an unknown density , estimate Image
Density estimation problem
Applications: classification, clustering, anomaly detection, etc.
17 / 26
General form of kernel density estimate is called a kernel Estimate is non parametric, also known as Parzen window method is the bandwidth Looks like a ridge regression kernel but with equal weights and need not be inner product kernel A kernel should satisfy the following for some Plenty of standard kernels: rectangular, Gaussian, etc. Demo kernel density estimation How do we choose ?
18 / 26
be a kernel density estimate based on kernel . Suppose we scale with as Then Seems like a very powerful result, kernel density estimation always works given enough points In practice, choose (Silverman’s rule of thumb) Can also use model selection techniques (split dataset for testing and training) Ugly truth: kernel density estimation works well with a lot of points in low dimension
19 / 26
Clustering problem: given samples assign points to disjoint subsets called clusters, so that points in the same cluster are more similar to each other than points in different clusters Clustering is map with the number of clusters; how do we choose
minimizing Lemma. The number of possible clusters is given by the Stirling’s numbers of the second kind No known efficient search strategy for this space exact solution by complete enumeration has intractable complexity
20 / 26
We want to find Lemma. where and
we have Solve enlarged optimization problem
21 / 26
To find means solution
, choose to minimize
choose to minimize The solution to subproblem 1 is The solution to subproblem 2 is
22 / 26
Algorithmic notes Algorithm typically initialized with as random point in dataset Several random initializations to avoid local minima Clusters boundaries are parts of hyperplanes Regions are intersections of halfplanes hence convex means fails if clusters are non convex Geometry changes if we change the norm
23 / 26
Extend idea behind
clusters are elliptical cluster can be modeled using a multivariate Gaussian density with , , full data set is modeled using a Gaussian mixture model (GMM) where , , , cluster estimation by performing MLE on GMM Interpretation of GMM State variable such that every realization with a hidden realization of the state variable challenge is to perform clustering without observing hidden states
24 / 26
Example (easy) MLE of single multivariate Gaussian Example (hard) MLE of mixture of multivariate Gaussian with incomplete data Example (ideal) MLE of mixture of multivariate Gaussian with complete data
25 / 26
Efficient algorithm to address incomplete data Key idea is to work with MLE for complete data and average out unobserved hidden state EM Algorithm
E step: evaluate where Maximization step: set
26 / 26