DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH - - PowerPoint PPT Presentation

dimensionality reduction dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH - - PowerPoint PPT Presentation

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26 MULTIDIMENSIONAL SCALING MULTIDIMENSIONAL SCALING There are situations for which Euclidean distance is not appropriate Suppose we have access to a


slide-1
SLIDE 1

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION

MATTHIEU BLOCH April 21, 2020

1 / 26

slide-2
SLIDE 2

MULTIDIMENSIONAL SCALING MULTIDIMENSIONAL SCALING

There are situations for which Euclidean distance is not appropriate Suppose we have access to a dissimilarity matrix and some distance function A dissimilarity matrix satisfies Triangle inequality not required Multidimensional scaling (MDS) Find dimension and such that In general, perfect embedding into the desired dimension will not exist Many variants of MDS based on choice of , whether is completely known, and Two types of MDS Metric MDS: try to ensure that Non Metric MDS: try to ensure that

2 / 26

slide-3
SLIDE 3

EUCLIDEAN EMBEDDINGS EUCLIDEAN EMBEDDINGS

Assume is completely known (no missing entry) and Algorithm to create embedding Form where Compute eigen decomposition Return , where consists of first columns of , consists of first columns

  • f

Where is this coming from? Theorem (Eckart-Young Theorem) The above algorithm returns the best rank approximation in the sense that it minimizes and

3 / 26

slide-4
SLIDE 4

4 / 26

slide-5
SLIDE 5

5 / 26

slide-6
SLIDE 6

6 / 26

slide-7
SLIDE 7

PCA VS MDS PCA VS MDS

Suppose we have and such that Set PCA computes an eigendecomposition of Equivalent to computing the SVD of New representation computed as , where MDS computes an eigendecomposition of Return

7 / 26

slide-8
SLIDE 8

PCA VS MDS PCA VS MDS

Subtle difference between PCS and MDS PCA gives us access to and : we can extract features and reconstruct approximations Need to recover How can we extract features in MDS and compute ? Important to add new point

  • Lemma. Assume we have access to

and want to add a new point to our embedding. Define Then , where consists of first columns of from the SVD

8 / 26

slide-9
SLIDE 9

EXTENSIONS OF MDS EXTENSIONS OF MDS

Classical MDS minimizes the loss function Many other choices exist Common choice is stress function are fixed weights handles missing data, penalizes error on nearby points Nonlinear embeddings High-dimensional data sets can have nonlinear structure that not captured via linear methods Kernelize PCA and MDS with non linear Use PCA on

  • r MDS on

9 / 26

slide-10
SLIDE 10

COMPUTING KERNEL PCA COMPUTING KERNEL PCA

Dataset , kernel , dimension Kernel PCA

  • 1. Form

where is kernel matrix and is centering matrix

  • 2. Compute eigen-decomposition
  • 3. Set

to first rows of Projection of transformed data computed with computed as with No computation in large dimensional Hilbert space!

10 / 26

slide-11
SLIDE 11

ISOMETRIC FEATURE MAPPING ISOMETRIC FEATURE MAPPING

Can be viewed as an extension of MDS Assumes that the data lies in low-dimensional manifold (looks Euclidean in small neighborhoods) Given dataset , try to compute estimate of the geodesic distance along manifold

Swiss roll manifold

How do we estimate the geodesic distance?

11 / 26

slide-12
SLIDE 12

ESTIMATING GEODESIC DISTANCE ESTIMATING GEODESIC DISTANCE

Compute shortest path using a proximity graph Form a matrix as follows

  • 1. For every

define local neighborhood (e.g., nearest neighbors, all s.t. )

  • 2. For each

, set is a weighted adjacency matrix of the graph Compute by setting to length of shortest path from node to node in graph described by Can compute embedding similarly to MDS Challenge: isomap can become inaccurate for points far apart

12 / 26

slide-13
SLIDE 13

13 / 26

slide-14
SLIDE 14

LOCALLY LINEAR EMBEDDING (LLE) LOCALLY LINEAR EMBEDDING (LLE)

Idea: a data manifold that is globally nonlinear still appears linear in local pieces Don’t try to explicitly model global geodesic distances Try to preserve structure in data by patching together local pieces of the manifold LLE algorithm for dataset

  • 1. For each

, define local neighborhood

  • 2. Solve
  • 3. Fix

and solve

14 / 26

slide-15
SLIDE 15

15 / 26

slide-16
SLIDE 16

LOCALLY LINEAR EMBEDDING (LLE) LOCALLY LINEAR EMBEDDING (LLE)

Eigenvalue problem in compact form: Same problem encountered in PCA! Use eigendecomposition of Can compute embedding of new points as with computed from same constrained least squares problem Demo notebook

16 / 26

slide-17
SLIDE 17

KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION

Density estimation problem: given samples from an unknown density , estimate Image

Density estimation problem

Applications: classification, clustering, anomaly detection, etc.

17 / 26

slide-18
SLIDE 18

KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION

General form of kernel density estimate is called a kernel Estimate is non parametric, also known as Parzen window method is the bandwidth Looks like a ridge regression kernel but with equal weights and need not be inner product kernel A kernel should satisfy the following for some Plenty of standard kernels: rectangular, Gaussian, etc. Demo kernel density estimation How do we choose ?

18 / 26

slide-19
SLIDE 19

KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION

  • Theorem. Let

be a kernel density estimate based on kernel . Suppose we scale with as Then Seems like a very powerful result, kernel density estimation always works given enough points In practice, choose (Silverman’s rule of thumb) Can also use model selection techniques (split dataset for testing and training) Ugly truth: kernel density estimation works well with a lot of points in low dimension

19 / 26

slide-20
SLIDE 20

CLUSTERING CLUSTERING

Clustering problem: given samples assign points to disjoint subsets called clusters, so that points in the same cluster are more similar to each other than points in different clusters Clustering is map with the number of clusters; how do we choose

  • Definition. (within-cluster scatter)
  • means clustering: find

minimizing Lemma. The number of possible clusters is given by the Stirling’s numbers of the second kind No known efficient search strategy for this space exact solution by complete enumeration has intractable complexity

20 / 26

slide-21
SLIDE 21

SUB-OPTIMAL K-MEANS CLUSTERING SUB-OPTIMAL K-MEANS CLUSTERING

We want to find Lemma. where and

  • Lemma. For a fixed clustering

we have Solve enlarged optimization problem

21 / 26

slide-22
SLIDE 22

ALTERNATING OPTIMIZATION PROCEDURE ALTERNATING OPTIMIZATION PROCEDURE

To find means solution

  • 1. Given

, choose to minimize

  • 2. Given

choose to minimize The solution to subproblem 1 is The solution to subproblem 2 is

22 / 26

slide-23
SLIDE 23

K MEANS REMARKS K MEANS REMARKS

Algorithmic notes Algorithm typically initialized with as random point in dataset Several random initializations to avoid local minima Clusters boundaries are parts of hyperplanes Regions are intersections of halfplanes hence convex means fails if clusters are non convex Geometry changes if we change the norm

23 / 26

slide-24
SLIDE 24

GAUSSIAN MIXTURE MODELS GAUSSIAN MIXTURE MODELS

Extend idea behind

  • means clustering to allow for more general cluster shapes

clusters are elliptical cluster can be modeled using a multivariate Gaussian density with , , full data set is modeled using a Gaussian mixture model (GMM) where , , , cluster estimation by performing MLE on GMM Interpretation of GMM State variable such that every realization with a hidden realization of the state variable challenge is to perform clustering without observing hidden states

24 / 26

slide-25
SLIDE 25

MAXIMUM LIKELIHOOD ESTIMATION MAXIMUM LIKELIHOOD ESTIMATION

Example (easy) MLE of single multivariate Gaussian Example (hard) MLE of mixture of multivariate Gaussian with incomplete data Example (ideal) MLE of mixture of multivariate Gaussian with complete data

25 / 26

slide-26
SLIDE 26

EXPECTATION-MAXIMIZATION (EM) ALGORITHM EXPECTATION-MAXIMIZATION (EM) ALGORITHM

Efficient algorithm to address incomplete data Key idea is to work with MLE for complete data and average out unobserved hidden state EM Algorithm

  • 1. Initialize
  • 2. For

E step: evaluate where Maximization step: set

  • Lemma. The algorithm gets monotonically better

26 / 26

    