SLIDE 1
Dimension Reduction
CS 6242 Ramakrishnan Kannan
Thanks : Prof. Jaegul Choo and Prof. Le Song
SLIDE 2 What is Dimension Reduction?
Data ¡item ¡index ¡(n) ¡ Dimension ¡ index ¡(d) ¡ Columns ¡as ¡ data ¡items ¡ ¡
low-‑dim ¡ data ¡
Dimension Reduction
How ¡big ¡is ¡ this? ¡
Why?
Attribute=Feature= Variable=Dimension
SLIDE 3 3
Serialized/rasterized pixel values
Image Data
3 80 24 58 63 45 3 80 24 58 78 45 5 34 78
Raw ¡images ¡ Pixel ¡values ¡
5 34 63
Serialized ¡ pixels ¡
In a 4K (4096x2160) image there are totally 8.8 million pixels
SLIDE 4 3 80 24 58 63 45 5 34 78 49 54 78 14 67 36 22 86 15 4
Serialized/rasterized pixel values
Huge dimensions
4096x2160 image size ¡→ ¡ ¡8847360 ¡dimensions 30 fps. Means for 2 mins video, you generate a matrix of size
8847360 ¡x3600
Video Data
3 80 24 58 63 45
Raw ¡images ¡ Pixel ¡values ¡
5 34 63
Serialized ¡ pixels ¡
49 54 78 14 15 67 22 86 36
… ¡
SLIDE 5 Bag-of-words vector
Document 1 = “Life of Pi won Oscar” Document 2 = “Life of Pi is also a book.”
Text Documents
Life ¡ Pi ¡ movies ¡ also ¡
book ¡ won ¡
Vocabulary ¡ Doc ¡1 ¡ Doc ¡2 ¡
1 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡
… ¡
1 ¡ 1 ¡ 0 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡
SLIDE 6 Data items
How many data items?
Dimensions
How many dimensions representing each item?
Two Axes of Data Set
Data ¡item ¡index ¡(n) ¡ Dimension ¡ index ¡(d) ¡ Columns ¡as ¡ data ¡items ¡ ¡
- vs. ¡Rows ¡as ¡data ¡items ¡
We ¡will ¡use ¡this ¡during ¡lecture ¡
SLIDE 7 Dimension Reduction
7
Dimension Reduction
High-dim data (n) low-dim data (n)
dimensions (k) Additional info about data Other parameters Dim-reducing transformation for new data : user-specified
Reduced dimension (k)
Dimension index (d)
SLIDE 8
Benefits of Dimension Reduction
Obviously, Compression Visualization Faster computation
Computing distances: 100,000-dim vs. 10-dim vectors
More importantly, Noise removal (improving data quality)
Separates the data into General Pattern + Sparse + Noise Is Noise the important signal? Works as pre-processing for better performance e.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition
SLIDE 9 Two Main Techniques
Selects a subset of the original variables as reduced dimensions relevant for a particular task e.g., the number of genes responsible for a particular disease may be small
Each reduced dimension combines multiple original dimensions The original dataset will be transformed to some other numbers
9
Feature = Variable = Dimension
SLIDE 10 Feature Selection
What are the optimal subset of m features to maximize a given criterion? Widely-used criteria
Information gain, correlation, …
Typically combinatorial optimization problems Therefore, greedy methods are popular
Forward selection: Empty set → Add one variable at a time Backward elimination: Entire set → Remove one variable at a time
10
SLIDE 11
Feature Extraction
SLIDE 12 Aspects of Dimension Reduction
Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input)
12
SLIDE 13 Linear vs. Nonlinear
Linear Represents each reduced dimension as a linear combination of original dimensions
Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4 Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4
Naturally capable of mapping new data to the same space
13
Dimension Reduction D1 D2 X1 1 1 X2 1 X3 2 X4 1 1 D1 D2 Y1 1.75
Y2
0.58
SLIDE 14 Linear vs. Nonlinear
Linear Represents each reduced dimension as a linear combination of original dimensions
e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4
Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics
14
SLIDE 15 Unsupervised vs. Supervised
Unsupervised Uses only the input data
15
Dimension Reduction
High-dim data low-dim data
dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data
SLIDE 16 Unsupervised vs. Supervised
Supervised Uses the input data + additional info
16
Dimension Reduction
High-dim data low-dim data
dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data
SLIDE 17 Unsupervised vs. Supervised
Supervised Uses the input data + additional info
e.g., grouping label
17
Dimension Reduction
High-dim data low-dim data
dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data
SLIDE 18 Global vs. Local
Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what should we emphasize more? Global Treats all pairwise distances equally important
Focuses on preserving large distances
Local Focuses on small distances, neighborhood relationships Active research area, e.g., manifold learning
18
SLIDE 19 Feature vectors vs. Similarity (as an input)
Dimension Reduction
High-dim data (n) low-dim data
dimensions (k) Other parameters Dim-reducing Transformer for a new data Additional info about data
Typical setup (feature vectors as an input)
Reduced dimension (k) Dimension index (d)
SLIDE 20 Feature vectors vs. Similarity (as an input)
Dimension Reduction
Similarity matrix low-dim data
dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data
Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead
(i,j)-th component indicates similarity between i-th and j-th data Assuming distance is a metric, similarity matrix is symmetric
SLIDE 21 Feature vectors vs. Similarity (as an input)
Dimension Reduction
low-dim data(kxn)
dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data
Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead Internally, converts feature vectors to similarity matrix before performing dimension reduction
Similarity matrix(nxn) High-dim data (dxn)
Dimension Reduction
low-dim data(kxn)
Graph Embedding
SLIDE 22 Feature vectors vs. Similarity (as an input)
Why called graph embedding? Similarity matrix can be viewed as a graph where similarity represents edge weight Similarity matrix High-dim data(dxn)
Dimension Reduction
low-dim data
Graph Embedding
SLIDE 23 Methods
Traditional
Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)
Advanced (nonlinear, kernelized, manifold learning)
Isometric feature mapping (Isomap)
23
* Matlab codes are available at
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
SLIDE 24 Principal Component Analysis
Finds the axis showing the largest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: Eigen-decomposition Pros: Fast Cons: Limited performances
24
Image source: http://en.wikipedia.org/wiki/Principal_component_analysis PC1 PC2
Linear Unsupervised Global Feature vectors
SLIDE 25 PCA – Some Questions
Algorithm Subtract mean from the dataset (X-µ) Find the covariance matrix (X-µ)’ (X-µ) Perform SVD on this covariance matrix to find the leading eigen vectors Project the data point X on these leading eigen vectors. That is., multiply. Key Questions Why covariance matrix? Can’t we perform SVD on the original matrix?
25
SLIDE 26 Multidimensional Scaling (MDS)
Main idea Tries to preserve given pairwise distances in low- dimensional space Metric MDS
Preserves given distance values
Nonmetric MDS
When you only know/care about ordering of distances Preserves only the orderings of distance values
Algorithm: gradient-decent type c.f. classical MDS is the same as PCA
26
Nonlinear Unsupervised Global Similarity input
ideal distance Low-dim distance
SLIDE 27 Multidimensional Scaling
Pros: widely-used (works well in general) Cons: slow (n-body problem)
Nonmetric MDS is even much slower than metric MDS Fast algorithm are available. Barnes-Hut algorithm GPU-based implementations
27
SLIDE 28
Linear Discriminant Analysis
What if clustering information is available? LDA tries to separate clusters by Putting different cluster as far as possible Putting each cluster as compact as possible (a) (b)
SLIDE 29 Aspects of Dimension Reduction
Unsupervised vs. Supervised
Supervised Uses the input data + additional info
e.g., grouping label
Dimension Reduction
High-dim data low-dim data
dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data
SLIDE 30 Linear Discriminant Analysis (LDA)
- vs. Principal Component Analysis
30
2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis (Supervised) Principal component analysis (Unsupervised)
30
SLIDE 31 LDA
Compute mean of the two classes, global mean (µ1, µ2,µ) Compute class specific covariance matrix Sw Compute between class covariance matrix using means. Call it Sb For every class compute inv(Sw)*Sb Questions Why is this the solution? inv(Sw)*Sb
31
- Fisher’s LDA generalizes gracefully
– 𝑧 𝐷 − 1 [𝑧1, 𝑧2, … 𝑧𝐷−1] 𝐷 − 1 𝑥𝑗 𝑋 = [𝑥1|𝑥2| … |𝑥𝐷−1] 𝑧𝑗 = 𝑥𝑗
𝑈𝑦 ⇒ 𝑧 = 𝑋𝑈𝑦
𝑇𝑋 = 𝑇𝑗
𝐷 𝑗=1
𝑦 − 𝜈𝑗 𝑦 − 𝜈𝑗 𝑈
𝑦∈𝜕𝑗
𝜈𝑗 =
1 𝑂𝑗
𝑦
𝑦∈𝜕𝑗
– es 𝑇𝐶 = 𝑂𝑗 𝜈𝑗 − 𝜈 𝜈𝑗 − 𝜈 𝑈
𝐷 𝑗=1
1 𝑂
𝑦
∀𝑦
=
1 𝑂
𝑂𝑗𝜈𝑗
𝐷 𝑗=1
– 𝑇𝑈 = 𝑇𝐶 + 𝑇𝑋
1 2 3
S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2 1 2 3
S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2
*http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf
SLIDE 32 Linear Discriminant Analysis
Maximally separates clusters by Putting different cluster far apart Shrinking each cluster compactly Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationships of data
32
Linear Supervised Global Feature vectors
SLIDE 33 Methods
Traditional
Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)
Advanced (nonlinear, kernelized, manifold learning)
Isometric feature mapping (Isomap)
33
* Matlab codes are available at
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
SLIDE 34 Manifold Learning
Swiss Roll Data
Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening)
34
SLIDE 35 Manifold Learning
Swiss Roll Data
Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡→ 2D
35
What if your data has low intrinsic dimensionality but resides in high-dimensional space?
SLIDE 36 Isomap
(Isometric Feature Mapping)
Let’s preserve pairwise geodesic distance (along manifold) Compute geodesic distance as the shortest path length from k-nearest neighbor (k-NN) graph *Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances
36
* Eigen-decomposition is the main algorithm of PCA
SLIDE 37 Isomap
(Isometric Feature Mapping)
Algorithm: all-pair shortest path computation + eigen- decomposition Pros: performs well in general Cons: slow (shortest path), sensitive to parameters
37
Nonlinear Unsupervised Global: all pairwise distances are considered Feature vectors
SLIDE 38 Practitioner’s Guide
Caveats
38
Trustworthiness of dimension reduction results Inevitable distortion/information loss in 2D/3D The best result of a method may not align with what we want, e.g., PCA visualization of facial image data
(1, 2)-dimension (3, 4)-dimension
SLIDE 39 Practitioner’s Guide
General Recommendation
Want something simple and fast to visualize data? PCA, force-directed layout Want to first try some manifold learning methods? Isomap
It is the method that will give empirically the best result.
Have cluster label to use? (pre-given or computed) LDA (supervised)
Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters
No labels, but still want some clusters to be revealed? Or simply, want some state-of-the-art method for visualization?
39
SLIDE 40 Practitioner’s Guide
Results Still Not Good?
Try various pre-processing Data centering
Subtract the global mean from each vector
Normalization
Make each vector have unit Euclidean norm Otherwise, a few outlier can affect dimension reduction significantly
Application-specific pre-processing
Document: TF-IDF weighting, remove too rare and/or short terms Image: histogram normalization
40
SLIDE 41 Practitioner’s Guide
Too Slow?
Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step
The results may be even better due to noise removed by PCA
See if there is any approximated but faster version
Landmarked versions (only using a subset of data items) e.g., landmarked Isomap Linearized versions (the same criterion, but only allow linear mapping) e.g., Laplacian Eigenmaps → Locality preserving projection
41
SLIDE 42 Practitioner’s Guide
Still need more?
Tweak dimension reduction for your own purpose Play with its algorithm, convergence criteria, etc.
See if you can impose label information Restrict the number of iterations to save computational time.
The main purpose of DR is to serve us in exploring data and solving complicated real-world problems
42
SLIDE 43 Take Away
43
PCA MDS LDA Isomap Supervised ✖ ✖ ✔ ✖ Linear ✔ ✖ ✔ ✖ Global ✔ ✔ ✔ ✔ Feature ✔ ✖ ✔ ✔
SLIDE 44 Useful Resource
Tutorial on PCA http://arxiv.org/pdf/1404.1100.pdf Tutorial on LDA http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf Review article
http://www.iai.uni-bonn.de/~jz/ dimensionality_reduction_a_comparative_review.pdf
Matlab toolbox for dimension reduction
http://homepage.tudelft.nl/19j49/ Matlab_Toolbox_for_Dimensionality_Reduction.html
Matlab manifold learning demo
http://www.math.ucla.edu/~wittman/mani/
44