Dimensionality Reduction: Linear Discriminant Analysis and - - PowerPoint PPT Presentation
Dimensionality Reduction: Linear Discriminant Analysis and - - PowerPoint PPT Presentation
Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component
Outline
Linear Algebra/Math Review Two Methods of Dimensionality Reduction
Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Covariance
covariance: how (linearly) correlated are variables
Value of variable j in object k Mean of variable j Value of variable i in object k Mean of variable i covariance of variables i and j
๐๐๐ = 1 ๐ โ 1 เท
๐=1 ๐
(๐ฆ๐๐ โ ๐๐)(๐ฆ๐๐ โ ๐๐)
Covariance
covariance: how (linearly) correlated are variables
Value of variable j in object k Mean of variable j Value of variable i in object k Mean of variable i covariance of variables i and j
๐๐๐ = 1 ๐ โ 1 เท
๐=1 ๐
(๐ฆ๐๐ โ ๐๐)(๐ฆ๐๐ โ ๐๐)
๐๐๐ = ๐
๐๐
ฮฃ = ๐11 โฏ ๐1๐ฟ โฎ โฑ โฎ ๐๐ฟ1 โฏ ๐๐ฟ๐ฟ
Eigenvalues and Eigenvectors
๐ต๐ฆ = ๐๐ฆ
matrix vector scalar
for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)
Eigenvalues and Eigenvectors
๐ต๐ฆ = ๐๐ฆ
matrix vector scalar
๐ต = 1 5 1
Eigenvalues and Eigenvectors
๐ต๐ฆ = ๐๐ฆ
matrix vector scalar
๐ต = 1 5 1
1 5 1 ๐ฆ ๐ง = ๐ฆ + 5๐ง ๐ง ๐ฆ + 5๐ง ๐ง = ๐ ๐ฆ ๐ง
Eigenvalues and Eigenvectors
๐ต๐ฆ = ๐๐ฆ
matrix vector scalar
๐ต = 1 5 1
- nly non-zero vector
to scale 1 5 1 1 0 = 1 1 ๐ฆ + 5๐ง ๐ง = ๐ ๐ฆ ๐ง
Outline
Linear Algebra/Math Review Two Methods of Dimensionality Reduction
Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Dimensionality Reduction
Original (lightly preprocessed data)
Compressed representation
N instances D input features L reduced features
Dimensionality Reduction
clarity of representation vs. ease of understanding
- versimplification: loss of important or relevant
information
Courtesy Antano ลฝilinsko
Why โmaximizeโ the variance?
How can we efficiently summarize? We maximize the variance within our summarization We donโt increase the variance in the dataset How can we capture the most information with the fewest number of axes?
Summarizing Redundant Information
(2,1) (2,-1) (-2,-1) (4,2)
Summarizing Redundant Information
(2,-1) (-2,-1) (4,2) (2,1) = 2*(1,0) + 1*(0,1) (2,1)
Summarizing Redundant Information
(2,1) (2,-1) (-2,-1) (4,2) (2,1) (2,-1) (-2,-1) (4,2) u1 u2 2u1
- u1
(2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)
Summarizing Redundant Information
(2,1) (2,-1) (-2,-1) (4,2) (2,1) (2,-1) (-2,-1) (4,2) u1 u2 2u1
- u1
(2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)
(Is it the most general? These vectors arenโt orthogonal)
Outline
Linear Algebra/Math Review Two Methods of Dimensionality Reduction
Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA)
Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible
Geometric Rationale of LDiscA & PCA
Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes):
- rdered such that principal axis 1 has the highest
variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)
Courtesy Antano ลฝilinsko
Remember: MAP Classifiers are Optimal for Classification
min
๐ฑ เท ๐
๐ฝเท
๐ง๐[โ0/1(๐ง, เทข
๐ง๐)] โ max
๐ฑ
เท
๐
๐ เท ๐ง๐ = ๐ง๐ ๐ฆ๐
๐ เท ๐ง๐ = ๐ง๐ ๐ฆ๐ โ ๐ ๐ฆ๐ เท ๐ง๐ ๐(เท ๐ง๐)
posterior class-conditional likelihood class prior
๐ฆ๐ โ โ๐ธ
Linear Discriminant Analysis
MAP Classifier where:
- 1. class-conditional likelihoods are Gaussian
- 2. common covariance among class likelihoods
LDiscA: (1) What if likelihoods are Gaussian
๐ เท ๐ง๐ = ๐ง๐ ๐ฆ๐ โ ๐ ๐ฆ๐ เท ๐ง๐ ๐(เท ๐ง๐) ๐ ๐ฆ๐ ๐ = ๐ช ๐๐, ฮฃ๐ = exp โ 1 2 ๐ฆ๐ โ ๐๐ ๐ฮฃ๐
โ1 ๐ฆ๐ โ ๐๐
2๐ ๐ธ/2 ฮฃ๐ 1/2
https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png
LDiscA: (2) Shared Covariance
log ๐ เท ๐ง๐ = ๐ ๐ฆ๐ ๐ เท ๐ง๐ = ๐ ๐ฆ๐ = log ๐(๐ฆ๐|๐) ๐(๐ฆ๐|๐) + log ๐(๐) ๐ ๐
LDiscA: (2) Shared Covariance
log ๐ เท ๐ง๐ = ๐ ๐ฆ๐ ๐ เท ๐ง๐ = ๐ ๐ฆ๐ = log ๐(๐ฆ๐|๐) ๐(๐ฆ๐|๐) + log ๐(๐) ๐ ๐ = log ๐(๐) ๐ ๐ + log exp โ 1 2 ๐ฆ๐ โ ๐๐ ๐ฮฃ๐
โ1 ๐ฆ๐ โ ๐๐
2๐ ๐ธ/2 ฮฃ๐ 1/2 exp โ 1 2 ๐ฆ๐ โ ๐๐ ๐ฮฃ๐
โ1 ๐ฆ๐ โ ๐๐
2๐ ๐ธ/2 ฮฃ๐ 1/2
LDiscA: (2) Shared Covariance
log ๐ เท ๐ง๐ = ๐ ๐ฆ๐ ๐ เท ๐ง๐ = ๐ ๐ฆ๐ = log ๐(๐ฆ๐|๐) ๐(๐ฆ๐|๐) + log ๐(๐) ๐ ๐ = log ๐(๐) ๐ ๐ + log exp โ 1 2 ๐ฆ๐ โ ๐๐ ๐ฮฃโ1 ๐ฆ๐ โ ๐๐ 2๐ ๐ธ/2 ฮฃ๐ 1/2 exp โ 1 2 ๐ฆ๐ โ ๐๐ ๐ฮฃโ1 ๐ฆ๐ โ ๐๐ 2๐ ๐ธ/2 ฮฃ๐ 1/2
ฮฃ๐ = ฮฃ๐
LDiscA: (2) Shared Covariance
log ๐ เท ๐ง๐ = ๐ ๐ฆ๐ ๐ เท ๐ง๐ = ๐ ๐ฆ๐ = log ๐(๐ฆ๐|๐) ๐(๐ฆ๐|๐) + log ๐(๐) ๐ ๐ = log ๐(๐) ๐ ๐ โ 1 2 ๐๐ โ ๐๐ ๐ฮฃโ1 ๐๐ โ ๐๐ + ๐ฆ๐
๐ฮฃโ1(๐๐ โ ๐๐)
linear in xi (check for yourself: why did the quadratic xi terms cancel?)
LDiscA: (2) Shared Covariance
log ๐ เท ๐ง๐ = ๐ ๐ฆ๐ ๐ เท ๐ง๐ = ๐ ๐ฆ๐ = log ๐(๐ฆ๐|๐) ๐(๐ฆ๐|๐) + log ๐(๐) ๐ ๐ = log ๐(๐) ๐ ๐ โ 1 2 ๐๐ โ ๐๐ ๐ฮฃโ1 ๐๐ โ ๐๐ + ๐ฆ๐
๐ฮฃโ1(๐๐ โ ๐๐)
linear in xi (check for yourself: why did the quadratic xi terms cancel?)
= ๐ฆ๐
๐ฮฃโ1๐๐ โ 1
2 ๐๐
๐ฮฃโ1๐๐ + log ๐(๐)
+๐ฆ๐
๐ฮฃโ1๐๐ โ 1
2 ๐๐
๐ฮฃโ1๐๐ + log ๐ ๐
rewrite only in terms of xi (data) and single-class terms
Classify via Linear Discriminant Functions
๐๐ ๐ฆ๐ = ๐ฆ๐
๐ฮฃโ1๐๐ โ 1
2 ๐๐
๐ฮฃโ1๐๐ + log ๐(๐)
arg max ๐ ๐๐ ๐ฆ๐ MAP classifier
equivalent to
LDiscA
Parameters to learn: ๐ ๐
๐, ๐๐ ๐, ฮฃ
๐ ๐ โ ๐๐
number of items labeled with class k
LDiscA
Parameters to learn: ๐ ๐
๐, ๐๐ ๐, ฮฃ
๐ ๐ โ ๐๐ ๐๐ = 1 ๐๐ เท
๐:๐ง๐=๐
๐ฆ๐
LDiscA
Parameters to learn: ๐ ๐
๐, ๐๐ ๐, ฮฃ
๐ ๐ โ ๐๐ ๐๐ = 1 ๐๐ เท
๐:๐ง๐=๐
๐ฆ๐
ฮฃ = 1 ๐ โ ๐ฟ เท
๐
scatter๐ = 1 ๐ โ ๐ฟ เท
๐
เท
๐:๐ง๐=๐
๐ฆ๐ โ ๐๐ ๐ฆ๐ โ ๐๐ ๐
within-class covariance
- ne option for ๐ต
Computational Steps for Full- Dimensional LDiscA
- 1. Compute means, priors, and covariance
Computational Steps for Full- Dimensional LDiscA
- 1. Compute means, priors, and covariance
- 2. Diagonalize covariance
ฮฃ = UDUT
diagonal matrix of eigenvalues K x K orthonormal matrix (eigenvectors) Eigen decomposition
Computational Steps for Full- Dimensional LDiscA
- 1. Compute means, priors, and covariance
- 2. Diagonalize covariance
- 3. Sphere the data
ฮฃ = UDUT Xโ = ๐ธ
โ1 2 ๐๐๐
Computational Steps for Full- Dimensional LDiscA
- 1. Compute means, priors, and covariance
- 2. Diagonalize covariance
- 3. Sphere the data (get unit covariance)
- 4. Classify according to linear discriminant
functions ๐๐(๐ฆ๐
โ)
ฮฃ = UDUT Xโ = ๐ธ
โ1 2 ๐๐๐
Two Extensions to LDiscA
Quadratic Discriminant Analysis (QDA)
Keep separate covariances per class ๐๐ ๐ฆ๐ = โ 1 2 ๐ฆ๐ โ ๐๐ Tฮฃk
โ1(๐ฆ๐ โ ๐๐)
+ log ๐ ๐ โ log |ฮฃ๐| 2
Two Extensions to LDiscA
Quadratic Discriminant Analysis (QDA)
Keep separate covariances per class ๐๐ ๐ฆ๐ = โ 1 2 ๐ฆ๐ โ ๐๐ Tฮฃk
โ1(๐ฆ๐ โ ๐๐)
+ log ๐ ๐ โ log |ฮฃ๐| 2 Regularized LDiscA Interpolate between shared covariance estimate (LDiscA) and class-specific estimate (QDA) ฮฃ๐ ๐ฝ = ๐ฝฮฃ๐ + 1 โ ๐ฝ ฮฃ
Vowel Classification
LDiscA (left) vs. QDA (right)
ESL 4.3
Vowel Classification
LDiscA (left) vs. QDA (right) Regularized LDiscA
ESL 4.3
ฮฃ๐ ๐ฝ = ๐ฝฮฃ๐ + 1 โ ๐ฝ ฮฃ
LDA for Dimensionality Reduction
Classifying D-dimensional inputs (features) into K-dimensional space (labels) Can we view the data faithfully (optimally) in smaller dimensions? Fisherโs optimal: spread out the centroids (means)
Fisherโs Argument
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
separating the means isnโt enough also consider the covariance
Fisherโs Argument
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
separating the means isnโt enough also consider the covariance
L-Dimensional LDiscA
B = เท
๐
๐๐ โ ๐ ๐๐ โ ๐ ๐
max ๐ฃ๐๐ถ๐ฃ ๐ฃ๐ฮฃ๐ฃ
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
between-class scatter (covariance)
L-Dimensional LDiscA
B = เท
๐
๐๐ โ ๐ ๐๐ โ ๐ ๐
max ๐ฃ๐๐ถ๐ฃ ๐ฃ๐ฮฃ๐ฃ
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
max ๐ฃ๐๐ถ๐ฃ s. t. ๐ฃ๐ฮฃ๐ฃ = 1
between-class scatter (covariance)
L-Dimensional LDiscA
max ๐ฃ๐๐ถ๐ฃ ๐ฃ๐ฮฃ๐ฃ
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
max ๐ฃ๐๐ถ๐ฃ s. t. ๐ฃ๐ฮฃ๐ฃ = 1
generalized eigenvalue problem first (largest) eigenvector
L-Dimensional LDiscA
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
max ๐ฃ2
๐๐ถ๐ฃ2
- s. t. ๐ฃ2
๐ฮฃ๐ฃ2 = 1, ๐ฃ1 ๐๐ฃ2 = 0
find the next largest eigenvector
L-Dimensional LDiscA
โFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ (ESL, 4.3)
max ๐ฃ3
๐๐ถ๐ฃ3
- s. t. ๐ฃ3
๐ฮฃ๐ฃ3 = 1,
๐ฃ1
๐๐ฃ2 = 0,
๐ฃ1
๐๐ฃ3 = 0,
๐ฃ2
๐๐ฃ3 = 0
and the next largest eigenvectorโฆ.
L-Dimensional LDiscA
- 1. Compute means ๐, priors, and common
covariance ฮฃ
ฮฃ = 1 ๐ โ ๐ฟ เท
๐
scatter๐ = 1 ๐ โ ๐ฟ เท
๐
เท
๐:๐ง๐=๐
๐ฆ๐ โ ๐๐ ๐ฆ๐ โ ๐๐ ๐
L-Dimensional LDiscA
- 1. Compute means ๐, priors, and common
covariance ฮฃ
- 2. Compute the between-class scatter
(covariance)
ฮฃ = 1 ๐ โ ๐ฟ เท
๐
scatter๐ = 1 ๐ โ ๐ฟ เท
๐
เท
๐:๐ง๐=๐
๐ฆ๐ โ ๐๐ ๐ฆ๐ โ ๐๐ ๐ B = เท
๐
๐๐ โ ๐ ๐๐ โ ๐ ๐
L-Dimensional LDiscA
- 1. Compute means ๐, priors, and common
covariance ฮฃ
- 2. Compute the between-class scatter
(covariance)
- 3. Compute the eigen decomposition of B
ฮฃ = 1 ๐ โ ๐ฟ เท
๐
scatter๐ = 1 ๐ โ ๐ฟ เท
๐
เท
๐:๐ง๐=๐
๐ฆ๐ โ ๐๐ ๐ฆ๐ โ ๐๐ ๐
๐ถ = ๐๐ธ๐ถ๐๐
B = เท
๐
๐๐ โ ๐ ๐๐ โ ๐ ๐
L-Dimensional LDiscA
- 1. Compute means ๐, priors, and common covariance ฮฃ
- 2. Compute the between-class scatter (covariance)
- 3. Compute the eigen decomposition of B
- 4. Take the top L eigenvectors from V
ฮฃ = 1 ๐ โ ๐ฟ เท
๐
scatter๐ = 1 ๐ โ ๐ฟ เท
๐
เท
๐:๐ง๐=๐
๐ฆ๐ โ ๐๐ ๐ฆ๐ โ ๐๐ ๐
๐ถ = ๐๐ธ๐ถ๐๐
B = เท
๐
๐๐ โ ๐ ๐๐ โ ๐ ๐
Vowel Classification
ESL 4.3
Vowel Classification
ESL 4.3
Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels
Supervised โ Unsupervised
Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels Unsupervised learning: learning without a teacher Only features and no labels Why is unsupervised learning useful? Visualization โ dimensionality reduction
lower dimensional features might help learning
Discover hidden structures in the data: clustering
Supervised โ Unsupervised
Outline
Linear Algebra/Math Review Two Methods of Dimensionality Reduction
Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Geometric Rationale of LDiscA & PCA
Objective: to rigidly rotate the axes of the D-dimensional space to new positions (principal axes):
- rdered such that principal axis 1
has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)
Adapted from Antano ลฝilinsko
L-Dimensional PCA
- 1. Compute mean ๐, priors, and common
covariance ฮฃ
- 2. Sphere the data (zero-mean, unit covariance)
- 3. Compute the (top L) eigenvectors, from
sphere-d data, via V
- 4. Project the data
ฮฃ = 1 ๐ เท
๐
๐ฆ๐ โ ๐ ๐ฆ๐ โ ๐ ๐
๐โ = ๐๐ธ๐ถ๐๐
๐ = 1 ๐ เท
๐
๐ฆ๐
2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20
Variable X1 Variable X 2
+
2D Example of PCA
variables X1 and X2 have positive covariance & each has a similar variance
35 . 8
1 =
X
91 . 4
2 =
X
Courtesy Antano ลฝilinsko
- 6
- 4
- 2
2 4 6 8
- 8
- 6
- 4
- 2
2 4 6 8 10 12
Variable X1 Variable X 2
Configuration is Centered
subtract the component-wise mean
Courtesy Antano ลฝilinsko
- 6
- 4
- 2
2 4 6
- 8
- 6
- 4
- 2
2 4 6 8 10 12
PC 1 PC 2
Compute Principal Components
PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance.
Courtesy Antano ลฝilinsko
Compute Principal Components
PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance.
- 6
- 4
- 2
2 4 6 8
- 8
- 6
- 4
- 2
2 4 6 8 10 12
Variable X1 Variable X 2
PC 1 PC 2
Courtesy Antano ลฝilinsko
- 6
- 4
- 2
2 4 6 8
- 8
- 6
- 4
- 2
2 4 6 8 10 12
Variable X1 Variable X 2
PC 1 PC 2
PC axes are a rigid rotation of the original variables PC 1 is simultaneously the direction of maximum variance and a least-squares โline of best fitโ (squared distances of points away from PC 1 are minimized).
Courtesy Antano ลฝilinsko
Generalization to p-dimensions
if we take the first k principal components, they define the k- dimensional โhyperplane of best fitโ to the point cloud
- f the total variance of all p variables:
PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions
Courtesy Antano ลฝilinsko
How many axes are needed?
does the (k+1)th principal axis represent more variance than would be expected by chance? a common โrule of thumbโ when PCA is based
- n correlations is that axes with eigenvalues > 1
are worth interpreting
Courtesy Antano ลฝilinsko
PCA as Reconstruction Error
min
๐
๐ โ ๐๐๐ 2 =
๐ = ๐๐
NxD DxL
PCA as Reconstruction Error
min
๐
๐ โ ๐๐๐ 2 =
๐ = ๐๐
min
๐
๐ โ ๐๐๐๐ 2 =
NxD DxL
PCA as Reconstruction Error
min
๐
๐ โ ๐๐๐ 2 =
๐ = ๐๐
min
๐
๐ โ ๐๐๐๐ 2 = min
๐ 2 ๐ 2 โ 2๐๐๐๐๐๐ =
NxD DxL
PCA as Reconstruction Error
min
๐
๐ โ ๐๐๐ 2 =
๐ = ๐๐
min
๐
๐ โ ๐๐๐๐ 2 = min
๐ 2 ๐ 2 โ 2๐๐๐๐๐๐ =
min
๐ ๐ท โ 2 ๐๐ 2
maximizing variance โ minimizing reconstruction error NxD DxL