Dimensionality Reduction: Linear Discriminant Analysis and - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC

Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

Covariance covariance: how (linearly) correlated are variables 𝑂 Mean of Mean of 1 variable i variable j 𝜏 𝑗𝑘 = 𝑂 − 1 ෍ (𝑦 𝑙𝑗 − 𝜈 𝑗 )(𝑦 𝑙𝑘 − 𝜈 𝑘 ) 𝑙=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k

Covariance covariance: how (linearly) correlated are variables 𝑂 Mean of Mean of 1 variable i variable j 𝜏 𝑗𝑘 = 𝑂 − 1 ෍ (𝑦 𝑙𝑗 − 𝜈 𝑗 )(𝑦 𝑙𝑘 − 𝜈 𝑘 ) 𝑙=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k 𝜏 11 ⋯ 𝜏 1𝐿 𝜏 𝑗𝑘 = 𝜏 ⋮ ⋱ ⋮ Σ = 𝑘𝑗 𝜏 𝐿1 ⋯ 𝜏 𝐿𝐿

Eigenvalues and Eigenvectors vector 𝐵𝑦 = 𝜇𝑦 scalar matrix for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)

Eigenvalues and Eigenvectors vector 𝐵𝑦 = 𝜇𝑦 scalar matrix 𝐵 = 1 5 0 1

Eigenvalues and Eigenvectors vector = 𝜇 𝑦 𝑦 + 5𝑧 𝐵𝑦 = 𝜇𝑦 𝑧 𝑧 scalar matrix 𝐵 = 1 5 0 1 𝑦 𝑧 = 𝑦 + 5𝑧 1 5 𝑧 0 1

Eigenvalues and Eigenvectors vector = 𝜇 𝑦 𝑦 + 5𝑧 𝐵𝑦 = 𝜇𝑦 𝑧 𝑧 1 5 1 0 = 1 1 scalar matrix 0 1 0 𝐵 = 1 5 0 1 only non-zero vector to scale

Dimensionality Reduction D input L reduced features features N Original (lightly preprocessed Compressed instances data) representation

Dimensionality Reduction clarity of representation vs. ease of understanding oversimplification: loss of important or relevant information Courtesy Antano Žilinsko

Why “maximize” the variance? How can we efficiently summarize? We maximize the variance within our summarization We don’t increase the variance in the dataset How can we capture the most information with the fewest number of axes?

Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1)

Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1) (2,1) = 2*(1,0) + 1*(0,1)

Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)

Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (Is it the most general? These vectors aren’t orthogonal) (4,2) = 2*(2,1) + 0*(2,-1)

Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA) Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible

Geometric Rationale of LDiscA & PCA Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes): ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated) Courtesy Antano Žilinsko

Remember: MAP Classifiers are Optimal for Classification 𝑧 𝑗 [ℓ 0/1 (𝑧, ෢ min 𝐱 ෍ 𝔽 ෞ 𝑧 𝑗 )] → max ෍ 𝑞 ෝ 𝑧 𝑗 = 𝑧 𝑗 𝑦 𝑗 𝐱 𝑗 𝑗 𝑞 ෝ 𝑧 𝑗 = 𝑧 𝑗 𝑦 𝑗 ∝ 𝑞 𝑦 𝑗 ෝ 𝑧 𝑗 𝑞(ෝ 𝑧 𝑗 ) class-conditional posterior class prior likelihood 𝑦 𝑗 ∈ ℝ 𝐸

Linear Discriminant Analysis MAP Classifier where: 1. class-conditional likelihoods are Gaussian 2. common covariance among class likelihoods

LDiscA: (1) What if likelihoods are Gaussian 𝑞 ෝ 𝑧 𝑗 = 𝑧 𝑗 𝑦 𝑗 ∝ 𝑞 𝑦 𝑗 ෝ 𝑧 𝑗 𝑞(ෝ 𝑧 𝑗 ) 𝑞 𝑦 𝑗 𝑙 = 𝒪 𝜈 𝑙 , Σ 𝑙 exp − 1 −1 𝑦 𝑗 − 𝜈 𝑙 2 𝑦 𝑗 − 𝜈 𝑙 𝑈 Σ 𝑙 = 2𝜌 𝐸/2 Σ 𝑙 1/2 https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png

LDiscA: (2) Shared Covariance log 𝑞 ෝ 𝑧 𝑗 = 𝑙 𝑦 𝑗 = log 𝑞(𝑦 𝑗 |𝑙) 𝑞(𝑦 𝑗 |𝑚) + log 𝑞(𝑙) 𝑞 ෝ 𝑧 𝑗 = 𝑚 𝑦 𝑗 𝑞 𝑚

LDiscA: (2) Shared Covariance log 𝑞 ෝ 𝑧 𝑗 = 𝑙 𝑦 𝑗 = log 𝑞(𝑦 𝑗 |𝑙) 𝑞(𝑦 𝑗 |𝑚) + log 𝑞(𝑙) 𝑞 ෝ 𝑧 𝑗 = 𝑚 𝑦 𝑗 𝑞 𝑚 exp − 1 −1 𝑦 𝑗 − 𝜈 𝑙 2 𝑦 𝑗 − 𝜈 𝑙 𝑈 Σ 𝑙 2𝜌 𝐸/2 Σ 𝑙 1/2 = log 𝑞(𝑙) 𝑞 𝑚 + log exp − 1 −1 𝑦 𝑗 − 𝜈 𝑚 2 𝑦 𝑗 − 𝜈 𝑚 𝑈 Σ 𝑚 2𝜌 𝐸/2 Σ 𝑚 1/2

LDiscA: (2) Shared Covariance log 𝑞 ෝ 𝑧 𝑗 = 𝑙 𝑦 𝑗 = log 𝑞(𝑦 𝑗 |𝑙) 𝑞(𝑦 𝑗 |𝑚) + log 𝑞(𝑙) 𝑞 ෝ 𝑧 𝑗 = 𝑚 𝑦 𝑗 𝑞 𝑚 exp − 1 2 𝑦 𝑗 − 𝜈 𝑙 𝑈 Σ −1 𝑦 𝑗 − 𝜈 𝑙 2𝜌 𝐸/2 Σ 𝑙 1/2 = log 𝑞(𝑙) 𝑞 𝑚 + log exp − 1 2 𝑦 𝑗 − 𝜈 𝑚 𝑈 Σ −1 𝑦 𝑗 − 𝜈 𝑚 2𝜌 𝐸/2 Σ 𝑚 1/2 Σ 𝑚 = Σ 𝑙

LDiscA: (2) Shared Covariance log 𝑞 ෝ 𝑧 𝑗 = 𝑙 𝑦 𝑗 = log 𝑞(𝑦 𝑗 |𝑙) 𝑞(𝑦 𝑗 |𝑚) + log 𝑞(𝑙) 𝑞 ෝ 𝑧 𝑗 = 𝑚 𝑦 𝑗 𝑞 𝑚 = log 𝑞(𝑙) 𝑞 𝑚 − 1 2 𝜈 𝑙 − 𝜈 𝑚 𝑈 Σ −1 𝜈 𝑙 − 𝜈 𝑚 + 𝑦 𝑗 𝑈 Σ −1 (𝜈 𝑙 − 𝜈 𝑚 ) linear in x i (check for yourself: why did the quadratic x i terms cancel?)

LDiscA: (2) Shared Covariance log 𝑞 ෝ 𝑧 𝑗 = 𝑙 𝑦 𝑗 = log 𝑞(𝑦 𝑗 |𝑙) 𝑞(𝑦 𝑗 |𝑚) + log 𝑞(𝑙) 𝑞 ෝ 𝑧 𝑗 = 𝑚 𝑦 𝑗 𝑞 𝑚 = log 𝑞(𝑙) 𝑞 𝑚 − 1 2 𝜈 𝑙 − 𝜈 𝑚 𝑈 Σ −1 𝜈 𝑙 − 𝜈 𝑚 + 𝑦 𝑗 𝑈 Σ −1 (𝜈 𝑙 − 𝜈 𝑚 ) 𝑈 Σ −1 𝜈 𝑙 − 1 𝑈 Σ −1 𝜈 𝑙 + log 𝑞(𝑙) = 𝑦 𝑗 2 𝜈 𝑙 𝑈 Σ −1 𝜈 𝑚 − 1 𝑈 Σ −1 𝜈 𝑚 + log 𝑞 𝑚 +𝑦 𝑗 2 𝜈 𝑚 linear in x i rewrite only in terms of x i (check for yourself: why did the (data) and single-class terms quadratic x i terms cancel?)

Classify via Linear Discriminant Functions 𝑈 Σ −1 𝜈 𝑙 − 1 𝑈 Σ −1 𝜈 𝑙 + log 𝑞(𝑙) 𝜀 𝑙 𝑦 𝑗 = 𝑦 𝑗 2 𝜈 𝑙 arg max equivalent MAP classifier 𝜀 𝑙 𝑦 𝑗 to 𝑙

LDiscA Parameters to learn: 𝑞 𝑙 𝑙 , 𝜈 𝑙 𝑙 , Σ 𝑞 𝑙 ∝ 𝑂 𝑙 number of items labeled with class k

LDiscA Parameters to learn: 𝑞 𝑙 𝑙 , 𝜈 𝑙 𝑙 , Σ 𝜈 𝑙 = 1 ෍ 𝑦 𝑗 𝑞 𝑙 ∝ 𝑂 𝑙 𝑂 𝑙 𝑗:𝑧 𝑗 =𝑙

LDiscA Parameters to learn: 𝑞 𝑙 𝑙 , 𝜈 𝑙 𝑙 , Σ 𝜈 𝑙 = 1 ෍ 𝑦 𝑗 𝑞 𝑙 ∝ 𝑂 𝑙 𝑂 𝑙 𝑗:𝑧 𝑗 =𝑙 1 1 𝑦 𝑗 − 𝜈 𝑙 𝑈 Σ = 𝑂 − 𝐿 ෍ scatter 𝑙 = 𝑂 − 𝐿 ෍ ෍ 𝑦 𝑗 − 𝜈 𝑙 𝑙 𝑙 𝑗:𝑧 𝑗 =𝑙 one option for 𝛵 within-class covariance

Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance

Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance Σ = UDU T Eigen decomposition K x K orthonormal diagonal matrix of matrix (eigenvectors) eigenvalues

Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance Σ = UDU T 3. Sphere the data −1 X ∗ = 𝐸 2 𝑉 𝑈 𝑌

Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance Σ = UDU T 3. Sphere the data (get unit covariance) −1 X ∗ = 𝐸 2 𝑉 𝑈 𝑌 4. Classify according to linear discriminant ∗ ) functions 𝜀 𝑙 (𝑦 𝑗

Two Extensions to LDiscA Quadratic Discriminant Analysis (QDA) Keep separate covariances per class 𝜀 𝑙 𝑦 𝑗 = − 1 −1 (𝑦 𝑗 − 𝜈 𝑙 ) 2 𝑦 𝑗 − 𝜈 𝑙 T Σ k + log 𝑞 𝑙 − log |Σ 𝑙 | 2

Two Extensions to LDiscA Quadratic Discriminant Analysis Regularized LDiscA (QDA) Keep separate covariances per Interpolate between shared class covariance estimate (LDiscA) and class-specific estimate (QDA) 𝜀 𝑙 𝑦 𝑗 = − 1 −1 (𝑦 𝑗 − 𝜈 𝑙 ) 2 𝑦 𝑗 − 𝜈 𝑙 T Σ k Σ 𝑙 𝛽 = 𝛽Σ 𝑙 + 1 − 𝛽 Σ + log 𝑞 𝑙 − log |Σ 𝑙 | 2

Vowel Classification LDiscA (left) vs. QDA (right) ESL 4.3

Vowel Classification LDiscA (left) vs. QDA (right) Regularized LDiscA Σ 𝑙 𝛽 = 𝛽Σ 𝑙 + 1 − 𝛽 Σ ESL 4.3

Dimensionality Reduction: Linear Discriminant Analysis and - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

How? Where? Who? When? Matthew 16:18 And I tell you, you are Peter, and on this rock I will

E9 205 Machine Learning for Signal Processing Dimensionality Reduction - I 21-08-2019 Instructor

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel

Introduction to Big Data and Machine Learning Dimensionality Reduction Continuous Latent

Pattern Detection in Computer Networks Using Robust Principal Component Analysis Randy

History and Theory of Nonlinear Principal Component Analysis Jan de Leeuw February 11, 2011 Jan

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Dimensionality Reduction: Linear Discriminant Analysis and - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

How? Where? Who? When? Matthew 16:18 And I tell you, you are Peter, and on this rock I will

E9 205 Machine Learning for Signal Processing Dimensionality Reduction - I 21-08-2019 Instructor

CSC 411: Lecture 14: Principal Components Analysis &amp; Autoencoders Class based on Raquel

Introduction to Big Data and Machine Learning Dimensionality Reduction Continuous Latent

Pattern Detection in Computer Networks Using Robust Principal Component Analysis Randy

History and Theory of Nonlinear Principal Component Analysis Jan de Leeuw February 11, 2011 Jan

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel