Dimensionality Reduction: Linear Discriminant Analysis and - - PowerPoint PPT Presentation

โ–ถ
dimensionality reduction linear discriminant analysis and
SMART_READER_LITE
LIVE PREVIEW

Dimensionality Reduction: Linear Discriminant Analysis and - - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component


slide-1
SLIDE 1

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis

CMSC 678 UMBC

slide-2
SLIDE 2

Outline

Linear Algebra/Math Review Two Methods of Dimensionality Reduction

Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

slide-3
SLIDE 3

Covariance

covariance: how (linearly) correlated are variables

Value of variable j in object k Mean of variable j Value of variable i in object k Mean of variable i covariance of variables i and j

๐œ๐‘—๐‘˜ = 1 ๐‘‚ โˆ’ 1 เท

๐‘™=1 ๐‘‚

(๐‘ฆ๐‘™๐‘— โˆ’ ๐œˆ๐‘—)(๐‘ฆ๐‘™๐‘˜ โˆ’ ๐œˆ๐‘˜)

slide-4
SLIDE 4

Covariance

covariance: how (linearly) correlated are variables

Value of variable j in object k Mean of variable j Value of variable i in object k Mean of variable i covariance of variables i and j

๐œ๐‘—๐‘˜ = 1 ๐‘‚ โˆ’ 1 เท

๐‘™=1 ๐‘‚

(๐‘ฆ๐‘™๐‘— โˆ’ ๐œˆ๐‘—)(๐‘ฆ๐‘™๐‘˜ โˆ’ ๐œˆ๐‘˜)

๐œ๐‘—๐‘˜ = ๐œ

๐‘˜๐‘—

ฮฃ = ๐œ11 โ‹ฏ ๐œ1๐ฟ โ‹ฎ โ‹ฑ โ‹ฎ ๐œ๐ฟ1 โ‹ฏ ๐œ๐ฟ๐ฟ

slide-5
SLIDE 5

Eigenvalues and Eigenvectors

๐ต๐‘ฆ = ๐œ‡๐‘ฆ

matrix vector scalar

for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)

slide-6
SLIDE 6

Eigenvalues and Eigenvectors

๐ต๐‘ฆ = ๐œ‡๐‘ฆ

matrix vector scalar

๐ต = 1 5 1

slide-7
SLIDE 7

Eigenvalues and Eigenvectors

๐ต๐‘ฆ = ๐œ‡๐‘ฆ

matrix vector scalar

๐ต = 1 5 1

1 5 1 ๐‘ฆ ๐‘ง = ๐‘ฆ + 5๐‘ง ๐‘ง ๐‘ฆ + 5๐‘ง ๐‘ง = ๐œ‡ ๐‘ฆ ๐‘ง

slide-8
SLIDE 8

Eigenvalues and Eigenvectors

๐ต๐‘ฆ = ๐œ‡๐‘ฆ

matrix vector scalar

๐ต = 1 5 1

  • nly non-zero vector

to scale 1 5 1 1 0 = 1 1 ๐‘ฆ + 5๐‘ง ๐‘ง = ๐œ‡ ๐‘ฆ ๐‘ง

slide-9
SLIDE 9

Outline

Linear Algebra/Math Review Two Methods of Dimensionality Reduction

Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

slide-10
SLIDE 10

Dimensionality Reduction

Original (lightly preprocessed data)

Compressed representation

N instances D input features L reduced features

slide-11
SLIDE 11

Dimensionality Reduction

clarity of representation vs. ease of understanding

  • versimplification: loss of important or relevant

information

Courtesy Antano ลฝilinsko

slide-12
SLIDE 12

Why โ€œmaximizeโ€ the variance?

How can we efficiently summarize? We maximize the variance within our summarization We donโ€™t increase the variance in the dataset How can we capture the most information with the fewest number of axes?

slide-13
SLIDE 13

Summarizing Redundant Information

(2,1) (2,-1) (-2,-1) (4,2)

slide-14
SLIDE 14

Summarizing Redundant Information

(2,-1) (-2,-1) (4,2) (2,1) = 2*(1,0) + 1*(0,1) (2,1)

slide-15
SLIDE 15

Summarizing Redundant Information

(2,1) (2,-1) (-2,-1) (4,2) (2,1) (2,-1) (-2,-1) (4,2) u1 u2 2u1

  • u1

(2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)

slide-16
SLIDE 16

Summarizing Redundant Information

(2,1) (2,-1) (-2,-1) (4,2) (2,1) (2,-1) (-2,-1) (4,2) u1 u2 2u1

  • u1

(2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)

(Is it the most general? These vectors arenโ€™t orthogonal)

slide-17
SLIDE 17

Outline

Linear Algebra/Math Review Two Methods of Dimensionality Reduction

Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

slide-18
SLIDE 18

Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA)

Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible

slide-19
SLIDE 19

Geometric Rationale of LDiscA & PCA

Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes):

  • rdered such that principal axis 1 has the highest

variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)

Courtesy Antano ลฝilinsko

slide-20
SLIDE 20

Remember: MAP Classifiers are Optimal for Classification

min

๐ฑ เท ๐‘—

๐”ฝเทž

๐‘ง๐‘—[โ„“0/1(๐‘ง, เทข

๐‘ง๐‘—)] โ†’ max

๐ฑ

เท

๐‘—

๐‘ž เท ๐‘ง๐‘— = ๐‘ง๐‘— ๐‘ฆ๐‘—

๐‘ž เท ๐‘ง๐‘— = ๐‘ง๐‘— ๐‘ฆ๐‘— โˆ ๐‘ž ๐‘ฆ๐‘— เท ๐‘ง๐‘— ๐‘ž(เท ๐‘ง๐‘—)

posterior class-conditional likelihood class prior

๐‘ฆ๐‘— โˆˆ โ„๐ธ

slide-21
SLIDE 21

Linear Discriminant Analysis

MAP Classifier where:

  • 1. class-conditional likelihoods are Gaussian
  • 2. common covariance among class likelihoods
slide-22
SLIDE 22

LDiscA: (1) What if likelihoods are Gaussian

๐‘ž เท ๐‘ง๐‘— = ๐‘ง๐‘— ๐‘ฆ๐‘— โˆ ๐‘ž ๐‘ฆ๐‘— เท ๐‘ง๐‘— ๐‘ž(เท ๐‘ง๐‘—) ๐‘ž ๐‘ฆ๐‘— ๐‘™ = ๐’ช ๐œˆ๐‘™, ฮฃ๐‘™ = exp โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆฮฃ๐‘™

โˆ’1 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™

2๐œŒ ๐ธ/2 ฮฃ๐‘™ 1/2

https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png

slide-23
SLIDE 23

LDiscA: (2) Shared Covariance

log ๐‘ž เท ๐‘ง๐‘— = ๐‘™ ๐‘ฆ๐‘— ๐‘ž เท ๐‘ง๐‘— = ๐‘š ๐‘ฆ๐‘— = log ๐‘ž(๐‘ฆ๐‘—|๐‘™) ๐‘ž(๐‘ฆ๐‘—|๐‘š) + log ๐‘ž(๐‘™) ๐‘ž ๐‘š

slide-24
SLIDE 24

LDiscA: (2) Shared Covariance

log ๐‘ž เท ๐‘ง๐‘— = ๐‘™ ๐‘ฆ๐‘— ๐‘ž เท ๐‘ง๐‘— = ๐‘š ๐‘ฆ๐‘— = log ๐‘ž(๐‘ฆ๐‘—|๐‘™) ๐‘ž(๐‘ฆ๐‘—|๐‘š) + log ๐‘ž(๐‘™) ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆฮฃ๐‘™

โˆ’1 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™

2๐œŒ ๐ธ/2 ฮฃ๐‘™ 1/2 exp โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘š ๐‘ˆฮฃ๐‘š

โˆ’1 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘š

2๐œŒ ๐ธ/2 ฮฃ๐‘š 1/2

slide-25
SLIDE 25

LDiscA: (2) Shared Covariance

log ๐‘ž เท ๐‘ง๐‘— = ๐‘™ ๐‘ฆ๐‘— ๐‘ž เท ๐‘ง๐‘— = ๐‘š ๐‘ฆ๐‘— = log ๐‘ž(๐‘ฆ๐‘—|๐‘™) ๐‘ž(๐‘ฆ๐‘—|๐‘š) + log ๐‘ž(๐‘™) ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆฮฃโˆ’1 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ 2๐œŒ ๐ธ/2 ฮฃ๐‘™ 1/2 exp โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘š ๐‘ˆฮฃโˆ’1 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘š 2๐œŒ ๐ธ/2 ฮฃ๐‘š 1/2

ฮฃ๐‘š = ฮฃ๐‘™

slide-26
SLIDE 26

LDiscA: (2) Shared Covariance

log ๐‘ž เท ๐‘ง๐‘— = ๐‘™ ๐‘ฆ๐‘— ๐‘ž เท ๐‘ง๐‘— = ๐‘š ๐‘ฆ๐‘— = log ๐‘ž(๐‘ฆ๐‘—|๐‘™) ๐‘ž(๐‘ฆ๐‘—|๐‘š) + log ๐‘ž(๐‘™) ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š ๐‘ˆฮฃโˆ’1 ๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š + ๐‘ฆ๐‘—

๐‘ˆฮฃโˆ’1(๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š)

linear in xi (check for yourself: why did the quadratic xi terms cancel?)

slide-27
SLIDE 27

LDiscA: (2) Shared Covariance

log ๐‘ž เท ๐‘ง๐‘— = ๐‘™ ๐‘ฆ๐‘— ๐‘ž เท ๐‘ง๐‘— = ๐‘š ๐‘ฆ๐‘— = log ๐‘ž(๐‘ฆ๐‘—|๐‘™) ๐‘ž(๐‘ฆ๐‘—|๐‘š) + log ๐‘ž(๐‘™) ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š ๐‘ˆฮฃโˆ’1 ๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š + ๐‘ฆ๐‘—

๐‘ˆฮฃโˆ’1(๐œˆ๐‘™ โˆ’ ๐œˆ๐‘š)

linear in xi (check for yourself: why did the quadratic xi terms cancel?)

= ๐‘ฆ๐‘—

๐‘ˆฮฃโˆ’1๐œˆ๐‘™ โˆ’ 1

2 ๐œˆ๐‘™

๐‘ˆฮฃโˆ’1๐œˆ๐‘™ + log ๐‘ž(๐‘™)

+๐‘ฆ๐‘—

๐‘ˆฮฃโˆ’1๐œˆ๐‘š โˆ’ 1

2 ๐œˆ๐‘š

๐‘ˆฮฃโˆ’1๐œˆ๐‘š + log ๐‘ž ๐‘š

rewrite only in terms of xi (data) and single-class terms

slide-28
SLIDE 28

Classify via Linear Discriminant Functions

๐œ€๐‘™ ๐‘ฆ๐‘— = ๐‘ฆ๐‘—

๐‘ˆฮฃโˆ’1๐œˆ๐‘™ โˆ’ 1

2 ๐œˆ๐‘™

๐‘ˆฮฃโˆ’1๐œˆ๐‘™ + log ๐‘ž(๐‘™)

arg max ๐‘™ ๐œ€๐‘™ ๐‘ฆ๐‘— MAP classifier

equivalent to

slide-29
SLIDE 29

LDiscA

Parameters to learn: ๐‘ž ๐‘™

๐‘™, ๐œˆ๐‘™ ๐‘™, ฮฃ

๐‘ž ๐‘™ โˆ ๐‘‚๐‘™

number of items labeled with class k

slide-30
SLIDE 30

LDiscA

Parameters to learn: ๐‘ž ๐‘™

๐‘™, ๐œˆ๐‘™ ๐‘™, ฮฃ

๐‘ž ๐‘™ โˆ ๐‘‚๐‘™ ๐œˆ๐‘™ = 1 ๐‘‚๐‘™ เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘—

slide-31
SLIDE 31

LDiscA

Parameters to learn: ๐‘ž ๐‘™

๐‘™, ๐œˆ๐‘™ ๐‘™, ฮฃ

๐‘ž ๐‘™ โˆ ๐‘‚๐‘™ ๐œˆ๐‘™ = 1 ๐‘‚๐‘™ เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘—

ฮฃ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

scatter๐‘™ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆ

within-class covariance

  • ne option for ๐›ต
slide-32
SLIDE 32

Computational Steps for Full- Dimensional LDiscA

  • 1. Compute means, priors, and covariance
slide-33
SLIDE 33

Computational Steps for Full- Dimensional LDiscA

  • 1. Compute means, priors, and covariance
  • 2. Diagonalize covariance

ฮฃ = UDUT

diagonal matrix of eigenvalues K x K orthonormal matrix (eigenvectors) Eigen decomposition

slide-34
SLIDE 34

Computational Steps for Full- Dimensional LDiscA

  • 1. Compute means, priors, and covariance
  • 2. Diagonalize covariance
  • 3. Sphere the data

ฮฃ = UDUT Xโˆ— = ๐ธ

โˆ’1 2 ๐‘‰๐‘ˆ๐‘Œ

slide-35
SLIDE 35

Computational Steps for Full- Dimensional LDiscA

  • 1. Compute means, priors, and covariance
  • 2. Diagonalize covariance
  • 3. Sphere the data (get unit covariance)
  • 4. Classify according to linear discriminant

functions ๐œ€๐‘™(๐‘ฆ๐‘—

โˆ—)

ฮฃ = UDUT Xโˆ— = ๐ธ

โˆ’1 2 ๐‘‰๐‘ˆ๐‘Œ

slide-36
SLIDE 36

Two Extensions to LDiscA

Quadratic Discriminant Analysis (QDA)

Keep separate covariances per class ๐œ€๐‘™ ๐‘ฆ๐‘— = โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ Tฮฃk

โˆ’1(๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™)

+ log ๐‘ž ๐‘™ โˆ’ log |ฮฃ๐‘™| 2

slide-37
SLIDE 37

Two Extensions to LDiscA

Quadratic Discriminant Analysis (QDA)

Keep separate covariances per class ๐œ€๐‘™ ๐‘ฆ๐‘— = โˆ’ 1 2 ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ Tฮฃk

โˆ’1(๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™)

+ log ๐‘ž ๐‘™ โˆ’ log |ฮฃ๐‘™| 2 Regularized LDiscA Interpolate between shared covariance estimate (LDiscA) and class-specific estimate (QDA) ฮฃ๐‘™ ๐›ฝ = ๐›ฝฮฃ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ

slide-38
SLIDE 38

Vowel Classification

LDiscA (left) vs. QDA (right)

ESL 4.3

slide-39
SLIDE 39

Vowel Classification

LDiscA (left) vs. QDA (right) Regularized LDiscA

ESL 4.3

ฮฃ๐‘™ ๐›ฝ = ๐›ฝฮฃ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ

slide-40
SLIDE 40

LDA for Dimensionality Reduction

Classifying D-dimensional inputs (features) into K-dimensional space (labels) Can we view the data faithfully (optimally) in smaller dimensions? Fisherโ€™s optimal: spread out the centroids (means)

slide-41
SLIDE 41

Fisherโ€™s Argument

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

separating the means isnโ€™t enough also consider the covariance

slide-42
SLIDE 42

Fisherโ€™s Argument

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

separating the means isnโ€™t enough also consider the covariance

slide-43
SLIDE 43

L-Dimensional LDiscA

B = เท

๐‘™

๐œˆ๐‘™ โˆ’ ๐œˆ ๐œˆ๐‘™ โˆ’ ๐œˆ ๐‘ˆ

max ๐‘ฃ๐‘ˆ๐ถ๐‘ฃ ๐‘ฃ๐‘ˆฮฃ๐‘ฃ

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

between-class scatter (covariance)

slide-44
SLIDE 44

L-Dimensional LDiscA

B = เท

๐‘™

๐œˆ๐‘™ โˆ’ ๐œˆ ๐œˆ๐‘™ โˆ’ ๐œˆ ๐‘ˆ

max ๐‘ฃ๐‘ˆ๐ถ๐‘ฃ ๐‘ฃ๐‘ˆฮฃ๐‘ฃ

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

max ๐‘ฃ๐‘ˆ๐ถ๐‘ฃ s. t. ๐‘ฃ๐‘ˆฮฃ๐‘ฃ = 1

between-class scatter (covariance)

slide-45
SLIDE 45

L-Dimensional LDiscA

max ๐‘ฃ๐‘ˆ๐ถ๐‘ฃ ๐‘ฃ๐‘ˆฮฃ๐‘ฃ

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

max ๐‘ฃ๐‘ˆ๐ถ๐‘ฃ s. t. ๐‘ฃ๐‘ˆฮฃ๐‘ฃ = 1

generalized eigenvalue problem first (largest) eigenvector

slide-46
SLIDE 46

L-Dimensional LDiscA

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

max ๐‘ฃ2

๐‘ˆ๐ถ๐‘ฃ2

  • s. t. ๐‘ฃ2

๐‘ˆฮฃ๐‘ฃ2 = 1, ๐‘ฃ1 ๐‘ˆ๐‘ฃ2 = 0

find the next largest eigenvector

slide-47
SLIDE 47

L-Dimensional LDiscA

โ€œFind a linear combination such that the between-class variance is maximized relative to the within-class varianceโ€ (ESL, 4.3)

max ๐‘ฃ3

๐‘ˆ๐ถ๐‘ฃ3

  • s. t. ๐‘ฃ3

๐‘ˆฮฃ๐‘ฃ3 = 1,

๐‘ฃ1

๐‘ˆ๐‘ฃ2 = 0,

๐‘ฃ1

๐‘ˆ๐‘ฃ3 = 0,

๐‘ฃ2

๐‘ˆ๐‘ฃ3 = 0

and the next largest eigenvectorโ€ฆ.

slide-48
SLIDE 48

L-Dimensional LDiscA

  • 1. Compute means ๐œˆ, priors, and common

covariance ฮฃ

ฮฃ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

scatter๐‘™ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆ

slide-49
SLIDE 49

L-Dimensional LDiscA

  • 1. Compute means ๐œˆ, priors, and common

covariance ฮฃ

  • 2. Compute the between-class scatter

(covariance)

ฮฃ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

scatter๐‘™ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆ B = เท

๐‘™

๐œˆ๐‘™ โˆ’ ๐œˆ ๐œˆ๐‘™ โˆ’ ๐œˆ ๐‘ˆ

slide-50
SLIDE 50

L-Dimensional LDiscA

  • 1. Compute means ๐œˆ, priors, and common

covariance ฮฃ

  • 2. Compute the between-class scatter

(covariance)

  • 3. Compute the eigen decomposition of B

ฮฃ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

scatter๐‘™ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆ

๐ถ = ๐‘Š๐ธ๐ถ๐‘Š๐‘ˆ

B = เท

๐‘™

๐œˆ๐‘™ โˆ’ ๐œˆ ๐œˆ๐‘™ โˆ’ ๐œˆ ๐‘ˆ

slide-51
SLIDE 51

L-Dimensional LDiscA

  • 1. Compute means ๐œˆ, priors, and common covariance ฮฃ
  • 2. Compute the between-class scatter (covariance)
  • 3. Compute the eigen decomposition of B
  • 4. Take the top L eigenvectors from V

ฮฃ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

scatter๐‘™ = 1 ๐‘‚ โˆ’ ๐ฟ เท

๐‘™

เท

๐‘—:๐‘ง๐‘—=๐‘™

๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ฆ๐‘— โˆ’ ๐œˆ๐‘™ ๐‘ˆ

๐ถ = ๐‘Š๐ธ๐ถ๐‘Š๐‘ˆ

B = เท

๐‘™

๐œˆ๐‘™ โˆ’ ๐œˆ ๐œˆ๐‘™ โˆ’ ๐œˆ ๐‘ˆ

slide-52
SLIDE 52

Vowel Classification

ESL 4.3

slide-53
SLIDE 53

Vowel Classification

ESL 4.3

slide-54
SLIDE 54

Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels

Supervised โ†’ Unsupervised

slide-55
SLIDE 55

Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels Unsupervised learning: learning without a teacher Only features and no labels Why is unsupervised learning useful? Visualization โ€” dimensionality reduction

lower dimensional features might help learning

Discover hidden structures in the data: clustering

Supervised โ†’ Unsupervised

slide-56
SLIDE 56

Outline

Linear Algebra/Math Review Two Methods of Dimensionality Reduction

Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

slide-57
SLIDE 57

Geometric Rationale of LDiscA & PCA

Objective: to rigidly rotate the axes of the D-dimensional space to new positions (principal axes):

  • rdered such that principal axis 1

has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)

Adapted from Antano ลฝilinsko

slide-58
SLIDE 58

L-Dimensional PCA

  • 1. Compute mean ๐œˆ, priors, and common

covariance ฮฃ

  • 2. Sphere the data (zero-mean, unit covariance)
  • 3. Compute the (top L) eigenvectors, from

sphere-d data, via V

  • 4. Project the data

ฮฃ = 1 ๐‘‚ เท

๐‘—

๐‘ฆ๐‘— โˆ’ ๐œˆ ๐‘ฆ๐‘— โˆ’ ๐œˆ ๐‘ˆ

๐‘Œโˆ— = ๐‘Š๐ธ๐ถ๐‘Š๐‘ˆ

๐œˆ = 1 ๐‘‚ เท

๐‘—

๐‘ฆ๐‘—

slide-59
SLIDE 59

2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20

Variable X1 Variable X 2

+

2D Example of PCA

variables X1 and X2 have positive covariance & each has a similar variance

35 . 8

1 =

X

91 . 4

2 =

X

Courtesy Antano ลฝilinsko

slide-60
SLIDE 60
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 12

Variable X1 Variable X 2

Configuration is Centered

subtract the component-wise mean

Courtesy Antano ลฝilinsko

slide-61
SLIDE 61
  • 6
  • 4
  • 2

2 4 6

  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 12

PC 1 PC 2

Compute Principal Components

PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance.

Courtesy Antano ลฝilinsko

slide-62
SLIDE 62

Compute Principal Components

PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance.

  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 12

Variable X1 Variable X 2

PC 1 PC 2

Courtesy Antano ลฝilinsko

slide-63
SLIDE 63
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 12

Variable X1 Variable X 2

PC 1 PC 2

PC axes are a rigid rotation of the original variables PC 1 is simultaneously the direction of maximum variance and a least-squares โ€œline of best fitโ€ (squared distances of points away from PC 1 are minimized).

Courtesy Antano ลฝilinsko

slide-64
SLIDE 64

Generalization to p-dimensions

if we take the first k principal components, they define the k- dimensional โ€œhyperplane of best fitโ€ to the point cloud

  • f the total variance of all p variables:

PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions

Courtesy Antano ลฝilinsko

slide-65
SLIDE 65

How many axes are needed?

does the (k+1)th principal axis represent more variance than would be expected by chance? a common โ€œrule of thumbโ€ when PCA is based

  • n correlations is that axes with eigenvalues > 1

are worth interpreting

Courtesy Antano ลฝilinsko

slide-66
SLIDE 66

PCA as Reconstruction Error

min

๐‘‰

๐‘Œ โˆ’ ๐‘Ž๐‘‰๐‘ˆ 2 =

๐‘Ž = ๐‘Œ๐‘‰

NxD DxL

slide-67
SLIDE 67

PCA as Reconstruction Error

min

๐‘‰

๐‘Œ โˆ’ ๐‘Ž๐‘‰๐‘ˆ 2 =

๐‘Ž = ๐‘Œ๐‘‰

min

๐‘‰

๐‘Œ โˆ’ ๐‘Œ๐‘‰๐‘‰๐‘ˆ 2 =

NxD DxL

slide-68
SLIDE 68

PCA as Reconstruction Error

min

๐‘‰

๐‘Œ โˆ’ ๐‘Ž๐‘‰๐‘ˆ 2 =

๐‘Ž = ๐‘Œ๐‘‰

min

๐‘‰

๐‘Œ โˆ’ ๐‘Œ๐‘‰๐‘‰๐‘ˆ 2 = min

๐‘‰ 2 ๐‘Œ 2 โˆ’ 2๐‘‰๐‘ˆ๐‘Œ๐‘ˆ๐‘Œ๐‘‰ =

NxD DxL

slide-69
SLIDE 69

PCA as Reconstruction Error

min

๐‘‰

๐‘Œ โˆ’ ๐‘Ž๐‘‰๐‘ˆ 2 =

๐‘Ž = ๐‘Œ๐‘‰

min

๐‘‰

๐‘Œ โˆ’ ๐‘Œ๐‘‰๐‘‰๐‘ˆ 2 = min

๐‘‰ 2 ๐‘Œ 2 โˆ’ 2๐‘‰๐‘ˆ๐‘Œ๐‘ˆ๐‘Œ๐‘‰ =

min

๐‘‰ ๐ท โˆ’ 2 ๐‘Œ๐‘‰ 2

maximizing variance โ†” minimizing reconstruction error NxD DxL

slide-70
SLIDE 70

Slides Credit

https://www.mii.lt/zilinskas/uploads/visualization/lectures/le ct4/lect4_pca/PCA1.ppt