Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

โ–ถ
dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. HW 4 out tonight J. Mark Sowers Distinguished Lecture Michael Jordan Pehong Chen Distinguished Professor


slide-1
SLIDE 1

Dimensionality Reduction

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • HW 3 due March 27.
  • HW 4 out tonight
slide-3
SLIDE 3
  • J. Mark Sowers Distinguished Lecture
  • Michael Jordan
  • Pehong Chen Distinguished Professor

Department of Statistics and Electrical Engineering and Computer Sciences

  • University of California, Berkeley
  • 3/28/19
  • 7:30 PM, McBryde 100
slide-4
SLIDE 4

ECE Faculty Candidate Talk

  • Siheng Chen
  • Ph.D. Carnegie Mellon University
  • Data science with graphs: From social network analysis to

autonomous driving

  • Time: 10:00 AM - 11:00 AM March 28
  • Location: 457B Whittemore
slide-5
SLIDE 5

Expectation Maximization (EM) Algorithm

  • Goal: Find ๐œ„ that maximizes log-likelihood ฯƒ๐‘— log ๐‘ž(๐‘ฆ ๐‘— ; ๐œ„)

ฯƒ๐‘— log ๐‘ž(๐‘ฆ ๐‘— ; ๐œ„) = ฯƒ๐‘— log ฯƒ๐‘จ(๐‘—) ๐‘ž(๐‘ฆ ๐‘— , ๐‘จ(๐‘—); ๐œ„) = ฯƒ๐‘— log ฯƒ๐‘จ(๐‘—) ๐‘…๐‘— ๐‘จ ๐‘—

๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ๐‘…๐‘—(๐‘จ(๐‘—))

โ‰ฅ ฯƒ๐‘— ฯƒ๐‘จ(๐‘—) ๐‘…๐‘— ๐‘จ ๐‘— log

๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ๐‘…๐‘—(๐‘จ(๐‘—))

Jensenโ€™s inequality: ๐‘” ๐น ๐‘Œ โ‰ฅ ๐น[๐‘” ๐‘Œ ]

slide-6
SLIDE 6

Expectation Maximization (EM) Algorithm

  • Goal: Find ๐œ„ that maximizes log-likelihood ฯƒ๐‘— log ๐‘ž(๐‘ฆ ๐‘— ; ๐œ„)

ฯƒ๐‘— log ๐‘ž(๐‘ฆ ๐‘— ; ๐œ„) โ‰ฅ ฯƒ๐‘— ฯƒ๐‘จ(๐‘—) ๐‘…๐‘— ๐‘จ ๐‘— log

๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ๐‘…๐‘—(๐‘จ(๐‘—))

  • The lower bound works for all possible set of distributions ๐‘…๐‘—
  • We want tight lower-bound: ๐‘” ๐น ๐‘Œ

= ๐น[๐‘” ๐‘Œ ]

  • When will that happen? ๐‘Œ = ๐น ๐‘Œ with probability 1 (๐‘Œ is a constant)

๐‘ž ๐‘ฆ ๐‘— , ๐‘จ ๐‘— ; ๐œ„ ๐‘…๐‘—(๐‘จ(๐‘—)) = ๐‘‘

slide-7
SLIDE 7

How should we choose ๐‘…๐‘—(๐‘จ(๐‘—))?

  • ๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„

๐‘…๐‘—(๐‘จ(๐‘—))

= ๐‘‘

  • ๐‘…๐‘—(๐‘จ(๐‘—)) โˆ ๐‘ž ๐‘ฆ ๐‘— , ๐‘จ ๐‘— ; ๐œ„
  • ฯƒ๐‘จ ๐‘…๐‘—(๐‘จ(๐‘—)) = 1 (because it is a distribution)
  • ๐‘…๐‘— ๐‘จ ๐‘—

=

๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ฯƒ๐‘จ ๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ = ๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ๐‘ž ๐‘ฆ ๐‘— ;๐œ„

= ๐‘ž(๐‘จ ๐‘— |๐‘ฆ ๐‘— ; ๐œ„)

slide-8
SLIDE 8

EM algorithm

Repeat until convergence{ (E-step) For each ๐‘—, set ๐‘…๐‘— ๐‘จ ๐‘— โ‰” ๐‘ž(๐‘จ ๐‘— |๐‘ฆ ๐‘— ; ๐œ„) (Probabilistic inference) (M-step) Set ๐œ„ โ‰” argmax๐œ„ ฯƒ๐‘— ฯƒ๐‘จ(๐‘—) ๐‘…๐‘— ๐‘จ ๐‘— log

๐‘ž ๐‘ฆ ๐‘— ,๐‘จ ๐‘— ;๐œ„ ๐‘…๐‘—(๐‘จ(๐‘—))

}

slide-9
SLIDE 9

Expectation Maximization (EM) Algorithm

๏€จ ๏€ฉ๏ƒท

๏ƒธ ๏ƒถ ๏ƒง ๏ƒจ ๏ƒฆ ๏€ฝ

๏ƒฅ

z

z x ๏ฑ ๏ฑ

๏ฑ

| , log argmax ห† p

Goal:

๏› ๏

๏€จ ๏€ฉ ๏€จ ๏€ฉ

๏› ๏

X f X f E E ๏‚ณ

Jensenโ€™s Inequality Log of sums is intractable

See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps

for concave functions f(x) (so we maximize the lower bound!)

slide-10
SLIDE 10

Expectation Maximization (EM) Algorithm

  • 1. E-step: compute
  • 2. M-step: solve

๏€จ ๏€ฉ ๏€จ ๏€ฉ

๏› ๏

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( , |

, | | , log | , log E

) (

t x z

p p p

t

๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x z x

z

๏ƒฅ

๏€ฝ

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( ) 1 (

, | | , log argmax

t t

p p ๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x

z

๏ƒฅ

๏€ฝ

๏€ซ

๏€จ ๏€ฉ๏ƒท

๏ƒธ ๏ƒถ ๏ƒง ๏ƒจ ๏ƒฆ ๏€ฝ

๏ƒฅ

z

z x ๏ฑ ๏ฑ

๏ฑ

| , log argmax ห† p

Goal:

slide-11
SLIDE 11
  • 1. E-step: compute
  • 2. M-step: solve

๏€จ ๏€ฉ ๏€จ ๏€ฉ

๏› ๏

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( , |

, | | , log | , log E

) (

t x z

p p p

t

๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x z x

z

๏ƒฅ

๏€ฝ

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( ) 1 (

, | | , log argmax

t t

p p ๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x

z

๏ƒฅ

๏€ฝ

๏€ซ

๏€จ ๏€ฉ๏ƒท

๏ƒธ ๏ƒถ ๏ƒง ๏ƒจ ๏ƒฆ ๏€ฝ

๏ƒฅ

z

z x ๏ฑ ๏ฑ

๏ฑ

| , log argmax ห† p

Goal:

๏› ๏

๏€จ ๏€ฉ ๏€จ ๏€ฉ

๏› ๏

X f X f E E ๏‚ณ

log of expectation of P(x|z) expectation of log of P(x|z)

slide-12
SLIDE 12

EM for Mixture of Gaussians - derivation

๏€จ ๏€ฉ

๏ƒฅ

๏ƒ— ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ ๏€ญ ๏€ญ ๏€ฝ

m m m m n m

x ๏ฐ ๏ณ ๏ญ ๏ฐ๏ณ

2 2 2 exp

2 1

๏€จ ๏€ฉ

๏€จ ๏€ฉ

๏ƒฅ

๏€ฝ ๏€ฝ

m m m m n n n

m z x p x p ๏ฐ ๏ณ ๏ญ , , | , , , |

2 2 ฯ€

ฯƒ ฮผ

1. E-step: 2. M-step:

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏› ๏ ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( , |

, | | , log | , log E

) (

t x z

p p p

t

๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x z x

z

๏ƒฅ

๏€ฝ

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( ) 1 (

, | | , log argmax

t t

p p ๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x

z

๏ƒฅ

๏€ฝ

๏€ซ

slide-13
SLIDE 13

EM for Mixture of Gaussians

๏€จ ๏€ฉ

๏ƒฅ

๏ƒ— ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ ๏€ญ ๏€ญ ๏€ฝ

m m m m n m

x ๏ฐ ๏ณ ๏ญ ๏ฐ๏ณ

2 2 2 exp

2 1

๏€จ ๏€ฉ

๏€จ ๏€ฉ

๏ƒฅ

๏€ฝ ๏€ฝ

m m m m n n n

m z x p x p ๏ฐ ๏ณ ๏ญ , , | , , , |

2 2 ฯ€

ฯƒ ฮผ

1. E-step: 2. M-step:

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏› ๏ ๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( , |

, | | , log | , log E

) (

t x z

p p p

t

๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x z x

z

๏ƒฅ

๏€ฝ

๏€จ ๏€ฉ ๏€จ ๏€ฉ ๏€จ

๏€ฉ

) ( ) 1 (

, | | , log argmax

t t

p p ๏ฑ ๏ฑ ๏ฑ

๏ฑ

x z z x

z

๏ƒฅ

๏€ฝ

๏€ซ

) , , , | (

) ( ) ( 2 ) ( t t t n n nm

x m z p ฯ€ ฯƒ ฮผ ๏€ฝ ๏€ฝ ๏ก ๏ƒฅ ๏ƒฅ

๏€ฝ

๏€ซ n n nm n nm t m

x ๏ก ๏ก ๏ญ 1 ห†

) 1 (

๏€จ ๏€ฉ

๏ƒฅ ๏ƒฅ

๏€ญ ๏€ฝ

๏€ซ n m n nm n nm t m

x

2 ) 1 ( 2

ห† 1 ห† ๏ญ ๏ก ๏ก ๏ณ

N

n nm t m

๏ƒฅ

๏€ฝ

๏€ซ

๏ก ๏ฐ

) 1 (

ห†

slide-14
SLIDE 14

EM algorithm - derivation

http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

slide-15
SLIDE 15

EM algorithm โ€“ E-Step

slide-16
SLIDE 16

EM algorithm โ€“ E-Step

slide-17
SLIDE 17

EM algorithm โ€“ M-Step

slide-18
SLIDE 18

EM algorithm โ€“ M-Step

Take derivative with respect to ๐œˆ๐‘š

slide-19
SLIDE 19

EM algorithm โ€“ M-Step

Take derivative with respect to ฯƒ๐‘š

โˆ’1

slide-20
SLIDE 20

EM Algorithm for GMM

slide-21
SLIDE 21

EM Algorithm

  • Maximizes a lower bound on the data likelihood at each iteration
  • Each step increases the data likelihood
  • Converges to local maximum
  • Common tricks to derivation
  • Find terms that sum or integrate to 1
  • Lagrange multiplier to deal with constraints
slide-22
SLIDE 22

Convergence of EM Algorithm

slide-23
SLIDE 23

โ€œHard EMโ€

  • Same as EM except compute z* as most likely values for hidden

variables

  • K-means is an example
  • Advantages
  • Simpler: can be applied when cannot derive EM
  • Sometimes works better if you want to make hard predictions at the end
  • But
  • Generally, pdf parameters are not as accurate as EM
slide-24
SLIDE 24

Dimensionality Reduction

  • Motivation
  • Data compression
  • Data visualization
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA
slide-25
SLIDE 25

Dimensionality Reduction

  • Motivation
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA
slide-26
SLIDE 26

Data Compression

  • Reduces the required time and storage space
  • Removing multi-collinearity improves the interpretation of the

parameters of the machine learning model. ๐‘ฆ(1) โˆˆ ๐‘†2 โ†’ ๐‘จ 1 โˆˆ ๐‘† ๐‘ฆ(2) โˆˆ ๐‘†2 โ†’ ๐‘จ 1 โˆˆ ๐‘† โ‹ฎ ๐‘ฆ(๐‘›) โˆˆ ๐‘†2 โ†’ ๐‘จ ๐‘› โˆˆ ๐‘†

๐‘ฆ2 ๐‘ฆ1 ๐‘จ1

slide-27
SLIDE 27

Data Compression

  • Reduces the required time and storage space
  • Removing multi-collinearity improves the interpretation of the

parameters of the machine learning model. ๐‘ฆ(1) โˆˆ ๐‘†2 โ†’ ๐‘จ 1 โˆˆ ๐‘† ๐‘ฆ(2) โˆˆ ๐‘†2 โ†’ ๐‘จ 1 โˆˆ ๐‘† โ‹ฎ ๐‘ฆ(๐‘›) โˆˆ ๐‘†2 โ†’ ๐‘จ ๐‘› โˆˆ ๐‘†

๐‘ฆ2 ๐‘ฆ1 ๐‘จ1

slide-28
SLIDE 28

Data Compression

  • Reduce data from 3D to 2D (in general 1000D -> 100D)

๐‘ฆ1 ๐‘ฆ3 ๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ3 ๐‘ฆ2 ๐‘จ1 ๐‘จ2 ๐‘จ1 ๐‘จ2

slide-29
SLIDE 29
slide-30
SLIDE 30

Dimensionality Reduction

  • Motivation
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA
slide-31
SLIDE 31

Principal Component Analysis Formulation

๐‘ฆ2 ๐‘ฆ1

slide-32
SLIDE 32

Principal Component Analysis Formulation

  • Reduce n-D to k-D: find ๐‘ฃ(1), ๐‘ฃ(2), โ‹ฏ , ๐‘ฃ(๐‘™) โˆˆ ๐‘†๐‘œ onto which

to project the data, so as to minimize the projection error

๐‘ฆ2 ๐‘ฆ1 ๐‘ฃ(1) ๐‘ฃ(1) ๐‘ฃ(2)

slide-33
SLIDE 33

PCA vs. Linear regression

๐‘ง ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

slide-34
SLIDE 34

Data pre-processing

  • Training set: ๐‘ฆ(1), ๐‘ฆ(2), โ‹ฏ , ๐‘ฆ(๐‘›)
  • Preprocessing (feature scaling/mean normalization)

๐œˆ๐‘˜ = 1 ๐‘› เท

๐‘—

๐‘ฆ๐‘˜

(๐‘—)

Replace each ๐‘ฆ๐‘˜

(๐‘—) with ๐‘ฆ๐‘˜ โˆ’ ๐œˆ๐‘˜

If different features on different scales, scale features to have comparable range of values ๐‘ฆ๐‘˜

(๐‘—) โ†

๐‘ฆ๐‘˜

(๐‘—) โˆ’ ๐œˆ๐‘˜

๐‘ก

๐‘˜

slide-35
SLIDE 35

Principal Component Analysis Algorithm

  • Goal: Reduce data from n-dimensions to k-dimensions
  • Step 1: Compute โ€œcovariance matrixโ€

ฮฃ = 1 ๐‘› เท

๐‘—=1 ๐‘œ

๐‘ฆ ๐‘— ๐‘ฆ ๐‘—

โŠค

  • Step 2: Compute โ€œeigenvectorsโ€ of the covariance matrix

[U, S, V] = svd(Sigma); U = ๐‘ฃ(1), ๐‘ฃ(2), โ‹ฏ , ๐‘ฃ(๐‘œ) โˆˆ ๐‘†๐‘œร—๐‘œ Principal components: ๐‘ฃ(1), ๐‘ฃ(2), โ‹ฏ , ๐‘ฃ ๐‘™ โˆˆ ๐‘†๐‘œ

slide-36
SLIDE 36

Principal Component Analysis Algorithm

  • Goal: Reduce data from n-dimensions to k-dimensions
  • Principal components: ๐‘ฃ(1), ๐‘ฃ(2), โ‹ฏ , ๐‘ฃ ๐‘™ โˆˆ ๐‘†๐‘œ

๐‘จ ๐‘— = ๐‘ฃ 1 , ๐‘ฃ 2 , โ‹ฏ , ๐‘ฃ ๐‘™

โŠค๐‘ฆ(๐‘—) โˆˆ ๐‘†๐‘™

slide-37
SLIDE 37

PCA algorithm summary

  • After mean normalization (ensure every feature has

zero mean) and optionally feature scaling

  • Simga =

1 ๐‘› ฯƒ๐‘—=1 ๐‘œ

๐‘ฆ ๐‘— ๐‘ฆ ๐‘—

โŠค

  • [U, S, V] = svd(Sigma);
  • Ureduce = U(:, 1:k);
  • z = Ureduceโ€™ * x;
slide-38
SLIDE 38

Reconstruction from compressed representation

  • Compression: ๐‘จ(๐‘—) = ๐‘‰reduce

โŠค

๐‘ฆ(๐‘—)

  • Reconstruction: ๐‘ฆapprox

(๐‘—)

= ๐‘‰reduce๐‘จ(๐‘—)

  • ๐‘ฆapprox

(๐‘—)

โˆˆ ๐‘†๐‘œ ๐‘‰reduce โˆˆ ๐‘†๐‘œร—๐‘™ ๐‘จ(๐‘—) โˆˆ ๐‘†๐‘™ร—1

๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

slide-39
SLIDE 39

3D face modeling

A morphable model for the synthesis of 3D faces, SIGGRAPH 1999

slide-40
SLIDE 40

Shape modeling

SMPL: Skinned multi-person linear model, SIGGRAPH Asia 2015

slide-41
SLIDE 41

Dimensionality Reduction

  • Motivation
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA
slide-42
SLIDE 42

How do we choose k (number of principal components)

  • Average squared projection error:

1 m ฯƒ๐‘—

๐‘ฆ ๐‘— โˆ’ ๐‘ฆapprox

๐‘— 2

  • Total variation in the data:

1 m ฯƒ๐‘— ๐‘ฆ ๐‘— 2

  • Typically, choose ๐‘™ to be the smallest value so that

1 m ฯƒ๐‘— ๐‘ฆ ๐‘— โˆ’๐‘ฆapprox ๐‘— 2 1 m ฯƒ๐‘— ๐‘ฆ ๐‘— 2

โ‰ค 0.01 (1%) โ€œ99% of variance is retainedโ€

slide-43
SLIDE 43

How do we choose k (number of principal components)

  • Try PCA with k = 1, 2, โ‹ฏ
  • Compute Ureduce, z(1), z(2), โ‹ฏ , z ๐‘› ,

๐‘ฆapprox

1

, ๐‘ฆapprox

2

, โ‹ฏ , ๐‘ฆapprox

๐‘›

  • Check if

1 m ฯƒ๐‘— ๐‘ฆ ๐‘— โˆ’๐‘ฆapprox ๐‘— 2 1 m ฯƒ๐‘— ๐‘ฆ ๐‘— 2

โ‰ค 0.01 ?

  • [U, S, V] = svd(Sigma)
  • ๐‘‡ =

๐‘ก11 โ‹ฏ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฏ ๐‘ก๐‘œ๐‘œ

  • For given ๐‘™

1 โˆ’ ฯƒ๐‘—=1

๐‘™

๐‘ก๐‘—๐‘— ฯƒ๐‘—=1

๐‘œ

๐‘ก๐‘—๐‘— โ‰ค 0.01 ฯƒ๐‘—=1

๐‘™

๐‘ก๐‘—๐‘— ฯƒ๐‘—=1

๐‘œ

๐‘ก๐‘—๐‘— โ‰ฅ 0.99

slide-44
SLIDE 44

Dimensionality Reduction

  • Motivation
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA
slide-45
SLIDE 45

Application of PCA

  • Compression
  • Reduce memory/disk needed to store data
  • Speed up learning algorithm
  • Visualization (k=2, k=3)
  • Bad use of PCA
  • Reduce the number of features -> less likely to overfit?
  • Use regularization instead.
slide-46
SLIDE 46

Taxonomy for dimensionality reduction

http://www.math.chalmers.se/Stat/Grundutb/GU/MSA220/S18/DimRed2.pdf

slide-47
SLIDE 47

Things to remember

  • Compression, visualization
  • Principal component analysis
  • Formulation
  • Algorithm
  • Reconstruction
  • Choosing the number of principal components
  • Applying PCA