coms 4721 machine learning for data science lecture 19 4
play

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University P RINCIPAL C OMPONENT A NALYSIS D IMENSIONALITY REDUCTION Were given


  1. COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. P RINCIPAL C OMPONENT A NALYSIS

  3. D IMENSIONALITY REDUCTION We’re given data x 1 , . . . , x n , where x ∈ R d . This data is often high-dimensional, but the “information” doesn’t use the full d dimensions. For example, we could represent the above images with three numbers since they have three degrees of freedom. Two for shifts and a third for rotation. Principal component analysis can be thought of as a way of automatically mapping data x i into some new low-dimensional coordinate system. ◮ It capture most of the information in the data in a few dimensions ◮ Extensions allow us to handle missing data, and “unwrap” the data.

  4. P RINCIPAL COMPONENT ANALYSIS Example: How can we approximate this data using a unit-length vector q ? x i q is a unit-length vector, q T q = 1. Red dot: The length, q T x i , to the axis after projecting x onto the line defined by q . (q T x i )q The vector ( q T x i ) q takes q and stretches it q to the corresponding red dot. So what’s a good q ? How about minimizing the squared approximation error, n � � x i − qq T x i � 2 q T q = 1 q = arg min subject to q i = 1 qq T x i = ( q T x i ) q : The approximation of x i by stretching q to the “red dot.”

  5. PCA : THE FIRST PRINCIPAL COMPONENT This is related to the problem of finding the largest eigenvalue, n � � x i − qq T x i � 2 q T q = 1 q = arg min s.t. q i = 1 � n � n � � x T i x i − q T x i x T = arg min q i q i = 1 i = 1 � �� � = XX T We’ve defined X = [ x 1 , . . . , x n ] . Since the first term doesn’t depend on q and we have a negative sign in front of the second term, equivalently we solve q T ( XX T ) q q T q = 1 q = arg max subject to q This is the eigendecomposition problem: ◮ q is the first eigenvector of XX T ◮ λ = q T ( XX T ) q is the first eigenvalue

  6. PCA: G ENERAL The general form of PCA considers K eigenvectors, n K � 1 , k = k ′ � � ( x T � 2 s.t. q T = � x i − i q k ) q k k q k ′ = q arg min 0 , k � = k ′ q i = 1 k = 1 � �� � approximates x � n � n K � � � x T q T x i x T = i x i − arg min q k k i q i = 1 k = 1 i = 1 � �� � = XX T The vectors in Q = [ q 1 , . . . , q K ] give us a K dimensional subspace with which to represent the data:   q T 1 x K � .   ( q T . x proj =  , x ≈ k x ) q k = Qx proj .  q T k = 1 K x The eigenvectors of ( XX T ) can be learned using built-in software.

  7. E IGENVALUES , EIGENVECTORS AND THE SVD An equivalent formulation of the problem is to find ( λ, q ) such that ( XX T ) q = λ q Since ( XX T ) is a PSD matrix, there are r ≤ min { d , n } pairs, λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 , q T q T k q k ′ = 0 k q k = 1 , Why is ( XX T ) PSD? Using the SVD, X = USV T , we have that ( XX T ) = US 2 U T λ i = ( S 2 ) ii ≥ 0 ⇒ Q = U , Preprocessing: Usually we first subtract off the mean of each dimension of x .

  8. PCA: E XAMPLE OF PROJECTING FROM R 3 TO R 2 For this data, most information (structure in the data) can be captured in R 2 . (left) The original data in R 3 . The hyperplane is defined by q 1 and q 2 . � � x T i q 1 (right) The new coordinates for the data: x i → x proj = . i x T i q 2

  9. E XAMPLE : D IGITS Data : 16 × 16 images of handwritten 3’s (as vectors in R 256 ) λ 1 = 3 .4· 1 0 5 λ 2 = 2 .8· 1 0 5 λ 3 = 2 .4· 1 0 5 λ 4 = 1 .6· 1 0 5 Mean Above: The first four eigenvectors q and their eigenvalues λ . Original M = 1 M = 1 0 M = 5 0 M = 2 5 0 Above: Reconstructing a 3 using the first M − 1 eigenvectors plus the mean, and approximation M − 1 � ( x T q k ) q k x ≈ mean + k = 1

  10. P ROBABILISTIC PCA

  11. PCA AND THE SVD We’ve discussed how any matrix X has a singular value decomposition, X = USV T , U T U = I , V T V = I and S is a diagonal matrix with non-negative entries. Therefore, XX T = US 2 U T ( XX T ) U = US 2 ⇔ U is a matrix of eigenvectors, and S 2 is a diagonal matrix of eigenvalues.

  12. A MODELING APPROACH TO PCA Using the SVD perspective of PCA, we can also derive a probabilistic model for the problem and use the EM algorithm to learn it. This model will have the advantages of: ◮ Handling the problem of missing data ◮ Allowing us to learn additional parameters such as noise ◮ Provide a framework that could be extended to more complex models ◮ Gives distributions used to characterize uncertainty in predictions ◮ etc.

  13. P ROBABILISTIC PCA In effect, this is a new matrix factorization model. ◮ With the SVD, we had X = USV T . ◮ We now approximate X ≈ WZ , where ◮ W is a d × K matrix. In different settings this is called a “factor loadings” matrix, or a “dictionary.” It’s like the eigenvectors, but no orthonormality. ◮ The i th column of Z is called z i ∈ R K . Think of it as a low-dimensional representation of x i . The generative process of Probabilistic PCA is x i ∼ N ( Wz i , σ 2 I ) , z i ∼ N ( 0 , I ) . In this case, we don’t know W or any of the z i .

  14. T HE L IKELIHOOD Maximum likelihood Our goal is to find the maximum likelihood solution of the matrix W under the marginal distribution, i.e., with the z i vectors integrated out, n � W ML = arg max ln p ( x 1 , . . . , x n | W ) = arg max ln p ( x i | W ) . W W i = 1 This is intractable because p ( x i | W ) = N ( x i | 0 , σ 2 I + WW T ) , 1 2 x T ( σ 2 I + WW T ) − 1 x 2 e − 1 N ( x i | 0 , σ 2 I + WW T ) = d 1 2 | σ 2 I + WW T | ( 2 π ) We can set up an EM algorithm that uses the vectors z 1 , . . . , z n .

  15. EM FOR P ROBABILISTIC PCA Setup The marginal log likelihood can be expressed using EM as n n � � q ( z i ) ln p ( x i , z i | W ) � � p ( x i , z i | W ) dz i = ← L ln dz i q ( z i ) i = 1 i = 1 n � q ( z i ) � + q ( z i ) ln ← KL p ( z i | x i , W ) dz i i = 1 EM Algorithm : Remember that EM has two iterated steps 1. Set q ( z i ) = p ( z i | x i , W ) for each i (making KL = 0) and calculate L 2. Maximize L with respect to W Again, for this to work well we need that ◮ we can calculate the posterior distribution p ( z i | x i , W ) , and ◮ maximizing L is easy, i.e., we update W using a simple equation

  16. T HE A LGORITHM EM for Probabilistic PCA Given : Data x 1 : n , x i ∈ R d and model x i ∼ N ( Wz i , σ 2 ) , z i ∼ N ( 0 , I ) , z ∈ R K Output : Point estimate of W and posterior distribution on each z i E-Step : Set each q ( z i ) = p ( z i | x i , W ) = N ( z i | µ i , Σ i ) where Σ i = ( I + W T W /σ 2 ) − 1 , µ i = Σ i W T x i /σ 2 M-Step : Update W by maximizing the objective L from the E-step � n � � � − 1 n � � x i µ T σ 2 I + ( µ i µ T W = i + Σ i ) i i = 1 i = 1 Iterate E and M steps until increase in � n i = 1 ln p ( x i | W ) is “small.” Comment: ◮ The probabilistic framework gives a way to learn K and σ 2 as well.

  17. E XAMPLE : I MAGE PROCESSING = 8 x 8 patch X data matrix, e.g., 64 x 262,144 For image problems such as denoising or inpainting (missing data) ◮ Extract overlapping patches (e.g., 8 × 8) and vectorize to construct X ◮ Model with a factor model such as Probabilistic PCA ◮ Approximate x i ≈ W µ i , where µ i is the posterior mean of z i ◮ Reconstruct the image by replacing x i with W µ i (and averaging)

  18. E XAMPLE : D ENOISING Noisy image on left, denoised image on right. The noise variance parameter σ 2 was learned for this example.

  19. E XAMPLE : M ISSING DATA Another somewhat extreme example: ◮ Image is 480 × 320 × 3 (RGB dimension) ◮ Throw away 80% at random ◮ (left) Missing data, (middle) reconstruction, (right) original image

  20. K ERNEL PCA

  21. K ERNEL PCA We’ve seen how we can take an algorithm that uses dot products, x T x , and generalize with a nonlinear kernel. This generalization can be made to PCA. Recall: With PCA we find the eigenvectors of the matrix � n i = 1 x i x T i = XX T . ◮ Let φ ( x ) be a feature mapping from R d to R D , where D ≫ d ◮ We want to solve the eigendecomposition � n � � φ ( x i ) φ ( x i ) T q k = λ k q k i = 1 without having to work in the higher dimensional space. ◮ That is, how can we do PCA without explicitly using φ ( · ) and q ?

  22. K ERNEL PCA Notice that we can reorganize the operations of the eigendecomposition n � � � φ ( x i ) T q k φ ( x i ) /λ k = q k � �� � i = 1 = a ki That is, the eigenvector q k = � n i = 1 a ki φ ( x i ) for some vector a k ∈ R n . The trick is that instead of learning q k , we’ll learn a k . Plug this equation for q k back into the first equation: N n n � � � a kj φ ( x i ) T φ ( x j ) φ ( x i ) = λ k a ki φ ( x i ) � �� � i = 1 j = 1 i = 1 = K ( x i , x j ) and multiply both sides by φ ( x l ) T for each l ∈ { 1 , . . . , n } .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend