COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

P RINCIPAL C OMPONENT A NALYSIS

D IMENSIONALITY REDUCTION We’re given data x 1 , . . . , x n , where x ∈ R d . This data is often high-dimensional, but the “information” doesn’t use the full d dimensions. For example, we could represent the above images with three numbers since they have three degrees of freedom. Two for shifts and a third for rotation. Principal component analysis can be thought of as a way of automatically mapping data x i into some new low-dimensional coordinate system. ◮ It capture most of the information in the data in a few dimensions ◮ Extensions allow us to handle missing data, and “unwrap” the data.

P RINCIPAL COMPONENT ANALYSIS Example: How can we approximate this data using a unit-length vector q ? x i q is a unit-length vector, q T q = 1. Red dot: The length, q T x i , to the axis after projecting x onto the line defined by q . (q T x i )q The vector ( q T x i ) q takes q and stretches it q to the corresponding red dot. So what’s a good q ? How about minimizing the squared approximation error, n � � x i − qq T x i � 2 q T q = 1 q = arg min subject to q i = 1 qq T x i = ( q T x i ) q : The approximation of x i by stretching q to the “red dot.”

PCA : THE FIRST PRINCIPAL COMPONENT This is related to the problem of finding the largest eigenvalue, n � � x i − qq T x i � 2 q T q = 1 q = arg min s.t. q i = 1 � n � n � � x T i x i − q T x i x T = arg min q i q i = 1 i = 1 � �� = XX T We’ve defined X = [ x 1 , . . . , x n ] . Since the first term doesn’t depend on q and we have a negative sign in front of the second term, equivalently we solve q T ( XX T ) q q T q = 1 q = arg max subject to q This is the eigendecomposition problem: ◮ q is the first eigenvector of XX T ◮ λ = q T ( XX T ) q is the first eigenvalue

PCA: G ENERAL The general form of PCA considers K eigenvectors, n K � 1 , k = k ′ � � ( x T � 2 s.t. q T = � x i − i q k ) q k k q k ′ = q arg min 0 , k � = k ′ q i = 1 k = 1 � �� approximates x � n � n K � � � x T q T x i x T = i x i − arg min q k k i q i = 1 k = 1 i = 1 � �� = XX T The vectors in Q = [ q 1 , . . . , q K ] give us a K dimensional subspace with which to represent the data:   q T 1 x K � .   ( q T . x proj =  , x ≈ k x ) q k = Qx proj .  q T k = 1 K x The eigenvectors of ( XX T ) can be learned using built-in software.

E IGENVALUES , EIGENVECTORS AND THE SVD An equivalent formulation of the problem is to find ( λ, q ) such that ( XX T ) q = λ q Since ( XX T ) is a PSD matrix, there are r ≤ min { d , n } pairs, λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 , q T q T k q k ′ = 0 k q k = 1 , Why is ( XX T ) PSD? Using the SVD, X = USV T , we have that ( XX T ) = US 2 U T λ i = ( S 2 ) ii ≥ 0 ⇒ Q = U , Preprocessing: Usually we first subtract off the mean of each dimension of x .

PCA: E XAMPLE OF PROJECTING FROM R 3 TO R 2 For this data, most information (structure in the data) can be captured in R 2 . (left) The original data in R 3 . The hyperplane is defined by q 1 and q 2 . � � x T i q 1 (right) The new coordinates for the data: x i → x proj = . i x T i q 2

E XAMPLE : D IGITS Data : 16 × 16 images of handwritten 3’s (as vectors in R 256 ) λ 1 = 3 .4· 1 0 5 λ 2 = 2 .8· 1 0 5 λ 3 = 2 .4· 1 0 5 λ 4 = 1 .6· 1 0 5 Mean Above: The first four eigenvectors q and their eigenvalues λ . Original M = 1 M = 1 0 M = 5 0 M = 2 5 0 Above: Reconstructing a 3 using the first M − 1 eigenvectors plus the mean, and approximation M − 1 � ( x T q k ) q k x ≈ mean + k = 1

P ROBABILISTIC PCA

PCA AND THE SVD We’ve discussed how any matrix X has a singular value decomposition, X = USV T , U T U = I , V T V = I and S is a diagonal matrix with non-negative entries. Therefore, XX T = US 2 U T ( XX T ) U = US 2 ⇔ U is a matrix of eigenvectors, and S 2 is a diagonal matrix of eigenvalues.

A MODELING APPROACH TO PCA Using the SVD perspective of PCA, we can also derive a probabilistic model for the problem and use the EM algorithm to learn it. This model will have the advantages of: ◮ Handling the problem of missing data ◮ Allowing us to learn additional parameters such as noise ◮ Provide a framework that could be extended to more complex models ◮ Gives distributions used to characterize uncertainty in predictions ◮ etc.

P ROBABILISTIC PCA In effect, this is a new matrix factorization model. ◮ With the SVD, we had X = USV T . ◮ We now approximate X ≈ WZ , where ◮ W is a d × K matrix. In different settings this is called a “factor loadings” matrix, or a “dictionary.” It’s like the eigenvectors, but no orthonormality. ◮ The i th column of Z is called z i ∈ R K . Think of it as a low-dimensional representation of x i . The generative process of Probabilistic PCA is x i ∼ N ( Wz i , σ 2 I ) , z i ∼ N ( 0 , I ) . In this case, we don’t know W or any of the z i .

T HE L IKELIHOOD Maximum likelihood Our goal is to find the maximum likelihood solution of the matrix W under the marginal distribution, i.e., with the z i vectors integrated out, n � W ML = arg max ln p ( x 1 , . . . , x n | W ) = arg max ln p ( x i | W ) . W W i = 1 This is intractable because p ( x i | W ) = N ( x i | 0 , σ 2 I + WW T ) , 1 2 x T ( σ 2 I + WW T ) − 1 x 2 e − 1 N ( x i | 0 , σ 2 I + WW T ) = d 1 2 | σ 2 I + WW T | ( 2 π ) We can set up an EM algorithm that uses the vectors z 1 , . . . , z n .

EM FOR P ROBABILISTIC PCA Setup The marginal log likelihood can be expressed using EM as n n � � q ( z i ) ln p ( x i , z i | W ) � � p ( x i , z i | W ) dz i = ← L ln dz i q ( z i ) i = 1 i = 1 n � q ( z i ) � + q ( z i ) ln ← KL p ( z i | x i , W ) dz i i = 1 EM Algorithm : Remember that EM has two iterated steps 1. Set q ( z i ) = p ( z i | x i , W ) for each i (making KL = 0) and calculate L 2. Maximize L with respect to W Again, for this to work well we need that ◮ we can calculate the posterior distribution p ( z i | x i , W ) , and ◮ maximizing L is easy, i.e., we update W using a simple equation

T HE A LGORITHM EM for Probabilistic PCA Given : Data x 1 : n , x i ∈ R d and model x i ∼ N ( Wz i , σ 2 ) , z i ∼ N ( 0 , I ) , z ∈ R K Output : Point estimate of W and posterior distribution on each z i E-Step : Set each q ( z i ) = p ( z i | x i , W ) = N ( z i | µ i , Σ i ) where Σ i = ( I + W T W /σ 2 ) − 1 , µ i = Σ i W T x i /σ 2 M-Step : Update W by maximizing the objective L from the E-step � n � � � − 1 n � � x i µ T σ 2 I + ( µ i µ T W = i + Σ i ) i i = 1 i = 1 Iterate E and M steps until increase in � n i = 1 ln p ( x i | W ) is “small.” Comment: ◮ The probabilistic framework gives a way to learn K and σ 2 as well.

E XAMPLE : I MAGE PROCESSING = 8 x 8 patch X data matrix, e.g., 64 x 262,144 For image problems such as denoising or inpainting (missing data) ◮ Extract overlapping patches (e.g., 8 × 8) and vectorize to construct X ◮ Model with a factor model such as Probabilistic PCA ◮ Approximate x i ≈ W µ i , where µ i is the posterior mean of z i ◮ Reconstruct the image by replacing x i with W µ i (and averaging)

E XAMPLE : D ENOISING Noisy image on left, denoised image on right. The noise variance parameter σ 2 was learned for this example.

E XAMPLE : M ISSING DATA Another somewhat extreme example: ◮ Image is 480 × 320 × 3 (RGB dimension) ◮ Throw away 80% at random ◮ (left) Missing data, (middle) reconstruction, (right) original image

K ERNEL PCA

K ERNEL PCA We’ve seen how we can take an algorithm that uses dot products, x T x , and generalize with a nonlinear kernel. This generalization can be made to PCA. Recall: With PCA we find the eigenvectors of the matrix � n i = 1 x i x T i = XX T . ◮ Let φ ( x ) be a feature mapping from R d to R D , where D ≫ d ◮ We want to solve the eigendecomposition � n � � φ ( x i ) φ ( x i ) T q k = λ k q k i = 1 without having to work in the higher dimensional space. ◮ That is, how can we do PCA without explicitly using φ ( · ) and q ?

K ERNEL PCA Notice that we can reorganize the operations of the eigendecomposition n � � � φ ( x i ) T q k φ ( x i ) /λ k = q k � �� i = 1 = a ki That is, the eigenvector q k = � n i = 1 a ki φ ( x i ) for some vector a k ∈ R n . The trick is that instead of learning q k , we’ll learn a k . Plug this equation for q k back into the first equation: N n n � � � a kj φ ( x i ) T φ ( x j ) φ ( x i ) = λ k a ki φ ( x i ) � �� i = 1 j = 1 i = 1 = K ( x i , x j ) and multiply both sides by φ ( x l ) T for each l ∈ { 1 , . . . , n } .

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University P RINCIPAL C OMPONENT A NALYSIS D IMENSIONALITY REDUCTION Were given

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb

Mortality Events Frances M.D. Gulland May 2018 Marine Mammal Mortality Information and data

Whats Most Important Key Trends from the 2020 Prosperity Now Scorecard January 29, 2020 The

6 Subsequences and sequential compactness 6.1 Nested intervals and nested d -cells Recall the

14.54 International Trade Lecture 10: Production Functions 14.54 Week 6 Fall 2016 14.54 (Week

Semidefinite programming bounds for codes and anticodes in Cayley graphs Frank Vallentin

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and

Se Sect ction ion 811 1 Pr Proj ojec ect t Ren ental al As Assi sistance ance Pr

Sambuz

Useful Links

Newsletter

Mail Us

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University P RINCIPAL C OMPONENT A NALYSIS D IMENSIONALITY REDUCTION Were given

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb

Mortality Events Frances M.D. Gulland May 2018 Marine Mammal Mortality Information and data

Whats Most Important Key Trends from the 2020 Prosperity Now Scorecard January 29, 2020 The

6 Subsequences and sequential compactness 6.1 Nested intervals and nested d -cells Recall the

14.54 International Trade Lecture 10: Production Functions 14.54 Week 6 Fall 2016 14.54 (Week

Semidefinite programming bounds for codes and anticodes in Cayley graphs Frank Vallentin

Learning for Hidden Markov Models &amp; Course Recap Michael Gutmann Probabilistic Modelling and

Se Sect ction ion 811 1 Pr Proj ojec ect t Ren ental al As Assi sistance ance Pr

Sambuz

Useful Links

Newsletter

Mail Us

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and