Principal Component Analysis Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Principal Component Analysis Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD — ECE 175A — Winter 2012

Curse of dimensionality Typical observation in Bayes decision theory: • Error increases when number of features is large Even for simple models (e.g. Gaussian) we need a large number of examples n to have good estimates Q: what does “large” mean? This depends on the dimension of the space The best way to see this is to think of an histogram • suppose you have 100 points and you need at least 10 bins per axis in order to get a reasonable quantization for uniform data you get, on average, dimension 1 2 3 points/bin 10 1 0.1 which is decent in1D, bad in 2D, terrible in 3D (9 out of each10 bins are empty!) 2

Curse of Dimensionality This is the curse of dimensionality: • For a given classifier the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space In higher dimensions the classifier has more parameters • Therefore: Higher complexity & Harder to learn 3

Dimensionality Reduction What do we do about this? Avoid unnecessary dimensions “Unnecessary” features arise in two ways: 1.features are not discriminant 2.features are not independent (are highly correlated) Non-discriminant means that they do not separate the classes well discriminan t non-discriminant 4

Dimensionality Reduction Q: How do we detect the presence of feature correlations? A: The data “lives” in a low dimensional subspace (up to some amounts of noise). E.g. new feature y salary salary o o o o o o o o o o o o o o o o o o o projection onto o o o o o o o o o o 1D subspace: y = a x car loan car loan In the example above we have a 3D hyper-plane in 5D If we can find this hyper-plane we can: • Project the data onto it • Get rid of two dimensions without introducing significant error 5

Principal Components Basic idea: • If the data lives in a (lower dimensional) subspace, it is going to look very flat when viewed from the full space, e.g. 2D subspace in 3D 1D subspace in 2D This means that: • If we fit a Gaussian to the data the iso-probability contours are going to be highly skewed ellipsoids • The directions that explain most of the variance in the fitted data give the Principle Components of the data. 6

Principal Components How do we find these ellipsoids? When we talked about metrics we said that the • Mahalanobis distance measures the “natural” units for the problem because it is “adapted” to the covariance of the data We also know that • What is special about it is that it uses S -1 Hence, information about possible subspace structure must be in the covariance     S    2 ( , ) 1 T matrix S ( ) ( ) d x x x 7

Multivariate Gaussian Review The equiprobability contours (level sets) of a Gaussian are the points such that Let’s consider the change of variable z = x-  , which only moves the origin by  . The equation is the equation of an ellipse (a hyperellipse). This is easy to see when S is diagonal: 8

Gaussian Review This is the equation of an ellipse with principal lengths s i • E.g. when d = 2 is the ellipse z 2 s 2 s 1 z 1 9

Gaussian Review Introduce a transformation y = F z Then y has covariance If F is proper orthogonal this is just a rotation and we have y 2 z 2 f 1 f 2 y = F z s 1 s 2 s 2 s 1 y 1 z 1 We obtain a rotated ellipse with principal components f 1 and f 2 which are the columns of F Note that is the eigendecomposition of S y 10

Principal Component Analysis (PCA) If y is Gaussian with covariance S , the equiprobability contours are the ellipses whose y 2 f 1 • Principal Components f i are f 2 the eigenvectors of S s 1 s 2 • Principal Values (lengths) s i are the y 1 square roots of the eigenvalues l i of S By computing the eigenvalues we know if the data is flat s 1 >> s 2 : flat s 1 = s 2 : not flat y 2 y 2 s 2 s 1 s 2 s 1 y 1 y 1 11

Learning-based PCA 12

Learning-based PCA 13

Principal Component Analysis How to determine the number of eigenvectors to keep? One possibility is to plot eigenvalue magnitudes • This is called a Scree Plot • Usually there is a fast decrease in the eigenvalue magnitude followed by a flat area • One good choice is the knee of this curve 14

Principal Component Analysis Another possibility: Percentage of Explained Variance • Remember that eigenvalues are a measure of variance along the principle directions (eigenvectors) y 2 z 2 f 1 f 2 y = F z l 1 s 2 l 2 s 1 y 1 z 1 • Ratio r k measures % of total variance k   s 2 contained in the top k eigenvalues i  1 i • Measure of the fraction of data variability r k n  along the associated eigenvectors s 2 i  1 i 15

Principal Component Analysis Given r k a natural measure is to pick the eigenvectors that explain p % of the data variability • This can be done by plotting the ratio r k as a function of k • E.g. we need 3 eigenvectors to cover 70% of the variability of this dataset 16

PCA by SVD There is an alternative way to compute the principal components, based on the singular value decomposition (“Condensed”) Singular Value Decomposition (SVD): • Any full-rank n x m matrix ( n > m ) can be decomposed as A   P  T • M is a n x m (nonsquare) column orthogonal matrix of left singular vectors (columns of M ) • P is an m x m (square) diagonal matrix containing the m singular values (which are nonzero and strictly positive) • N an m x m row orthogonal matrix of right singular vectors (columns of N = rows of N T )        T T T NN I I   m m m m 17

PCA by SVD To relate this to PCA, we construct the d x n Data Matrix   | |     X x x  1 n    | |  The sample mean is     | | 1     n 1 1 1      1 x x x X     1 i n n n n      1 i  | |    1 18

PCA by SVD We center the data by subtracting the mean from each column of X This yields the d x n Centered Data Matrix     | | | |         X x x     1 c n      | |   | |    1 1        T T T 1 11  11  X X X X I   n n 19

PCA by SVD The Sample Covariance is the d x d matrix   1 1      T T S       c c x x x x i i i i n n i i c is the i th column of X c where x i This can be written as       c | | x 1     1 1 S   c c T x x X X     1 n c c n n       c    | | x  n 20

PCA by SVD The centered data matrix     c x 1     T X  c     c x   n is n x d . Assuming it has rank = d , it has the SVD:  P T T       X T T I I c This yields: 1 1 1 S   P P  P  T T T 2 T X X c c n n n 21

PCA by SVD   1 S   P  2 T     n Noting that N is d x d and orthonormal, and P 2 diagonal, shows that this is just the eigenvalue decomposition of S It follows that • The eigenvectors of S are the columns of N • The eigenvalues of S are  2 l  s  2 i i i n This gives an alternative algorithm for PCA 22

PCA by SVD Summary of Computation of PCA by SVD: Given X with one example per column • 1) Create the (transposed) Centered Data-Matrix:   1 11   T T T   X I X c   n • 2) Compute its SVD:  P  T T X c • 3) Principal Components are columns of N; Principle Values are:  s  l  i i i n 23

Principal Component Analysis Principal components are often quite informative about the structure of the data Example: • Eigenfaces, the principal components for the space of images of faces • The figure only show the first 16 eigenvectors (eigenfaces) • Note lighting, structure, etc 24

Principal Components Analysis PCA has been applied to virtually all learning problems E.g. eigenshapes for face morphing morphed faces 25

Principal Component Analysis Sound average Eigensounds corresponding to the three sound images highest eigenvalues 26

Principal Component Analysis Turbulence Eigenflames Flames 27

Principal Component Analysis Video Eigenrings reconstruction 28

Principal Component Analysis Text: Latent Semantic Indexing • Represent each document by a word histogram • Perform SVD on the document x word matrix terms concepts terms concepts x x documents documents = • Principal components as the directions of semantic concepts 29

Principal Component Analysis Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Principal Component Analysis Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Curse of dimensionality Typical observation in Bayes decision theory: Error increases when number of features is large Even for simple

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

Principal Component Analysis in a Linear Algebraic View by Anna Orosz under the mentorship of

Lecture 3 Principal Component Analysis Lin ZHANG, PhD School of Software Engineering Tongji

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

Reducing dimensionality Principal components R.W. Oldford Reducing dimensions Recall how

CME/STATS 195 CME/STATS 195 Lecture 8: Hypothesis Testing and Lecture 8: Hypothesis Testing and

Introduction to Machine Learning Session 3b: Principal Components Analysis Reto West

A posteriori error estimates for space-time domain decomposition method for two-phase flow

The Overlapping Thermodynamic Dissociation Constants of the Antidepressant Vortioxetine Using

DEVELOPING A THEORETICAL MODEL OF CLINICIAN INFORMATION USAGE PROPENSITY Dr Philip J Scott MSc

Statistical Natural Language Processing . ltekin, variables learned without labels

Boosting New Physics Searches with Deep Learning David Shih NHETC, Rutgers University