let denote an average
play

. . . . . . . . . . . . . . . . . . . . . Let - PowerPoint PPT Presentation

Overview Factor Analysis and Beyond Principal Components Analysis Factor Analysis Independent Components Analysis Chris Williams Non-linear Factor Analysis School of Informatics, University of Edinburgh Reading: Handout on


  1. Overview Factor Analysis and Beyond ◮ Principal Components Analysis ◮ Factor Analysis ◮ Independent Components Analysis Chris Williams ◮ Non-linear Factor Analysis School of Informatics, University of Edinburgh ◮ Reading: Handout on “Factor Analysis and Beyond”, Bishop §12.1, 12.2 (but not 12.2.1, 12.2.2, 12.2.3), 12.4 October 2011 (but not 12.4.2) 1 / 26 2 / 26 Covariance matrix . . . . . . . . . . . . . . . . . . . . . ◮ Let � � denote an average . . . . ◮ Suppose we have a random vector X = ( X 1 , X 2 , . . . , X d ) T . ◮ � X � denotes the mean of X , ( µ 1 , µ 2 , . . . µ d ) T ◮ σ ii = � ( X i − µ i ) 2 � is the variance of component i (gives a measure of the “spread” of component i ) ◮ σ ij = � ( X i − µ i )( X j − µ j ) � is the covariance between components i and j ◮ In d -dimensions there are d variances and d ( d − 1 ) / 2 covariances which can be arranged into a covariance matrix Σ ◮ The population covariance matrix is denoted Σ , the sample covariance matrix is denoted S 3 / 26 4 / 26

  2. Principal Components Analysis ◮ Generalize this to consider projection from d dimensions down to m If you want to use a single number to describe a whole vector ◮ Σ has eigenvalues λ 1 ≥ λ 2 ≥ . . . λ d ≥ 0 drawn from a known distribution, pick the projection of the ◮ The directions to choose are the first m eigenvectors of Σ vector onto the direction of maximum variation (variance) corresponding to λ 1 , . . . , λ m ◮ w i . w j = 0 i � = j ◮ Assume � x � = 0 ◮ Fraction of total variation explained by using m principal ◮ y = w . x components is ◮ Choose w to maximize � y 2 � , subject to w . w = 1 � m i = 1 λ i ◮ Solution: w is the eigenvector corresponding to the largest � d i = 1 λ i eigenvalue of Σ = � xx T � ◮ PCA is basically a rotation of the axes in the data space 5 / 26 6 / 26 Factor Analysis ◮ visible variables : x = ( x 1 , . . . , x d ) , ◮ latent variables: z = ( z 1 , . . . , z m ) , z ∼ N ( 0 , I m ) ◮ noise variables: e = ( e 1 , . . . , e d ) , e ∼ N ( 0 , Ψ) , where ◮ A latent variable model; can the observations be explained Ψ = diag ( ψ 1 , . . . , ψ d ) . in terms of a small number of unobserved latent variables ? Assume ◮ FA is a proper statistical model of the data; it explains x = µ + W z + e covariance between variables rather than variance ( cf PCA) then covariance structure of x is ◮ FA has a controversial rôle in social sciences C = WW T + Ψ W is called the factor loadings matrix 7 / 26 8 / 26

  3. ◮ Rotation of solution: if W is a solution, so is WR where RR T = I m as ( WR )( WR ) T = WW T . Causes a problem if p ( x ) is like a multivariate Gaussian pancake we want to interpret factors. Unique solution can be imposed by various conditions, e.g. that W T Ψ − 1 W is p ( x | z ) ∼ N ( W z + µ , Ψ) diagonal. � p ( x ) = p ( x | z ) p ( z ) d z ◮ Is the FA model a simplification of the covariance structure? S has d ( d + 1 ) / 2 independent entries. Ψ and p ( x ) ∼ N ( µ , WW T + Ψ) W together have d + dm free parameters (and uniqueness condition above can reduce this). FA model makes sense if number of free parameters is less than d ( d + 1 ) / 2. 9 / 26 10 / 26 FA example m = 1 m = 2 (not rotated) m = 2 (rotated) ˜ ˜ Variable w 1 w 1 w 2 w 1 w 2 [from Mardia, Kent & Bibby, table 9.4.1] ◮ Correlation matrix 1 0.600 0.628 0.372 0.270 0.678   1 0 . 553 0 . 547 0 . 410 0 . 389 mechanics 2 0.667 0.696 0.313 0.360 0.673 1 0 . 610 0 . 485 0 . 437 vectors   3 0.917 0.899 -0.050 0.743 0.510   1 0 . 711 0 . 665 algebra   4 0.772 0.779 -0.201 0.740 0.317   1 0 . 607 analysis   5 0.724 0.728 -0.200 0.698 0.286 1 statstics ◮ Maximum likelihood FA (impose that W T Ψ − 1 W is ◮ 1-factor and first factor of the 2-factor solutions differ (cf PCA) diagonal). Require m ≤ 2 otherwise more free parameters ◮ problem of interpretation due to rotation of factors than entries in S . 11 / 26 12 / 26

  4. FA for visualization Learning W , Ψ p ( z | x ) ∝ p ( z ) p ( x | z ) Posterior is a Gaussian. If z is low dimensional. Can be used for visualization (as with PCA) ◮ Maximum likelihood solution available (Lawley/Jöreskog). ◮ EM algorithm for ML solution (Rubin and Thayer, 1982) x 2 ◮ E-step: for each x i , infer p ( z | x i ) . o data ◮ M-step: do linear regression from z to x to get W latent space space x = z w ◮ Choice of m difficult (see Bayesian methods later). . z x 1 0 13 / 26 14 / 26 Comparing FA and PCA Probabilistic PCA Tipping and Bishop (1997), see Bishop §12.2 ◮ Both are linear methods and model second-order structure S Let Ψ = σ 2 I . ◮ FA is invariant to changes in scaling on the axes, but not ◮ In this case W ML spans the space defined by the first m rotation invariant (cf PCA). eigenvectors of S ◮ FA models covariance , PCA models variance ◮ PCA and FA give same results as Ψ → 0. 15 / 26 16 / 26

  5. Example Application: Useful Texts Handwritten Digits Recognition on PCA and FA Hinton, Dayan and Revow, IEEE Trans Neural Networks 8(1), 1997 ◮ Do digit recognition with class-conditional densities ◮ B. S. Everitt and G. Dunn “Applied Multivariate Data Analysis” Edward Arnold, 1991. ◮ 8 × 8 images ⇒ 64 · 65 / 2 entries in the covariance matrix. ◮ C. Chatfield and A. J. Collins “Introduction to Multivariate ◮ 10-dimensional latent space used Analysis”, Chapman and Hall, 1980. ◮ Visualization of W matrix. Each hidden unit gives rise to a ◮ K. V. Mardia, J. T. Kent and J. M. Bibby “Multivariate weight image ... Analysis”, Academic Press, 1979. ◮ In practice use a mixture of FAs! 17 / 26 18 / 26 Independent Components Analysis ◮ A non-Gaussian latent variable model, plus linear transformation, e.g. m � e −| z i | p ( z ) ∝ i = 1 x = W z + µ + e ◮ Rotational symmetry in z -space is now broken ◮ p ( x ) is non-Gaussian, go beyond second-order statistics of data for fitting model unmixed mixed ◮ Can be used with dim ( z ) = dim ( x ) for blind source separation ◮ http://www.cnl.salk.edu/ ∼ tony/ica.html ◮ Blind source separation demo: Te-Won Lee 19 / 26 20 / 26

  6. A General View of Latent Variable Models Non-linear Factor Analysis � . . . p ( x ) = p ( x | z ) p ( z ) d z z For PPCA p ( x | z ) ∼ N ( W z + µ , σ 2 I ) . . . x If we make the prediction of the mean a non-linear function of z , we get non-linear factor analysis, with p ( x | z ) ∼ N ( φ ( z ) , σ 2 I ) and φ ( z ) = ( φ 1 ( z ) , φ 2 ( z ) , . . . , φ d ( z )) T . However, there is a problem— we can’t do the integral analytically, so we need to approximate it. ◮ Clustering: z is one-on-in- m encoding ◮ Factor analysis: z ∼ N ( 0 , I m ) K p ( x ) ≃ 1 � p ( x | z k ) ◮ ICA: p ( z ) = � i p ( z i ) , and each p ( z i ) is non-Gaussian K k = 1 ◮ Latent Dirichlet Allocation: z ∼ Dir ( α ) (Blei et al, 2003). Used especially for “topic modelling” of documents where the samples z k are drawn from the density p ( z ) . Note that the approximation to p ( x ) is a mixture of Gaussians. 21 / 26 22 / 26 Fitting the Model to Data x 3 . . . φ . . . ◮ Adjust the parameters of φ and σ 2 to maximize the log z . . . 2 likelihood of the data. z 1 x ◮ For a simple form of mapping φ ( z ) = � i w i ψ i ( z ) we can 2 obtain EM updates for the weights { w i } and the variance x σ 2 . 1 ◮ We are fitting a constrained mixture of Gaussians to the data. The algorithm works quite like Kohonen’s ◮ Generative Topographic Mapping (Bishop, Svensen and self-organizing map (SOM), but is more principled as there Williams, 1997/8) is an objective function. ◮ Do GTM demo 23 / 26 24 / 26

  7. Visualization Manifold Learning ◮ The mean may be ◮ A manifold is a topological space that is locally Euclidean a bad summary of ◮ We are particularly interested in the case of non-linear the posterior + dimensionality reduction, where a low-dimensional distribution. nonlinear manifold is embedded in a high-dimensional space ◮ As well as GTM, there are other methods for non-linear dimensionality reduction. Some recent methods based on P(z|x) eigendecomposition include: ◮ Isomap (Renenbaum et al, 2000) ◮ Local linear embedding (Roweis and Saul, 2000) ◮ Lapacian eigenmaps (Belkin and Niyogi, 2001) z 25 / 26 26 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend