Latent Variable Models CS3750 Xiaoting Li 1 Out utli line - - PDF document

β–Ά
latent variable models
SMART_READER_LITE
LIVE PREVIEW

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line - - PDF document

2/4/2020 Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models Expectation Maximization Algorithm (EM) Factor Analysis Probabilistic Principal Component Analysis Model Formulation Maximum


slide-1
SLIDE 1

2/4/2020 1

Latent Variable Models

CS3750 Xiaoting Li

1

Out utli line

  • Latent Variable Models
  • Expectation Maximization Algorithm (EM)
  • Factor Analysis
  • Probabilistic Principal Component Analysis
  • Model Formulation
  • Maximum Likelihood for PPCA
  • EM for PPCA
  • Examples
  • Sensible Principal Component Analysis
  • Model Formulation
  • EM for SPCA
  • References

2

slide-2
SLIDE 2

2/4/2020 2

Laten ent Var ariable le Mod

  • dels:

: Mot

  • tiv

ivatio ion

3

Laten ent Var ariable le Mod

  • dels:

: Mot

  • tiv

ivatio ion

4

slide-3
SLIDE 3

2/4/2020 3

Laten ent Var ariable le Mod

  • dels:

: Mot

  • tiv

ivatio ion

  • Gaussian mixture models
  • A single Gaussian is not a good fit to data
  • But two different Gaussians may do
  • True class of each point is unobservable

5

Laten ent Var ariable le Mod

  • dels

A latent variable model p is a probability distribution over two sets of variables s,x: π‘ž(𝑑, 𝑦; πœ„) where the x variables are observed at learning time in a dataset D and the s are never observed

6

slide-4
SLIDE 4

2/4/2020 4

Laten ent Var ariable le Mod

  • dels
  • The goal of a latent variable model is to express the distribution p(x)
  • f the variables 𝑦1, … , 𝑦𝑒 in terms of a smaller number of latent

variables s = (𝑑1,..., π‘‘π‘Ÿ) where q < d

𝑦4 𝑑1 𝑑2 𝑑3 𝑦1 𝑦2 𝑦3

Latent variable: s, q-dimensions q < d Observed variable: x, d-dimensions

7

Ex Expe pectatio ion-Maxim imizatio ion (EM (EM) alg algorit ithm

  • EM algorithm is a hugely important and widely used algorithm

for learning directed latent-variable graphical

  • The key idea of the method:
  • Compute the parameter estimates iteratively by performing the

following two steps:

  • 1. Expectation step. For all hidden and missing variables (and their possible

value assignments) calculate their expectations for the current set of parameters Θ'

  • 2. Maximization step. Compute the new estimates of Θ by considering the

expectations of the different value completions

  • Stop when no improvement possible

8

slide-5
SLIDE 5

2/4/2020 5

Fac actor Anal nalysis is

  • Assumptions:
  • Underlying latent variable has a Gaussian distribution
  • s ~ 𝑂(0, I), independent, Gaussian with unit variance
  • Linear relationship between latent and observed variables
  • Diagonal Gaussian noise in data dimensions
  • πœ— ~ 𝑂 0, Ξ¨ , Gaussian noise

9

Fac actor Anal nalysis is

  • A common latent variable where the relationship is linear:

x = 𝑋𝑑 + 𝜈 + πœ—

  • dβˆ’dimensional observation vector x
  • q-dimensional vector of latent variable s
  • 𝑒 Γ— π‘Ÿ matrix W relates the two sets of variables, π‘Ÿ < 𝑒
  • 𝜈 permits the model to have non-zero mean
  • s ~ 𝑂(0, I), independent, Gaussian with unit variance
  • πœ— ~ 𝑂 0, Ξ¨ , Gaussian noise
  • Then x ~ 𝑂(𝜈, π‘‹π‘‹π‘ˆ + Ξ¨)

10

slide-6
SLIDE 6

2/4/2020 6

Fac actor Ana naly lysis is

𝑦4 𝑑1 𝑑2 𝑑3 𝑦1 𝑦2 𝑦3

Latent variable: s, q-dimensions Observed variable: x, d-dimensions s ~ 𝑂(0, 𝐽) π‘†π‘“π‘›π‘π‘žπ‘žπ‘—π‘œπ‘•: Ws (weight matrix: w) 𝜈 (location parameter) πœ— ~ 𝑂 0, Ξ¨ , Gaussian noise x = Ws + 𝜈 + πœ— 𝑦~ 𝑂 𝜈, π‘‹π‘‹π‘ˆ + Ξ¨ + +

Parameters of interest: W (weight matrix),Ξ¨ (variance of noise), 𝝂

11

Fac actor Anal nalysis is: Optim imizatio ion

  • Use EM to solve parameters
  • E-step:
  • compute posterior p(s|x)
  • M-step:
  • Take derivatives of the expected complete log likelihood

with respect to parameters

12

slide-7
SLIDE 7

2/4/2020 7

Pri rincipal l Com

  • mponent Ana

naly lysis

  • General motivation is to transform the data into

some reduced dimensionality representation

  • Linear transformation of a d dimensional input x

to q dimensional vector s such that q < d under which the retained variance is maximal

  • Limitation:
  • Absence of an associated probabilistic model for the
  • bserved data
  • Computational intensive for covariance matrix
  • Does not deal properly with missing data

13

Prob

  • babili

listic ic PCA

  • Motivations:
  • The corresponding likelihood measure would permit comparison

with other density–estimation techniques and would facilitate statistical testing.

  • Provides a natural framework for thinking about hypothesis testing
  • Offers the potential to extend the scope of conventional PCA.
  • Can be utilized as a constrained Gaussian density model.
  • Constrained covariance
  • Allows us to deal with missing values in the data set.
  • Can be used to model class conditional densities and hence it can

be applied to classification problems.

14

slide-8
SLIDE 8

2/4/2020 8

Gen enerativ ive Vie View of

  • f PPCA

15

  • Generative view of the PPCA for a 2-d data space and 1-d latent space

s

s s s s s

s

PPCA PPCA

  • Assumptions:
  • Underlying latent variable π‘Ÿ βˆ’ dim 𝑑 has a Gaussian distribution
  • Linear relationship between π‘Ÿ βˆ’ dim latent 𝑑 and 𝑒 βˆ’ dim
  • bserved 𝑦 variables
  • Isotropic Gaussian noise in observed dimensions
  • Noise variances constrained to be equal

16

slide-9
SLIDE 9

2/4/2020 9

PPCA PPCA

  • A special case of factor analysis
  • noise variances constrained to be equal:
  • πœ— ~ 𝑂(0, 𝜏2I)
  • the s conditional probability distribution over x-space:
  • x|𝑑 ~ 𝑂(𝑋𝑑 + 𝜈, 𝜏2I)
  • latent variables:
  • s ~ 𝑂(0, 𝐽)
  • observed data x be obtained by integrating out the latent variables:
  • x ~ 𝑂 𝜈, 𝐷
  • 𝐹 𝑦 = 𝐹 𝜈 + 𝑋𝑑 + πœ— = 𝜈 + 𝑋𝐹 𝑑 + 𝐹 πœ— = 𝜈 + 𝑋0 + 0 = 𝜈
  • 𝐷 = π‘‹π‘‹π‘ˆ + 𝜏2I (the observation covariance model)
  • 𝐷 = 𝐷𝑝𝑀 𝑦 = 𝐹 𝜈 + 𝑋𝑑 + πœ— βˆ’ 𝜈

𝜈 + 𝑋𝑑 + πœ— βˆ’ 𝜈 π‘ˆ = 𝐹 𝑋𝑑 + πœ— 𝑋𝑑 + πœ— π‘ˆ = π‘‹π‘‹π‘ˆ + 𝜏2I

  • The maximum likelihood estimator for 𝜈 is given by the mean of data, S is sample

covariance matrix of the observations {π‘¦π‘œ}

  • Estimates for 𝑋 and 𝜏2 can be solved in two ways
  • Closed form
  • EM Algorithms

17

PPCA PPCA

𝑦4 𝑑1 𝑑2 𝑑3 𝑦1 𝑦2 𝑦3

Latent variable: s, q-dimensions Observed variable: x, d-dimensions s ~ 𝑂(0, 𝐽) π‘†π‘“π‘›π‘π‘žπ‘žπ‘—π‘œπ‘•: Ws (weight matrix: w) 𝜈 (location parameter) Random error (noise): πœ— ~ 𝑂 0, 𝜏2𝐽 x = Ws + 𝜈 + πœ— x ~ 𝑂(𝜈, π‘‹π‘‹π‘ˆ + 𝜏2I) + +

Parameters of interest: W (weight matrix), π‰πŸ‘ (variance of noise), 𝝂

18

slide-10
SLIDE 10

2/4/2020 10

Fac actor Anal nalysis is vs.

  • s. PPCA
  • PPCA
  • x ~ 𝑂(𝜈, π‘‹π‘‹π‘ˆ + 𝜏2I)
  • Isotropic error
  • Factor Analysis
  • x ~ 𝑂(𝜈, π‘‹π‘‹π‘ˆ + Ξ¨)
  • The error covariance is a diagonal matrix
  • FA doesn’t change if you scale variables
  • FA looks for directions of large correlation in the data
  • FA doesn’t chase large-noise features that are uncorrelated with
  • ther features
  • FA changes if you rotate data
  • can’t interpret multiple factors as being unique

19

Maxi ximum Likelih ihood for

  • r PPCA
  • The log-likelihood for the observed data under this model is given by

β„’ = ෍

π‘œ=1 𝑂

ln π‘ž π‘¦π‘œ = βˆ’ 𝑂𝑒 2 ln 2𝜌 βˆ’ 𝑂 2 ln C βˆ’ 𝑂 2 π‘ˆπ‘ {π·βˆ’1𝑇}

  • where 𝑇 is the sample covariance matrix of the observations

π‘¦π‘œ 𝑇 = 1 𝑂 ෍

π‘œ=1 𝑂

(π‘¦π‘œ βˆ’ 𝜈)(π‘¦π‘œ βˆ’ 𝜈)π‘ˆ

  • 𝐷 = π‘‹π‘‹π‘ˆ + 𝜏2I
  • The log-likelihood is maximized when the columns of W span the principal

subspace of the data.

  • Fit parameters (𝑿, 𝜈, 𝜏) to maximum likelihood: make the constrained model

covariance as close as possible to the observed covariance

20

slide-11
SLIDE 11

2/4/2020 11

Maxi ximum Likelih ihood for

  • r PPCA
  • Consider the derivatives with respect to W
  • πœ–β„’

πœ–π‘‹ = 𝑂(π·βˆ’1π‘‡π·βˆ’1𝑋 βˆ’ π·βˆ’1W)

  • Maximizing with respect to W
  • 𝑋

𝑁𝑀 = π‘‰π‘Ÿ(βˆ§π‘Ÿ βˆ’πœ2𝐽)1/2𝑆

  • Where
  • the π‘Ÿ column vectors in π‘‰π‘Ÿ are eigenvectors of 𝑇, with corresponding

eigenvalues in the diagonal matrix Ξ›π‘Ÿ

  • 𝑆 is an arbitrary π‘Ÿ Γ— π‘Ÿ orthogonal rotation matrix.
  • For 𝑋 = 𝑋

𝑁𝑀, the maximum-likelihood estimator for 𝜏2 is given by

  • πœπ‘π‘€

2

=

1 π‘’βˆ’π‘Ÿ Οƒπ‘˜=π‘Ÿ+1 𝑒

πœ‡π‘˜

  • the average variance associated with the discarded dimensions

21

Maxi ximum Likelih ihood for

  • r PPCA
  • Consider the derivatives with respect to W
  • πœ–β„’

πœ–π‘‹ = 𝑂(π·βˆ’1π‘‡π·βˆ’1𝑋 βˆ’ π·βˆ’1W)

  • At the stationary points π‘‡π·βˆ’1𝑋 = 𝑋, assuming that π·βˆ’1 exists
  • Three possible classes of solutions
  • 𝑋 = 0, minimum of the log-likelihood
  • 𝐷 = 𝑇
  • Covariance model is exact
  • π‘‹π‘‹π‘ˆ = 𝑇 βˆ’ 𝜏2𝐽 has a known solution at 𝑋 = 𝑉(∧ βˆ’ 𝜏2𝐽)1/2𝑆, where 𝑉 is a square

matrix whose columns are the eigenvectors of 𝑇, with ∧ is the corresponding diagonal matrix of eigenvalues, 𝑆 is an arbitrary orthogonal matrix

  • π‘‡π·βˆ’1𝑋 = 𝑋, 𝑐𝑣𝑒 𝑋 β‰  0 π‘π‘œπ‘’ 𝐷 β‰  𝑇

22

slide-12
SLIDE 12

2/4/2020 12

Maxi ximum Likelih ihood for

  • r PPCA
  • Consider the derivatives with respect to W
  • πœ–β„’

πœ–π‘‹ = 𝑂(π·βˆ’1π‘‡π·βˆ’1𝑋 βˆ’ π·βˆ’1W)

  • At the stationary points π‘‡π·βˆ’1𝑋 = 𝑋, assuming that π·βˆ’1 exists
  • Case: π‘‡π·βˆ’1𝑋 = 𝑋, 𝑐𝑣𝑒 𝑋 β‰  0 π‘π‘œπ‘’ 𝐷 β‰  𝑇
  • Express the parameter matrix 𝑋 in terms of singular value decomposition

(SVD):

  • 𝑋 = π‘‰π‘€π‘Šπ‘ˆ, 𝑽: 𝑒 𝑦 π‘Ÿ orthonormal vectors, 𝑴: π‘Ÿ 𝑦 π‘Ÿ matrix of singular

values, 𝑾: π‘Ÿ 𝑦 π‘Ÿ orthogonal matrix

  • π·βˆ’1𝑋 = W (𝜏2𝐽 + π‘‹π‘ˆπ‘‹) βˆ’1= 𝑉𝑀(𝜏2𝐽 + 𝑀2) βˆ’1π‘Šπ‘ˆ
  • At the stationary points
  • 𝑇𝑉𝑀(𝜏2𝐽 + 𝑀2) βˆ’1π‘Šπ‘ˆ = π‘‰π‘€π‘Šπ‘ˆ
  • 𝑇𝑉𝑀 = 𝑉(𝜏2𝐽 + 𝑀2)𝑀

23 W L

Maxi ximum Likelih ihood for

  • r PPCA
  • Column vectors of 𝑽, π‘£π‘˜, are eigenvectors of 𝑻, with eigenvalue πœ‡π‘˜, such

that 𝜏2 + π‘šπ‘˜

2 = πœ‡π‘˜

  • π‘‡π‘£π‘˜ = (𝜏2 + π‘šπ‘˜

2)π‘£π‘˜

  • π‘šπ‘˜

2 = (πœ‡π‘˜ βˆ’ 𝜏2) 1/2

  • (substitute into SVD) , 𝑋 = π‘‰π‘Ÿ(β‹€π‘Ÿ βˆ’ 𝜏2𝐽) 𝑆
  • π‘‰π‘Ÿ : d x q with q column eigenvectors π‘£π‘˜of S
  • β‹€π‘Ÿ : q x q diagonal matrix with elements: πœ‡1... πœ‡π‘Ÿ, (eigenvalues to π‘£π‘˜),
  • r 𝜏2 (equivalent to π‘šπ‘˜ = 0)
  • R: arbitrary orthogonal matrix, equivalent to a rotation in principal

subspace (or a re-parametrization)

24

slide-13
SLIDE 13

2/4/2020 13

EM for

  • r PPCA
  • Goal: to estimate the model parameters W and 𝜏2, based on the
  • bserved dataset
  • Rather than solve directly, can apply EM
  • EM can be scaled to very large high-dimensional datasets.
  • Consider the latent variables π‘‘π‘œ to be β€˜missing’ data
  • Need Complete-data log-likelihood:
  • ℒ𝐷 = Οƒπ‘œ=1

𝑂

ln{π‘ž(π‘¦π‘œ, π‘‘π‘œ}

  • since
  • x|𝑑 ~ 𝑂(𝑋𝑑 + 𝜈, 𝜏2I) and s ~ 𝑂(0, 𝐽)
  • we have
  • π‘ž(π‘¦π‘œ, π‘‘π‘œ) = (2𝜌𝜏2)βˆ’π‘’/2exp(βˆ’

π‘¦π‘œ βˆ’π‘‹π‘‘π‘œ βˆ’ 𝜈

2

2𝜏2

)(2𝜌)βˆ’π‘Ÿ

2exp(βˆ’

π‘‘π‘œ

2

2

)

25

EM EM for

  • r PPCA
  • E-step
  • Compute expectation of complete log likelihood with respect to posterior of latent

variables

  • Take the expectation of ℒ𝐷 with respect to the distributions π‘ž(π‘‘π‘œ|π‘¦π‘œ, W, 𝜏2)
  • ℒ𝐷 = βˆ’ Οƒπ‘œ=1

𝑂 𝑒 2 ln(𝜏2) + 1 2 𝑒𝑠 π‘‘π‘œπ‘‘π‘œ π‘ˆ

+

1 2𝜏2 (π‘¦π‘œ βˆ’ 𝜈)π‘ˆ(π‘¦π‘œ βˆ’

26

slide-14
SLIDE 14

2/4/2020 14

PPCA Ex Exam ample les

  • Missing data
  • A natural approach to the estimation of

the principal axes in cases where some

  • r indeed all, of the data vectors exhibit
  • ne or more missing (at random) values
  • Fig. 1 (a): projection of 38 examples

from the 18-dimensional Tobamovirus data (Ripley 1996) using standard PCA

  • Fig.1 (b): an equivalent PPCA projection
  • btained by using an EM algorithm
  • Simulated missing data by randomly

removing each value in the data set with probability 20%

27

PPCA Ex Exam ample les

  • Mixtures of probabilistic principal component analysis models
  • Combining multiple PCA models, notably for image compression
  • Fig.2: three PCA projections of the virus data obtained from a three-component mixture

model, optimized by using an EM algorithm

  • Effectively implements a simultaneous automated clustering and visualizing data

28

slide-15
SLIDE 15

2/4/2020 15

PPCA Ex Exam ample les

  • Controlling the degrees of freedom
  • Applied as a covariance model of data
  • Permits control of the model complexity through the choice of q
  • The covariance model in PPCA comprises π‘’π‘Ÿ + 1 – π‘Ÿ(π‘Ÿ βˆ’ 1)/2 free parameters
  • Table 1: estimated prediction error for various Gaussian models fitted to the Tobamovirus

data

  • PPCA with q = 2 gives the lowest error

29

Sen ensib ible le Prin rincip ipal Com

  • mponent Ana

naly lysis is (S (SPCA)

  • SPCA
  • x = Ws + v
  • x ~ 𝑂(0, 𝑋𝑋2 + 𝜏2𝐽)
  • Similar to PCA, the differences are:
  • Require noise covariance matrix to be a multiple 𝜏2𝐽 of the identity matrix, but

do not take the limit as 𝜏2𝐽 0

  • During EM iterations, data can be directly generated from the SPCA model, and

the likelihood estimated from the test data set

  • Likelihood much lower for data far away from the training set, even if they are

near the principal subspace

30

slide-16
SLIDE 16

2/4/2020 16

EM EM for

  • r SPCA
  • SPCA
  • x ~ 𝑂(0, π‘‹π‘‹π‘ˆ + 𝜏2𝐽)
  • E-step:
  • 𝛾 = π‘‹π‘ˆ(π‘‹π‘‹π‘ˆ + 𝜏2𝐽)βˆ’1
  • π‘‘π‘œ|π‘¦π‘œ = 𝛾(π‘Œ βˆ’ 𝜈)
  • Σ𝑑 = π‘œπ½ βˆ’ π‘œπ›Ύπ‘‹ + π‘‘π‘œ|π‘¦π‘œ π‘‘π‘œ|π‘¦π‘œ

π‘ˆ

  • Log-likelihood in terms of weight matrix 𝑋, and a centered
  • bserved data matrix X βˆ’ 𝜈, noise covariance 𝜏2𝐽, and

conditional latent mean π‘‘π‘œ|π‘¦π‘œ

31

EM EM for

  • r SPCA
  • SPCA
  • x ~ 𝑂(0, π‘‹π‘‹π‘ˆ + 𝜏2𝐽)
  • M-step:
  • π‘‹π‘œπ‘“π‘₯ = (X βˆ’ 𝜈) π‘‘π‘œ|π‘¦π‘œ

π‘ˆ Σ𝑑 βˆ’1

  • 𝜏2 π‘œπ‘“π‘₯ = trace[π‘‡π‘‡π‘ˆ βˆ’ 𝑋 π‘‘π‘œ|π‘¦π‘œ

X βˆ’ 𝜈 π‘ˆ]/π‘œ2

  • Differentiate LL in terms of 𝑋 and 𝜏2 and set to zero

32

slide-17
SLIDE 17

2/4/2020 17

EM EM for

  • r SPCA
  • Since 𝜏2𝐽 is diagonal, the inversion in the e-step can be performed

efficiently using the matrix inversion lemma:

  • (π‘‹π‘‹π‘ˆ + 𝜏2𝐽)βˆ’1= (

𝐽 𝜏2 βˆ’ 𝑋(𝐽 + π‘‹π‘ˆπ‘‹ 𝜏2 )βˆ’1π‘‹π‘ˆ/(𝜏2)2)

  • Since we are only taking the trace of the matrix in the m–step, we do not

need to compute the full sample covariance π‘‡π‘‡π‘ˆ, but instead can compute

  • nly the variance along each coordinate
  • 𝜏2 π‘œπ‘“π‘₯ = trace[π‘‡π‘‡π‘ˆ βˆ’ 𝑋 π‘‘π‘œ|π‘¦π‘œ

X βˆ’ 𝜈 π‘ˆ]/π‘œ2

  • Shows that learning for SPCA enjoys a complexity limited by 𝑃(π‘’π‘œπ‘Ÿ) and not worse
  • Methods that explicitly compute the sample covariance matrix have

complexities O(π‘œπ‘’2)

  • EM algorithm does not require computation of sample covariance matrix, 𝑃(π‘’π‘œπ‘Ÿ)
  • Huge advantage when q << d (# of principal components is much smaller than original # of

variables)

33

Soft

  • ftware
  • Matlab
  • https://www.mathworks.com/help/stats/ppca.html
  • R Programming
  • https://www.rdocumentation.org/packages/pcaMethods/versions/1.64.0/top

ics/ppca

34

slide-18
SLIDE 18

2/4/2020 18

Reference:

  • Roweis, Sam T. "EM algorithms for PCA and SPCA." Advances in neural information processing systems. 1998.
  • Tipping, Michael E., and Christopher M. Bishop. "Probabilistic principal component analysis." Journal of the

Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622.

  • Bishop, Christopher M. "Latent variable models." Learning in graphical models. Springer, Dordrecht, 1998.

371-403.

  • https://www.cs.toronto.edu/~hinton/csc2515/notes/lec7middle.pdf
  • https://www.cs.toronto.edu/~rsalakhu/STA4273_2015/notes/Lecture8_2015.pdf
  • https://www.cs.ubc.ca/~schmidtm/Courses/540-W16/L12.pdf
  • https://www.seas.upenn.edu/~cis520/lectures/PCA_PLS_CCA.pdf
  • http://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class8.pdf
  • http://people.cs.pitt.edu/~milos/courses/cs2750-Spring2019/Lectures/Class19.pdf
  • https://ermongroup.github.io/cs228-notes/learning/latent/
  • https://people.cs.pitt.edu/~milos/courses/cs3750-Fall2007/lectures/class17.pdf
  • https://people.cs.pitt.edu/~milos/courses/cs3750-Fall2014/lectures/class13.pdf
  • https://liorpachter.wordpress.com/tag/ppca/

35