Probabilistic PCA and Factor analysis Course of Machine Learning - - PowerPoint PPT Presentation

▶

Apr 25, 2023 436 likes •667 views

Probabilistic PCA and Factor analysis Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 Idea where on a lower-dimensional subspace) lower-dimensional subspace

SLIDE 1

Probabilistic PCA and Factor analysis

Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019

SLIDE 2

Idea

Introduce a latent variable model to relate a d-dimensional observation vector to a corresponding d′-dimensional gaussian latent variable (with

d′ < d) x = Wz + µ + ϵ

where

z is a d′-dimensional gaussian latent variable (the “projection” of x
n a lower-dimensional subspace)
W is a d × d′ matrix, relating the original space with the

lower-dimensional subspace

ϵ is a d-dimensional gaussian noise: noise covariance on different

dimensions is assumed to be 0. Noise variance is assumed equal on all dimensions: hence p(ϵ) = N(0, σ2I)

µ is the d-dimensional vector of the means

ϵ and µ are assumed independent.

SLIDE 3

Graphical model

σ ϵi xi zi µ W

n

1. z ∈ I

Rd′, x, ϵ ∈ I Rd, d′ < d

2. p(z) = N(0, I)
3. p(ϵ) = N(0, σ2I), (isotropic gaussian noise)

SLIDE 4

Generative process

This can be interpreted in terms of a generative process

1. sample the latent variable z ∈ I

Rd′ from p(z) = 1 (2π)d′/2 e− ||z||2

2. linearly project onto I

Rd y = Wz + µ

3. sample the noise component ϵ ∈ I

Rd from p(ϵ) = 1 (2π)d/2 e− ||ϵ||2

2σ2

4. add the noise component ϵ

x = y + ϵ

This results into p(x|z) = N(Wz + µ, σ2I)

SLIDE 5

Generative process

SLIDE 6

Probability recall

Let

x1 ∈ I Rr x2 ∈ I Rs x =

[

x1 x2

]

Assume x is normally distributed: p(x) = N(µ, Σ), and let

µ =

[

µ1 µ2

]

Σ =

[

Σ11 Σ12 Σ21 Σ22

]

with

µ1 ∈ I Rr µ2 ∈ I Rs Σ11 ∈ I Rr×r Σ12 = ΣT

21 ∈ I

Rr×s Σ22 ∈ I Rs×s

SLIDE 7

Probability recall

Under the above assumptions:

The marginal distribution p(x1) is a gaussian on I

Rr, with

E[x1] = µ1 Cov(x1) = Σ11

The conditional distribution p(x1|x2) is a gaussian on I

Rr, with

E[x1|x2] = µ1 + Σ12Σ−1

22 (x2 − µ2)

Cov(x1|x2) = Σ11 − Σ12Σ−1

22 Σ21

SLIDE 8

Probability recall

Under the same hypotheses, the conditional distribution p(x1|x2) is a gaussian on I

Rr, with

E[x1|x2] = µ1 + Σ12Σ−1

22 (x2 − µ2)

and Cov(x1|x2) = Σ11 − Σ12Σ−1

22 Σ21

SLIDE 9

Latent variable model

The joint distribution is

p

([

z x

])

= N(µzx, Σ)

By definition,

µzx =

[

µz µx

]

Since p(z) = N(0, I), then µz = 0.
Since p(x) = Wz + µ + ϵ, then

µx = E[x] = E[Wz + µ + ϵ] = WE[z] + µ + E[ϵ] = µ

Hence

µzx =

[

µ

]

SLIDE 10

Latent variable model

For what concerns the distribution covariance

Σ =

[

Σzz Σzx Σzx Σxx

]

where

Σzz = E[(z − E[z])(z − E[z])T ] = E[zzT ] = I Σzx = E[(z − E[z])(x − E[x])T ] = WT Σxx = E[(x − E[x])(x − E[x])T ] = WWT + σ2I

SLIDE 11

Latent variable model

Joint distribution As a consequence, we get

µzx =

[

µ

]

Σ =

[

I WT W WWT + σ2I

]

Marginal distribution The marginal distribution of x is then p(x) = N(µ, WWT + σ2I) Conditional distribution The conditional distribution of z given x is p(z|x) = N(µz|x, Σz|x) with

µz|x = WT (WWT + σ2I)−1(x − µ) Σz|x = I − WT (WWT + σ2I)−1W = σ2(σ2I + WT W)−1

SLIDE 12

Maximum likelihood for PCA

Setting C = WWT + σ2I, the log-likelihood of the dataset in the model is

logp(X|W, µ, σ2) =

∑

i=1

log p(xi|W, µ, σ2) = −nd 2 log(2π) − n 2 log |C| − 1 2

∑

i=1

(xn − µ)C−1(xi − µ)T

Setting the derivative wrt µ to zero results into

µ = x = 1 n

∑

i=1

xi

SLIDE 13

Maximum likelihood for PCA

Maximization wrt W and σ2 is more complex: however, a closed form solution exists:

W = Ud′(Ld′ − σ2I)1/2R

where

Ud′ is the d × d′ matrix whose columns are the eigenvectors

corresponding to the d′ largest eigenvalues

Ld′ is the d′ × d′ diagonal matrix of the largest eigenvalues
R is an arbitrary d′ × d′ orthogonal matrix, corresponding to a

rotation in the latent space

R can be interpreted as a rotation matrix in latent space. If R = I, the

columns of W are the principal components eigenvectors scaled by the variance λi − σ2

SLIDE 14

Maximum likelihood for PCA

For what concerns maximization wrt σ2, it results

σ2 = 1 d − d′

∑

i=d′+1

λi

since eigenvalues provide measures of the dataset variance along the corresponding eigenvector direction, this corresponds to the average variance along the discarded directions.

SLIDE 15

Mapping points to subspace

The conditional distribution

p(z|x) = N(WT (WWT + σ2I)−1(x − µ), σ2(σ2I + WT W)−1)

can be applied. In particular, the conditional expectation E[z|x] = WT (WWT + σ2I)−1(x − µ) can be assumed as the latent space point corresponding to x. The projection onto the d′-dimensional subspace can then be performed as

x′ = WE[z|x] + µ = WWT (WWT + σ2I)−1(x − µ) + µ

SLIDE 16

EM for PCA

Even if the log-likelihood has a closed form maximization, applying the Expectation-Maximization algorithm can be useful in high-dimensional spaces.

SLIDE 17

Factor analysis

Graphical model Noise components still gaussian and independent, but with different variance.

Ψ ϵi xi zi µ W

n

1. z ∈ I

Rd, x, ϵ ∈ I RD, d << D

2. p(z) = N(0, I)
3. p(ϵ) = N(0, Ψ), Ψ diagonal (independent gaussian noise)

SLIDE 18

Factor analysis

Generative model

1. sample the vector of factors z ∈ I

Rd from p(z) = 1 (2π)d/2 exp(−1 2||z||2)

2. perform a linear projection onto I

RD (a subspace of dimension d of I RD) y = Λz + µ

3. sample the noise component ϵ ∈ I

RD from p(ϵ) = 1 (2π)D/2 exp(−1 2ϵT Ψ−1ϵ)

4. add the noise component ϵ

x = y + ϵ

SLIDE 19

Factor analysis

Model distribution are modified accordingly.

Joint distribution

p

([

z x

])

= N

([

W

]

,

[

I WT Λ WWT + Ψ

])

Marginal distribution

p(x) = N(µ, WWT + Ψ)

Conditional distribution The conditional distribution of z given x is

now p(z|x) = N(µz|x, Σz|x) with

µz|x = WT (WWT + Ψ)−1(x − µ) Σz|x = I − WT (WWT + Ψ)−1W

SLIDE 20

Maximum likelihood for FA

The log-likelihood of the dataset in the model is now

logp(X|W, µ, Ψ) =

∑

i=1

log p(xi|W, µ, Ψ) = −nd 2 log(2π) − n 2 log |WWT + Ψ| − 1 2

∑

i=1

(xn − µ)(WWT + Ψ)−

Setting the derivative wrt µ to zero results into

µ = x = 1 n

∑

i=1

xi

Estimating parameters through log-likelihood maximization does not provide a closed form solution for W and Ψ. Iterative techniques such as Expecation-Maximization must be applied.