Probabilistic PCA and Factor analysis Course of Machine Learning - - PowerPoint PPT Presentation

probabilistic pca and factor analysis
SMART_READER_LITE
LIVE PREVIEW

Probabilistic PCA and Factor analysis Course of Machine Learning - - PowerPoint PPT Presentation

Probabilistic PCA and Factor analysis Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 Idea where on a lower-dimensional subspace) lower-dimensional subspace


slide-1
SLIDE 1

Probabilistic PCA and Factor analysis

Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019

slide-2
SLIDE 2

Idea

Introduce a latent variable model to relate a d-dimensional observation vector to a corresponding d′-dimensional gaussian latent variable (with

d′ < d) x = Wz + µ + ϵ

where

  • z is a d′-dimensional gaussian latent variable (the “projection” of x
  • n a lower-dimensional subspace)
  • W is a d × d′ matrix, relating the original space with the

lower-dimensional subspace

  • ϵ is a d-dimensional gaussian noise: noise covariance on different

dimensions is assumed to be 0. Noise variance is assumed equal on all dimensions: hence p(ϵ) = N(0, σ2I)

  • µ is the d-dimensional vector of the means

ϵ and µ are assumed independent.

2

slide-3
SLIDE 3

Graphical model

σ ϵi xi zi µ W

n

  • 1. z ∈ I

Rd′, x, ϵ ∈ I Rd, d′ < d

  • 2. p(z) = N(0, I)
  • 3. p(ϵ) = N(0, σ2I), (isotropic gaussian noise)

3

slide-4
SLIDE 4

Generative process

This can be interpreted in terms of a generative process

  • 1. sample the latent variable z ∈ I

Rd′ from p(z) = 1 (2π)d′/2 e− ||z||2

2

  • 2. linearly project onto I

Rd y = Wz + µ

  • 3. sample the noise component ϵ ∈ I

Rd from p(ϵ) = 1 (2π)d/2 e− ||ϵ||2

2σ2

  • 4. add the noise component ϵ

x = y + ϵ

This results into p(x|z) = N(Wz + µ, σ2I)

4

slide-5
SLIDE 5

Generative process

5

slide-6
SLIDE 6

Probability recall

Let

x1 ∈ I Rr x2 ∈ I Rs x =

[

x1 x2

]

Assume x is normally distributed: p(x) = N(µ, Σ), and let

µ =

[

µ1 µ2

]

Σ =

[

Σ11 Σ12 Σ21 Σ22

]

with

µ1 ∈ I Rr µ2 ∈ I Rs Σ11 ∈ I Rr×r Σ12 = ΣT

21 ∈ I

Rr×s Σ22 ∈ I Rs×s

6

slide-7
SLIDE 7

Probability recall

Under the above assumptions:

  • The marginal distribution p(x1) is a gaussian on I

Rr, with

E[x1] = µ1 Cov(x1) = Σ11

  • The conditional distribution p(x1|x2) is a gaussian on I

Rr, with

E[x1|x2] = µ1 + Σ12Σ−1

22 (x2 − µ2)

Cov(x1|x2) = Σ11 − Σ12Σ−1

22 Σ21

7

slide-8
SLIDE 8

Probability recall

Under the same hypotheses, the conditional distribution p(x1|x2) is a gaussian on I

Rr, with

E[x1|x2] = µ1 + Σ12Σ−1

22 (x2 − µ2)

and Cov(x1|x2) = Σ11 − Σ12Σ−1

22 Σ21

8

slide-9
SLIDE 9

Latent variable model

The joint distribution is

p

([

z x

])

= N(µzx, Σ)

By definition,

µzx =

[

µz µx

]

  • Since p(z) = N(0, I), then µz = 0.
  • Since p(x) = Wz + µ + ϵ, then

µx = E[x] = E[Wz + µ + ϵ] = WE[z] + µ + E[ϵ] = µ

Hence

µzx =

[

µ

]

9

slide-10
SLIDE 10

Latent variable model

For what concerns the distribution covariance

Σ =

[

Σzz Σzx Σzx Σxx

]

where

Σzz = E[(z − E[z])(z − E[z])T ] = E[zzT ] = I Σzx = E[(z − E[z])(x − E[x])T ] = WT Σxx = E[(x − E[x])(x − E[x])T ] = WWT + σ2I

10

slide-11
SLIDE 11

Latent variable model

Joint distribution As a consequence, we get

µzx =

[

µ

]

Σ =

[

I WT W WWT + σ2I

]

Marginal distribution The marginal distribution of x is then p(x) = N(µ, WWT + σ2I) Conditional distribution The conditional distribution of z given x is p(z|x) = N(µz|x, Σz|x) with

µz|x = WT (WWT + σ2I)−1(x − µ) Σz|x = I − WT (WWT + σ2I)−1W = σ2(σ2I + WT W)−1

11

slide-12
SLIDE 12

Maximum likelihood for PCA

Setting C = WWT + σ2I, the log-likelihood of the dataset in the model is

logp(X|W, µ, σ2) =

n

i=1

log p(xi|W, µ, σ2) = −nd 2 log(2π) − n 2 log |C| − 1 2

n

i=1

(xn − µ)C−1(xi − µ)T

Setting the derivative wrt µ to zero results into

µ = x = 1 n

n

i=1

xi

12

slide-13
SLIDE 13

Maximum likelihood for PCA

Maximization wrt W and σ2 is more complex: however, a closed form solution exists:

W = Ud′(Ld′ − σ2I)1/2R

where

  • Ud′ is the d × d′ matrix whose columns are the eigenvectors

corresponding to the d′ largest eigenvalues

  • Ld′ is the d′ × d′ diagonal matrix of the largest eigenvalues
  • R is an arbitrary d′ × d′ orthogonal matrix, corresponding to a

rotation in the latent space

R can be interpreted as a rotation matrix in latent space. If R = I, the

columns of W are the principal components eigenvectors scaled by the variance λi − σ2

13

slide-14
SLIDE 14

Maximum likelihood for PCA

For what concerns maximization wrt σ2, it results

σ2 = 1 d − d′

d

i=d′+1

λi

since eigenvalues provide measures of the dataset variance along the corresponding eigenvector direction, this corresponds to the average variance along the discarded directions.

14

slide-15
SLIDE 15

Mapping points to subspace

The conditional distribution

p(z|x) = N(WT (WWT + σ2I)−1(x − µ), σ2(σ2I + WT W)−1)

can be applied. In particular, the conditional expectation E[z|x] = WT (WWT + σ2I)−1(x − µ) can be assumed as the latent space point corresponding to x. The projection onto the d′-dimensional subspace can then be performed as

x′ = WE[z|x] + µ = WWT (WWT + σ2I)−1(x − µ) + µ

15

slide-16
SLIDE 16

EM for PCA

Even if the log-likelihood has a closed form maximization, applying the Expectation-Maximization algorithm can be useful in high-dimensional spaces.

16

slide-17
SLIDE 17

Factor analysis

Graphical model Noise components still gaussian and independent, but with different variance.

Ψ ϵi xi zi µ W

n

  • 1. z ∈ I

Rd, x, ϵ ∈ I RD, d << D

  • 2. p(z) = N(0, I)
  • 3. p(ϵ) = N(0, Ψ), Ψ diagonal (independent gaussian noise)

17

slide-18
SLIDE 18

Factor analysis

Generative model

  • 1. sample the vector of factors z ∈ I

Rd from p(z) = 1 (2π)d/2 exp(−1 2||z||2)

  • 2. perform a linear projection onto I

RD (a subspace of dimension d of I RD) y = Λz + µ

  • 3. sample the noise component ϵ ∈ I

RD from p(ϵ) = 1 (2π)D/2 exp(−1 2ϵT Ψ−1ϵ)

  • 4. add the noise component ϵ

x = y + ϵ

18

slide-19
SLIDE 19

Factor analysis

Model distribution are modified accordingly.

  • Joint distribution

p

([

z x

])

= N

([

W

]

,

[

I WT Λ WWT + Ψ

])

  • Marginal distribution

p(x) = N(µ, WWT + Ψ)

  • Conditional distribution The conditional distribution of z given x is

now p(z|x) = N(µz|x, Σz|x) with

µz|x = WT (WWT + Ψ)−1(x − µ) Σz|x = I − WT (WWT + Ψ)−1W

19

slide-20
SLIDE 20

Maximum likelihood for FA

The log-likelihood of the dataset in the model is now

logp(X|W, µ, Ψ) =

n

i=1

log p(xi|W, µ, Ψ) = −nd 2 log(2π) − n 2 log |WWT + Ψ| − 1 2

n

i=1

(xn − µ)(WWT + Ψ)−

Setting the derivative wrt µ to zero results into

µ = x = 1 n

n

i=1

xi

Estimating parameters through log-likelihood maximization does not provide a closed form solution for W and Ψ. Iterative techniques such as Expecation-Maximization must be applied.

20