Probabilistic & Unsupervised Learning Latent Variable Models - - PowerPoint PPT Presentation

probabilistic unsupervised learning latent variable models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic & Unsupervised Learning Latent Variable Models - - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018 Exponential


slide-1
SLIDE 1

Probabilistic & Unsupervised Learning Latent Variable Models

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

slide-2
SLIDE 2

Exponential family models

◮ Simple, ’single-stage’ generative models. ◮ Easy, often closed-form expressions for learning and model comparison. ◮ . . . but limited in expressiveness.

slide-3
SLIDE 3

Exponential family models

◮ Simple, ’single-stage’ generative models. ◮ Easy, often closed-form expressions for learning and model comparison. ◮ . . . but limited in expressiveness.

What about distributions like these?

slide-4
SLIDE 4

Exponential family models

◮ Simple, ’single-stage’ generative models. ◮ Easy, often closed-form expressions for learning and model comparison. ◮ . . . but limited in expressiveness.

What about distributions like these? In each case, data may be generated by combining and transforming latent exponential family variates.

slide-5
SLIDE 5

Latent variable models

Explain correlations in x by assuming dependence on latent variables z

(e.g. objects, illumination, pose) (e.g. object parts, surfaces) (e.g. edges) (retinal image, i.e. pixels)

z ∼ P[θz] x | z ∼ P[θx] p(x, z; θx, θz) = p(x | z; θx)p(z; θz) p(x; θx, θz) =

  • dz p(x | z; θx)p(z; θz)
slide-6
SLIDE 6

Latent variable models

◮ Describe structured distributions.

◮ Correlations in high-dimensional x may be captured by fewer parameters.

◮ Capture an underlying generative process.

◮ z may describe causes of x. ◮ help to separate signal from noise.

◮ Combine exponential family distributions into richer, more flexible forms.

◮ P(z), P(x|z) and even P(x, z) may be in the exponential family ◮ P(x) rarely is. (Exception: Linear Gaussian models).

slide-7
SLIDE 7

Latent variables and Gaussians

Gaussian correlation can be composed from latent components and uncorrelated noise. x ∼ N

  • 0,
  • 3

2 2 3

slide-8
SLIDE 8

Latent variables and Gaussians

Gaussian correlation can be composed from latent components and uncorrelated noise. x ∼ N

  • 0,
  • 3

2 2 3

z ∼ N (0, 1) x ∼ N

2

  • 1

1

  • z,
  • 1

1

slide-9
SLIDE 9

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA.

slide-10
SLIDE 10

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, ψ) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •
slide-11
SLIDE 11

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, ψ) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •

Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x|z) = N (Λz, ψI) where Λ is a D × K matrix.

slide-12
SLIDE 12

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, ψ) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •

Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x|z) = N (Λz, ψI) p(x) =

  • p(z)p(x|z)dz

where Λ is a D × K matrix.

slide-13
SLIDE 13

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, ψ) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •

Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x|z) = N (Λz, ψI) p(x) =

  • p(z)p(x|z)dz = N
  • Ez [Λz] , Ez
  • ΛzzTΛT

+ ψI

  • Note: Ex [f(x)] = Ez
  • Ex|z [f(x)]
  • Vx [x] = Ez [V [x|z]] + Vz [E [x|z]]

where Λ is a D × K matrix.

slide-14
SLIDE 14

Probabilistic Principal Components Analysis (PPCA)

If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, ψ) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •

Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x|z) = N (Λz, ψI) p(x) =

  • p(z)p(x|z)dz = N
  • Ez [Λz] , Ez
  • ΛzzTΛT

+ ψI

  • Note: Ex [f(x)] = Ez
  • Ex|z [f(x)]
  • Vx [x] = Ez [V [x|z]] + Vz [E [x|z]]

= N

  • 0, ΛΛT + ψI
  • where Λ is a D × K matrix.
slide-15
SLIDE 15

Multivariate Gaussians and latent variables

Two models: p(x) = N (0, Σ)

◮ Descriptive density model: correlations

are captured by off-diagonal elements of

Σ.

◮ Σ has D(D+1) 2

free parameters.

◮ Only constrained to be positive definite. ◮ Simple ML estimate.

p(z) = N (0, I) p(x|z) = N (Λz, ψI)

⇒ p(x) = N

  • 0, ΛΛT + ψI
  • ◮ Interpretable causal model: correlations

captured by common influence of latent variable.

◮ ΛΛT + ψI has DK + 1 free parameters. ◮ For K < D covariance structure is

constrained (“blurry pancake”)

◮ ML estimation is more complex.

slide-16
SLIDE 16

PPCA likelihood

The marginal distribution on x gives us the PPCA likelihood: log p(X|Λ, ψ) = −N 2 log

  • 2π(ΛΛT + ψI)
  • − 1

2Tr

  • (ΛΛT + ψI)−1

n

xxT

NS

  • To find the ML values of (Λ, ψ) we could optimise numerically (gradient ascent / Newton’s

method), or we could use a different iterative algorithm called EM which we’ll introduce soon. In fact, however, ML for PPCA is more straightforward in principle, as we will see by first considering the limit ψ → 0. [Note: We may also add a constant mean µ to the output, so as to model data that are not distributed around 0. In this case, the ML estimate

µ =

1 N

  • n xn and we can define

S =

1 N

  • n(x −

µ)(x − µ)T in the likelihood above.]

slide-17
SLIDE 17

The ψ → 0 limit

As ψ → 0, the latent model can only capture K dimensions of variance.

slide-18
SLIDE 18

The ψ → 0 limit

As ψ → 0, the latent model can only capture K dimensions of variance.

slide-19
SLIDE 19

The ψ → 0 limit

As ψ → 0, the latent model can only capture K dimensions of variance.

slide-20
SLIDE 20

The ψ → 0 limit

As ψ → 0, the latent model can only capture K dimensions of variance.

slide-21
SLIDE 21

The ψ → 0 limit

As ψ → 0, the latent model can only capture K dimensions of variance. In a Gaussian model, the ML parameters will find the K-dimensional space of most variance.

slide-22
SLIDE 22

Principal Components Analysis

This leads us to an (old) algorithm called Principal Components Analysis (PCA).

−5 5 −5 −4 −3 −2 −1 1 2 3 4 5 −5 5

x2 x1 x3

Assume data D = {xi} have zero mean (if not, subtract it).

◮ Find direction of greatest variance – λ(1).

λ(1) = argmax

v=1

  • n

(xT

nv)2 ◮ Find direction orthogonal to λ(1) with greatest variance –

λ(2)

. . .

◮ Find direction orthogonal to {λ(1), λ(2), . . . , λ(n−1)} with

greatest variance – λ(n).

◮ Terminate when remaining variance drops below a

threshold.

slide-23
SLIDE 23

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy.

slide-24
SLIDE 24

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., uTu = 1).

slide-25
SLIDE 25

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., uTu = 1). For a covariance matrix S =

  • xxT

(which is D × D, symmetric, positive semi-definite):

◮ In general there are D eigenvector-eigenvalue pairs (u(i), ω(i)), except if two or more

eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector).

slide-26
SLIDE 26

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., uTu = 1). For a covariance matrix S =

  • xxT

(which is D × D, symmetric, positive semi-definite):

◮ In general there are D eigenvector-eigenvalue pairs (u(i), ω(i)), except if two or more

eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector).

◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω(i) = ω(j)). Thus, they form

an orthonormal basis.

i u(i)u(i) T = I.

slide-27
SLIDE 27

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., uTu = 1). For a covariance matrix S =

  • xxT

(which is D × D, symmetric, positive semi-definite):

◮ In general there are D eigenvector-eigenvalue pairs (u(i), ω(i)), except if two or more

eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector).

◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω(i) = ω(j)). Thus, they form

an orthonormal basis.

i u(i)u(i) T = I. ◮ Any vector v can be written as

v =

i

u(i)u(i)

T

v =

  • i

(u(i)

Tv)u(i) =

  • i

v(i)u(i)

slide-28
SLIDE 28

Eigendecomposition of a covariance matrix

The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., uTu = 1). For a covariance matrix S =

  • xxT

(which is D × D, symmetric, positive semi-definite):

◮ In general there are D eigenvector-eigenvalue pairs (u(i), ω(i)), except if two or more

eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector).

◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω(i) = ω(j)). Thus, they form

an orthonormal basis.

i u(i)u(i) T = I. ◮ Any vector v can be written as

v =

i

u(i)u(i)

T

v =

  • i

(u(i)

Tv)u(i) =

  • i

v(i)u(i)

◮ The original matrix S can be written:

S =

  • i

ω(i)u(i)u(i)

T = UWUT

where U = [u(1), u(2), . . . , u(D)] collects the eigenvectors and W = diag

  • (ω(1), ω(2), . . . , ω(D))
  • .
slide-29
SLIDE 29

PCA and eigenvectors

◮ The variance in direction u(i) is

  • (xTu(i))2

=

  • u(i)

TxxTu(i)

  • = u(i)

TSu(i) = u(i) Tω(i)u(i) = ω(i)

slide-30
SLIDE 30

PCA and eigenvectors

◮ The variance in direction u(i) is

  • (xTu(i))2

=

  • u(i)

TxxTu(i)

  • = u(i)

TSu(i) = u(i) Tω(i)u(i) = ω(i) ◮ The variance in an arbitrary direction v is

  • (xTv)2

=

  • xT

i

v(i)u(i)

2 =

  • ij

v(i)u(i)

TSu(j)v(j)

=

  • ij

v(i)ω(j)v(j)u(i)

Tu(j) =

  • i

v 2

(i)ω(i)

slide-31
SLIDE 31

PCA and eigenvectors

◮ The variance in direction u(i) is

  • (xTu(i))2

=

  • u(i)

TxxTu(i)

  • = u(i)

TSu(i) = u(i) Tω(i)u(i) = ω(i) ◮ The variance in an arbitrary direction v is

  • (xTv)2

=

  • xT

i

v(i)u(i)

2 =

  • ij

v(i)u(i)

TSu(j)v(j)

=

  • ij

v(i)ω(j)v(j)u(i)

Tu(j) =

  • i

v 2

(i)ω(i) ◮ If vTv = 1, then i v 2 (i) = 1 and so argmaxv=1

  • (xTv)2

= u(max)

The direction of greatest variance is the eigenvector the largest eigenvalue.

slide-32
SLIDE 32

PCA and eigenvectors

◮ The variance in direction u(i) is

  • (xTu(i))2

=

  • u(i)

TxxTu(i)

  • = u(i)

TSu(i) = u(i) Tω(i)u(i) = ω(i) ◮ The variance in an arbitrary direction v is

  • (xTv)2

=

  • xT

i

v(i)u(i)

2 =

  • ij

v(i)u(i)

TSu(j)v(j)

=

  • ij

v(i)ω(j)v(j)u(i)

Tu(j) =

  • i

v 2

(i)ω(i) ◮ If vTv = 1, then i v 2 (i) = 1 and so argmaxv=1

  • (xTv)2

= u(max)

The direction of greatest variance is the eigenvector the largest eigenvalue.

◮ In general, the PCs are exactly the eigenvectors of the empirical covariance matrix,

  • rdered by decreasing eigenvalue.
slide-33
SLIDE 33

PCA and eigenvectors

◮ The variance in direction u(i) is

  • (xTu(i))2

=

  • u(i)

TxxTu(i)

  • = u(i)

TSu(i) = u(i) Tω(i)u(i) = ω(i) ◮ The variance in an arbitrary direction v is

  • (xTv)2

=

  • xT

i

v(i)u(i)

2 =

  • ij

v(i)u(i)

TSu(j)v(j)

=

  • ij

v(i)ω(j)v(j)u(i)

Tu(j) =

  • i

v 2

(i)ω(i) ◮ If vTv = 1, then i v 2 (i) = 1 and so argmaxv=1

  • (xTv)2

= u(max)

The direction of greatest variance is the eigenvector the largest eigenvalue.

◮ In general, the PCs are exactly the eigenvectors of the empirical covariance matrix,

  • rdered by decreasing eigenvalue.

◮ The eigenspectrum shows how the variance

is distributed across dimensions; can iden- tify transitions that might separate signal from noise, or the number of PCs that capture a pre- determined fraction of variance.

10 20 30 20 40 60 80 100

eigenvalue number eigenvalue (variance)

10 20 30 0.2 0.4 0.6 0.8 1

eigenvalue number fractional variance remaining

slide-34
SLIDE 34

PCA subspace

The K principle components define the K-dimensional subspace of greatest variance.

−5 5 −5 −4 −3 −2 −1 1 2 3 4 5 −5 5

x2 x1 x3 ◮ Each data point xn is associated with a projection ˆ

xn into the principle subspace.

ˆ

xn =

K

  • k=1

xn(k)λ(k)

◮ This can be used for lossy compression, denoising, recognition, . . .

slide-35
SLIDE 35

Example of PCA: Eigenfaces

vismod.media.mit.edu/vismod/demos/facerec/basic.html

slide-36
SLIDE 36

Example of PCA: Genetic variation within Europe

Novembre et al. (2008) Nature 456:98-101

slide-37
SLIDE 37

Example of PCA: Genetic variation within Europe

Novembre et al. (2008) Nature 456:98-101

slide-38
SLIDE 38

Example of PCA: Genetic variation within Europe

Novembre et al. (2008) Nature 456:98-101

slide-39
SLIDE 39

Equivalent definitions of PCA

◮ Find K directions of greatest variance in data. ◮ Find K-dimensional orthogonal projection that preserves greatest

variance.

◮ Find K-dimensional vectors zi and matrix Λ so that ˆ

xi = Λzi is as close as possible (in squared distance) to xi.

◮ . . . (many others)

slide-40
SLIDE 40

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z)

slide-41
SLIDE 41

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z) So we want to maximise the entropy of z.

slide-42
SLIDE 42

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z)

= −

  • dz p(z) ln p(z) = 1

2 ln |Σ| + D 2 (1 + ln 2π)

slide-43
SLIDE 43

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z)

= −

  • dz p(z) ln p(z) = 1

2 ln |Σ| + D 2 (1 + ln 2π) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix).

Σz = AΣxAT = AUWxUTAT

slide-44
SLIDE 44

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z)

= −

  • dz p(z) ln p(z) = 1

2 ln |Σ| + D 2 (1 + ln 2π) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix).

Σz = AΣxAT = AUWxUTAT

So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances).

slide-45
SLIDE 45

Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) − H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z)

= −

  • dz p(z) ln p(z) = 1

2 ln |Σ| + D 2 (1 + ln 2π) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix).

Σz = AΣxAT = AUWxUTAT

So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.

slide-46
SLIDE 46

Linear autoencoders: From supervised learning to PCA

x1 x2 xD

  • • •

input units z1 zK

  • • •

hidden units

ˆ

x1

ˆ

x2

ˆ

xD

  • • •
  • utput units

P Q

      

decoder “generation”

      

encoder “recognition” Learning: argmin

P,Q

ˆ

x − x2

ˆ

x = Qz z = Px At the optimum, P and Q perform the projection and reconstruction steps of PCA. (Baldi & Hornik 1989).

slide-47
SLIDE 47

ML learning for PPCA

ℓ = − N

2 log |2πC| − N 2 Tr

  • C−1S
  • where C = ΛΛT + ψI

∂ℓ ∂Λ = N

2

  • − ∂

∂Λ

log |C| − ∂

∂Λ

Tr

  • C−1S
  • = N
  • −C−1Λ + C−1SC−1Λ
  • So at the stationary points we have SC−1Λ = Λ. This implies either:

◮ Λ = 0, which turns out to be a minimum. ◮ C = S ⇒ ΛΛT = S − ψI. Now rank(ΛΛT) ≤ K ⇒ rank(S − ψI) ≤ K

⇒ S has D − K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors.

◮ or, taking the SVD: Λ = ULV T:

S(ULV TVLUT + ψI)−1ULV T = ULV T ×VL−1

S(UL2UT + ψI)−1U = U

U(L2 + ψI) = (UL2UT + ψI)U

⇒ (UL2UT + ψI)−1U = U(L2 + ψI)−1

SU(L2 + ψI)−1 = U ×(L2 + ψI)

SU = U (L2 + ψI)

  • diagonal

⇒ columns of U are eigenvectors of S with eigenvalues given by l2

i + ψ.

Thus, Λ = ULV T spans a space defined by K eigenvectors of S; and the lengths of the column vectors of L are given by the eigenvalues −ψ (V selects an arbitrary basis in the latent space). Remains to show (we won’t, but it’s intuitively reasonable) that the global ML solution is attained when Λ aligns with the K leading eigenvectors.

slide-48
SLIDE 48

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

slide-49
SLIDE 49

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

slide-50
SLIDE 50

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

slide-51
SLIDE 51

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

slide-52
SLIDE 52

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

slide-53
SLIDE 53

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

slide-54
SLIDE 54

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

= c” × exp{−1

2[zT

nΣ−1zn − 2zT nΣ−1µ + µTΣ−1µ]}

slide-55
SLIDE 55

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

= c” × exp{−1

2[zT

nΣ−1zn − 2zT nΣ−1µ + µTΣ−1µ]}

So Σ = (I + ΛTΨ−1Λ)−1 = I − βΛ and µ = ΣΛTΨ−1xn = βxn. Where β = ΣΛTΨ−1.

slide-56
SLIDE 56

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

= c” × exp{−1

2[zT

nΣ−1zn − 2zT nΣ−1µ + µTΣ−1µ]}

So Σ = (I + ΛTΨ−1Λ)−1 = I − βΛ and µ = ΣΛTΨ−1xn = βxn. Where β = ΣΛTΨ−1.

◮ Thus, ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn

slide-57
SLIDE 57

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

= c” × exp{−1

2[zT

nΣ−1zn − 2zT nΣ−1µ + µTΣ−1µ]}

So Σ = (I + ΛTΨ−1Λ)−1 = I − βΛ and µ = ΣΛTΨ−1xn = βxn. Where β = ΣΛTΨ−1.

◮ Thus, ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn

◮ This is not the same projection. PPCA takes into account noise in the principal

subspace.

slide-58
SLIDE 58

PPCA latents

◮ In PCA the “noise” is orthogonal to the subspace, and we can project xn → ˆ

xn trivially.

◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

Find the expected value zn = E [zn|xn] and then take ˆ xn = Λzn.

◮ Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn?

p(zn, xn) = p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

= c × exp{−1

2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

= c’ × exp{−1

2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

= c” × exp{−1

2[zT

nΣ−1zn − 2zT nΣ−1µ + µTΣ−1µ]}

So Σ = (I + ΛTΨ−1Λ)−1 = I − βΛ and µ = ΣΛTΨ−1xn = βxn. Where β = ΣΛTΨ−1.

◮ Thus, ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn

◮ This is not the same projection. PPCA takes into account noise in the principal

subspace.

◮ As ψ → 0, the PPCA estimate → the PCA value.

slide-59
SLIDE 59

PPCA latents

principal subspace

slide-60
SLIDE 60

PPCA latents

principal subspace PCA projection

slide-61
SLIDE 61

PPCA latents

principal subspace PPCA noise PPCA latent prior

slide-62
SLIDE 62

PPCA latents

principal subspace PPCA noise PPCA latent prior PPCA projection PPCA posterior

slide-63
SLIDE 63

Factor Analysis

If dimensions are not equivalent, equal variance assumption is inappropriate. Data: D = X = {x1, x2, . . . , xN}; xi ∈ RD Latents: Z = {z1, z2, . . . , zN}; zi ∈ RK Linear generative model: xd =

K

  • k=1

Λdk zk + ǫd

◮ zk are independent N(0, 1) Gaussian factors ◮ ǫd are independent N(0, Ψdd) Gaussian noise ◮ K <D

x1 x2 xD z1 z2 zK

  • • •
  • • •

Model for observations x is still a correlated Gaussian: p(z) = N (0, I) p(x|z) = N (Λz, Ψ) p(x) =

  • p(z)p(x|z)dz = N
  • 0, ΛΛT + Ψ
  • where Λ is a D × K, and Ψ is D × D and diagonal.

Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

slide-64
SLIDE 64

Factor Analysis (cont.)

x1 x2 xD z1 z2 zK

  • • •
  • • •

◮ ML learning finds Λ (“common factors”) and Ψ (“unique factors” or “uniquenesses”)

given data

◮ parameters (corrected for symmetries): DK + D − K(K−1) 2 ◮ If number of parameters > D(D+1) 2

model is not identifiable (even after accounting for rotational degeneracy discussed later)

◮ no closed form solution for ML params: N(0, ΛΛT + Ψ)

slide-65
SLIDE 65

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA.

slide-66
SLIDE 66

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (UTU = UUT = I) without changing the likelihood.

slide-67
SLIDE 67

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (UTU = UUT = I) without changing the likelihood.

˜

z = Uz and

˜ Λ = ΛUT ⇒ ˜ Λ˜

z = ΛUTUz = Λz

slide-68
SLIDE 68

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (UTU = UUT = I) without changing the likelihood.

˜

z = Uz and

˜ Λ = ΛUT ⇒ ˜ Λ˜

z = ΛUTUz = Λz

− ℓ = 1

2 log

  • 2π(ΛΛT + Ψ)
  • + 1

2xT(ΛΛT + Ψ)−1x

slide-69
SLIDE 69

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (UTU = UUT = I) without changing the likelihood.

˜

z = Uz and

˜ Λ = ΛUT ⇒ ˜ Λ˜

z = ΛUTUz = Λz

− ℓ = 1

2 log

  • 2π(ΛUTUΛT + Ψ)
  • + 1

2xT(ΛUTUΛT + Ψ)−1x

slide-70
SLIDE 70

Factor Analysis projections

Our analysis for PPCA still applies:

ˆ

xn = Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1xn = xn − Ψ(ΛΛT + Ψ)−1xn but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (UTU = UUT = I) without changing the likelihood.

˜

z = Uz and

˜ Λ = ΛUT ⇒ ˜ Λ˜

z = ΛUTUz = Λz

− ℓ = 1

2 log

  • 2π(ΛUTUΛT + Ψ)
  • + 1

2xT(ΛUTUΛT + Ψ)−1x

= 1

2 log

  • 2π(˜

Λ˜ ΛT + Ψ)

  • + 1

2xT(˜

Λ˜ ΛT + Ψ)−1x

slide-71
SLIDE 71

Gradient methods for learning FA

Optimise negative log-likelihood:

−ℓ = 1

2 log |2π(ΛΛT + Ψ)| + 1 2xT(ΛΛT + Ψ)−1x w.r.t. Λ and Ψ (need matrix calculus) subject to constraints.

◮ No spectral short-cut exists. ◮ Likelihood can have more than one (local) optimum, making it difficult to find the global

value.

◮ For some data (“Heywood cases”) likelihood may grow unboundedly by taking one or

more Ψdd → 0. Can eliminate by assuming a prior on Ψ with zero density at Ψdd = 0, but results sensitive to precise choice of prior. Expectation maximisation (next week) provides an alternative approach to maximisation, but doesn’t solve these issues.

slide-72
SLIDE 72

FA vs PCA

◮ PCA and PPCA are rotationally invariant; FA is not

If x → Ux for unitary U, then λPCA

(i)

→ UλPCA

(i) ◮ FA is measurement scale invariant; PCA and PPCA are not

If x → Sx for diagonal S, then λFA

(i) → SλFA (i) ◮ FA and PPCA define a probabilistic model; PCA does not

[Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise

  • variance. Total variance has contributions from both ΛΛT and noise, so this approach does

not exactly solve the problem.]

slide-73
SLIDE 73

Canonical Correlations Analysis

Data vector pairs: D = {(u1, v1), (u2, v2) . . . } in spaces U and V. Classic CCA

◮ Find unit vectors υ1 ∈ U, φ1 ∈ V such that the correlation of uT i υ1 and vT i φ1 is

maximised.

◮ As with PCA, repeat in orthogonal subspaces.

Probabilistic CCA

◮ Generative model with latent zi ∈ RK :

z ∼ N (0, I) u ∼ N (Υz, Ψu)

Ψu 0

v ∼ N (Φz, Ψv)

Ψv 0

◮ Block diagonal noise.

slide-74
SLIDE 74

Limitations of Gaussian, FA and PCA models

◮ Gaussian, FA and PCA models are easy to understand and use in practice. ◮ They are a convenient way to find interesting directions in very high dimensional data

sets, eg as preprocessing

◮ However, they make strong assumptions about the distribution of the data: only the

mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive.

−1 1 −1 1

xi1 xi2

By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.

slide-75
SLIDE 75

Mixture Distributions

−1 1 −1 1

xi1 xi2

A mixture distribution has a single discrete latent variable: si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.

slide-76
SLIDE 76

The Mixture Likelihood

The mixture model is si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ]

slide-77
SLIDE 77

The Mixture Likelihood

The mixture model is si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ] Under the discrete distribution P(si = m) = πm;

πm ≥ 0,

k

  • m=1

πm = 1

slide-78
SLIDE 78

The Mixture Likelihood

The mixture model is si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ] Under the discrete distribution P(si = m) = πm;

πm ≥ 0,

k

  • m=1

πm = 1

Thus, the probability (density) at a single data point xi is P(xi) =

k

  • m=1

P(xi | si = m)P(si = m)

slide-79
SLIDE 79

The Mixture Likelihood

The mixture model is si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ] Under the discrete distribution P(si = m) = πm;

πm ≥ 0,

k

  • m=1

πm = 1

Thus, the probability (density) at a single data point xi is P(xi) =

k

  • m=1

P(xi | si = m)P(si = m)

=

k

  • m=1

πmPm(xi; θm)

slide-80
SLIDE 80

The Mixture Likelihood

The mixture model is si

iid

∼ Discrete[π]

xi | si ∼ Psi [θsi ] Under the discrete distribution P(si = m) = πm;

πm ≥ 0,

k

  • m=1

πm = 1

Thus, the probability (density) at a single data point xi is P(xi) =

k

  • m=1

P(xi | si = m)P(si = m)

=

k

  • m=1

πmPm(xi; θm)

The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).

slide-81
SLIDE 81

Approximation with a Mixture of Gaussians (MoG)

The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-Gaussian densities are modelled (aproximated) as a mixture

  • f Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the

resulting density.

−0.5 0.5 1 1.5 0.5 1 Uniform −0.5 0.5 1 1.5 1 2 Triangle −2 2 0.5 1 Heavy tails

Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.

slide-82
SLIDE 82

Clustering with a MoG

slide-83
SLIDE 83

Clustering with a MoG

slide-84
SLIDE 84

Clustering with a MoG

In clustering applications, the latent variable si represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P(si = m | π) = πm Data from the mth cluster are distributed according to the mth component: P(xi | si = m) = Pm(xi) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P(si = m | xi) = Pm(xi)πm

  • m Pm(xi)πm

This is often called the responsibility of the mth cluster for the ith data point.

slide-85
SLIDE 85

The MoG likelihood

Each component of a MoG is a Gaussian, with mean µm and covariance matrix Σm. Thus, the probability density evaluated at a set of n iid observations, D = {x1 . . . xn} (i.e. the likelihood) is p(D | {µm}, {Σm}, π) =

n

  • i=1

k

  • m=1

πm N (xi | µm, Σm) =

n

  • i=1

k

  • m=1

πm

1

  • |2πΣm|

e− 1

2 (xi−µm)TΣ−1 m (xi−µm)

The log of the likelihood is log p(D | {µm}, {Σm}, π) =

n

  • i=1

log

k

  • m=1

πm

1

  • |2πΣm|

e− 1

2 (xi−µm)TΣ−1 m (xi−µm)

Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.

slide-86
SLIDE 86

Maximum Likelihood for a Mixture Model

The log likelihood is:

slide-87
SLIDE 87

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

slide-88
SLIDE 88

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

slide-89
SLIDE 89

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

  • r, using ∂P/∂θ = P × ∂ log P/∂θ,
slide-90
SLIDE 90

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

  • r, using ∂P/∂θ = P × ∂ log P/∂θ,

=

n

  • i=1

πmPm(xi; θm) k

m=1 πmPm(xi; θm)

  • ∂ log Pm(xi; θm)

∂θm

slide-91
SLIDE 91

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

  • r, using ∂P/∂θ = P × ∂ log P/∂θ,

=

n

  • i=1

πmPm(xi; θm) k

m=1 πmPm(xi; θm)

  • ∂ log Pm(xi; θm)

∂θm =

n

  • i=1

rim

∂ log Pm(xi; θm) ∂θm

slide-92
SLIDE 92

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

  • r, using ∂P/∂θ = P × ∂ log P/∂θ,

=

n

  • i=1

πmPm(xi; θm) k

m=1 πmPm(xi; θm)

  • ∂ log Pm(xi; θm)

∂θm =

n

  • i=1

rim

∂ log Pm(xi; θm) ∂θm

And its partial derivative wrt πm is

∂L ∂πm =

n

  • i=1

Pm(xi; θm)

k

m=1 πmPm(xi; θm)

slide-93
SLIDE 93

Maximum Likelihood for a Mixture Model

The log likelihood is:

L =

n

  • i=1

log

k

  • m=1

πmPm(xi; θm)

Its partial derivative wrt θm is

∂L ∂θm =

n

  • i=1

πm k

m=1 πmPm(xi; θm)

∂Pm(xi; θm) ∂θm

  • r, using ∂P/∂θ = P × ∂ log P/∂θ,

=

n

  • i=1

πmPm(xi; θm) k

m=1 πmPm(xi; θm)

  • ∂ log Pm(xi; θm)

∂θm =

n

  • i=1

rim

∂ log Pm(xi; θm) ∂θm

And its partial derivative wrt πm is

∂L ∂πm =

n

  • i=1

Pm(xi; θm)

k

m=1 πmPm(xi; θm)

=

n

  • i=1

rim

πm

slide-94
SLIDE 94

MoG Derivatives

For a MoG, with θm = {µm, Σm} we get

∂L ∂µm =

n

  • i=1

rimΣ−1

m (xi − µm)

∂L ∂Σ−1

m

= 1

2

n

  • i=1

rim

  • Σm − (xi − µm)(xi − µm)T

These equations can be used (along with Lagrangian derivatives wrt πm that enforce normalisation) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).

slide-95
SLIDE 95

The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis).

slide-96
SLIDE 96

The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take πm = 1/k and Σm = σ2I, with σ2 → 0.

slide-97
SLIDE 97

The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take πm = 1/k and Σm = σ2I, with σ2 → 0. Then the responsibilities become binary rim → δ(m, argmin

l

xi − µl2)

with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0.

slide-98
SLIDE 98

The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take πm = 1/k and Σm = σ2I, with σ2 → 0. Then the responsibilities become binary rim → δ(m, argmin

l

xi − µl2)

with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps:

◮ assign each point to its closest mean

  • set rim = δ(m, argmin

l

xi − µl2)

  • ◮ update the means to the average of their assigned points
  • set µm =
  • i rimxi
  • i rim
slide-99
SLIDE 99

The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take πm = 1/k and Σm = σ2I, with σ2 → 0. Then the responsibilities become binary rim → δ(m, argmin

l

xi − µl2)

with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps:

◮ assign each point to its closest mean

  • set rim = δ(m, argmin

l

xi − µl2)

  • ◮ update the means to the average of their assigned points
  • set µm =
  • i rimxi
  • i rim
  • This usually converges within a few iterations, although the fixed point depends on the initial

values chosen for µm. The algorithm has no learning rate, but the assumptions are quite limiting.

slide-100
SLIDE 100

A preview of the EM algorithm

We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µm and

Σm.

The EM algorithm for MoGs iterates these two steps:

◮ Evaluate the responsibilities for each point given the current parameters. ◮ Optimise the parameters assuming the responsibilities stay fixed:

µm =

  • i rimxi
  • i rim

and

Σm =

  • i rim(xi − µm)(xi − µm)T
  • i rim

Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.

slide-101
SLIDE 101

Issues

There are several problems with these algorithms:

◮ slow convergence for the gradient based method ◮ gradient based method may develop invalid covariance matrices ◮ local minima; the end configuration may depend on the starting state ◮ how do you adjust k? Using the likelihood alone is no good. ◮ singularities; components with a single data point will have their covariance going to

zero and the likelihood will tend to infinity. We will attempt to address many of these as the course goes on.