Probabilistic & Unsupervised Learning Latent Variable Models - PowerPoint PPT Presentation

Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω , of a matrix S if S u = ω u u can have any norm, but we will define it to be unity (i.e., u T u = 1). � xx T � (which is D × D , symmetric, positive semi-definite): For a covariance matrix S = ◮ In general there are D eigenvector-eigenvalue pairs ( u ( i ) , ω ( i ) ) , except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector). ◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω ( i ) = ω ( j ) ). Thus, they form an orthonormal basis. � T = I . i u ( i ) u ( i )

Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω , of a matrix S if S u = ω u u can have any norm, but we will define it to be unity (i.e., u T u = 1). � xx T � (which is D × D , symmetric, positive semi-definite): For a covariance matrix S = ◮ In general there are D eigenvector-eigenvalue pairs ( u ( i ) , ω ( i ) ) , except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector). ◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω ( i ) = ω ( j ) ). Thus, they form an orthonormal basis. � T = I . i u ( i ) u ( i ) ◮ Any vector v can be written as � � T � � � T v ) u ( i ) = v = u ( i ) u ( i ) v = ( u ( i ) v ( i ) u ( i ) i i i

Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω , of a matrix S if S u = ω u u can have any norm, but we will define it to be unity (i.e., u T u = 1). � xx T � (which is D × D , symmetric, positive semi-definite): For a covariance matrix S = ◮ In general there are D eigenvector-eigenvalue pairs ( u ( i ) , ω ( i ) ) , except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate — any linear combination is also an eigenvector). ◮ The D eigenvectors are orthogonal (or orthogonalisable, if ω ( i ) = ω ( j ) ). Thus, they form an orthonormal basis. � T = I . i u ( i ) u ( i ) ◮ Any vector v can be written as � � T � � � T v ) u ( i ) = v = u ( i ) u ( i ) v = ( u ( i ) v ( i ) u ( i ) i i i ◮ The original matrix S can be written: � T = UWU T S = ω ( i ) u ( i ) u ( i ) i where U = [ u ( 1 ) , u ( 2 ) , . . . , u ( D ) ] collects the eigenvectors and � � W = diag ( ω ( 1 ) , ω ( 2 ) , . . . , ω ( D ) ) .

PCA and eigenvectors ◮ The variance in direction u ( i ) is � ( x T u ( i ) ) 2 � � � T xx T u ( i ) T S u ( i ) = u ( i ) T ω ( i ) u ( i ) = ω ( i ) = u ( i ) = u ( i )

PCA and eigenvectors ◮ The variance in direction u ( i ) is � ( x T u ( i ) ) 2 � � � T xx T u ( i ) T S u ( i ) = u ( i ) T ω ( i ) u ( i ) = ω ( i ) = u ( i ) = u ( i ) ◮ The variance in an arbitrary direction v is � ( x T v ) 2 � �� x T � � �� 2 � � T S u ( j ) v ( j ) = v ( i ) u ( i ) = v ( i ) u ( i ) i ij � � T u ( j ) = v 2 = v ( i ) ω ( j ) v ( j ) u ( i ) ( i ) ω ( i ) ij i

PCA and eigenvectors ◮ The variance in direction u ( i ) is � ( x T u ( i ) ) 2 � � � T xx T u ( i ) T S u ( i ) = u ( i ) T ω ( i ) u ( i ) = ω ( i ) = u ( i ) = u ( i ) ◮ The variance in an arbitrary direction v is � ( x T v ) 2 � �� x T � � �� 2 � � T S u ( j ) v ( j ) = v ( i ) u ( i ) = v ( i ) u ( i ) i ij � � T u ( j ) = v 2 = v ( i ) ω ( j ) v ( j ) u ( i ) ( i ) ω ( i ) ij i � ( x T v ) 2 � ◮ If v T v = 1, then � i v 2 ( i ) = 1 and so argmax � v � = 1 = u ( max ) The direction of greatest variance is the eigenvector the largest eigenvalue.

PCA and eigenvectors ◮ The variance in direction u ( i ) is � ( x T u ( i ) ) 2 � � � T xx T u ( i ) T S u ( i ) = u ( i ) T ω ( i ) u ( i ) = ω ( i ) = u ( i ) = u ( i ) ◮ The variance in an arbitrary direction v is � ( x T v ) 2 � �� x T � � �� 2 � � T S u ( j ) v ( j ) = v ( i ) u ( i ) = v ( i ) u ( i ) i ij � � T u ( j ) = v 2 = v ( i ) ω ( j ) v ( j ) u ( i ) ( i ) ω ( i ) ij i � ( x T v ) 2 � ◮ If v T v = 1, then � i v 2 ( i ) = 1 and so argmax � v � = 1 = u ( max ) The direction of greatest variance is the eigenvector the largest eigenvalue. ◮ In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue.

PCA and eigenvectors ◮ The variance in direction u ( i ) is � ( x T u ( i ) ) 2 � � � T xx T u ( i ) T S u ( i ) = u ( i ) T ω ( i ) u ( i ) = ω ( i ) = u ( i ) = u ( i ) ◮ The variance in an arbitrary direction v is � ( x T v ) 2 � �� x T � � �� 2 � � T S u ( j ) v ( j ) = v ( i ) u ( i ) = v ( i ) u ( i ) i ij � � T u ( j ) = v 2 = v ( i ) ω ( j ) v ( j ) u ( i ) ( i ) ω ( i ) ij i � ( x T v ) 2 � ◮ If v T v = 1, then � i v 2 ( i ) = 1 and so argmax � v � = 1 = u ( max ) The direction of greatest variance is the eigenvector the largest eigenvalue. ◮ In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. fractional variance remaining ◮ The eigenspectrum shows how the variance 100 1 eigenvalue (variance) 80 0.8 is distributed across dimensions; can iden- 0.6 60 tify transitions that might separate signal from 0.4 40 noise, or the number of PCs that capture a pre- 0.2 20 determined fraction of variance. 0 0 0 10 20 30 0 10 20 30 eigenvalue number eigenvalue number

PCA subspace The K principle components define the K -dimensional subspace of greatest variance. 5 4 3 2 1 5 0 0 −1 x 3 −2 −5 −3 −5 x 2 −4 0 −5 5 x 1 ◮ Each data point x n is associated with a projection ˆ x n into the principle subspace. K � ˆ x n = x n ( k ) λ ( k ) k = 1 ◮ This can be used for lossy compression, denoising, recognition, . . .

Example of PCA: Eigenfaces vismod.media.mit.edu/vismod/demos/facerec/basic.html

Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101

Equivalent definitions of PCA ◮ Find K directions of greatest variance in data. ◮ Find K -dimensional orthogonal projection that preserves greatest variance. ◮ Find K -dimensional vectors z i and matrix Λ so that ˆ x i = Λ z i is as close as possible (in squared distance) to x i . ◮ . . . (many others)

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z )

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z ) So we want to maximise the entropy of z .

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z ) So we want to maximise the entropy of z . What is the entropy of a Gaussian? � d z p ( z ) ln p ( z ) = 1 2 ln | Σ | + D H ( z ) = − 2 ( 1 + ln 2 π )

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z ) So we want to maximise the entropy of z . What is the entropy of a Gaussian? � d z p ( z ) ln p ( z ) = 1 2 ln | Σ | + D H ( z ) = − 2 ( 1 + ln 2 π ) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = A Σ x A T = AUW x U T A T

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z ) So we want to maximise the entropy of z . What is the entropy of a Gaussian? � d z p ( z ) ln p ( z ) = 1 2 ln | Σ | + D H ( z ) = − 2 ( 1 + ln 2 π ) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = A Σ x A T = AUW x U T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances).

Another view of PCA: Mutual information Problem: Given x , find z = A x with columns of A unit vectors, s.t. I ( z ; x ) is maximised (assuming that P ( x ) is Gaussian). I ( z ; x ) = H ( z ) + H ( x ) − H ( z , x ) = H ( z ) So we want to maximise the entropy of z . What is the entropy of a Gaussian? � d z p ( z ) ln p ( z ) = 1 2 ln | Σ | + D H ( z ) = − 2 ( 1 + ln 2 π ) Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = A Σ x A T = AUW x U T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.

Linear autoencoders: From supervised learning to PCA • • • output units ˆ ˆ ˆ x 1 x 2 x D     decoder Q  “generation”    z 1 • • • z K hidden units    encoder P  “recognition”   x 1 x 2 • • • x D input units x − x � 2 � ˆ ˆ x = Q z z = P x Learning: argmin P , Q At the optimum, P and Q perform the projection and reconstruction steps of PCA. (Baldi & Hornik 1989).

ML learning for PPCA ℓ = − N log | 2 π C | − N � � where C = ΛΛ T + ψ I C − 1 S Tr 2 2 � �� ∂ Λ = N ∂ℓ − ∂ log | C | − ∂ � � � C − 1 S − C − 1 Λ + C − 1 SC − 1 Λ Tr = N 2 ∂ Λ ∂ Λ So at the stationary points we have SC − 1 Λ = Λ . This implies either: ◮ Λ = 0, which turns out to be a minimum. ◮ C = S ⇒ ΛΛ T = S − ψ I . Now rank (ΛΛ T ) ≤ K ⇒ rank ( S − ψ I ) ≤ K ⇒ S has D − K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors. ◮ or, taking the SVD: Λ = ULV T : S ( ULV T VLU T + ψ I ) − 1 ULV T = ULV T × VL − 1 S ( UL 2 U T + ψ I ) − 1 U = U U ( L 2 + ψ I ) = ( UL 2 U T + ψ I ) U ⇒ ⇒ ( UL 2 U T + ψ I ) − 1 U = U ( L 2 + ψ I ) − 1 SU ( L 2 + ψ I ) − 1 = U × ( L 2 + ψ I ) ⇒ SU = U ( L 2 + ψ I ) ⇒ � �� diagonal ⇒ columns of U are eigenvectors of S with eigenvalues given by l 2 i + ψ . Thus, Λ = ULV T spans a space defined by K eigenvectors of S ; and the lengths of the column vectors of L are given by the eigenvalues − ψ ( V selects an arbitrary basis in the latent space). Remains to show (we won’t, but it’s intuitively reasonable) that the global ML solution is attained when Λ aligns with the K leading eigenvectors.

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n .

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n )

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) }

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] }

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 2 [ z T n Σ − 1 z n − 2 z T n Σ − 1 µ + µ T Σ − 1 µ ] }

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 2 [ z T n Σ − 1 z n − 2 z T n Σ − 1 µ + µ T Σ − 1 µ ] } So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 .

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 2 [ z T n Σ − 1 z n − 2 z T n Σ − 1 µ + µ T Σ − 1 µ ] } So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 . x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ◮ Thus, ˆ

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 2 [ z T n Σ − 1 z n − 2 z T n Σ − 1 µ + µ T Σ − 1 µ ] } So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 . x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ◮ Thus, ˆ ◮ This is not the same projection. PPCA takes into account noise in the principal subspace.

PPCA latents ◮ In PCA the “noise” is orthogonal to the subspace, and we can project x n → ˆ x n trivially. ◮ In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [ z n | x n ] and then take ˆ x n = Λ z n . ◮ Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 = ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 z T 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } = c × exp {− 1 n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } 2 [ z T = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 2 [ z T n Σ − 1 z n − 2 z T n Σ − 1 µ + µ T Σ − 1 µ ] } So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 . x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ◮ Thus, ˆ ◮ This is not the same projection. PPCA takes into account noise in the principal subspace. ◮ As ψ → 0, the PPCA estimate → the PCA value.

PPCA latents principal subspace

PPCA latents PCA projection principal subspace

PPCA latents PPCA noise PPCA latent prior principal subspace

PPCA latents PPCA posterior PPCA noise PPCA projection PPCA latent prior principal subspace

Factor Analysis If dimensions are not equivalent, equal variance assumption is inappropriate. Data: D = X = { x 1 , x 2 , . . . , x N } ; x i ∈ R D Latents: Z = { z 1 , z 2 , . . . , z N } ; z i ∈ R K • • • z 1 z 2 z K � K Linear generative model: x d = Λ dk z k + ǫ d k = 1 ◮ z k are independent N ( 0 , 1 ) Gaussian factors x 1 x 2 • • • x D ◮ ǫ d are independent N ( 0 , Ψ dd ) Gaussian noise ◮ K < D Model for observations x is still a correlated Gaussian: p ( z ) = N ( 0 , I ) p ( x | z ) = N (Λ z , Ψ) � � � 0 , ΛΛ T + Ψ p ( z ) p ( x | z ) d z = N p ( x ) = where Λ is a D × K , and Ψ is D × D and diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

Factor Analysis (cont.) z 1 z 2 • • • z K • • • x 1 x 2 x D ◮ ML learning finds Λ (“common factors”) and Ψ (“unique factors” or “uniquenesses”) given data ◮ parameters (corrected for symmetries): DK + D − K ( K − 1 ) 2 ◮ If number of parameters > D ( D + 1 ) model is not identifiable (even after accounting for 2 rotational degeneracy discussed later) ◮ no closed form solution for ML params: N ( 0 , ΛΛ T + Ψ)

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA.

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U ( U T U = UU T = I ) without changing the likelihood.

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U ( U T U = UU T = I ) without changing the likelihood. ˜ ˜ Λ = Λ U T z = Λ U T U z = Λ z z = U z ˜ ⇒ Λ˜ and

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U ( U T U = UU T = I ) without changing the likelihood. ˜ ˜ Λ = Λ U T z = Λ U T U z = Λ z ˜ z = U z ⇒ Λ˜ and � � − ℓ = 1 � + 1 � � 2 π (ΛΛ T + Ψ) � 2 x T (ΛΛ T + Ψ) − 1 x 2 log

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U ( U T U = UU T = I ) without changing the likelihood. ˜ ˜ Λ = Λ U T z = Λ U T U z = Λ z ˜ z = U z ⇒ Λ˜ and � � − ℓ = 1 � + 1 � � 2 π (Λ U T U Λ T + Ψ) � 2 x T (Λ U T U Λ T + Ψ) − 1 x 2 log

Factor Analysis projections Our analysis for PPCA still applies: x n = Λ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 x n = x n − Ψ(ΛΛ T + Ψ) − 1 x n ˆ but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U ( U T U = UU T = I ) without changing the likelihood. ˜ ˜ Λ = Λ U T z = Λ U T U z = Λ z z = U z ˜ ⇒ Λ˜ and � � − ℓ = 1 � + 1 � � 2 π (Λ U T U Λ T + Ψ) � 2 x T (Λ U T U Λ T + Ψ) − 1 x 2 log � � = 1 � + 1 � Λ T + Ψ) � Λ T + Ψ) − 1 x � 2 π (˜ Λ˜ 2 x T (˜ Λ˜ 2 log

Gradient methods for learning FA Optimise negative log-likelihood: − ℓ = 1 2 log | 2 π (ΛΛ T + Ψ) | + 1 2 x T (ΛΛ T + Ψ) − 1 x w.r.t. Λ and Ψ (need matrix calculus) subject to constraints. ◮ No spectral short-cut exists. ◮ Likelihood can have more than one (local) optimum, making it difficult to find the global value. ◮ For some data (“Heywood cases”) likelihood may grow unboundedly by taking one or more Ψ dd → 0. Can eliminate by assuming a prior on Ψ with zero density at Ψ dd = 0, but results sensitive to precise choice of prior. Expectation maximisation (next week) provides an alternative approach to maximisation, but doesn’t solve these issues.

FA vs PCA ◮ PCA and PPCA are rotationally invariant; FA is not then λ PCA → U λ PCA If x → U x for unitary U , ( i ) ( i ) ◮ FA is measurement scale invariant; PCA and PPCA are not then λ FA ( i ) → S λ FA If x → S x for diagonal S , ( i ) ◮ FA and PPCA define a probabilistic model; PCA does not [Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise variance. Total variance has contributions from both ΛΛ T and noise, so this approach does not exactly solve the problem.]

Canonical Correlations Analysis Data vector pairs: D = { ( u 1 , v 1 ) , ( u 2 , v 2 ) . . . } in spaces U and V . Classic CCA ◮ Find unit vectors υ 1 ∈ U , φ 1 ∈ V such that the correlation of u T i υ 1 and v T i φ 1 is maximised. ◮ As with PCA, repeat in orthogonal subspaces. Probabilistic CCA ◮ Generative model with latent z i ∈ R K : z ∼ N ( 0 , I ) u ∼ N (Υ z , Ψ u ) Ψ u � 0 v ∼ N (Φ z , Ψ v ) Ψ v � 0 ◮ Block diagonal noise.

Limitations of Gaussian, FA and PCA models ◮ Gaussian, FA and PCA models are easy to understand and use in practice. ◮ They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing ◮ However, they make strong assumptions about the distribution of the data: only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i2 0 −1 −1 0 1 x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.

Mixture Distributions 1 x i2 0 −1 −1 0 1 x i1 A mixture distribution has a single discrete latent variable: iid ∼ Discrete [ π ] s i x i | s i ∼ P s i [ θ s i ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.

The Mixture Likelihood The mixture model is iid s i ∼ Discrete [ π ] x i | s i ∼ P s i [ θ s i ]

The Mixture Likelihood The mixture model is iid s i ∼ Discrete [ π ] x i | s i ∼ P s i [ θ s i ] Under the discrete distribution k � P ( s i = m ) = π m ; π m ≥ 0 , π m = 1 m = 1

The Mixture Likelihood The mixture model is iid s i ∼ Discrete [ π ] x i | s i ∼ P s i [ θ s i ] Under the discrete distribution k � P ( s i = m ) = π m ; π m ≥ 0 , π m = 1 m = 1 Thus, the probability (density) at a single data point x i is � k P ( x i ) = P ( x i | s i = m ) P ( s i = m ) m = 1

The Mixture Likelihood The mixture model is iid s i ∼ Discrete [ π ] x i | s i ∼ P s i [ θ s i ] Under the discrete distribution k � P ( s i = m ) = π m ; π m ≥ 0 , π m = 1 m = 1 Thus, the probability (density) at a single data point x i is � k P ( x i ) = P ( x i | s i = m ) P ( s i = m ) m = 1 k � = π m P m ( x i ; θ m ) m = 1

The Mixture Likelihood The mixture model is iid s i ∼ Discrete [ π ] x i | s i ∼ P s i [ θ s i ] Under the discrete distribution k � P ( s i = m ) = π m ; π m ≥ 0 , π m = 1 m = 1 Thus, the probability (density) at a single data point x i is � k P ( x i ) = P ( x i | s i = m ) P ( s i = m ) m = 1 k � = π m P m ( x i ; θ m ) m = 1 The mixture distribution (density) is a convex combination (or weighted average ) of the component distributions (densities).

Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-Gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails 1 2 1 0.5 1 0.5 0 0 0 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −2 0 2 Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.

Clustering with a MoG

Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the i th observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P ( s i = m | π ) = π m Data from the m th cluster are distributed according to the m th component: P ( x i | s i = m ) = P m ( x i ) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P m ( x i ) π m P ( s i = m | x i ) = � m P m ( x i ) π m This is often called the responsibility of the m th cluster for the i th data point.

The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m . Thus, the probability density evaluated at a set of n iid observations, D = { x 1 . . . x n } (i.e. the likelihood) is n k � � p ( D | { µ m } , { Σ m } , π ) = π m N ( x i | µ m , Σ m ) i = 1 m = 1 � n � k 1 2 ( x i − µ m ) T Σ − 1 e − 1 m ( x i − µ m ) = π m � | 2 π Σ m | i = 1 m = 1 The log of the likelihood is n k � � 1 e − 1 2 ( x i − µ m ) T Σ − 1 m ( x i − µ m ) log p ( D | { µ m } , { Σ m } , π ) = π m log � | 2 π Σ m | i = 1 m = 1 Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.

Maximum Likelihood for a Mixture Model The log likelihood is:

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 or, using ∂ P /∂θ = P × ∂ log P /∂θ ,

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 or, using ∂ P /∂θ = P × ∂ log P /∂θ , � n π m P m ( x i ; θ m ) ∂ log P m ( x i ; θ m ) = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 � ��

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 or, using ∂ P /∂θ = P × ∂ log P /∂θ , � n π m P m ( x i ; θ m ) ∂ log P m ( x i ; θ m ) = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 � �� n ∂ log P m ( x i ; θ m ) = r im ∂θ m i = 1

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 or, using ∂ P /∂θ = P × ∂ log P /∂θ , � n π m P m ( x i ; θ m ) ∂ log P m ( x i ; θ m ) = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 � �� n ∂ log P m ( x i ; θ m ) = r im ∂θ m i = 1 And its partial derivative wrt π m is n � ∂ L P m ( x i ; θ m ) ∂π m = � k m = 1 π m P m ( x i ; θ m ) i = 1

Maximum Likelihood for a Mixture Model The log likelihood is: n k � � L = π m P m ( x i ; θ m ) log i = 1 m = 1 Its partial derivative wrt θ m is n � ∂ L π m ∂ P m ( x i ; θ m ) ∂θ m = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 or, using ∂ P /∂θ = P × ∂ log P /∂θ , � n π m P m ( x i ; θ m ) ∂ log P m ( x i ; θ m ) = � k ∂θ m m = 1 π m P m ( x i ; θ m ) i = 1 � �� n ∂ log P m ( x i ; θ m ) = r im ∂θ m i = 1 And its partial derivative wrt π m is n n � � ∂ L P m ( x i ; θ m ) r im ∂π m = = � k π m m = 1 π m P m ( x i ; θ m ) i = 1 i = 1

MoG Derivatives For a MoG, with θ m = { µ m , Σ m } we get n � ∂ L r im Σ − 1 ∂ µ m = m ( x i − µ m ) i = 1 � Σ m − ( x i − µ m )( x i − µ m ) T � n � ∂ L = 1 r im ∂ Σ − 1 2 m i = 1 These equations can be used (along with Lagrangian derivatives wrt π m that enforce normalisation) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).

The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis).

The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1 / k and Σ m = σ 2 I , with σ 2 → 0.

The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1 / k and Σ m = σ 2 I , with σ 2 → 0. Then the responsibilities become binary � x i − µ l � 2 ) r im → δ ( m , argmin l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0.

The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1 / k and Σ m = σ 2 I , with σ 2 → 0. Then the responsibilities become binary � x i − µ l � 2 ) r im → δ ( m , argmin l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: � � � x i − µ l � 2 ) ◮ assign each point to its closest mean set r im = δ ( m , argmin l � � � i r im x i ◮ update the means to the average of their assigned points set µ m = � i r im

The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1 / k and Σ m = σ 2 I , with σ 2 → 0. Then the responsibilities become binary � x i − µ l � 2 ) r im → δ ( m , argmin l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: � � � x i − µ l � 2 ) ◮ assign each point to its closest mean set r im = δ ( m , argmin l � � � i r im x i ◮ update the means to the average of their assigned points set µ m = � i r im This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m . The algorithm has no learning rate, but the assumptions are quite limiting.

A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m . The EM algorithm for MoGs iterates these two steps: ◮ Evaluate the responsibilities for each point given the current parameters. ◮ Optimise the parameters assuming the responsibilities stay fixed: � � i r im ( x i − µ m )( x i − µ m ) T i r im x i µ m = � and Σ m = � i r im i r im Although this appears ad hoc , we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.

Probabilistic & Unsupervised Learning Latent Variable Models - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018 Exponential

1 Latent variable models In the next section we will discuss latent variable models for

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

MLG Spotlight Talks August 20th, 2018 Growing Better Graphs with Latent-Variable Probabilistic

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

La Latent-sp space Dynam Dynamics ics for r Re Reduced Deformable Simulation Lawson Fulton

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray

Variational Sequential Labelers for Semi-Supervised Learning Mingda Chen, Qingming Tang, Karen

Poster #24 1 Applied AI Lab, Oxford Robotics Institute 2 Department of Statistics, University of

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

Probabilistic & Unsupervised Learning Latent Variable Models - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018 Exponential

1 Latent variable models In the next section we will discuss latent variable models for

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Probabilistic &amp; Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani

Probabilistic &amp; Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

MLG Spotlight Talks August 20th, 2018 Growing Better Graphs with Latent-Variable Probabilistic

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

La Latent-sp space Dynam Dynamics ics for r Re Reduced Deformable Simulation Lawson Fulton

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray

Variational Sequential Labelers for Semi-Supervised Learning Mingda Chen, Qingming Tang, Karen

Poster #24 1 Applied AI Lab, Oxford Robotics Institute 2 Department of Statistics, University of

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani