probabilistic unsupervised learning expectation
play

Probabilistic & Unsupervised Learning Expectation Maximisation - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Expectation Maximisation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018


  1. The Expectation Maximisation (EM) algorithm The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. ◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often imputation and inference or estimation. ◮ Not essential for simple models (like MoGs/FA), though often more efficient than alternatives. Crucial for learning in complex settings.

  2. The Expectation Maximisation (EM) algorithm The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. ◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often imputation and inference or estimation. ◮ Not essential for simple models (like MoGs/FA), though often more efficient than alternatives. Crucial for learning in complex settings. ◮ Provides a framework for principled approximations.

  3. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood.

  4. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) x 1 x 2

  5. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) x 1 x 2

  6. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) α x 1 + ( 1 − α ) x 2 x 1 x 2

  7. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2

  8. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general:

  9. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : � � � � α i x i ≥ α i log ( x i ) log i i

  10. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : For probability measure α and concave f � � � � α i x i ≥ α i log ( x i ) f ( E α [ x ]) ≥ E α [ f ( x )] log i i

  11. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : For probability measure α and concave f � � � � α i x i ≥ α i log ( x i ) f ( E α [ x ]) ≥ E α [ f ( x )] log i i Equality (if and) only if f ( x ) is almost surely constant or linear on (convex) support of α .

  12. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ )

  13. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � d Z P ( Z , X| θ ) ℓ ( θ ) = log

  14. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � d Z q ( Z ) P ( Z , X| θ ) ℓ ( θ ) = log q ( Z )

  15. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) ≥ ℓ ( θ ) = log q ( Z ) q ( Z )

  16. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z )

  17. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z )

  18. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z ) � = d Z q ( Z ) log P ( Z , X| θ ) + H [ q ] , where H [ q ] is the entropy of q ( Z ) .

  19. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z ) � = d Z q ( Z ) log P ( Z , X| θ ) + H [ q ] , where H [ q ] is the entropy of q ( Z ) . So: F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ]

  20. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] ,

  21. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] , The EM steps can be re-written: ◮ E step: optimize F ( q , θ ) wrt distribution over hidden variables holding parameters fixed: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax F . q ( Z )

  22. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] , The EM steps can be re-written: ◮ E step: optimize F ( q , θ ) wrt distribution over hidden variables holding parameters fixed: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax F . q ( Z ) ◮ M step: maximize F ( q , θ ) wrt parameters holding hidden distribution fixed: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F � log P ( Z , X| θ ) � q ( k ) ( Z ) = argmax θ θ � � q ( k ) ( Z ) does not depend directly on θ . The second equality comes from the fact that H

  23. The E Step The free energy can be re-written

  24. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z )

  25. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z )

  26. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z )

  27. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence.

  28. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0.

  29. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0. But KL [ q � p ] is zero if and only if q = p (see appendix.)

  30. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0. But KL [ q � p ] is zero if and only if q = p (see appendix.) So, the E step sets q ( k ) ( Z ) = P ( Z|X , θ ( k − 1 ) ) [inference / imputation] and, after an E step, the free energy equals the likelihood.

  31. Coordinate Ascent in F (Demo) To visualise, we consider a one parameter / one latent mixture: s ∼ Bernoulli [ π ] x | s = 0 ∼ N [ − 1 , 1 ] x | s = 1 ∼ N [ 1 , 1 ] . Single data point x 1 = . 3. q ( s ) is a distribution on a single binary latent, and so is represented by r 1 ∈ [ 0 , 1 ] . 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

  32. Coordinate Ascent in F (Demo)

  33. Coordinate Ascent in F (Demo)

  34. Coordinate Ascent in F (Demo)

  35. Coordinate Ascent in F (Demo)

  36. Coordinate Ascent in F (Demo)

  37. Coordinate Ascent in F (Demo)

  38. Coordinate Ascent in F (Demo)

  39. Coordinate Ascent in F (Demo)

  40. Coordinate Ascent in F (Demo)

  41. Coordinate Ascent in F (Demo)

  42. Coordinate Ascent in F (Demo)

  43. Coordinate Ascent in F (Demo)

  44. Coordinate Ascent in F (Demo)

  45. Coordinate Ascent in F (Demo)

  46. Coordinate Ascent in F (Demo)

  47. Coordinate Ascent in F (Demo)

  48. Coordinate Ascent in F (Demo)

  49. Coordinate Ascent in F (Demo)

  50. Coordinate Ascent in F (Demo)

  51. Coordinate Ascent in F (Demo)

  52. Coordinate Ascent in F (Demo)

  53. Coordinate Ascent in F (Demo)

  54. Coordinate Ascent in F (Demo)

  55. Coordinate Ascent in F (Demo)

  56. Coordinate Ascent in F (Demo)

  57. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � ℓ

  58. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � ℓ = F E step ◮ The E step brings the free energy to the likelihood.

  59. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � ℓ = F ≤ F E step M step ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ .

  60. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

  61. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL If the M-step is executed so that θ ( k ) � = θ ( k − 1 ) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

  62. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL If the M-step is executed so that θ ( k ) � = θ ( k − 1 ) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases. Can also show that fixed points of EM (generally) correspond to maxima of the likelihood (see appendices).

  63. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ )

  64. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ )

  65. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z )

  66. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z ) ◮ M step: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F = argmax � log P ( Z , X| θ ) � q ( k ) ( Z ) θ θ

  67. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z ) ◮ M step: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F = argmax � log P ( Z , X| θ ) � q ( k ) ( Z ) θ θ ◮ After E-step F ( q , θ ) = ℓ ( θ ) ⇒ maximum of free-energy is maximum of likelihood.

  68. Partial M steps and Partial E steps Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). In fact, immediately after an E step � � � � ∂ � log P ( X , Z| θ ) � q ( k ) ( Z )[= P ( Z|X ,θ ( k − 1 ) )] = ∂ � � log P ( X| θ ) � � ∂θ � ∂θ � θ ( k − 1 ) θ ( k − 1 ) [cf. mixture gradients from last lecture.] So E-step (inference) can be used to construct other gradient-based optimisation schemes (e.g. “Expectation Conjugate Gradient”, Salakhutdinov et al. ICML 2003). Partial E steps: We can also just increase F wrt to some of the q s. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. One might also update the posterior over a subset of the hidden variables, while holding others fixed...

  69. EM for MoGs ◮ Evaluate responsibilities P m ( x ) π m r im = � m ′ P m ′ ( x ) π m ′ ◮ Update parameters � i r im x i µ m ← � i r im � i r im ( x i − µ m )( x i − µ m ) T Σ m ← � i r im � i r im π m ← N

  70. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i .

  71. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ )

  72. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � q ( s i = m ) ∝ π m 1 − exp σ m 2 σ 2 m with the normalization such that � m r im = 1.

  73. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � = q ( s i = m ) ∝ π m 1 def − r im exp (responsibilities) σ m 2 σ 2 m with the normalization such that � m r im = 1.

  74. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � = q ( s i = m ) ∝ π m 1 def − ← � δ s i = m � q r im exp (responsibilities) σ m 2 σ 2 m with the normalization such that � m r im = 1.

  75. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero:

  76. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i

  77. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i � � � � σ m + ( x i − µ m ) 2 i r im ( x i − µ m ) 2 ∂ − 1 σ 2 E = = 0 ⇒ m = � , r im ∂σ m σ 3 i r im m i

  78. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i � � � � σ m + ( x i − µ m ) 2 i r im ( x i − µ m ) 2 ∂ − 1 σ 2 E = = 0 ⇒ m = � , r im ∂σ m σ 3 i r im m i � � ∂ ∂ E 1 π m = 1 E = r im π m , ∂π m + λ = 0 ⇒ r im , ∂π m n i i where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.

  79. EM for Factor Analysis • • • z 1 z 2 z K The model for x : � p ( z | θ ) p ( x | z , θ ) d z = N ( 0 , ΛΛ T + Ψ) p ( x | θ ) = Model parameters: θ = { Λ , Ψ } . x 1 x 2 • • • x D E step: For each data point x n , compute the posterior distribution of hidden factors given the observed data: q n ( z n ) = p ( z n | x n , θ t ) . M step: Find the θ t + 1 that maximises F ( q , θ ) : � � F ( q , θ ) = q n ( z n ) [ log p ( z n | θ ) + log p ( x n | z n , θ ) − log q n ( z n )] d z n n � � = q n ( z n ) [ log p ( z n | θ ) + log p ( x n | z n , θ )] d z n + c. n

  80. The E step for Factor Analysis E step: For each data point x n , compute the posterior distribution of hidden factors given the observed data: q n ( z n ) = p ( z n | x n , θ ) = p ( z n , x n | θ ) / p ( x n | θ ) Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } 2 z T = c × exp {− 1 2 [ z T n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 n Σ − 1 z n − 2 z T n Σ − 1 µ n + µ T n Σ − 1 µ n ] } 2 [ z T = So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ n = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 . Note that µ n is a linear function of x n and Σ does not depend on x n .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend