em variational bayes
play

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 - PowerPoint PPT Presentation

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19 MLE by Gradient


  1. EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19

  2. Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19

  3. MLE by Gradient Ascent Goal: maximize L ( θ ; X ) = log p ( X | θ ) w.r.t θ Gradient Ascent (GA) � � ◮ One-step view: θ t + 1 ← ∇L θ t ; X + θ t ◮ Two-step view: θ t − θ � θ t − θ � θ ; θ t � � � � � � � � � � 2 θ t ; X θ t ; X − 1 1. Q = L + ∇L 2 2 2. θ t + 1 ← argmax θ Q � θ ; θ t � Drawbacks 1. ∇L can be too complicated to work with 2. Too general to be efficient for structured problems 3 / 19

  4. MLE by EM Expectation-maximization (EM) � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q ◮ Replace L ( θ ; X ) by L ( θ ; X , Z ) � �� � � �� � log-likelihood complete log-likelihood ◮ L ( θ ; X , Z ) is a random function w.r.t Z —use the expected function as a surrogate 4 / 19

  5. why EM is superior � θ , θ t � A comparison between Q , i.e., the local concave model 1. EM � θ ; θ t � = E Z | X , θ t L ( θ ; X , Z ) Q � � Z | X , θ t � � = L ( θ ; X ) − D KL p || p ( Z | X , θ ) + C 2. GA − 1 � � � θ ; θ t � � � � � � � 2 θ t − θ � θ t − θ θ t ; X θ t ; X � � Q = L + ∇L � 2 2 5 / 19

  6. Example: vMF mixture Notations � � ◮ X = { x i } n π ∈ ∆ k − 1 , { ( µ i , κ i ) } k i = 1 , θ = i = 1 ◮ Z = { z ij ∈ { 0 , 1 }} ◮ z ij = 1 = ⇒ x i ∼ the j -th mixture component Log-likelihood n n k � � � � � L ( θ ; X ) = log p ( x i | θ ) = log π j vMF x i | µ j , κ j i = 1 i = 1 j = 1 � �� � log sum coupling Complete log-likelihood n n k � � �� � � � L ( θ ; X , Z ) = log p ( x i , z i | θ ) = π j vMF x i | µ j , κ j z ij log i = 1 i = 1 j = 1 6 / 19

  7. E-step θ ; θ t � ∆ � Compute Q = E Z | X , θ t L ( θ ; X , Z ) � π , µ , κ ; π t , µ t , κ t � Q n k � � �� � � x i | µ j , κ j = E Z | X , π t , µ t , κ t z ij log π j vMF i = 1 j = 1 n k � � � � w t + w t = x i | µ j , κ j ij log vMF ij log π j i = 1 j = 1 where � z ij = 1 | x i , π t , µ t , κ t � w t ij = E z ij | X , π t , µ t , κ t [ z ij ] = p � � π t x i | µ t j , κ t j · vMF j = � k u = 1 π t u · vMF ( x i | µ t u , κ t u ) 7 / 19

  8. M-step Maximize n k � π , µ , κ ; π t , µ t , κ t � � � � � w t + w t = x i | µ j , κ j ij log π j Q ij log vMF i = 1 j = 1 w.r.t π , µ and κ s.t. | π | 1 = 1 and � µ j � 2 = 1 , ∀ j ∈ [ k ] To impose constraints, maximize k � � � � � Q ∆ ˜ 1 − π ⊤ 1 1 − µ ⊤ = Q + λ + ν j j µ j j = 1 8 / 19

  9. M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t Q = x i | µ j , κ j ij log π j ij log vMF i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating π t j Combining � k j π j = � k j w t ij = 1 with � n i = 1 w t ∂ π j ˜ ij Q = − λ = 0 π j � n i = 1 w t π t + 1 = ⇒ ← ij j n 9 / 19

  10. M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t x i | µ j , κ j Q = ij log vMF ij log π j i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating µ t j � � = κ j µ ⊤ x i | µ j , κ j j x i + C (w.r.t µ j ) log vMF n � ∂ µ j ˜ w t Q = κ j ij x i − ν j µ j = 0 i = 1 � r j � 2 where r j = � n r j ⇒ µ t + 1 i = 1 w t = ← ij x i j 10 / 19

  11. M-step Updating κ t j p 2 − 1 κ ◮ C p ( κ j ) = j p ( 2 π ) 2 I p 2 − 1 ( κ j ) ◮ the recurrence property of modified Bessel function 1 I p 2 ( κ j ) p 2 − 1 ∂ κ j log I p 2 − 1 ( κ j ) = + κ j I p 2 − 1 ( κ j ) � � n I p 2 ( κ j ) � ∂ κ j ˜ 2 − 1 ( κ j ) + µ ⊤ w t − Q = j x i = 0 ij I p i = 1 2 ( κ j ) r 3 I p r j p − ¯ ≈ ¯ j ⇒ κ t + 1 = ⇒ 2 − 1 ( κ j ) = ¯ r j = [ ? ] j r 2 1 − ¯ I p j � n i = 1 w t ij µ ⊤ j x i where ¯ r j = � n i = 1 w t ij 1http://functions.wolfram.com/Bessel-TypeFunctions/BesselK/introductions/Bessels/05/ 11 / 19

  12. An alternative view of EM EM - original definition � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) why? � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q L ( θ ; X ) = E q log p ( X | θ ) � � � � log p ( X , Z | θ ) q ( Z ) = E q + E q log q ( Z ) p ( Z | X , θ ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) EM - coordinate ascent � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � Show the equivalence? 12 / 19

  13. Bayes Inference Notations ◮ θ : hyper parameters ◮ Z : hidden variables + random parameters Goals 1. find a good posterior q ( Z ) ≈ p ( Z | X ; θ ) 2. estimate θ by Empirical Bayes, i.e., maximize L ( θ ; X ) w.r.t θ � � � � log p ( X , Z | θ ) q ( Z ) L ( θ ; X ) = E q + E q log p ( Z | X , θ ) q ( Z ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) both goals can be achieved via the same procedure as EM 13 / 19

  14. Variational Bayes Inference One should have q → p ( Z ; X , θ ∗ ) by alternating between � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � However, we do not want q to be too complicated � θ ; θ t � ◮ e.g., Q = E q L ( θ ; X , Z ) can be intractable Solution : modify the first step as � q , θ t � q t + 1 = argmax q ∈Q VLB Q - some tractable distribution families � Z | X , θ t � ◮ Recall: without Q , q t + 1 ≡ p 14 / 19

  15. Variational Bayes Inference � q , θ t � Goal: solve argmax q ∈Q VLB � � i = 1 q i ( Z i ) ∆ q | q ( Z ) = � M = � M usually, Q = i = 1 q i Coordinate ascent � X , Z ; θ t �   p � q j ; q − j , θ t � VLB = E q  log  q ( Z ) M � X , Z ; θ t � � = E q log p − E q log q i i = 1 � � X , Z ; θ t �� − E q j log q j + C = E q j E q − j log p � � X , Z ; θ t �� = − D KL log q j || E q − j log p + C � X , Z ; θ t � log q ∗ j = E q − j log p 15 / 19

  16. Example: Bayes Mixture of Gaussians Consider putting a prior over the means in GM 2 � 0 , τ 2 � ◮ For k = 1 , 2 . . . K , µ k ∼ N ◮ For i = 1 , 2 . . . N 1. z i ∼ Mult ( π ) � µ z i , σ 2 � 2. x i ∼ N p ( z , µ | X ) = p ( X | z , µ ) p ( z ) p ( µ ) p ( X ) � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) = � � � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) d µ z N K � � � � σ 2 q ( z , µ ) = q ( z i ; φ i ) q µ k ; ˜ µ k , ˜ k i = 1 k = 1 2https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf 16 / 19

  17. Example: Bayes Mixture of Gaussians log q ∗ ( z j ) = E q \ zj log p ( z , µ , X ) � N � K � � log p ( z i ) + log p ( x i | z i , µ ) + = E q \ zj log p ( µ k ) i = 1 k = 1 � + C � x j | z j , µ z j = log p ( z j ) + E q ( µ zj ) log p − 1 � � � µ z j � µ 2 = log π z j + x j E q ( µ zj ) + C 2 E q ( µ zj ) z j � �� � � �� � µ zj ˜ µ 2 ˜ zj +˜ σ 2 zj By observation q ∗ ( z j ) ∼ Mult , we can update φ j accordingly 17 / 19

  18. Example: Bayes Mixture of Gaussians log q ∗ ( µ j ) = E q \ µ j log p ( z , µ , X ) � N � K � � = E q \ µ j log p ( z i ) + log p ( x i | z i , µ z i ) + log p ( µ k ) i = 1 k = 1 N K � � = E q \ µ j δ z i = k log N ( x i | µ k ) + log p ( µ j ) + C i = 1 k = 1 N � = E z i [ δ z i = j ] log N ( x i | µ j ) + log p ( µ j ) + C � �� � i = 1 φ j i Observing that q ∗ ( µ j ) ∼ N , ˜ σ 2 µ j and ˜ j can be updated accordingly 18 / 19

  19. Stay tuned Next topics ◮ LDA (Wanli) ◮ Bayes vMF 19 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend