variational inference for bayes vmf mixture
play

Variational Inference for Bayes vMF Mixture Hanxiao Liu September - PowerPoint PPT Presentation

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational Inference Review Lower bound the likelihood L ( ; X ) = E q log p ( X | ) log p ( X , Z | ) q ( Z ) = E q + E q log q (


  1. Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14

  2. Variational Inference Review Lower bound the likelihood L ( θ ; X ) = E q log p ( X | θ ) � � � � log p ( X , Z | θ ) q ( Z ) = E q + E q log q ( Z ) p ( Z | X , θ ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) Raise VLB ( q , θ ) by coordinate ascent � q , θ t � 1. q t + 1 = argmax VLB q = � M i = 1 q i 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � 2 / 14

  3. Variational Inference Review � q , θ t � Goal : solve by coordinate ascent, i.e. argmax VLB q = � M i = 1 q i sequentially updating a single q i in each iteration. Each coordinate step has a closed-form solution— � � � X , Z | θ t � log p � q j ; q − j , θ t � = E q VLB q ( Z ) M � � X , Z | θ t � = E q log p − E q log q i i = 1 � X , Z | θ t � = E q j E q − j log p − E q j log q j + const � �� � log ˜ q j + const � q j log ˜ q j = + const = − D KL ( q j || ˜ q j ) + const q j � X , Z | θ t � ⇒ log q ∗ = j = E q − j log p + const 3 / 14

  4. Bayes vMF Mixture [Gopal and Yang, 2014] ◮ π ∼ Dirichlet ( ·| α ) ? ◮ q ( π ) ≡ Dirichlet ( ·| ρ ) ◮ µ k ∼ vMF ( ·| µ 0 , C 0 ) ? ◮ q ( µ k ) ≡ vMF ( ·| ψ k , γ k ) � ·| m , σ 2 � ◮ κ k ∼ logNormal ? ◮ q ( κ k ) ≡ logNormal ( ·| a k , b k ) ◮ z i ∼ Multi ( ·| π ) � � ? ◮ q ( z i ) ◮ x i ∼ vMF ·| µ z i , κ z i ≡ Multi ( ·| λ i ) 4 / 14

  5. Compute log p ( X , Z | θ ) N � � � p ( X , Z | θ ) = Dirichlet ( π | α ) × Multi ( z i | π ) vMF x i | µ z i , κ z i i = 1 K � � κ k | m , σ 2 � × vMF ( µ k | µ 0 , C 0 ) logNormal k = 1 K � log p ( X , Z | θ ) = − log B ( α ) + ( α − 1 ) log π k k = 1 N K N K � � � � � � log C D ( κ k ) + κ k x ⊤ + z ik log π k + z ik i µ k i = 1 k = 1 i = 1 k = 1 K � � � log C D ( C 0 ) + C 0 µ ⊤ + k µ 0 k = 1 � � K − ( log κ k − m ) 2 − log κ k − 1 � � 2 πσ 2 � + 2 log 2 σ 2 k = 1 5 / 14

  6. Updating q ( π ) ? q ( π ) ≡ Dirichlet ( ·| ρ ) log q ∗ ( π ) = E q \ π log p ( X , Z | θ ) + const � K � N K � � � = E q \ π ( α − 1 ) log π k + + const z ik log π k i = 1 k = 1 k = 1 � � K N � � = α + E q [ z ik ] − 1 log π k + const k = 1 i = 1 K α + � N � i = 1 E q [ z ik ] − 1 ⇒ q ∗ ( π ) ∝ = ∼ Dirichlet π k k = 1 N � ⇒ ρ ∗ = k = α + E q [ z ik ] i = 1 6 / 14

  7. Updating q ( z i ) ? q ( z i ) ≡ Multi ( ·| λ i ) log q ∗ ( z i ) = E q \ z i log p ( X , Z | θ ) + const � N � K N K � � � � � � log C D ( κ k ) + κ k x ⊤ = E q \ z i z ik log π k + z ik i µ k + const i = 1 k = 1 i = 1 k = 1 K � � � E q log π k + E q log C D ( κ k ) + E q [ κ k ] x ⊤ = i E q [ µ k ] + const z ik k = 1 ⇒ q ∗ ( z i ) ∼ Multi , λ ∗ ik ∝ e E q log π k + E q log C D ( κ k )+ E q [ κ k ] x ⊤ i E q [ µ k ] = Assume E q log π k , E q log C D ( κ k ) , E q [ κ k ] and E q [ µ k ] are already known. We will explicitly compute them later. 7 / 14

  8. Updating q ( µ k ) ? q ( µ k ) ≡ vMF ( ·| ψ k , γ k ) log q ∗ ( µ k ) = E q \ µ k log p ( X , Z | θ ) + const   N K K � � � z ij κ j x ⊤ C 0 µ ⊤  + const = E q \ µ k i µ j + j µ 0  i = 1 j = 1 j = 1 � N � � E q [ z ik ] x ⊤ + C 0 µ ⊤ = E q [ κ k ] i µ k k µ 0 + const i = 1 � �� N � � ⊤ µ k ∼ vMF E q [ κ k ] i = 1 E q [ z ik ] x i + C 0 µ 0 ⇒ q ∗ ( µ k ) ∝ e = �� N � � � N � � E q [ κ k ] i = 1 E q [ z ik ] x i + C 0 µ 0 � � � γ ∗ � � , ψ ∗ k = E q [ κ k ] E q [ z ik ] x i + C 0 µ 0 k = � � γ k � � i = 1 8 / 14

  9. Updating q ( κ k ) ? q ( κ k ) ≡ logNormal ( ·| a k , b k ) log q ∗ ( κ k ) = E q \ κ k log p ( X , Z | θ ) + const � � N K K − log κ j − ( log κ j − m ) 2 � � � � � log C D ( κ j ) + κ j x ⊤ = E q \ κ k z ij i µ j + + const 2 σ 2 i = 1 j = 1 j = 1 � � N − log κ k − ( log κ k − m ) 2 � � � log C D ( κ k ) + κ k x ⊤ = E q \ κ k + const z ik i µ k 2 σ 2 i = 1 N − log κ k − ( log κ k − m ) 2 � E q [ z ik ] � i E q [ µ k ] � log C D ( κ k ) + κ k x ⊤ = + const 2 σ 2 i = 1 ⇒ q ∗ ( κ k ) �∼ logNormal = due to the existence of log C D ( κ k ) 9 / 14

  10. Intermediate Quantities Some intermediate quantities are in closed-form ◮ q ( z i ) ≡ Multi ( z i | λ i ) = ⇒ E q [ z ij ] = λ ij �� � ◮ q ( π ) ≡ Dirichlet ( π | ρ ) = ⇒ E q log π k = Ψ ( ρ k ) − Ψ j ρ j I D ( γ k ) 1 ◮ q ( µ k ) ≡ vMF ( µ k | ψ k , γ k ) = ⇒ E q [ µ k ] = 2 − 1 ( γ k ) ψ k 2 I D [Rothenbuehler, 2005] Some are not— E q [ κ k ] and E q log C D ( κ k ) 1. the absence of a good parametric form of q ( κ k ) ◮ apply sampling 2. even if κ k ∼ logNormal is assumed, E q log C D ( κ k ) is still hard to deal with ◮ bound log C D ( · ) by some simple functions 1 can be derived from the characteristic function of vMF 10 / 14

  11. Sampling In principle we can sample κ k from p ( κ k | X , θ ) . Unfortunately, the sampling procedure above requires the samples of z i , µ k , π , . . . which are not maintained by variational inference. Recall the optimal posterior for κ k satisfies 2 log q ∗ ( κ k ) N − log κ k − ( log κ k − m ) 2 � � � log C D ( κ k ) + κ k x ⊤ = E [ z ik ] i E q [ µ k ] + const 2 σ 2 i = 1 � N � � ⇒ q ∗ ( κ k ) ∝ exp � � log C D ( κ k ) + κ k x ⊤ = E [ z ik ] i E q [ µ k ] i = 1 � κ k | m , σ 2 � × logNormal We can sample from q ∗ ( κ k ) ! 2 see derivation on p.8 11 / 14

  12. Bounding Outline ◮ Assume q ( κ k ) ≡ logNormal ( ·| a k , b k ) ◮ Lower bound E q log C D ( κ k ) in VLB by some simple terms ◮ To optimize q ( κ k ) , use gradient ascent w.r.t a k and b k to raise the VLB Empirically, sampling outperforms bounding 12 / 14

  13. Empirical Bayes for Hyperparameters Raise VLB ( q , θ ) by coordinate ascent � q , θ t � 1. q t + 1 = argmax VLB q = � M i = 1 q i 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � = argmax θ E q t + 1 log p ( X , Z | θ ) For example, one can use gradient ascent to optimize α K � max − log B ( α ) + ( α − 1 ) E q t + 1 [ log π k ] α> 0 k = 1 m , σ 2 , µ 0 and C 0 can be optimized in a similar manner 3 3 Unlike α , their solutions can be written in closed-form 13 / 14

  14. Reference I Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. In Journal of Machine Learning Research , pages 1345–1382. Gopal, S. and Yang, Y. (2014). Von mises-fisher clustering models. In Proceedings of The 31st International Conference on Machine Learning , pages 154–162. Rothenbuehler, J. (2005). Dependence Structures beyond copulas: A new model of a multivariate regular varying distribution based on a finite von Mises-Fisher mixture model . PhD thesis, Cornell University. 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend