EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19

Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19

MLE by Gradient Ascent Goal: maximize L ( θ ; X ) = log p ( X | θ ) w.r.t θ Gradient Ascent (GA) � � ◮ One-step view: θ t + 1 ← ∇L θ t ; X + θ t ◮ Two-step view: θ t − θ � θ t − θ � θ ; θ t � � � � � � � � � � 2 θ t ; X θ t ; X − 1 1. Q = L + ∇L 2 2 2. θ t + 1 ← argmax θ Q � θ ; θ t � Drawbacks 1. ∇L can be too complicated to work with 2. Too general to be efficient for structured problems 3 / 19

MLE by EM Expectation-maximization (EM) � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q ◮ Replace L ( θ ; X ) by L ( θ ; X , Z ) � �� log-likelihood complete log-likelihood ◮ L ( θ ; X , Z ) is a random function w.r.t Z —use the expected function as a surrogate 4 / 19

why EM is superior � θ , θ t � A comparison between Q , i.e., the local concave model 1. EM � θ ; θ t � = E Z | X , θ t L ( θ ; X , Z ) Q � � Z | X , θ t � � = L ( θ ; X ) − D KL p || p ( Z | X , θ ) + C 2. GA − 1 � � � θ ; θ t � � � � � � � 2 θ t − θ � θ t − θ θ t ; X θ t ; X � � Q = L + ∇L � 2 2 5 / 19

Example: vMF mixture Notations � � ◮ X = { x i } n π ∈ ∆ k − 1 , { ( µ i , κ i ) } k i = 1 , θ = i = 1 ◮ Z = { z ij ∈ { 0 , 1 }} ◮ z ij = 1 = ⇒ x i ∼ the j -th mixture component Log-likelihood n n k � � � � � L ( θ ; X ) = log p ( x i | θ ) = log π j vMF x i | µ j , κ j i = 1 i = 1 j = 1 � �� log sum coupling Complete log-likelihood n n k � � �� L ( θ ; X , Z ) = log p ( x i , z i | θ ) = π j vMF x i | µ j , κ j z ij log i = 1 i = 1 j = 1 6 / 19

E-step θ ; θ t � ∆ � Compute Q = E Z | X , θ t L ( θ ; X , Z ) � π , µ , κ ; π t , µ t , κ t � Q n k � � �� x i | µ j , κ j = E Z | X , π t , µ t , κ t z ij log π j vMF i = 1 j = 1 n k � � � � w t + w t = x i | µ j , κ j ij log vMF ij log π j i = 1 j = 1 where � z ij = 1 | x i , π t , µ t , κ t � w t ij = E z ij | X , π t , µ t , κ t [ z ij ] = p � � π t x i | µ t j , κ t j · vMF j = � k u = 1 π t u · vMF ( x i | µ t u , κ t u ) 7 / 19

M-step Maximize n k � π , µ , κ ; π t , µ t , κ t � � � � � w t + w t = x i | µ j , κ j ij log π j Q ij log vMF i = 1 j = 1 w.r.t π , µ and κ s.t. | π | 1 = 1 and � µ j � 2 = 1 , ∀ j ∈ [ k ] To impose constraints, maximize k � � � � � Q ∆ ˜ 1 − π ⊤ 1 1 − µ ⊤ = Q + λ + ν j j µ j j = 1 8 / 19

M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t Q = x i | µ j , κ j ij log π j ij log vMF i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating π t j Combining � k j π j = � k j w t ij = 1 with � n i = 1 w t ∂ π j ˜ ij Q = − λ = 0 π j � n i = 1 w t π t + 1 = ⇒ ← ij j n 9 / 19

M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t x i | µ j , κ j Q = ij log vMF ij log π j i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating µ t j � � = κ j µ ⊤ x i | µ j , κ j j x i + C (w.r.t µ j ) log vMF n � ∂ µ j ˜ w t Q = κ j ij x i − ν j µ j = 0 i = 1 � r j � 2 where r j = � n r j ⇒ µ t + 1 i = 1 w t = ← ij x i j 10 / 19

M-step Updating κ t j p 2 − 1 κ ◮ C p ( κ j ) = j p ( 2 π ) 2 I p 2 − 1 ( κ j ) ◮ the recurrence property of modified Bessel function 1 I p 2 ( κ j ) p 2 − 1 ∂ κ j log I p 2 − 1 ( κ j ) = + κ j I p 2 − 1 ( κ j ) � � n I p 2 ( κ j ) � ∂ κ j ˜ 2 − 1 ( κ j ) + µ ⊤ w t − Q = j x i = 0 ij I p i = 1 2 ( κ j ) r 3 I p r j p − ¯ ≈ ¯ j ⇒ κ t + 1 = ⇒ 2 − 1 ( κ j ) = ¯ r j = [ ? ] j r 2 1 − ¯ I p j � n i = 1 w t ij µ ⊤ j x i where ¯ r j = � n i = 1 w t ij 1http://functions.wolfram.com/Bessel-TypeFunctions/BesselK/introductions/Bessels/05/ 11 / 19

An alternative view of EM EM - original definition � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) why? � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q L ( θ ; X ) = E q log p ( X | θ ) � � � � log p ( X , Z | θ ) q ( Z ) = E q + E q log q ( Z ) p ( Z | X , θ ) � �� VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) EM - coordinate ascent � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � Show the equivalence? 12 / 19

Bayes Inference Notations ◮ θ : hyper parameters ◮ Z : hidden variables + random parameters Goals 1. find a good posterior q ( Z ) ≈ p ( Z | X ; θ ) 2. estimate θ by Empirical Bayes, i.e., maximize L ( θ ; X ) w.r.t θ � � � � log p ( X , Z | θ ) q ( Z ) L ( θ ; X ) = E q + E q log p ( Z | X , θ ) q ( Z ) � �� VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) both goals can be achieved via the same procedure as EM 13 / 19

Variational Bayes Inference One should have q → p ( Z ; X , θ ∗ ) by alternating between � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � However, we do not want q to be too complicated � θ ; θ t � ◮ e.g., Q = E q L ( θ ; X , Z ) can be intractable Solution : modify the first step as � q , θ t � q t + 1 = argmax q ∈Q VLB Q - some tractable distribution families � Z | X , θ t � ◮ Recall: without Q , q t + 1 ≡ p 14 / 19

Variational Bayes Inference � q , θ t � Goal: solve argmax q ∈Q VLB � � i = 1 q i ( Z i ) ∆ q | q ( Z ) = � M = � M usually, Q = i = 1 q i Coordinate ascent � X , Z ; θ t �   p � q j ; q − j , θ t � VLB = E q  log  q ( Z ) M � X , Z ; θ t � � = E q log p − E q log q i i = 1 � � X , Z ; θ t �� − E q j log q j + C = E q j E q − j log p � � X , Z ; θ t �� = − D KL log q j || E q − j log p + C � X , Z ; θ t � log q ∗ j = E q − j log p 15 / 19

Example: Bayes Mixture of Gaussians Consider putting a prior over the means in GM 2 � 0 , τ 2 � ◮ For k = 1 , 2 . . . K , µ k ∼ N ◮ For i = 1 , 2 . . . N 1. z i ∼ Mult ( π ) � µ z i , σ 2 � 2. x i ∼ N p ( z , µ | X ) = p ( X | z , µ ) p ( z ) p ( µ ) p ( X ) � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) = � � � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) d µ z N K � � � � σ 2 q ( z , µ ) = q ( z i ; φ i ) q µ k ; ˜ µ k , ˜ k i = 1 k = 1 2https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf 16 / 19

Example: Bayes Mixture of Gaussians log q ∗ ( z j ) = E q \ zj log p ( z , µ , X ) � N � K � � log p ( z i ) + log p ( x i | z i , µ ) + = E q \ zj log p ( µ k ) i = 1 k = 1 � + C � x j | z j , µ z j = log p ( z j ) + E q ( µ zj ) log p − 1 � � � µ z j � µ 2 = log π z j + x j E q ( µ zj ) + C 2 E q ( µ zj ) z j � �� µ zj ˜ µ 2 ˜ zj +˜ σ 2 zj By observation q ∗ ( z j ) ∼ Mult , we can update φ j accordingly 17 / 19

Example: Bayes Mixture of Gaussians log q ∗ ( µ j ) = E q \ µ j log p ( z , µ , X ) � N � K � � = E q \ µ j log p ( z i ) + log p ( x i | z i , µ z i ) + log p ( µ k ) i = 1 k = 1 N K � � = E q \ µ j δ z i = k log N ( x i | µ k ) + log p ( µ j ) + C i = 1 k = 1 N � = E z i [ δ z i = j ] log N ( x i | µ j ) + log p ( µ j ) + C � �� i = 1 φ j i Observing that q ∗ ( µ j ) ∼ N , ˜ σ 2 µ j and ˜ j can be updated accordingly 18 / 19

Stay tuned Next topics ◮ LDA (Wanli) ◮ Bayes vMF 19 / 19

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 - PowerPoint PPT Presentation

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19 MLE by Gradient

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational Hamiltonian Monte Carlo via Score Matching Cheng Zhang (Joint work with Prof. Shahbaba

Problem statement of SDN and NFV co-deploy ment in cloud datacenters dr af t - gu- sdnr g- pr obl

SoC-Network for Interleaving in Wireless Communications Norbert Wehn wehn@eit.uni-kl.de

Search Algorithms for Speech Recognition Berlin Chen 2004 References Books 1. X. Huang,

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

Improving Data Centre Performance using Multipath TCP (work in progress) Mark Handley Costin

OPEN-O Unified NFV/SDN Open Source Orchestrator Hui Deng, China Mobile Kai Liu, China Telecom

Feature Selection for SVMs by J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V.

Physics 115 General Physics II Session 30 Induction Induced currents R. J. Wilkes

Sambuz

Useful Links

Newsletter

Mail Us