Notes on Neal and Hintons Generalized Expectation Maximization - PowerPoint PPT Presentation

Notes on Neal and Hinton’s Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1

Talk overview • What kinds of problems does expectation maximization solve? • An example of EM • Relaxation, and proving that EM converges • Sufficient statistics and EM • The generalized EM algorithm 2

Maximum likelihood estimation • Given visible data ( y , x ), how can we estimate θ ? • Maximum likelihood principle: � = argmax L ( y,x ) ( θ ) , where: θ θ L ( y , x ) ( θ ) = log P θ ( y , x ) = log P( y , x | θ ) • For a HMM, these are simple to calculate: n y i ,y j ( y , x ) � = � θ y i ,y j i ,y j ( y , x ) i n y ′ y ′ n x i ,y i ( y , x ) � = � θ x i ,y i i ,y i ( y , x ) i n x ′ x ′ 4

ML estimation from hidden data • Our model defines P( Y , X ), but our data only contains values for X , i.e., the variable Y is hidden – HMM example: D only contains words x but not their labels y • Maximum likelihood principle still applies: � = argmax L x ( θ ) , where: θ θ � L x ( θ ) = log P( x | θ ) = log P( y , x | θ ) y ∈ Y • But maximizing L x ( θ ) may now be a non-trivial problem! 5

What does Expectation Maximization do? • Expectation Maximization (EM) is a maximum likelihood estimation procedure for problems with hidden variables • EM is good for problems where: – our model P( Y, X | θ ) involves variables Y and X – our training data contains x but not y – maximizing P( x | θ ) is hard – maximizing P( y, x | θ ) is easy • In HMM example: if training data consists of words x alone, and does not contain their labels 6

The EM algorithm • The EM algorithm: – Guess an initial model θ (0) – For t = 1 , 2 , . . . , compute Q ( t ) ( y ) and θ ( t ) , where Q ( t ) ( y ) P( y | x, θ ( t − 1) ) = (E-step) θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] (M-step) θ � Q ( t ) ( y ) log P( y, x | θ ) = argmax θ y ∈Y � P( y, x | θ ) Q ( t ) ( y ) = argmax θ y ∈Y • Q ( t ) ( y ) is probability of “pseudo-data” y using model θ ( t − 1) • θ ( t ) is the MLE based on pseudo-data ( y, x ), where each ( y, x ) is weighted according to Q ( t ) ( y ) 7

HMM example • For a HMM, the EM formulae are: P( y | x , θ ( t − 1) ) Q ( t ) ( y ) = P( y , x | θ ( t − 1) ) = � y ∈ Y P( y , x | θ ( t − 1) ) � y ∈ Y Q ( t ) ( y ) n y i ,y j ( y , x ) θ ( t ) = � � y i ,y j y ∈ Y Q ( t ) ( y ) n y ′ i ,y j ( y , x ) y ′ i � y ∈Y Q ( t ) ( y ) n x i ,y i ( y , x ) θ ( t ) = � � x i ,y i y ∈Y Q ( t ) ( y ) n x ′ i ,y i ( y , x ) x ′ i 8

EM converges — overview • Neal and Hinton define a function F ( Q, θ ) where: – Q ( Y ) is a probability distribution over the hidden variables – θ are the model parameters � argmax max F ( Q, θ ) = θ, the MLE of θ Q θ max F ( Q, θ ) = L x ( θ ) , the log likelihood of θ Q argmax F ( Q, θ ) = P( Y | x, θ ) for all θ Q • The EM algorithm is an alternating maximization of F Q ( t ) F ( Q, θ ( t − 1) ) = argmax (E-step) Q θ ( t ) F ( Q ( t ) , θ ) = argmax (M-step) θ 9

The EM algorithm converges F ( Q, θ ) = E Y ∼ Q [log P( Y, x | θ )] + H ( Q ) = L x ( θ ) − KL( Q ( Y ) || P( Y | x, θ )) H ( Q ) = entropy of Q L x ( θ ) = log P( x | θ ) = log likelihood of θ KL( Q || P) = KL divergence between Q and P Q ( t ) ( Y ) P( Y | x, θ ( t − 1) ) F ( Q, θ ( t − 1) ) (E-st = = argmax Q θ ( t ) F ( Q ( t ) , θ ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] = argmax (M-st θ θ • The maximum value of F is achieved at θ = � θ and Q ( Y ) = P( Y | x, � θ ). • The sequence of F values produced by the EM algorithm is non-decreasing and bounded above by L ( � θ ). 10

Generalized EM • Idea: anything that increases F gets you closer to � θ • Idea: insert any additional operations you want into the EM algorithm so long as they don’t decrease F – Update θ after each data item has been processed – Visit some data items more often than others – Only update some components of θ on some iterations 11

Incremental EM for factored models • Data and model both factor: Y = ( Y 1 , . . . , Y n ) , X = ( X 1 , . . . , X n ) n � P( Y, X | θ ) = P( Y i , X i | θ ) i =1 • Incremental EM algorithm: – Initialize θ (0) and Q (0) i ( Y i ) for i = 1 , . . . , n – E-step: Choose some data item i to be updated Q ( t ) Q ( t − 1) = for all j � = i j j Q ( t ) P( Y i | x i , θ ( t − 1) ) i ( Y i ) = – M-step: θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] θ 12

EM using sufficient statistics • Model parameters θ estimated from sufficient statistics S : ( Y, X ) → S → θ • In HMM example, pseudo-counts are sufficient statistics • EM algorithm with sufficient statistics: s ( t ) ˜ = E Y ∼ P( Y | x,θ ( t − 1) ) [ S ] (E-step) θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ (M-step) 13

Incremental EM using sufficient statistics • Incremental EM algorithm with sufficient statistics: � ( Y i , X i ) → S i → S → θ S = S i i – Initialize θ (0) and ˜ s (0) for i = 1 , . . . , n i – E-step: Choose some data item i to be updated s ( t ) s ( t − 1) ˜ = ˜ for all j � = i j j s ( t ) ˜ = E Y i ∼ P( Y i | x i ,θ ( t − 1) ) [ S i ] i s ( t − 1) + (˜ s ( t ) s ( t − 1) s ( t ) ˜ = ˜ − ˜ ) i i – M-step: θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ 14

Conclusion • The Expectation-Maximization algorithm is a general technique for using supervised maximum likelihood estimators to solve unsupervised estimation problems • The E-step and the M-step can be viewed as steps of an alternating maximization procedure – The functional F is bounded above by the log likelihood – Each E and M step increases F ⇒ Eventually the EM algorithm converges to a local optimum (not necessarily a global optimum) • We can insert any steps we like into the EM algorithm so long as they do not decrease F ⇒ Incremental versions of the EM algorithm 15

Notes on Neal and Hintons Generalized Expectation Maximization - PowerPoint PPT Presentation

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1 Talk overview What kinds of problems does expectation maximization solve? An example

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Combined M anagement of Retroperiton Retroperiton neal Sarcoma neal Sarcoma Carol J. Swallow

ALIA Online 2017 Imogen Ingram James Neal Chaos breeds life! Chaos breeds life Order breeds

2020 REAL ESTATE TRENDS INVESTMENT FORECAST FOR 2020 WHAT YOU REALLY NEED TO KNOW Meet Neal

The Importance of Establishing a Global Lunar Seismic Network Clive R. Neal 1 (neal.1@nd.edu)

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

Energy Consumption of the Internet Jayant Baliga Jayant Baliga Robert W. Ayre, Kerry Hinton,

Deep Learning Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 521, 436444 (28 May 2015)

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Foreign packages in GNU Guix Examples from Ruby gems, Python modules and R/CRAN Pjotr Prins

On the Connection Between Adversarial Robustness and Saliency Map Interpretability Christian

Specificity of knowledge intensive entrepreneurship in central and eastern Europe Prof. Slavo

System Analysis Chapter 3: Textual Modeling Jonathan Thaler Department of Computer Science 1 /

Marek Szyprowski m.szyprowski@samsung.com Samsung R&D Institute Poland Quick Introduction

Development of thin GEM readout structures Yasemin Schelhaas MAGIX Collaboration Meeting

Event finding in GEM T racker Radoslaw Karabowicz 1 Introduction This is a

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong

Notes on Neal and Hintons Generalized Expectation Maximization - PowerPoint PPT Presentation

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1 Talk overview What kinds of problems does expectation maximization solve? An example

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Combined M anagement of Retroperiton Retroperiton neal Sarcoma neal Sarcoma Carol J. Swallow

ALIA Online 2017 Imogen Ingram James Neal Chaos breeds life! Chaos breeds life Order breeds

2020 REAL ESTATE TRENDS INVESTMENT FORECAST FOR 2020 WHAT YOU REALLY NEED TO KNOW Meet Neal

The Importance of Establishing a Global Lunar Seismic Network Clive R. Neal 1 (neal.1@nd.edu)

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

Energy Consumption of the Internet Jayant Baliga Jayant Baliga Robert W. Ayre, Kerry Hinton,

Deep Learning Yann LeCun, Yoshua Bengio &amp; Geoffrey Hinton Nature 521, 436444 (28 May 2015)

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Foreign packages in GNU Guix Examples from Ruby gems, Python modules and R/CRAN Pjotr Prins

On the Connection Between Adversarial Robustness and Saliency Map Interpretability Christian

Specificity of knowledge intensive entrepreneurship in central and eastern Europe Prof. Slavo

System Analysis Chapter 3: Textual Modeling Jonathan Thaler Department of Computer Science 1 /

Marek Szyprowski m.szyprowski@samsung.com Samsung R&amp;D Institute Poland Quick Introduction

Development of thin GEM readout structures Yasemin Schelhaas MAGIX Collaboration Meeting

Event finding in GEM T racker Radoslaw Karabowicz 1 Introduction This is a

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong

Deep Learning Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 521, 436444 (28 May 2015)

Marek Szyprowski m.szyprowski@samsung.com Samsung R&D Institute Poland Quick Introduction