MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - PowerPoint PPT Presentation

Latent variable models. k p θ ( y i = j | x i ) and p θ ( y i = j | x i ) = p θ ( y i = j, x i ) � Since 1 = , then p θ ( x i ) j =1 n n n k � � � � L ( θ ) = ln p θ ( x i ) = 1 · ln p θ ( x i ) = p θ ( y i = j | x i ) ln p θ ( x i ) i =1 i =1 i =1 j =1 n k p θ ( y i = j | x i ) ln p θ ( x i , y i = j ) � � = p θ ( y i = j | x i ) . i =1 j =1 Therefore: define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := ; R ij i =1 j =1 note that R ij := p θ ( y i = j | x i ) implies L ( θ ; R ) = L ( θ ) . 29 / 70

E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 30 / 70

E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 30 / 70

E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := = , � n nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 31 / 70

Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . 34 / 70

Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . Remarks. ◮ We proved a similar guarantee for k -means, which is also an alternating minimization scheme. ◮ Similarly, MLE for Gaussian mixtures is NP-hard; it is also known to need exponentially many samples in k to information-theoretically recover the parameters. 34 / 70

Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. 35 / 70

Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. By concavity of ln (“Jensen’s inequality” in convexity lectures), for any R ∈ R n,k , n k R ij ln p θ t ( x i , y i = j ) � � L ( θ t ; R ) = R ij i =1 j =1   n k p θ t ( x i , y i = j ) � � ≤ ln R ij   R ij i =1 j =1 n � = ln p θ t ( x i ) = L ( θ t ) = L ( θ t ; R t +1 ) . i =1 Since R was arbitrary, max R ∈R L ( θ t ; R ) = L ( θ t ; R t +1 ) . 35 / 70

Proof (continued). Here’s a second proof of that missing fact. To evaluate arg max R ∈R n,k L ( θ ; R ) , consider Lagrangian   n k k � k � � � � �  . R ij ln p θ ( x i , y = j ) − R ij ln R ij + λ i R ij − 1  i =1 j =1 j =1 j =1 Fixing i and taking the gradient with respect to R ij for any j , 0 = ln p θ ( x i , y i = j ) − ln R ij − 1 + λ i , giving R ij = p θ ( x i , y = j ) exp( λ i − 1) . Since moreover � � R ij = exp( λ i − 1) p θ ( x i , y = j ) = exp( λ i − 1) p θ ( x i ) , 1 = j j it follows that exp( λ i − 1) = 1 / p θ ( x i ) , and the optimal R satisfies R ij = p θ ( x i ,y = j ) / p θ ( x i ) = p θ ( y = j | x i ) . � 36 / 70

Related issues. 37 / 70

Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. 38 / 70

Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? 38 / 70

Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? Computation (of inverse), sample complexity, . . . 38 / 70

Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - PowerPoint PPT Presentation

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y = j Discrete ( 1 , . . . , k ) , X = x | Y = j N ( j , j ) , and the parameters are = (( 1 , 1 , 1 ) , . . . , ( k ,

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM:

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan

Estimation III: Method of Moments and Maximum Likelihood Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1

Final exam location: Clough 152 Please fill out your CIOS survey! Post topics for

Fitting Linear Statistical Models to Data by Least Squares III: Multivariate Brian R. Hunt and C.

On the use of the damped Newton method to solve direct and controllability problems for parabolic

Evaluation 1. written exam dealing with all theoretical background and examples discussed in the

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Controllability of parabolic systems: the moment method Evolution Equations: long time behavior

Sum-Product Networks CS486 / 686 University of Waterloo Lecture 23: July 19, 2017 Outline

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - PowerPoint PPT Presentation

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y = j Discrete ( 1 , . . . , k ) , X = x | Y = j N ( j , j ) , and the parameters are = (( 1 , 1 , 1 ) , . . . , ( k ,

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM:

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CSE 527, Additional notes on MLE &amp; EM Based on earlier notes by C. Grant &amp; M. Narasimhan

Estimation III: Method of Moments and Maximum Likelihood Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1

Final exam location: Clough 152 Please fill out your CIOS survey! Post topics for

Fitting Linear Statistical Models to Data by Least Squares III: Multivariate Brian R. Hunt and C.

On the use of the damped Newton method to solve direct and controllability problems for parabolic

Evaluation 1. written exam dealing with all theoretical background and examples discussed in the

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Controllability of parabolic systems: the moment method Evolution Equations: long time behavior

Sum-Product Networks CS486 / 686 University of Waterloo Lecture 23: July 19, 2017 Outline

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan