Parametric Models Part II: Expectation-Maximization and Mixture - PowerPoint PPT Presentation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 32

Missing Features ◮ Suppose that we have a Bayesian classifier that uses the feature vector x but a subset x g of x are observed and the values for the remaining features x b are missing. ◮ How can we make a decision? ◮ Throw away the observations with missing values. ◮ Or, substitute x b by their average ¯ x b in the training data, and use x = ( x g , ¯ x b ) . ◮ Or, marginalize the posterior over the missing features, and use the resulting posterior � P ( w i | x g , x b ) p ( x g , x b ) d x b P ( w i | x g ) = . � p ( x g , x b ) d x b CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 32

Expectation-Maximization ◮ We can also extend maximum likelihood techniques to allow learning of parameters when some training patterns have missing features. ◮ The Expectation-Maximization (EM) algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 32

Expectation-Maximization ◮ There are two main applications of the EM algorithm: ◮ Learning when the data is incomplete or has missing values. ◮ Optimizing a likelihood function that is analytically intractable but can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. ◮ The second problem is more common in pattern recognition applications. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 32

Expectation-Maximization ◮ Assume that the observed data X is generated by some distribution. ◮ Assume that a complete dataset Z = ( X , Y ) exists as a combination of the observed but incomplete data X and the missing data Y . ◮ The observations in Z are assumed to be i.i.d. from the joint density p ( z | Θ ) = p ( x , y | Θ ) = p ( y | x , Θ ) p ( x | Θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 32

Expectation-Maximization ◮ We can define a new likelihood function L ( Θ |Z ) = L ( Θ |X , Y ) = p ( X , Y| Θ ) called the complete-data likelihood where L ( Θ |X ) is referred to as the incomplete-data likelihood. ◮ The EM algorithm: ◮ First, finds the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). ◮ Then, maximizes this expectation (maximization step). CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 32

Expectation-Maximization ◮ Define Q ( Θ , Θ ( i − 1) ) = E log p ( X , Y| Θ ) | X , Θ ( i − 1) � � as the expected value of the complete-data log-likelihood w.r.t. the unknown data Y given the observed data X and the current parameter estimates Θ ( i − 1) . ◮ The expected value can be computed as � log p ( X , Y| Θ ) |X , Θ ( i − 1) � log p ( X , y | Θ ) p ( y |X , Θ ( i − 1) ) d y . � E = ◮ This is called the E-step . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 32

Expectation-Maximization ◮ Then, the expectation can be maximized by finding optimum values for the new parameters Θ as Θ ( i ) = arg max Θ Q ( Θ , Θ ( i − 1) ) . ◮ This is called the M-step . ◮ These two steps are repeated iteratively where each iteration is guaranteed to increase the log-likelihood. ◮ The EM algorithm is also guaranteed to converge to a local maximum of the likelihood function. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 32

Mixture Densities ◮ A mixture model is a linear combination of m densities m � p ( x | Θ ) = α j p j ( x | θ j ) j =1 where Θ = ( α 1 , . . . , α m , θ 1 , . . . , θ m ) such that α j ≥ 0 and � m j =1 α j = 1 . ◮ α 1 , . . . , α m are called the mixing parameters. ◮ p j ( x | θ j ) , j = 1 , . . . , m are called the component densities. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 32

Mixture Densities ◮ Suppose that X = { x 1 , . . . , x n } is a set of observations i.i.d. with distribution p ( x | Θ ) . ◮ The log-likelihood function of Θ becomes n n m � � � � � log L ( Θ |X ) = log p ( x i | Θ ) = log α j p j ( x i | θ j ) . i =1 i =1 j =1 ◮ We cannot obtain an analytical solution for Θ by simply setting the derivatives of log L ( Θ |X ) to zero because of the logarithm of the sum. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 32

Mixture Density Estimation via EM ◮ Consider X as incomplete and define hidden variables Y = { y i } n i =1 where y i corresponds to which mixture component generated the data vector x i . ◮ In other words, y i = j if the i ’th data vector was generated by the j ’th mixture component. ◮ Then, the log-likelihood becomes log L ( Θ |X , Y ) = log p ( X , Y| Θ ) n � = log( p ( x i | y i , θ i ) p ( y i | θ i )) i =1 n � = log( α y i p y i ( x i | θ y i )) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 32

Mixture Density Estimation via EM ◮ Assume we have the initial parameter estimates Θ ( g ) = ( α ( g ) 1 , . . . , α ( g ) m , θ ( g ) 1 , . . . , θ ( g ) m ) . ◮ Compute p ( y i | x i , Θ ( g ) ) = α ( g ) α ( g ) y i p y i ( x i | θ ( g ) y i p y i ( x i | θ ( g ) y i ) y i ) = p ( x i | Θ ( g ) ) � m j =1 α ( g ) j p j ( x i | θ ( g ) j ) and n � p ( Y|X , Θ ( g ) ) = p ( y i | x i , Θ ( g ) ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 32

Mixture Density Estimation via EM ◮ Then, Q ( Θ , Θ ( g ) ) takes the form Q ( Θ , Θ ( g ) ) = � log p ( X , y | Θ ) p ( y |X , Θ ( g ) ) y m n � � log( α j p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( α j ) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) . + j =1 i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 32

Mixture Density Estimation via EM ◮ We can maximize the two sets of summations for α j and θ j independently because they are not related. ◮ The estimate for α j can be computed as n α j = 1 � p ( j | x i , Θ ( g ) ) ˆ n i =1 where α ( g ) j p j ( x i | θ ( g ) j ) p ( j | x i , Θ ( g ) ) = . � m t =1 α ( g ) t p t ( x i | θ ( g ) t ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 32

Mixture of Gaussians ◮ We can obtain analytical expressions for θ j for the special case of a Gaussian mixture where θ j = ( µ j , Σ j ) and p j ( x | θ j ) = p j ( x | µ j , Σ j ) 1 � − 1 � 2( x − µ j ) T Σ − 1 = (2 π ) d/ 2 | Σ j | 1 / 2 exp j ( x − µ j ) . ◮ Equating the partial derivative of Q ( Θ , Θ ( g ) ) with respect to µ j to zero gives � n i =1 p ( j | x i , Θ ( g ) ) x i µ j = ˆ . � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 32

Mixture of Gaussians ◮ We consider five models for the covariance matrix Σ j : ◮ Σ j = σ 2 I m n σ 2 = 1 � � p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 ˆ nd j =1 i =1 ◮ Σ j = σ 2 j I � n i =1 p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 σ 2 ˆ j = d � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 32

Mixture of Gaussians ◮ Covariance models continued: ◮ Σ j = diag ( { σ 2 jk } d k =1 ) � n i =1 p ( j | x i , Θ ( g ) )( x i k − ˆ µ j k ) 2 σ 2 ˆ jk = � n i =1 p ( j | x i , Θ ( g ) ) ◮ Σ j = Σ m n Σ = 1 ˆ � � p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ n j =1 i =1 ◮ Σ j = arbitrary � n i =1 p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ ˆ Σ j = � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 32

Mixture of Gaussians ◮ Summary: ◮ Estimates for α j , µ j and Σ j perform both expectation and maximization steps simultaneously. ◮ EM iterations proceed by using the current estimates as the initial estimates for the next iteration. ◮ The priors are computed from the proportion of examples belonging to each mixture component. ◮ The means are the component centroids. ◮ The covariance matrices are calculated as the sample covariance of the points associated with each component. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 32

Examples ◮ Mixture of Gaussians examples ◮ 1-D Bayesian classification examples ◮ 2-D Bayesian classification examples CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 32

2 2 2 �✂✁☎✄ 0 0 0 −2 −2 −2 (a) (b) (c) −2 0 2 −2 0 2 −2 0 2 2 2 2 �✂✁☎✄ �✂✁☎✄ �✂✁☎✄✝✆ 0 0 0 −2 −2 −2 (d) (e) (f) −2 0 2 −2 0 2 −2 0 2 Figure 1: Illustration of the EM algorithm iterations for a mixture of two Gaussians. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 20 / 32

Parametric Models Part II: Expectation-Maximization and Mixture - PowerPoint PPT Presentation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 2019, Selim Aksoy

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

Semi-parametric and response setup non-parametric approaches to Parametric models

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Expectation-Maximization L eon Bottou NEC Labs America COS 424 3/9/2010 Agenda

Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos

Probabilistic Programming Fun but Intricate Too! Joost-Pieter Katoen with Friedrich Gretz, Nils

Transport problems 18.S995 - L32 dunkel@math.mit.edu Root systems Katifori lab, MPI Goettingen

Improved reconstruction attacks using range query leakage Marie-Sarah Lacharit Brice Minaud

Problems of Network Coding in P2P - and how to overcome it Christian Schindelhauer joint work

Covariance in Unsupervised Learning of Probabilis6c Grammars Cohen

Recap: variance/covariance structure for linear mixed models Important features of linear mixed

Visualizing covariates in proportional hazards Model comparison with rank-hazard plots

What does your model say? It may depend on who is asking David M. Drukker Executive Director of