parametric models part ii expectation maximization and
play

Parametric Models Part II: Expectation-Maximization and Mixture - PowerPoint PPT Presentation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 2019, Selim Aksoy


  1. Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 32

  2. Missing Features ◮ Suppose that we have a Bayesian classifier that uses the feature vector x but a subset x g of x are observed and the values for the remaining features x b are missing. ◮ How can we make a decision? ◮ Throw away the observations with missing values. ◮ Or, substitute x b by their average ¯ x b in the training data, and use x = ( x g , ¯ x b ) . ◮ Or, marginalize the posterior over the missing features, and use the resulting posterior � P ( w i | x g , x b ) p ( x g , x b ) d x b P ( w i | x g ) = . � p ( x g , x b ) d x b CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 32

  3. Expectation-Maximization ◮ We can also extend maximum likelihood techniques to allow learning of parameters when some training patterns have missing features. ◮ The Expectation-Maximization (EM) algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 32

  4. Expectation-Maximization ◮ There are two main applications of the EM algorithm: ◮ Learning when the data is incomplete or has missing values. ◮ Optimizing a likelihood function that is analytically intractable but can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. ◮ The second problem is more common in pattern recognition applications. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 32

  5. Expectation-Maximization ◮ Assume that the observed data X is generated by some distribution. ◮ Assume that a complete dataset Z = ( X , Y ) exists as a combination of the observed but incomplete data X and the missing data Y . ◮ The observations in Z are assumed to be i.i.d. from the joint density p ( z | Θ ) = p ( x , y | Θ ) = p ( y | x , Θ ) p ( x | Θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 32

  6. Expectation-Maximization ◮ We can define a new likelihood function L ( Θ |Z ) = L ( Θ |X , Y ) = p ( X , Y| Θ ) called the complete-data likelihood where L ( Θ |X ) is referred to as the incomplete-data likelihood. ◮ The EM algorithm: ◮ First, finds the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). ◮ Then, maximizes this expectation (maximization step). CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 32

  7. Expectation-Maximization ◮ Define Q ( Θ , Θ ( i − 1) ) = E log p ( X , Y| Θ ) | X , Θ ( i − 1) � � as the expected value of the complete-data log-likelihood w.r.t. the unknown data Y given the observed data X and the current parameter estimates Θ ( i − 1) . ◮ The expected value can be computed as � log p ( X , Y| Θ ) |X , Θ ( i − 1) � log p ( X , y | Θ ) p ( y |X , Θ ( i − 1) ) d y . � E = ◮ This is called the E-step . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 32

  8. Expectation-Maximization ◮ Then, the expectation can be maximized by finding optimum values for the new parameters Θ as Θ ( i ) = arg max Θ Q ( Θ , Θ ( i − 1) ) . ◮ This is called the M-step . ◮ These two steps are repeated iteratively where each iteration is guaranteed to increase the log-likelihood. ◮ The EM algorithm is also guaranteed to converge to a local maximum of the likelihood function. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 32

  9. Mixture Densities ◮ A mixture model is a linear combination of m densities m � p ( x | Θ ) = α j p j ( x | θ j ) j =1 where Θ = ( α 1 , . . . , α m , θ 1 , . . . , θ m ) such that α j ≥ 0 and � m j =1 α j = 1 . ◮ α 1 , . . . , α m are called the mixing parameters. ◮ p j ( x | θ j ) , j = 1 , . . . , m are called the component densities. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 32

  10. Mixture Densities ◮ Suppose that X = { x 1 , . . . , x n } is a set of observations i.i.d. with distribution p ( x | Θ ) . ◮ The log-likelihood function of Θ becomes n n m � � � � � log L ( Θ |X ) = log p ( x i | Θ ) = log α j p j ( x i | θ j ) . i =1 i =1 j =1 ◮ We cannot obtain an analytical solution for Θ by simply setting the derivatives of log L ( Θ |X ) to zero because of the logarithm of the sum. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 32

  11. Mixture Density Estimation via EM ◮ Consider X as incomplete and define hidden variables Y = { y i } n i =1 where y i corresponds to which mixture component generated the data vector x i . ◮ In other words, y i = j if the i ’th data vector was generated by the j ’th mixture component. ◮ Then, the log-likelihood becomes log L ( Θ |X , Y ) = log p ( X , Y| Θ ) n � = log( p ( x i | y i , θ i ) p ( y i | θ i )) i =1 n � = log( α y i p y i ( x i | θ y i )) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 32

  12. Mixture Density Estimation via EM ◮ Assume we have the initial parameter estimates Θ ( g ) = ( α ( g ) 1 , . . . , α ( g ) m , θ ( g ) 1 , . . . , θ ( g ) m ) . ◮ Compute p ( y i | x i , Θ ( g ) ) = α ( g ) α ( g ) y i p y i ( x i | θ ( g ) y i p y i ( x i | θ ( g ) y i ) y i ) = p ( x i | Θ ( g ) ) � m j =1 α ( g ) j p j ( x i | θ ( g ) j ) and n � p ( Y|X , Θ ( g ) ) = p ( y i | x i , Θ ( g ) ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 32

  13. Mixture Density Estimation via EM ◮ Then, Q ( Θ , Θ ( g ) ) takes the form Q ( Θ , Θ ( g ) ) = � log p ( X , y | Θ ) p ( y |X , Θ ( g ) ) y m n � � log( α j p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( α j ) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) . + j =1 i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 32

  14. Mixture Density Estimation via EM ◮ We can maximize the two sets of summations for α j and θ j independently because they are not related. ◮ The estimate for α j can be computed as n α j = 1 � p ( j | x i , Θ ( g ) ) ˆ n i =1 where α ( g ) j p j ( x i | θ ( g ) j ) p ( j | x i , Θ ( g ) ) = . � m t =1 α ( g ) t p t ( x i | θ ( g ) t ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 32

  15. Mixture of Gaussians ◮ We can obtain analytical expressions for θ j for the special case of a Gaussian mixture where θ j = ( µ j , Σ j ) and p j ( x | θ j ) = p j ( x | µ j , Σ j ) 1 � − 1 � 2( x − µ j ) T Σ − 1 = (2 π ) d/ 2 | Σ j | 1 / 2 exp j ( x − µ j ) . ◮ Equating the partial derivative of Q ( Θ , Θ ( g ) ) with respect to µ j to zero gives � n i =1 p ( j | x i , Θ ( g ) ) x i µ j = ˆ . � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 32

  16. Mixture of Gaussians ◮ We consider five models for the covariance matrix Σ j : ◮ Σ j = σ 2 I m n σ 2 = 1 � � p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 ˆ nd j =1 i =1 ◮ Σ j = σ 2 j I � n i =1 p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 σ 2 ˆ j = d � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 32

  17. Mixture of Gaussians ◮ Covariance models continued: ◮ Σ j = diag ( { σ 2 jk } d k =1 ) � n i =1 p ( j | x i , Θ ( g ) )( x i k − ˆ µ j k ) 2 σ 2 ˆ jk = � n i =1 p ( j | x i , Θ ( g ) ) ◮ Σ j = Σ m n Σ = 1 ˆ � � p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ n j =1 i =1 ◮ Σ j = arbitrary � n i =1 p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ ˆ Σ j = � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 32

  18. Mixture of Gaussians ◮ Summary: ◮ Estimates for α j , µ j and Σ j perform both expectation and maximization steps simultaneously. ◮ EM iterations proceed by using the current estimates as the initial estimates for the next iteration. ◮ The priors are computed from the proportion of examples belonging to each mixture component. ◮ The means are the component centroids. ◮ The covariance matrices are calculated as the sample covariance of the points associated with each component. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 32

  19. Examples ◮ Mixture of Gaussians examples ◮ 1-D Bayesian classification examples ◮ 2-D Bayesian classification examples CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 32

  20. 2 2 2 �✂✁☎✄ 0 0 0 −2 −2 −2 (a) (b) (c) −2 0 2 −2 0 2 −2 0 2 2 2 2 �✂✁☎✄ �✂✁☎✄ �✂✁☎✄✝✆ 0 0 0 −2 −2 −2 (d) (e) (f) −2 0 2 −2 0 2 −2 0 2 Figure 1: Illustration of the EM algorithm iterations for a mixture of two Gaussians. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 20 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend