gaussian mixture models em

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Mixture Models: definition Mixture models: Linear supper-position of mixtures or components | =


  1. Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Mixture Models: definition  Mixture models: Linear supper-position of mixtures or components 𝐿 π‘ž π’š|𝜾 = 𝑄(𝑁 π‘˜ ) π‘ž π’š 𝑁 π‘˜ ; 𝜾 π‘˜ π‘˜=1 𝐿  π‘˜=1 𝑄(𝑁 π‘˜ ) = 1 π‘˜ ) : the prior probability of π‘˜ -th mixture  𝑄(𝑁  𝜾 π‘˜ : the parameters of π‘˜ -th mixture π‘˜ ; 𝜾 π‘˜ : the probability of π’š according to π‘˜ -th mixture  π‘ž π’š 𝑁  Framework for finding more complex probability distributions  Goal: estimate π‘ž π’š πœ„ E.g., Multi-modal density estimation 2

  3. Gaussian Mixture Models (GMMs)  Gaussian Mixture Models: π‘ž π’š 𝑁 π‘˜ ; 𝜾 π‘˜ ~𝑂(𝝂 π‘˜ , 𝜯 π‘˜ ) 𝐿 0 ≀ 𝜌 π‘˜ ≀ 1 π‘ž π’š = 𝜌 π‘˜ π’ͺ(π’š|𝝂 π‘˜ , 𝜯 π‘˜ ) 𝐿 π‘˜=1 𝜌 π‘˜ = 1 π‘˜=1  Fitting the Gaussian mixture model 𝑂  Input: data points π’š 𝑗 𝑗=1  Goal: find the parameters of GMM ( 𝜌 π‘˜ , 𝝂 π‘˜ , 𝜯 π‘˜ , π‘˜ = 1, … , 𝐿 ) 3

  4. GMM: 1-D Example  ο€½ ο€­ 2 1  1 ο€½ 2 𝜌 1 = 0.6  2 ο€½ 4  2 ο€½ 1 𝜌 2 = 0.3  3 ο€½ 8  3 ο€½ 0 . 2 𝜌 3 = 0.1 4

  5. GMM: 2-D Example 𝝂 1 = βˆ’2 3 1 0.5 Ξ£ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 βˆ’4 Ξ£ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Ξ£ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 5

  6. GMM: 2-D Example  GMM distribution 𝝂 1 = βˆ’2 3 1 0.5 Ξ£ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 βˆ’4 Ξ£ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Ξ£ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 6

  7. How to Fit GMM?  In order to maximize log likelihood: 𝒀 = π’š (1) , … , π’š (𝑂) 𝑂 𝑙 ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 = ln 𝜌 π‘˜ π’ͺ(π’š|𝝂 π‘˜ , 𝜯 π‘˜ ) 𝑗=1 π‘˜=1  The sum over components appears inside the log and there is no closed form solution for maximum likelihood. πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 πœ–π‚ 𝑙 πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝑙 = 1, … , 𝐿 πœ–πœ― 𝑙 𝐿 πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 + πœ‡ π‘˜=1 𝜌 π‘˜ βˆ’ 1 = 0 πœ–πœŒ 𝑙 7

  8. ML for GMM 𝜌 𝑙 π’ͺ(π’š (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝝂 𝑙 = 1 𝜌 π‘˜ π’ͺ(π’š (𝑗) |𝝂 π‘˜ , 𝜯 π‘˜ ) π’š (𝑗) 𝐿 π‘˜=1 𝑂 𝑙 𝑗=1 𝜌 𝑙 π’ͺ(π’š (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝜯 𝑙 = 1 new )(π’š 𝑗 βˆ’π‚ 𝑙 new ) π‘ˆ 𝜌 π‘˜ π’ͺ(π’š (𝑗) |𝝂 π‘˜ , 𝜯 π‘˜ ) (π’š (𝑗) βˆ’π‚ 𝑙 𝐿 𝑂 𝑙 π‘˜=1 𝑗=1 new = 𝑂 𝑙 𝜌 𝑙 𝑂 𝑂 𝜌 𝑙 π’ͺ(π’š (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝑙 = 𝐿 π‘˜=1 𝜌 π‘˜ π’ͺ(π’š (𝑗) |𝝂 π‘˜ , 𝜯 π‘˜ ) 𝑗=1 πœ– log 𝑩 βˆ’1 πœ–π’š π‘ˆ π‘©π’š = 𝑩 π‘ˆ = π’šπ’š π‘ˆ 8 πœ–π‘© βˆ’1 πœ–π‘©

  9. EM algorithm  An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function  General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).  EM find the maximum likelihood parameters in cases where the models involve unobserved variables π‘Ž in addition to unknown parameters 𝜾 and known data observations π‘Œ . 9

  10. Mixture models: discrete latent variables 𝐿 π‘ž(π’š) = 𝑄 𝑨 π‘˜ = 1 π‘ž π’š 𝑨 π‘˜ = 1 = 𝜌 π‘˜ π‘ž π’š 𝑨 π‘˜ = 1 π‘˜=1  𝑨 : latent or hidden variable  specifies the mixture component  𝑄 𝑨 π‘˜ = 1 = 𝜌 π‘˜  0 ≀ 𝜌 π‘˜ ≀ 1 𝐿  π‘˜=1 𝜌 π‘˜ = 1 10

  11. 𝜾 = [𝝆, 𝝂, 𝜯] 𝑨 (𝑗) ∈ {1,2, … ,𝐿} shows the mixture EM for GMM from which 𝑦 (𝑗) is generated  Initialize 𝝂 𝑙 , 𝜯 𝑙 , 𝜌 𝑙 𝑙 = 1, … , 𝐿  E step : 𝑗 = 1, … , 𝑂 , π‘˜ = 1, … , 𝐿 π‘π‘šπ‘’ π’ͺ(π’š 𝑗 |𝝂 π‘˜ π‘π‘šπ‘’ , 𝜯 π‘˜ π‘π‘šπ‘’ ) 𝜌 π‘˜ (𝑗) = 1|π’š 𝑗 , 𝜾 π‘π‘šπ‘’ 𝑗 = 𝑄 𝑨 𝛿 π‘˜ = π‘˜ 𝐿 π‘π‘šπ‘’ π’ͺ(π’š (𝑗) |𝝂 𝑙 π‘π‘šπ‘’ , 𝜯 𝑙 π‘π‘šπ‘’ ) 𝑙=1 𝜌 𝑙  M Step : π‘˜ = 1, … , 𝐿 𝑂 𝑗 π’š (𝑗) 𝑗=1 𝛿 π‘˜ π‘œπ‘“π‘₯ = 𝝂 π‘˜ 𝑗 𝑂 𝑗=1 𝛿 π‘˜ 𝑂 1 π‘œπ‘“π‘₯ = new )(π’š 𝑗 βˆ’π‚ π‘˜ 𝑗 (π’š (𝑗) βˆ’π‚ π‘˜ new ) π‘ˆ 𝜯 π‘˜ 𝑗 𝛿 π‘˜ 𝑂 𝑗=1 𝛿 π‘˜ 𝑗=1 𝑂 𝑗 𝑗=1 𝛿 π‘˜ new = 𝜌 π‘˜ 𝑂  Repeat E and M steps until convergence 11

  12. EM & GMM: example [Bishop] 12

  13. EM & GMM: Example 13 [Bishop]

  14. Local Minima 14

  15. 𝝂 1 = βˆ’2 3 1 0.5 Ξ£ 1 = Local Minima 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 βˆ’4 Ξ£ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Ξ£ 3 = 3 1 1 1 𝜌 3 = 0.15 𝐷 3 𝐷 2 𝝂 1 = 0.36 βˆ’4.09 𝝂 1 = 1.45 βˆ’1.81 𝐷 3 Ξ£ 1 = 0.89 0.26 Ξ£ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 𝐷 2 𝜌 1 = 0.249 𝜌 1 = 0.392 𝝂 2 = 3.25 𝝂 2 = βˆ’2.20 2.09 3.16 Ξ£ 2 = 2.23 1.08 Ξ£ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 𝜌 2 = 0.146 𝜌 2 = 0.429 𝐷 1 𝝂 3 = βˆ’2.11 3.36 𝐷 1 𝝂 3 = βˆ’1.88 3.74 Ξ£ 3 = 1.12 0.61 5.83 βˆ’0.82 Ξ£ 3 = 0.61 3.61 βˆ’0.82 5.83 𝜌 3 = 0.604 𝜌 3 = 0.178 15

  16. EM+GMM vs. k-means  k-means:  It is not probabilistic  Has fewer parameters (and faster)  Limited by the underlying assumption of spherical clusters  can be extended to use covariance – get β€œ hard EM ” (ellipsoidal k- means).  Both EM and k-means depend on initialization  getting stuck in local optima  EM+GMM has more local minima  Useful trick: first run k-means and then use its result to initialize EM. 16

  17. EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).

  18. Incomplete log likelihood  Complete log likelihood  Maximizing likelihood (i.e., log 𝑄(π‘Œ, 𝑍|𝜾) ) for labeled data is straightforward  Incomplete log likelihood  With π‘Ž unobserved, our objective becomes the log of a marginal probability log 𝑄(π‘Œ|𝜾) = log π‘Ž 𝑄(π‘Œ, π‘Ž|𝜾)  This objective will not decouple and we use EM algorithm to solve it 18

  19. EM Algorithm  Assumptions: π‘Œ (observed or known variables), π‘Ž (unobserved or latent variables), π‘Œ come from a specific model with unknown parameters 𝜾  If π‘Ž is relevant to π‘Œ (in any way), we can hope to extract information about it from π‘Œ assuming a specific parametric model on the data.  Steps:  Initialization: Initialize the unknown parameters 𝜾  Iterate the following steps, until convergence:  Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data.  Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19

  20. EM algorithm intuition  When learning with hidden variables, we are trying to solve two problems at once:  hypothesizing values for the unobserved variables in each data sample  learning the parameters  Each of these tasks is fairly easy when we have the solution to the other.  Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas.  Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20

  21. EM algorithm 21

  22. EM theoretical analysis  What is the underlying theory for the use of the expected complete log likelihood in the M-step? 𝐹 𝑄 π‘Ž π‘Œ, 𝜾 π‘π‘šπ‘’ log 𝑄 π‘Œ, π‘Ž 𝜾  Now, we show that maximizing this function also maximizes the likelihood 22

  23. EM theoretical foundation: Objective function π‘Ž 23

  24. Jensen ’ s inequality 24

  25. EM theoretical foundation: Algorithm in general form 25

  26. EM theoretical foundation: E-step 𝑅 𝑒 = 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) ⟹ 𝑅 𝑒 = argmax 𝐺 𝜾 𝑒 , 𝑅 𝑅 Proof: 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) log 𝑄(π‘Œ, π‘Ž|𝜾 𝑒 ) 𝐺 𝜾 𝑒 , 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) = 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) π‘Ž 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) log 𝑄(π‘Œ|𝜾 𝑒 ) = log 𝑄(π‘Œ|𝜾 𝑒 ) = β„“ 𝜾 𝑒 ; π‘Œ = π‘Ž is a lower bound on β„“ 𝜾; π‘Œ . Thus, 𝐺 𝜾 𝑒 , 𝑅 has been  𝐺 𝜾, 𝑅 maximized by setting 𝑅 to 𝑄 π‘Ž π‘Œ, 𝜾 𝑒 : 𝐺 𝜾 𝑒 , 𝑄(π‘Ž|π‘Œ, 𝜾 𝑒 ) = β„“ 𝜾 𝑒 ; π‘Œ β‡’ 𝑄 π‘Ž π‘Œ, 𝜾 𝑒 = argmax 𝐺 𝜾 𝑒 , 𝑅 𝑅 26

  27. EM algorithm: illustration β„“ 𝜾; π‘Œ 𝐺 𝜾, 𝑅 𝑒 𝜾 𝑒 𝜾 𝑒+1 27

  28. EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: 𝜾 𝑒+1 = argmax 𝐺 𝜾, 𝑅 𝑒 = argmax 𝐹 𝑅 𝑒 log 𝑄(π‘Œ, π‘Ž|𝜾) 𝜾 𝜾 Proof: 𝑅 𝑒 (π‘Ž) log 𝑄(π‘Œ, π‘Ž|𝜾) 𝐺 𝜾, 𝑅 𝑒 = 𝑅 𝑒 (π‘Ž) π‘Ž 𝑅 𝑒 (π‘Ž) log 𝑄(π‘Œ, π‘Ž|𝜾) βˆ’ 𝑅 𝑒 (π‘Ž) log 𝑅 𝑒 (π‘Ž) = π‘Ž π‘Ž β‡’ 𝐺 𝜾, 𝑅 𝑒 = 𝐹 𝑅 𝑒 log 𝑄(π‘Œ, π‘Ž|𝜾) + 𝐼(𝑅 𝑒 π‘Ž ) Independent of 𝜾 28

Recommend


More recommend