gaussian mixture models em
play

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Mixture Models: definition Mixture models: Linear supper-position of mixtures or components | =


  1. Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Mixture Models: definition  Mixture models: Linear supper-position of mixtures or components 𝐿 𝑞 𝒚|𝜾 = 𝑄(𝑁 𝑘 ) 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 𝑘=1 𝐿  𝑘=1 𝑄(𝑁 𝑘 ) = 1 𝑘 ) : the prior probability of 𝑘 -th mixture  𝑄(𝑁  𝜾 𝑘 : the parameters of 𝑘 -th mixture 𝑘 ; 𝜾 𝑘 : the probability of 𝒚 according to 𝑘 -th mixture  𝑞 𝒚 𝑁  Framework for finding more complex probability distributions  Goal: estimate 𝑞 𝒚 𝜄 E.g., Multi-modal density estimation 2

  3. Gaussian Mixture Models (GMMs)  Gaussian Mixture Models: 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 ~𝑂(𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 0 ≤ 𝜌 𝑘 ≤ 1 𝑞 𝒚 = 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 𝑘=1 𝜌 𝑘 = 1 𝑘=1  Fitting the Gaussian mixture model 𝑂  Input: data points 𝒚 𝑗 𝑗=1  Goal: find the parameters of GMM ( 𝜌 𝑘 , 𝝂 𝑘 , 𝜯 𝑘 , 𝑘 = 1, … , 𝐿 ) 3

  4. GMM: 1-D Example    2 1  1  2 𝜌 1 = 0.6  2  4  2  1 𝜌 2 = 0.3  3  8  3  0 . 2 𝜌 3 = 0.1 4

  5. GMM: 2-D Example 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 5

  6. GMM: 2-D Example  GMM distribution 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 6

  7. How to Fit GMM?  In order to maximize log likelihood: 𝒀 = 𝒚 (1) , … , 𝒚 (𝑂) 𝑂 𝑙 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝑘=1  The sum over components appears inside the log and there is no closed form solution for maximum likelihood. 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝜖𝝂 𝑙 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝑙 = 1, … , 𝐿 𝜖𝜯 𝑙 𝐿 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 + 𝜇 𝑘=1 𝜌 𝑘 − 1 = 0 𝜖𝜌 𝑙 7

  8. ML for GMM 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝝂 𝑙 = 1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝒚 (𝑗) 𝐿 𝑘=1 𝑂 𝑙 𝑗=1 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝜯 𝑙 = 1 new )(𝒚 𝑗 −𝝂 𝑙 new ) 𝑈 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) (𝒚 (𝑗) −𝝂 𝑙 𝐿 𝑂 𝑙 𝑘=1 𝑗=1 new = 𝑂 𝑙 𝜌 𝑙 𝑂 𝑂 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝑙 = 𝐿 𝑘=1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝜖 log 𝑩 −1 𝜖𝒚 𝑈 𝑩𝒚 = 𝑩 𝑈 = 𝒚𝒚 𝑈 8 𝜖𝑩 −1 𝜖𝑩

  9. EM algorithm  An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function  General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).  EM find the maximum likelihood parameters in cases where the models involve unobserved variables 𝑎 in addition to unknown parameters 𝜾 and known data observations 𝑌 . 9

  10. Mixture models: discrete latent variables 𝐿 𝑞(𝒚) = 𝑄 𝑨 𝑘 = 1 𝑞 𝒚 𝑨 𝑘 = 1 = 𝜌 𝑘 𝑞 𝒚 𝑨 𝑘 = 1 𝑘=1  𝑨 : latent or hidden variable  specifies the mixture component  𝑄 𝑨 𝑘 = 1 = 𝜌 𝑘  0 ≤ 𝜌 𝑘 ≤ 1 𝐿  𝑘=1 𝜌 𝑘 = 1 10

  11. 𝜾 = [𝝆, 𝝂, 𝜯] 𝑨 (𝑗) ∈ {1,2, … ,𝐿} shows the mixture EM for GMM from which 𝑦 (𝑗) is generated  Initialize 𝝂 𝑙 , 𝜯 𝑙 , 𝜌 𝑙 𝑙 = 1, … , 𝐿  E step : 𝑗 = 1, … , 𝑂 , 𝑘 = 1, … , 𝐿 𝑝𝑚𝑒 𝒪(𝒚 𝑗 |𝝂 𝑘 𝑝𝑚𝑒 , 𝜯 𝑘 𝑝𝑚𝑒 ) 𝜌 𝑘 (𝑗) = 1|𝒚 𝑗 , 𝜾 𝑝𝑚𝑒 𝑗 = 𝑄 𝑨 𝛿 𝑘 = 𝑘 𝐿 𝑝𝑚𝑒 𝒪(𝒚 (𝑗) |𝝂 𝑙 𝑝𝑚𝑒 , 𝜯 𝑙 𝑝𝑚𝑒 ) 𝑙=1 𝜌 𝑙  M Step : 𝑘 = 1, … , 𝐿 𝑂 𝑗 𝒚 (𝑗) 𝑗=1 𝛿 𝑘 𝑜𝑓𝑥 = 𝝂 𝑘 𝑗 𝑂 𝑗=1 𝛿 𝑘 𝑂 1 𝑜𝑓𝑥 = new )(𝒚 𝑗 −𝝂 𝑘 𝑗 (𝒚 (𝑗) −𝝂 𝑘 new ) 𝑈 𝜯 𝑘 𝑗 𝛿 𝑘 𝑂 𝑗=1 𝛿 𝑘 𝑗=1 𝑂 𝑗 𝑗=1 𝛿 𝑘 new = 𝜌 𝑘 𝑂  Repeat E and M steps until convergence 11

  12. EM & GMM: example [Bishop] 12

  13. EM & GMM: Example 13 [Bishop]

  14. Local Minima 14

  15. 𝝂 1 = −2 3 1 0.5 Σ 1 = Local Minima 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 𝐷 3 𝐷 2 𝝂 1 = 0.36 −4.09 𝝂 1 = 1.45 −1.81 𝐷 3 Σ 1 = 0.89 0.26 Σ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 𝐷 2 𝜌 1 = 0.249 𝜌 1 = 0.392 𝝂 2 = 3.25 𝝂 2 = −2.20 2.09 3.16 Σ 2 = 2.23 1.08 Σ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 𝜌 2 = 0.146 𝜌 2 = 0.429 𝐷 1 𝝂 3 = −2.11 3.36 𝐷 1 𝝂 3 = −1.88 3.74 Σ 3 = 1.12 0.61 5.83 −0.82 Σ 3 = 0.61 3.61 −0.82 5.83 𝜌 3 = 0.604 𝜌 3 = 0.178 15

  16. EM+GMM vs. k-means  k-means:  It is not probabilistic  Has fewer parameters (and faster)  Limited by the underlying assumption of spherical clusters  can be extended to use covariance – get “ hard EM ” (ellipsoidal k- means).  Both EM and k-means depend on initialization  getting stuck in local optima  EM+GMM has more local minima  Useful trick: first run k-means and then use its result to initialize EM. 16

  17. EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).

  18. Incomplete log likelihood  Complete log likelihood  Maximizing likelihood (i.e., log 𝑄(𝑌, 𝑍|𝜾) ) for labeled data is straightforward  Incomplete log likelihood  With 𝑎 unobserved, our objective becomes the log of a marginal probability log 𝑄(𝑌|𝜾) = log 𝑎 𝑄(𝑌, 𝑎|𝜾)  This objective will not decouple and we use EM algorithm to solve it 18

  19. EM Algorithm  Assumptions: 𝑌 (observed or known variables), 𝑎 (unobserved or latent variables), 𝑌 come from a specific model with unknown parameters 𝜾  If 𝑎 is relevant to 𝑌 (in any way), we can hope to extract information about it from 𝑌 assuming a specific parametric model on the data.  Steps:  Initialization: Initialize the unknown parameters 𝜾  Iterate the following steps, until convergence:  Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data.  Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19

  20. EM algorithm intuition  When learning with hidden variables, we are trying to solve two problems at once:  hypothesizing values for the unobserved variables in each data sample  learning the parameters  Each of these tasks is fairly easy when we have the solution to the other.  Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas.  Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20

  21. EM algorithm 21

  22. EM theoretical analysis  What is the underlying theory for the use of the expected complete log likelihood in the M-step? 𝐹 𝑄 𝑎 𝑌, 𝜾 𝑝𝑚𝑒 log 𝑄 𝑌, 𝑎 𝜾  Now, we show that maximizing this function also maximizes the likelihood 22

  23. EM theoretical foundation: Objective function 𝑎 23

  24. Jensen ’ s inequality 24

  25. EM theoretical foundation: Algorithm in general form 25

  26. EM theoretical foundation: E-step 𝑅 𝑢 = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) ⟹ 𝑅 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 Proof: 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌, 𝑎|𝜾 𝑢 ) 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) 𝑎 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌|𝜾 𝑢 ) = log 𝑄(𝑌|𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 = 𝑎 is a lower bound on ℓ 𝜾; 𝑌 . Thus, 𝐺 𝜾 𝑢 , 𝑅 has been  𝐺 𝜾, 𝑅 maximized by setting 𝑅 to 𝑄 𝑎 𝑌, 𝜾 𝑢 : 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 ⇒ 𝑄 𝑎 𝑌, 𝜾 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 26

  27. EM algorithm: illustration ℓ 𝜾; 𝑌 𝐺 𝜾, 𝑅 𝑢 𝜾 𝑢 𝜾 𝑢+1 27

  28. EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: 𝜾 𝑢+1 = argmax 𝐺 𝜾, 𝑅 𝑢 = argmax 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) 𝜾 𝜾 Proof: 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) 𝐺 𝜾, 𝑅 𝑢 = 𝑅 𝑢 (𝑎) 𝑎 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) − 𝑅 𝑢 (𝑎) log 𝑅 𝑢 (𝑎) = 𝑎 𝑎 ⇒ 𝐺 𝜾, 𝑅 𝑢 = 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) + 𝐼(𝑅 𝑢 𝑎 ) Independent of 𝜾 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend