Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Mixture Models: definition ο½ Mixture models: Linear supper-position of mixtures or components πΏ π π|πΎ = π(π π ) π π π π ; πΎ π π=1 πΏ ο½ π=1 π(π π ) = 1 π ) : the prior probability of π -th mixture ο½ π(π ο½ πΎ π : the parameters of π -th mixture π ; πΎ π : the probability of π according to π -th mixture ο½ π π π ο½ Framework for finding more complex probability distributions ο½ Goal: estimate π π π E.g., Multi-modal density estimation 2
Gaussian Mixture Models (GMMs) ο½ Gaussian Mixture Models: π π π π ; πΎ π ~π(π π , π― π ) πΏ 0 β€ π π β€ 1 π π = π π πͺ(π|π π , π― π ) πΏ π=1 π π = 1 π=1 ο½ Fitting the Gaussian mixture model π ο½ Input: data points π π π=1 ο½ Goal: find the parameters of GMM ( π π , π π , π― π , π = 1, β¦ , πΏ ) 3
GMM: 1-D Example ο ο½ ο 2 1 ο³ 1 ο½ 2 π 1 = 0.6 ο 2 ο½ 4 ο³ 2 ο½ 1 π 2 = 0.3 ο 3 ο½ 8 ο³ 3 ο½ 0 . 2 π 3 = 0.1 4
GMM: 2-D Example π 1 = β2 3 1 0.5 Ξ£ 1 = 0.5 4 π 1 = 0.6 π 2 = 0 β4 Ξ£ 2 = 1 0 0 1 π 2 = 0.25 π 3 = 3 2 Ξ£ 3 = 3 1 1 1 π 3 = 0.15 k = 3 5
GMM: 2-D Example ο½ GMM distribution π 1 = β2 3 1 0.5 Ξ£ 1 = 0.5 4 π 1 = 0.6 π 2 = 0 β4 Ξ£ 2 = 1 0 0 1 π 2 = 0.25 π 3 = 3 2 Ξ£ 3 = 3 1 1 1 π 3 = 0.15 k = 3 6
How to Fit GMM? ο½ In order to maximize log likelihood: π = π (1) , β¦ , π (π) π π ln π π π, π, π― = ln π π πͺ(π|π π , π― π ) π=1 π=1 ο½ The sum over components appears inside the log and there is no closed form solution for maximum likelihood. π ln π π π, π, π― = π ππ π π ln π π π, π, π― = π π = 1, β¦ , πΏ ππ― π πΏ π ln π π π, π, π― + π π=1 π π β 1 = 0 ππ π 7
ML for GMM π π πͺ(π (π) |π π , π― π ) π π π = 1 π π πͺ(π (π) |π π , π― π ) π (π) πΏ π=1 π π π=1 π π πͺ(π (π) |π π , π― π ) π π― π = 1 new )(π π βπ π new ) π π π πͺ(π (π) |π π , π― π ) (π (π) βπ π πΏ π π π=1 π=1 new = π π π π π π π π πͺ(π (π) |π π , π― π ) π π = πΏ π=1 π π πͺ(π (π) |π π , π― π ) π=1 π log π© β1 ππ π π©π = π© π = ππ π 8 ππ© β1 ππ©
EM algorithm ο½ An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function ο½ General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data). ο½ EM find the maximum likelihood parameters in cases where the models involve unobserved variables π in addition to unknown parameters πΎ and known data observations π . 9
Mixture models: discrete latent variables πΏ π(π) = π π¨ π = 1 π π π¨ π = 1 = π π π π π¨ π = 1 π=1 ο½ π¨ : latent or hidden variable ο½ specifies the mixture component ο½ π π¨ π = 1 = π π ο½ 0 β€ π π β€ 1 πΏ ο½ π=1 π π = 1 10
πΎ = [π, π, π―] π¨ (π) β {1,2, β¦ ,πΏ} shows the mixture EM for GMM from which π¦ (π) is generated ο½ Initialize π π , π― π , π π π = 1, β¦ , πΏ ο½ E step : π = 1, β¦ , π , π = 1, β¦ , πΏ πππ πͺ(π π |π π πππ , π― π πππ ) π π (π) = 1|π π , πΎ πππ π = π π¨ πΏ π = π πΏ πππ πͺ(π (π) |π π πππ , π― π πππ ) π=1 π π ο½ M Step : π = 1, β¦ , πΏ π π π (π) π=1 πΏ π πππ₯ = π π π π π=1 πΏ π π 1 πππ₯ = new )(π π βπ π π (π (π) βπ π new ) π π― π π πΏ π π π=1 πΏ π π=1 π π π=1 πΏ π new = π π π ο½ Repeat E and M steps until convergence 11
EM & GMM: example [Bishop] 12
EM & GMM: Example 13 [Bishop]
Local Minima 14
π 1 = β2 3 1 0.5 Ξ£ 1 = Local Minima 0.5 4 π 1 = 0.6 π 2 = 0 β4 Ξ£ 2 = 1 0 0 1 π 2 = 0.25 π 3 = 3 2 Ξ£ 3 = 3 1 1 1 π 3 = 0.15 π· 3 π· 2 π 1 = 0.36 β4.09 π 1 = 1.45 β1.81 π· 3 Ξ£ 1 = 0.89 0.26 Ξ£ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 π· 2 π 1 = 0.249 π 1 = 0.392 π 2 = 3.25 π 2 = β2.20 2.09 3.16 Ξ£ 2 = 2.23 1.08 Ξ£ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 π 2 = 0.146 π 2 = 0.429 π· 1 π 3 = β2.11 3.36 π· 1 π 3 = β1.88 3.74 Ξ£ 3 = 1.12 0.61 5.83 β0.82 Ξ£ 3 = 0.61 3.61 β0.82 5.83 π 3 = 0.604 π 3 = 0.178 15
EM+GMM vs. k-means ο½ k-means: ο½ It is not probabilistic ο½ Has fewer parameters (and faster) ο½ Limited by the underlying assumption of spherical clusters ο½ can be extended to use covariance β get β hard EM β (ellipsoidal k- means). ο½ Both EM and k-means depend on initialization ο½ getting stuck in local optima ο½ EM+GMM has more local minima ο½ Useful trick: first run k-means and then use its result to initialize EM. 16
EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).
Incomplete log likelihood ο½ Complete log likelihood ο½ Maximizing likelihood (i.e., log π(π, π|πΎ) ) for labeled data is straightforward ο½ Incomplete log likelihood ο½ With π unobserved, our objective becomes the log of a marginal probability log π(π|πΎ) = log π π(π, π|πΎ) ο½ This objective will not decouple and we use EM algorithm to solve it 18
EM Algorithm ο½ Assumptions: π (observed or known variables), π (unobserved or latent variables), π come from a specific model with unknown parameters πΎ ο½ If π is relevant to π (in any way), we can hope to extract information about it from π assuming a specific parametric model on the data. ο½ Steps: ο½ Initialization: Initialize the unknown parameters πΎ ο½ Iterate the following steps, until convergence: ο½ Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data. ο½ Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19
EM algorithm intuition ο½ When learning with hidden variables, we are trying to solve two problems at once: ο½ hypothesizing values for the unobserved variables in each data sample ο½ learning the parameters ο½ Each of these tasks is fairly easy when we have the solution to the other. ο½ Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas. ο½ Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20
EM algorithm 21
EM theoretical analysis ο½ What is the underlying theory for the use of the expected complete log likelihood in the M-step? πΉ π π π, πΎ πππ log π π, π πΎ ο½ Now, we show that maximizing this function also maximizes the likelihood 22
EM theoretical foundation: Objective function π 23
Jensen β s inequality 24
EM theoretical foundation: Algorithm in general form 25
EM theoretical foundation: E-step π π’ = π(π|π, πΎ π’ ) βΉ π π’ = argmax πΊ πΎ π’ , π π Proof: π(π|π, πΎ π’ ) log π(π, π|πΎ π’ ) πΊ πΎ π’ , π(π|π, πΎ π’ ) = π(π|π, πΎ π’ ) π π(π|π, πΎ π’ ) log π(π|πΎ π’ ) = log π(π|πΎ π’ ) = β πΎ π’ ; π = π is a lower bound on β πΎ; π . Thus, πΊ πΎ π’ , π has been ο½ πΊ πΎ, π maximized by setting π to π π π, πΎ π’ : πΊ πΎ π’ , π(π|π, πΎ π’ ) = β πΎ π’ ; π β π π π, πΎ π’ = argmax πΊ πΎ π’ , π π 26
EM algorithm: illustration β πΎ; π πΊ πΎ, π π’ πΎ π’ πΎ π’+1 27
EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: πΎ π’+1 = argmax πΊ πΎ, π π’ = argmax πΉ π π’ log π(π, π|πΎ) πΎ πΎ Proof: π π’ (π) log π(π, π|πΎ) πΊ πΎ, π π’ = π π’ (π) π π π’ (π) log π(π, π|πΎ) β π π’ (π) log π π’ (π) = π π β πΊ πΎ, π π’ = πΉ π π’ log π(π, π|πΎ) + πΌ(π π’ π ) Independent of πΎ 28
Recommend
More recommend