Applied Machine Learning
Expectation Maximization
for Mixture of Gaussians
Siamak Ravanbakhsh
COMP 551 (Fall 2020)
Applied Machine Learning Expectation Maximization for Mixture of - - PowerPoint PPT Presentation
Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization
for Mixture of Gaussians
Siamak Ravanbakhsh
COMP 551 (Fall 2020)
what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization algorithm relationship to k-means
given data model e.g., multivariate Gaussian, Bernoulli
D = {x , … , x }
(1) (N)
p(x; θ)
we saw generative models for classification
D = {(x , y ), … , (x , y )}
(1) (1) (N) (N)
p(x, y; θ) ∝ p(y; θ)p(x∣y; θ)
Model used maximum likelihood to fit the data or Bayesian inference
= θ ^ arg max log p(x , y ; θ)
θ ∑n (n) (n)
Learning
e.g., we used this to fit the naive Bayes
p(θ∣D) = p(θ)p(D∣θ)
sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables
examples
bias (unobserved) leading to a hiring practice (observed) 3D scene (unobserved) producing a 2D photograph (observed) gravity (unobserved) leading to apple falling (observed) genotype (unobserved) leading to some phenotype (observed) input features (observed) having some unobserved class labels ...
data
D = {x , … , x }
(1) (N)
model p(x, z; θ)
the latent variable
is partial or incomplete accounts for both observed (x) and latent variables (z)
sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables data
D = {x , … , x }
(1) (N)
is partial or incomplete
p(x, z; θ) = p(z; θ)p(x∣z; θ)
model it gives us a lot of flexibility in modeling the data find hidden factors and learn about how they lead to our observations both natural and powerful way to model complex observations difficult to "learn" the model from partial observations
sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables data D = {x
, … , x }
(1) (N)
p(x, z; θ) = p(z; θ)p(x∣z; θ)
model is partial or incomplete if the latent variable is the class label, this resembles generative classification so we can use latent variable models for clustering?
p(x, y; θ) = p(y; θ)p(x∣y; θ)
but here we don't observe the labels we saw clustering performs classification without having labels
suppose the latent variable has a categorical distribution (unobserved class label)
p(x, z; θ, π) = Categorical(z; π)p(x∣z; θ)
each datapoint with probability comes from the
x πk p(x∣z = k; θ )
k
the marginal over the observed variables is a mixture of K distributions
p(z; π)
we can marginalize out z to get the data distribution
p(x; θ, π) = Categorical(z = ∑k k; π)p(x∣z = k; θ) = π p(x∣z = ∑k
k
k; θ )
k
we only observe x
lets consider the case where this is Gaussian mixture weights
p(x; π, {μ , Σ }) =
k k
π N(x; μ , Σ ) ∑k
k k k
model the data as a mixture of K Gaussian distributions
Gaussian mixture model for D=2
we can calculate the probability of each datapoint belonging to a cluster k
(n) π N(x ;μ ,Σ ) ∑c
c (n) c c
π N(x ;μ ,Σ )
k (n) k k
also called the responsibility of cluster k for data point (n)
weighted density of k'th Gaussian at density of the whole mixture at that point
x(n)
complete data (we have both x and z) incomplete data (we only have x)
p(x) = p(x∣z = ∑k k)p(z = k)
marginal distribution visualizing samples from the join distribution
z ∼
(n)
p(z; π) x ∼
(n)
p(x∣z ; θ)
(n)
p(x, z)
colors show the value of z
p(z = k∣x ) =
(n) π N(x ;μ ,Σ ) ∑c
c (n) c c
π N(x ;μ ,Σ )
k (n) k k
responsibilities
COMP 551 | Fall 2020
p(x; π, {μ , Σ }) =
k k
π N(x; μ , Σ ) ∑k
k k k
mixture of Gaussians we can calculate the probability of each datapoint belonging to a cluster k
(n) π N(x ;μ ,Σ ) ∑c
c (n) c c
π N(x ;μ ,Σ )
k (n) k k
a probabilistic alternative to K-means: cluster membership r ∈
n,k
{0, 1} cluster mean μk soft cluster membership (responsibilities) r
=
n,k
p(z = k∣x )
(n)
cluster mean μk cluster covariance matrix Σk
ℓ(π, {μ , Σ }) =
k k
log π N(x ; μ , Σ ) ∑n (∑k
k (n) k k )
maximize the marginal likelihood of observations under our model set the derivatives to zero (see our references for step-by-step calculation)
p(x)
μ =
k
r x
r ∑n
n,k
1
∑n
n,k (n)
weighted mean
weight is the responsibility probability of sample (n) belonging to cluster k
r =
n,k
p(z = k∣x )
(n)
Σ =
k
r (x −
r ∑n′
n ,k ′
1
∑n
n,k (n)
μ )(x −
k (n)
μ )
k ⊤
weighted covariance
=
∂μk ∂ℓ
=
∂Σk ∂ℓ
=
∂πk ∂ℓ
π =
k N r ∑n
n,k
the total amount of responsibilities accepted by cluster k
problem model parameters depend on the responsibilities responsibilities depend on model parameters
solution iteratively update both parameters and responsibilities until convergence
start from some initial model {μ , Σ }, π
k k
for Gaussian Mixture
μ ←
k
r x
r ∑n
n,k
1
∑n
n,k (n)
Σ ←
k
r (x −
r ∑n′
n ,k ′
1
∑n
n,k (n)
μ )(x −
k (n)
μ )
k ⊤
π ←
k N r ∑n
n,k
update the model given the responsibilities
∀k
maximization step
r ←
n,k π N(x ;μ ,Σ ) ∑c
c (n) c c
π N(x ;μ ,Σ )
k (n) k k
update the responsibilities given the model ∀n, k
expectation step
repeat until convergence:
example
for Gaussian Mixture
expectation step (finding responsibilities) initialize maximization step (finding model parameters) iteration 2 iteration 5 iteration 20
EM converges after 20 iteration (D=2, K=2)
COMP 551 | Fall 2020
example
for Gaussian Mixture
Iris flowers dataset, multiple runs
converged after 34 iterations, average log-likelihood: -1.49 converged after 120 iterations, average log-likelihood: -1.47 converged after 50 iterations, average log-likelihood: -1.45 converged after 43 iterations, average log-likelihood: -1.45
which model is better?
K-Means EM for Gaussian mixture model
minimize the sum of squared Euclidean distance to cluster centers minimize the negative log-(marginal) likelihood
parameters cluster centers
means, covariances and mixture weights
parameters responsibilities
hard cluster memberships soft cluster memberships
responsibilities algorithm
alternating minimization wrt parameters and responsibility
feature scaling
sensitive
feature scaling
robust, because of learning the covariance
efficiency
faster convergence
efficiency
slower convergence
both converge to a local optima, and in both swapping cluster indices makes no difference in the
we saw application of EM to Gaussian mixture EM is a general algorithm for learning latent variable models: we have a model and partial observations D = {x
, … , x }
(1) (N)
p(x, z; θ)
to learn model parameters and infer the latent variables use EM
E-step: do a probabilistic completion p(z
∣x ; θ)∀n
(n) (n)
M-step: fit the model to the (probabilistically) completed data start from some initial model θ repeat until convergence:
θ
COMP 551 | Fall 2020
a simple variation called hard EM algorithm
E-step: do a deterministic completion z
=
(n)
arg max p(z∣x ; θ)∀n
z (n)
M-step: fit the model to the completed data using max-likelihood start from some initial model θ repeat until convergence:
θ
K-means is performing hard-EM using a fixed covariance and mixture weights
find the closest center (finding the Gaussian with the highest probability) fit Gaussians to completed data (x,z)
Latent variable models: a general and powerful type of probablistic model we have only partial observations can use EM to learn the parameters and infer hidden values Expectation maximization (EM): useful when we have hidden variables or missing values tries to maximize log-likelihood of observations iteratates between learning model parameters and inferring the latents converges to a local optima (performance depends on initialization The only concrete example that we saw: Gaussian mixture model (GMM) EM in GMM for soft clustering relationship to K-means