Applied Machine Learning Expectation Maximization for Mixture of - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Expectation Maximization for Mixture of - - PowerPoint PPT Presentation

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization


slide-1
SLIDE 1

Applied Machine Learning

Expectation Maximization

for Mixture of Gaussians

Siamak Ravanbakhsh

COMP 551 (Fall 2020)

slide-2
SLIDE 2

what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization algorithm relationship to k-means

Learning objectives

slide-3
SLIDE 3

Probabilistic modeling so far...

given data model e.g., multivariate Gaussian, Bernoulli

D = {x , … , x }

(1) (N)

p(x; θ)

  • r if we have labels

we saw generative models for classification

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

p(x, y; θ) ∝ p(y; θ)p(x∣y; θ)

Model used maximum likelihood to fit the data or Bayesian inference

= θ ^ arg max log p(x , y ; θ)

θ ∑n (n) (n)

Learning

e.g., we used this to fit the naive Bayes

p(θ∣D) = p(θ)p(D∣θ)

slide-4
SLIDE 4

Latent variable models

sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables

examples

bias (unobserved) leading to a hiring practice (observed) 3D scene (unobserved) producing a 2D photograph (observed) gravity (unobserved) leading to apple falling (observed) genotype (unobserved) leading to some phenotype (observed) input features (observed) having some unobserved class labels ...

data

D = {x , … , x }

(1) (N)

model p(x, z; θ)

the latent variable

is partial or incomplete accounts for both observed (x) and latent variables (z)

slide-5
SLIDE 5

Latent variable models

sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables data

D = {x , … , x }

(1) (N)

is partial or incomplete

p(x, z; θ) = p(z; θ)p(x∣z; θ)

  • ften we model

model it gives us a lot of flexibility in modeling the data find hidden factors and learn about how they lead to our observations both natural and powerful way to model complex observations difficult to "learn" the model from partial observations

slide-6
SLIDE 6

Latent variable models

sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables data D = {x

, … , x }

(1) (N)

p(x, z; θ) = p(z; θ)p(x∣z; θ)

  • ften we model

model is partial or incomplete if the latent variable is the class label, this resembles generative classification so we can use latent variable models for clustering?

p(x, y; θ) = p(y; θ)p(x∣y; θ)

but here we don't observe the labels we saw clustering performs classification without having labels

slide-7
SLIDE 7

Mixture models

suppose the latent variable has a categorical distribution (unobserved class label)

p(x, z; θ, π) = Categorical(z; π)p(x∣z; θ)

each datapoint with probability comes from the

x πk p(x∣z = k; θ )

k

the marginal over the observed variables is a mixture of K distributions

p(z; π)

we can marginalize out z to get the data distribution

p(x; θ, π) = Categorical(z = ∑k k; π)p(x∣z = k; θ) = π p(x∣z = ∑k

k

k; θ )

k

we only observe x

lets consider the case where this is Gaussian mixture weights

slide-8
SLIDE 8

Mixture of Gaussians

p(x; π, {μ , Σ }) =

k k

π N(x; μ , Σ ) ∑k

k k k

model the data as a mixture of K Gaussian distributions

Gaussian mixture model for D=2

we can calculate the probability of each datapoint belonging to a cluster k

p(z = k∣x ) =

(n) π N(x ;μ ,Σ ) ∑c

c (n) c c

π N(x ;μ ,Σ )

k (n) k k

also called the responsibility of cluster k for data point (n)

weighted density of k'th Gaussian at density of the whole mixture at that point

x(n)

slide-9
SLIDE 9

Mixture of Gaussians

complete data (we have both x and z) incomplete data (we only have x)

p(x) = p(x∣z = ∑k k)p(z = k)

marginal distribution visualizing samples from the join distribution

z ∼

(n)

p(z; π) x ∼

(n)

p(x∣z ; θ)

(n)

p(x, z)

colors show the value of z

p(z = k∣x ) =

(n) π N(x ;μ ,Σ ) ∑c

c (n) c c

π N(x ;μ ,Σ )

k (n) k k

responsibilities

slide-10
SLIDE 10

COMP 551 | Fall 2020

Clustering with Gaussian mixture

p(x; π, {μ , Σ }) =

k k

π N(x; μ , Σ ) ∑k

k k k

mixture of Gaussians we can calculate the probability of each datapoint belonging to a cluster k

p(z = k∣x ) =

(n) π N(x ;μ ,Σ ) ∑c

c (n) c c

π N(x ;μ ,Σ )

k (n) k k

a probabilistic alternative to K-means: cluster membership r ∈

n,k

{0, 1} cluster mean μk soft cluster membership (responsibilities) r

=

n,k

p(z = k∣x )

(n)

cluster mean μk cluster covariance matrix Σk

slide-11
SLIDE 11

Learning the Gaussian mixture model

ℓ(π, {μ , Σ }) =

k k

log π N(x ; μ , Σ ) ∑n (∑k

k (n) k k )

maximize the marginal likelihood of observations under our model set the derivatives to zero (see our references for step-by-step calculation)

p(x)

μ =

k

r x

r ∑n

n,k

1

∑n

n,k (n)

weighted mean

weight is the responsibility probability of sample (n) belonging to cluster k

r =

n,k

p(z = k∣x )

(n)

Σ =

k

r (x −

r ∑n′

n ,k ′

1

∑n

n,k (n)

μ )(x −

k (n)

μ )

k ⊤

weighted covariance

=

∂μk ∂ℓ

=

∂Σk ∂ℓ

=

∂πk ∂ℓ

π =

k N r ∑n

n,k

the total amount of responsibilities accepted by cluster k

problem model parameters depend on the responsibilities responsibilities depend on model parameters

slide-12
SLIDE 12

Expectation Maximization algorithm

solution iteratively update both parameters and responsibilities until convergence

start from some initial model {μ , Σ }, π

k k

for Gaussian Mixture

μ ←

k

r x

r ∑n

n,k

1

∑n

n,k (n)

Σ ←

k

r (x −

r ∑n′

n ,k ′

1

∑n

n,k (n)

μ )(x −

k (n)

μ )

k ⊤

π ←

k N r ∑n

n,k

update the model given the responsibilities

∀k

maximization step

r ←

n,k π N(x ;μ ,Σ ) ∑c

c (n) c c

π N(x ;μ ,Σ )

k (n) k k

update the responsibilities given the model ∀n, k

expectation step

repeat until convergence:

slide-13
SLIDE 13

EM algorithm

example

for Gaussian Mixture

expectation step (finding responsibilities) initialize maximization step (finding model parameters) iteration 2 iteration 5 iteration 20

EM converges after 20 iteration (D=2, K=2)

slide-14
SLIDE 14

COMP 551 | Fall 2020

EM algorithm

example

for Gaussian Mixture

Iris flowers dataset, multiple runs

converged after 34 iterations, average log-likelihood: -1.49 converged after 120 iterations, average log-likelihood: -1.47 converged after 50 iterations, average log-likelihood: -1.45 converged after 43 iterations, average log-likelihood: -1.45

which model is better?

slide-15
SLIDE 15

Comparison with K-Means

K-Means EM for Gaussian mixture model

minimize the sum of squared Euclidean distance to cluster centers minimize the negative log-(marginal) likelihood

  • bjective
  • bjective

parameters cluster centers

means, covariances and mixture weights

parameters responsibilities

hard cluster memberships soft cluster memberships

responsibilities algorithm

alternating minimization wrt parameters and responsibility

feature scaling

sensitive

feature scaling

robust, because of learning the covariance

efficiency

faster convergence

efficiency

slower convergence

  • ptimality

both converge to a local optima, and in both swapping cluster indices makes no difference in the

  • bjective
slide-16
SLIDE 16

Expectation Maximization

we saw application of EM to Gaussian mixture EM is a general algorithm for learning latent variable models: we have a model and partial observations D = {x

, … , x }

(1) (N)

p(x, z; θ)

to learn model parameters and infer the latent variables use EM

E-step: do a probabilistic completion p(z

∣x ; θ)∀n

(n) (n)

M-step: fit the model to the (probabilistically) completed data start from some initial model θ repeat until convergence:

θ

slide-17
SLIDE 17

COMP 551 | Fall 2020

Expectation Maximization

a simple variation called hard EM algorithm

E-step: do a deterministic completion z

=

(n)

arg max p(z∣x ; θ)∀n

z (n)

M-step: fit the model to the completed data using max-likelihood start from some initial model θ repeat until convergence:

θ

K-means is performing hard-EM using a fixed covariance and mixture weights

find the closest center (finding the Gaussian with the highest probability) fit Gaussians to completed data (x,z)

slide-18
SLIDE 18

Summary

Latent variable models: a general and powerful type of probablistic model we have only partial observations can use EM to learn the parameters and infer hidden values Expectation maximization (EM): useful when we have hidden variables or missing values tries to maximize log-likelihood of observations iteratates between learning model parameters and inferring the latents converges to a local optima (performance depends on initialization The only concrete example that we saw: Gaussian mixture model (GMM) EM in GMM for soft clustering relationship to K-means