GMM & EM Last time summary Normalization Bias-Variance - - PowerPoint PPT Presentation

gmm em last time summary
SMART_READER_LITE
LIVE PREVIEW

GMM & EM Last time summary Normalization Bias-Variance - - PowerPoint PPT Presentation

GMM & EM Last time summary Normalization Bias-Variance trade-off Overfitting and underfitting MLE vs MAP estimate How to use the prior LRT (Bayes Classifier) Nave Bayes A simple decision rule If we can know


slide-1
SLIDE 1

GMM & EM

slide-2
SLIDE 2

Last time summary

  • Normalization
  • Bias-Variance trade-off
  • Overfitting and underfitting
  • MLE vs MAP estimate
  • How to use the prior
  • LRT (Bayes Classifier)
  • Naïve Bayes
slide-3
SLIDE 3

A simple decision rule

  • If we can know either p(x|w) or p(w|x) we can make a

classification guess

Goal: Find p(x|w) or p(w|x) by finding the parameter of the distribution

slide-4
SLIDE 4

A simple way to estimate p(x|w)

Make a histogram! What happens if there is no data in a bin?

slide-5
SLIDE 5

The parametric approach

  • We assume p(x|w) or p(w|x) follow some distributions with

parameter θ

Goal: Find θ so that we can estimate p(x|w) or p(w|x) The method where we find the histogram is the non-parametric approach

slide-6
SLIDE 6

Gaussian Mixture Models (GMMs)

  • Gaussians cannot handle

multi-modal data well

  • Consider a class can be

further divided into additional factors

  • Mixing weight makes

sure the overall probability sums to 1

slide-7
SLIDE 7

Model of one Gaussian

slide-8
SLIDE 8

Mixture of two Gaussians

slide-9
SLIDE 9

Mixture models

  • A mixture of models from the same distributions (but with

different parameters)

  • Different mixtures can come from different sub-class
  • Cat class
  • Siamese cats
  • Persian cats
  • p(k) is usually categorical (discrete classes)
  • Usually the exact class for a sample point is unknown.
  • Latent variable
slide-10
SLIDE 10

Parametric models

Parameter θ Drawn from distribution P(x|θ) Data D Gaussian Parameter θ=[µ,σ2] Drawn from Distribution N(µ,σ2) Parametric models

slide-11
SLIDE 11

Maximum A Posteriori (MAP) Estimate

  • Maximizing the posterior (model parameters

given data) argmax p(θ|x)

  • But we don’t know p(θ|x)
  • Use Bayes rule

p(θ|x) = p(x|θ)p(θ) p(x)

  • Taking the argmax for θ we can ignore p(x)
  • argmax p(x|θ) p(θ)

θ

  • Maximizing the likelihood (probability of

data given model parameters) argmax p(x|θ) p(x|θ) = L(θ)

  • Usually done on log likelihood
  • Take the partial derivative wrt to θ and

solve for the θ that maximizes the likelihood θ

MAP MLE

θ

slide-12
SLIDE 12

What if some data is missing?

Mixture of Gaussian Parameter θ=[µ1,σ1

2,

µ2,σ2

2]

N(µ1,σ1

2)

N(µ1,σ2

2)

Unknown mixture labels Parameter θ=[µ1,σ1

2,

µ2,σ2

2]

N(µ1,σ1

2)

N(µ1,σ2

2)

slide-13
SLIDE 13

Estimating missing data

Parameter θ Drawn from distribution P(x,k|θ) Data D Parametric models Latent variables,k unknown Need to estimate both the latent Variables and the model parameters.

slide-14
SLIDE 14

Estimating latent variables and model parameters

  • GMM
  • Observed (x1,x2,…,xN)
  • Latent (k1,k2,…,kN) from K possible mixtures
  • Parameter for p(k) is ϕ , p(k = 1) = ϕ1, p(k = 2) = ϕ2…

Cannot be solved by differentiating

slide-15
SLIDE 15

Assuming k

  • What if we somehow know kn?
  • Maximizing wrt to ϕ, µ, σ gives
  • (HW3 J)

Indicator function. Equals one if condition is met. Zero otherwise

slide-16
SLIDE 16

Iterative algorithm

  • Initialize ϕ, µ, σ
  • Repeat till convergence
  • Expectation step (E-step) : Estimate the latent labels k
  • Maximization step (M-step) : Estimate the parameters ϕ, µ, σ given the

latent labels

  • Called Expectation Maximization (EM) Algorithm
  • How to estimate the latent labels?
slide-17
SLIDE 17

Iterative algorithm

  • Initialize ϕ, µ, σ
  • Repeat till convergence
  • Expectation step (E-step) : Estimate the latent labels k by finding the

expected value of k given everything else E[k| ϕ, µ, σ, x]

  • Maximization step (M-step) : Estimate the parameters ϕ, µ, σ given the

latent labels

  • Extension of MLE for latent variables
  • MLE : argmax log p(x|θ)
  • EM : argmax Ek[ log p(x, k|θ) ]
slide-18
SLIDE 18

EM on GMM

  • E-step
  • Set soft labels: wn,j = probability that nth sample comes from jth

mixture p

  • Using Bayes rule
  • p(k|x ; µ, σ, ϕ) = p(x|k ; µ, σ, ϕ) p(k; µ, σ, ϕ) / p(x; µ, σ, ϕ)
  • p(k|x ; µ, σ, ϕ) α p(x|k ; µ, σ, ϕ) p(k; ϕ)
slide-19
SLIDE 19

EM on GMM

  • M-step (hard labels)
slide-20
SLIDE 20

EM on GMM

  • M-step (soft labels)
slide-21
SLIDE 21

K-mean vs EM

EM on GMM can be considered as EM with soft labels (with standard Gaussians as mixtures)

slide-22
SLIDE 22

K-mean clustering

  • Task: cluster data into groups
  • K-mean algorithm
  • Initialization: Pick K data points as cluster centers
  • Assign: Assign data points to the closest centers
  • Update: Re-compute cluster center
  • Repeat: Assign and Update
slide-23
SLIDE 23

EM algorithm for GMM

  • Task: cluster data into Gaussians
  • EM algorithm
  • Initialization: Randomly initialize parameters Gaussians
  • Expectation: Assign data points to the closest Gaussians
  • Maximization: Re-compute Gaussians parameters according to

assigned data points

  • Repeat: Expectation and Maximization
  • Note: assigning data points is actually a soft assignment

(with probability)

slide-24
SLIDE 24

EM/GMM notes

  • Converges to local maxima (maximizing likelihood)
  • Just like k-means, need to try different initialization points
  • Just like k-means some centroid can get stuck with one

sample point and no longer moves

  • For EM on GMM this cause variance to go to 0…
  • Introduce variance floor (minimum variance a Gaussian can have)
  • Tricks to avoid bad local maxima
  • Starts with 1 Gaussian
  • Split the Gaussians according to the direction of maximum variance
  • Repeat until arrive at k Gaussians
  • Does not guarantee global maxima but works well in practice
slide-25
SLIDE 25

Gaussian splitting

split em split

slide-26
SLIDE 26

Picking the amount of Gaussians

  • As we increase K, the likelihood will keep increasing
  • More mixtures -> more parameters -> overfits

http://staffblogs.le.ac.uk/bayeswithstata/2014/05/22/mixture-models-how-many-components/

slide-27
SLIDE 27

Picking the amount of Gaussians

  • Need a measure of goodness (like Elbow method in k-

mean)

  • Bayesian Information Criterion (BIC)
  • Penalize the log likelihood from the data by the amount of

parameters in the model

  • -2 log L + t log (n)
  • t = number of parameters in the model
  • n = number of data points
  • We want to mimimize BIC
slide-28
SLIDE 28

BIC is bad use cross validation!

  • BIC is bad use cross validation!
  • BIC is bad use cross validation!
  • BIC is bad use cross validation!
  • Test on the goal of your model
slide-29
SLIDE 29

EM on a simple example

  • Grades in class P(A) = 0.5 P(B) = 1-θ P(C) = θ
  • We want to estimate θ from three known numbers
  • Na Nb Nc
  • Find the maximum likelihood estimate of θ
slide-30
SLIDE 30

EM on a simple example

  • Grades in class P(A) = 0.5 P(B) = 1-θ P(C) = θ
  • We want to estimate θ from ONE known number
  • Nc (we also know N the total number of students)
  • Find θ using EM
slide-31
SLIDE 31

EM usage examples

slide-32
SLIDE 32

Image segmentation with GMM EM

  • D - {r,g,b} value at each pixel
  • K - segment where each pixel comes from
  • Hyperparameters: number of mixtures, initial values

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.1371&rep=rep1&type=pdf

slide-33
SLIDE 33

Face pose estimation (estimate 3d coordinates from 2d picture)

https://www.researchgate.net/publication/2489713_Estimating_3D_Facial_Pose_using_the_EM_Algorithm

slide-34
SLIDE 34

Language modeling

Latent variable: Topic P(word|topic) For examples: see Probabilistic latent semantic analysis

slide-35
SLIDE 35

Summary

  • GMM
  • Mixture of Gaussians
  • EM
  • Expectation
  • Maximization