[PPT] - Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline PowerPoint Presentation

SLIDE 1

Clustering: Models and Algorithms

Shikui Tu 2019-03-07

1

SLIDE 2

Outline

Gaussian Mixture Models (GMM)
Expectation-Maximization (EM) for maximum

likelihood

Model selection, Bayesian learning

2

SLIDE 3

From distance to probability

|| x −µ ||2 exp{−λ || x −µ ||2}

distance likely

“The closer, the more likely.” Sum or integral to be one Probability Gaussian distribution with the Mahalanobis distance

It is more powerful to consider everything in probability framework!

3

S

SLIDE 4

Review the clustering problem again

We have the following data: We want to cluster the data into two clusters (red and blue)

4

SLIDE 5

Instead if using {µ1, µ2}, each cluster is represented as a Gaussian distribution

µ2 µ1 K-means

k=2 k=1

=

Gaussian Mixture Model (GMM)

5

SLIDE 6

Maximum likelihood

Maximizing the log-likelihood function: Similarly we get = 0 and are the maximum likelihood estimates of the mean and the co-variance matrix.

6

SLIDE 7

Matrix derivatives

7

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

SLIDE 8

Gaussian Mixture Model (GMM)

Assume the points in the same cluster follow a Gaussian distribution We use zk = 1 to indicate a point x belongs to cluster k A mixing weight for each cluster: z = (z1, …, zK) prior probability of point belonging to a cluster k=2 k=1 So, we get a distribution for the data point x:

8

SLIDE 9

Introduce a latent variable

We use zk = 1 to indicate a point x belongs to cluster k

Assume the points in the same cluster follow a Gaussian distribution z = (z1, …, zK) k=2 k=1

9

A mixing weight for each cluster: prior probability of point belonging to a cluster

SLIDE 10

Gaussian Mixture Model (GMM)

So, we get a distribution for the data point x:

10

Generative process

Randomly sample a z from a

categorical distribution [!1, …, !K];

Generate " according to Gaussian

distribution

Graphical representation of # $, & = #())# $ &

SLIDE 11

From minimizing sum of square distances to finding maximum likelihood

minimize

rn1 = 1 rn2 = 0

µ2 µ1

k=2 k=1

maximize likelihood Remember: The closer the distance, the more likely the probability.

X = {x1,..., xN} π = {π1,...,π K} µ = {µ1,…,µK} Σ = {Σ1,...,ΣK}

11

SLIDE 12

Outline

Gaussian Mixture Models (GMM)
Expectation-Maximization (EM) for maximum

likelihood

Model selection, Bayesian learning

12

SLIDE 13

Expectation-Maximization (EM) algorithm for maximum likelihood

Initialization k=2 k=1

13

SLIDE 14

E Step

When the parameters are given, the assignments of the points can be calculated by the posterior probability, i.e., the probability of a data point belonging to a cluster once we have

bserved the data point.

Soft assignment: A point fractionally belongs to two clusters. For example, 0.2 belong to cluster 1 0.8 belong to cluster 2

14

SLIDE 15

M Step

When the assignments γ(znk) of the points to the clusters are known, parameters could be calculated for each cluster (Gaussian) separately.

L denotes the number of cycles of the EM algorithm. Mixing weight πk: the proportion of number of points in cluster k within all data points µk, Σk: the mean and the covariance matrix are calculated for each cluster

15

;

SLIDE 16

initialization E-Step M-Step L denotes the number of cycles of E-Step and M-Step. Convergence

16

SLIDE 17

Details of the EM Algorithm

17

SLIDE 18

K-means is a hard-cut EM

Σk = εI

{µk}

GMM considers covariance and mixing weights. One-in-K assignment Soft assignment

18

Fixed equal mixing weights

SLIDE 19

19

The General EM Algorithm Given a joint distribution p(X, Z|θ) over observed variables X and latent variables Z, governed by parameters θ, the goal is to maximize the likelihood function p(X|θ) with respect to θ.

1. Choose an initial setting for the parameters θold.
2. E step Evaluate p(Z|X, θold).
3. M step Evaluate θnew given by

θnew = arg max

θ

Q(θ, θold) (9.32) where Q(θ, θold) =

Z

p(Z|X, θold) ln p(X, Z|θ). (9.33)

4. Check for convergence of either the log likelihood or the parameter values.

If the convergence criterion is not satisfied, then let θold ← θnew (9.34) and return to step 2.

SLIDE 20

Summary for the EM algorithm for GMM

Does it find the global optimum?

– No, like K-means, EM only finds the nearest local

ptimum and the optimum depends on the

initialization

GMM is more general then K-means by

considering mixing weights, covariance matrices, and soft assignments.

Like K-means, it does not tell you the best K.

20

SLIDE 21

EM never decreases the likelihood

21

quan-

ln p(X|θ) L(q, θ) KL(q||p)

ln[$ % & ' ] KL[+,

' ||$ . /, & '

] ℱ(+,

' , & ' )

Log likelihood Lower bound ln $ % & ' = ℱ(+,

'56 , & ' )

KL[+,

'56 | $ . /, & '

= 0 New lower bound New log likelihood

ln[$ % & '56 ]

KL[+,

'56 ||$ . /, & '56 ]

ℱ(+,

'56 , & '56 )

E-Step M-Step

(t) (t+1)

SLIDE 22

22

SLIDE 23

23

The KL[q(x)kp(x)] is non-negative and zero iff 8x : p(x) = q(x)

First let’s consider discrete distributions; the Kullback-Liebler divergence is: KL[qkp] =

X

i

qi log qi pi .

To find the distribution q which minimizes KL[qkp] we add a Lagrange multiplier to enforce the normalization constraint:

E

def

= KL[qkp] + λ

1

X

i

qi

=

X

i

qi log qi pi + λ

1

X

i

qi

We then take partial derivatives and set to zero:

∂E ∂qi = log qi log pi + 1 λ = 0 ) qi = pi exp(λ 1) ∂E ∂λ = 1 X

i

qi = 0 ) X

i

qi = 1 9 > > > = > > > ; ) qi = pi.

k

Check that the curvature (Hessian) is positive (definite), corresponding to a minimum:

∂2E ∂qi∂qi = 1 qi > 0, ∂2E ∂qi∂qj = 0,

showing that qi = pi is a genuine minimum. At the minimum is it easily verified that KL[pkp] = 0.

SLIDE 24

24

Jensen’s Inequality due to convexity

SLIDE 25

EM never decreases the likelihood

25

quan-

ln p(X|θ) L(q, θ) KL(q||p)

ln[$ % & ' ] KL[+,

' ||$ . /, & '

] ℱ(+,

' , & ' )

Log likelihood Lower bound ln $ % & ' = ℱ(+,

'56 , & ' )

KL[+,

'56 | $ . /, & '

= 0 New lower bound New log likelihood

ln[$ % & '56 ]

KL[+,

'56 ||$ . /, & '56 ]

ℱ(+,

'56 , & '56 )

E-Step M-Step

(t) (t+1)

SLIDE 26

Outline

Gaussian Mixture Models (GMM)
Expectation-Maximization (EM) for maximum

likelihood

Model selection, Bayesian learning

26

SLIDE 27

How to determine the cluster number K?

K-mean

J

K K0 J does not tell which K is better. K

GMM

Log-likelihood

Negative log-likelihood also decreases as K increases.

27

SLIDE 28

Model selection in general

p(XN |ΘK )

Probabilistic model

Θ1 ⊆ Θ2 ⊆!⊆ ΘK ⊆!

K

Candidate models:

Criterion

ln p(XN | ˆ ΘK )− dk

ln p(XN | ˆ ΘK )− 1 2 dk ln N

Akaike’s Information Criterion (AIC) Bayesian Information Criterion (BIC) Models become more and more complex as K increases. Negative log- likelihood (or fitting error) to make sure the model fit the data well. A trade-off between fitting the data well and keeping the model simple

dk: number of free parameters N: sample size

28

SLIDE 29

Bayesian learning

Maximum A Posteriori (MAP)

29

log $ %, Θ = log $(%| Θ) + log $( Θ) max $(Θ|%)

Equivalent to: Consider a simple example:

$ 1 Θ = 2(1|3, Σ) $ 3 = 2(3|35, 65

7)

SLIDE 30

Model selection

30

p(XN |ΘK )

Probabilistic model

Θ1 ⊆ Θ2 ⊆!⊆ ΘK ⊆!

Candidate models:

SLIDE 31

31

Using Occam’s Razor to Learn Model Structure

Compare model classes m using their posterior probability given the data: P(m|y) = P(y|m)P(m) P(y) , P(y|m) = Z

Θm

P(y|θm, m)P(θm|m) dθm Interpretation of P(y|m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random.

too simple too complex "just right" All possible data sets P(Y|Mi) Y

SLIDE 32

Bayesian model selection

32

A model class m is a set of models parameterised by θm, e.g. the set of all possible

mixtures of m Gaussians.

The marginal likelihood of model class m:

P(y|m) = Z

Θm

P(y|θm, m)P(θm|m) dθm is also known as the Bayesian evidence for model m.

The ratio of two marginal likelihoods is known as the Bayes factor:

P(y|m) P(y|m0)

The Occam’s Razor principle is, roughly speaking, that one should prefer simpler

explanations than more complex explanations.

Bayesian inference formalises and automatically implements the Occam’s Razor principle.

SLIDE 33

VBEM for GMM

Model descriptions:

33

p(Z|π) =

N

n=1

K

k=1

πznk

k

p(X|Z, µ, Λ) =

N

n=1

K

k=1

N

xn|µk, Λ−1

k

znk

p(π) = Dir(π|α0) = C(α0)

K

k=1

πα0−1

k

mix- de-

xn zn N π µ Λ

p(µ, Λ) = p(µ|Λ)p(Λ) =

K

k=1

N µk|m0, (β0Λk)−1 W(Λk|W0, ν0)

Prior distributions over parameters:

p(X, Z, π, µ, Λ) = p(X|Z, µ, Λ)p(Z|π)p(π)p(µ|Λ)p(Λ)

SLIDE 34

How VBEM for GMM works

34

http://www.cs.ubc.ca/~murphyk/Software/VBEMGMM/index.html http://scikit-learn.org/stable/modules/mixture.html

15 60 120

SLIDE 35

Determine K by the variational lower bound (free energy)

35

bound com- mixture data, alue and symbols i- y that suboptimal hap- K p(D|K) 1 2 3 4 5 6

SLIDE 36

Thank you!

36

SLIDE 37

S

HM MVN

749MV46

K KAM2 KEAM31 KBAM+855L 31M2G+W

.

37

http://www.elecfans.com/d/604076.html