Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - - PowerPoint PPT Presentation

clustering models and algorithms
SMART_READER_LITE
LIVE PREVIEW

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - - PowerPoint PPT Presentation

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline Gaussian Mixture Models (GMM) Expectation-Maximization (EM) for maximum likelihood Model selection, Bayesian learning 2 From distance to probability distance


slide-1
SLIDE 1

Clustering: Models and Algorithms

Shikui Tu 2019-03-07

1

slide-2
SLIDE 2

Outline

  • Gaussian Mixture Models (GMM)
  • Expectation-Maximization (EM) for maximum

likelihood

  • Model selection, Bayesian learning

2

slide-3
SLIDE 3

From distance to probability

|| x −µ ||2 exp{−λ || x −µ ||2}

distance likely

“The closer, the more likely.” Sum or integral to be one Probability Gaussian distribution with the Mahalanobis distance

It is more powerful to consider everything in probability framework!

3

S

slide-4
SLIDE 4

Review the clustering problem again

We have the following data: We want to cluster the data into two clusters (red and blue)

4

slide-5
SLIDE 5

Instead if using {µ1, µ2}, each cluster is represented as a Gaussian distribution

µ2 µ1 K-means

k=2 k=1

=

Gaussian Mixture Model (GMM)

5

slide-6
SLIDE 6

Maximum likelihood

Maximizing the log-likelihood function: Similarly we get = 0 and are the maximum likelihood estimates of the mean and the co-variance matrix.

6

slide-7
SLIDE 7

Matrix derivatives

7

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

slide-8
SLIDE 8

Gaussian Mixture Model (GMM)

Assume the points in the same cluster follow a Gaussian distribution We use zk = 1 to indicate a point x belongs to cluster k A mixing weight for each cluster: z = (z1, …, zK) prior probability of point belonging to a cluster k=2 k=1 So, we get a distribution for the data point x:

8

slide-9
SLIDE 9

Introduce a latent variable

We use zk = 1 to indicate a point x belongs to cluster k

Assume the points in the same cluster follow a Gaussian distribution z = (z1, …, zK) k=2 k=1

9

A mixing weight for each cluster: prior probability of point belonging to a cluster

slide-10
SLIDE 10

Gaussian Mixture Model (GMM)

So, we get a distribution for the data point x:

10

Generative process

  • Randomly sample a z from a

categorical distribution [!1, …, !K];

  • Generate " according to Gaussian

distribution

Graphical representation of # $, & = #())# $ &

slide-11
SLIDE 11

From minimizing sum of square distances to finding maximum likelihood

minimize

rn1 = 1 rn2 = 0

µ2 µ1

k=2 k=1

maximize likelihood Remember: The closer the distance, the more likely the probability.

X = {x1,..., xN} π = {π1,...,π K} µ = {µ1,…,µK} Σ = {Σ1,...,ΣK}

11

slide-12
SLIDE 12

Outline

  • Gaussian Mixture Models (GMM)
  • Expectation-Maximization (EM) for maximum

likelihood

  • Model selection, Bayesian learning

12

slide-13
SLIDE 13

Expectation-Maximization (EM) algorithm for maximum likelihood

Initialization k=2 k=1

13

slide-14
SLIDE 14

E Step

When the parameters are given, the assignments of the points can be calculated by the posterior probability, i.e., the probability of a data point belonging to a cluster once we have

  • bserved the data point.

Soft assignment: A point fractionally belongs to two clusters. For example, 0.2 belong to cluster 1 0.8 belong to cluster 2

14

slide-15
SLIDE 15

M Step

When the assignments γ(znk) of the points to the clusters are known, parameters could be calculated for each cluster (Gaussian) separately.

L denotes the number of cycles of the EM algorithm. Mixing weight πk: the proportion of number of points in cluster k within all data points µk, Σk: the mean and the covariance matrix are calculated for each cluster

15

;

slide-16
SLIDE 16

initialization E-Step M-Step L denotes the number of cycles of E-Step and M-Step. Convergence

16

slide-17
SLIDE 17

Details of the EM Algorithm

17

slide-18
SLIDE 18

K-means is a hard-cut EM

Σk = εI

{µk}

GMM considers covariance and mixing weights. One-in-K assignment Soft assignment

18

Fixed equal mixing weights

slide-19
SLIDE 19

19

The General EM Algorithm Given a joint distribution p(X, Z|θ) over observed variables X and latent vari- ables Z, governed by parameters θ, the goal is to maximize the likelihood func- tion p(X|θ) with respect to θ.

  • 1. Choose an initial setting for the parameters θold.
  • 2. E step Evaluate p(Z|X, θold).
  • 3. M step Evaluate θnew given by

θnew = arg max

θ

Q(θ, θold) (9.32) where Q(θ, θold) =

  • Z

p(Z|X, θold) ln p(X, Z|θ). (9.33)

  • 4. Check for convergence of either the log likelihood or the parameter values.

If the convergence criterion is not satisfied, then let θold ← θnew (9.34) and return to step 2.

slide-20
SLIDE 20

Summary for the EM algorithm for GMM

  • Does it find the global optimum?

– No, like K-means, EM only finds the nearest local

  • ptimum and the optimum depends on the

initialization

  • GMM is more general then K-means by

considering mixing weights, covariance matrices, and soft assignments.

  • Like K-means, it does not tell you the best K.

20

slide-21
SLIDE 21

EM never decreases the likelihood

21

quan-

ln p(X|θ) L(q, θ) KL(q||p)

ln[$ % & ' ] KL[+,

' ||$ . /, & '

] ℱ(+,

' , & ' )

Log likelihood Lower bound ln $ % & ' = ℱ(+,

'56 , & ' )

KL[+,

'56 | $ . /, & '

= 0 New lower bound New log likelihood

ln[$ % & '56 ]

KL[+,

'56 ||$ . /, & '56 ]

ℱ(+,

'56 , & '56 )

E-Step M-Step

(t) (t+1)

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

The KL[q(x)kp(x)] is non-negative and zero iff 8x : p(x) = q(x)

First let’s consider discrete distributions; the Kullback-Liebler divergence is: KL[qkp] =

X

i

qi log qi pi .

To find the distribution q which minimizes KL[qkp] we add a Lagrange multiplier to enforce the normalization constraint:

E

def

= KL[qkp] + λ

  • 1

X

i

qi

  • =

X

i

qi log qi pi + λ

  • 1

X

i

qi

  • We then take partial derivatives and set to zero:

∂E ∂qi = log qi log pi + 1 λ = 0 ) qi = pi exp(λ 1) ∂E ∂λ = 1 X

i

qi = 0 ) X

i

qi = 1 9 > > > = > > > ; ) qi = pi.

k

Check that the curvature (Hessian) is positive (definite), corresponding to a minimum:

∂2E ∂qi∂qi = 1 qi > 0, ∂2E ∂qi∂qj = 0,

showing that qi = pi is a genuine minimum. At the minimum is it easily verified that KL[pkp] = 0.

slide-24
SLIDE 24
  • 24

Jensen’s Inequality due to convexity

slide-25
SLIDE 25

EM never decreases the likelihood

25

quan-

ln p(X|θ) L(q, θ) KL(q||p)

ln[$ % & ' ] KL[+,

' ||$ . /, & '

] ℱ(+,

' , & ' )

Log likelihood Lower bound ln $ % & ' = ℱ(+,

'56 , & ' )

KL[+,

'56 | $ . /, & '

= 0 New lower bound New log likelihood

ln[$ % & '56 ]

KL[+,

'56 ||$ . /, & '56 ]

ℱ(+,

'56 , & '56 )

E-Step M-Step

(t) (t+1)

slide-26
SLIDE 26

Outline

  • Gaussian Mixture Models (GMM)
  • Expectation-Maximization (EM) for maximum

likelihood

  • Model selection, Bayesian learning

26

slide-27
SLIDE 27

How to determine the cluster number K?

K-mean

J

K K0 J does not tell which K is better. K

GMM

  • Log-likelihood

Negative log-likelihood also decreases as K increases.

27

slide-28
SLIDE 28

Model selection in general

p(XN |ΘK )

Probabilistic model

Θ1 ⊆ Θ2 ⊆!⊆ ΘK ⊆!

K

Candidate models:

Criterion

ln p(XN | ˆ ΘK )− dk

ln p(XN | ˆ ΘK )− 1 2 dk ln N

Akaike’s Information Criterion (AIC) Bayesian Information Criterion (BIC) Models become more and more complex as K increases. Negative log- likelihood (or fitting error) to make sure the model fit the data well. A trade-off between fitting the data well and keeping the model simple

dk: number of free parameters N: sample size

28

slide-29
SLIDE 29

Bayesian learning

  • Maximum A Posteriori (MAP)

29

log $ %, Θ = log $(%| Θ) + log $( Θ) max $(Θ|%)

Equivalent to: Consider a simple example:

$ 1 Θ = 2(1|3, Σ) $ 3 = 2(3|35, 65

7)

slide-30
SLIDE 30

Model selection

30

p(XN |ΘK )

Probabilistic model

Θ1 ⊆ Θ2 ⊆!⊆ ΘK ⊆!

Candidate models:

slide-31
SLIDE 31

31

Using Occam’s Razor to Learn Model Structure

Compare model classes m using their posterior probability given the data: P(m|y) = P(y|m)P(m) P(y) , P(y|m) = Z

Θm

P(y|θm, m)P(θm|m) dθm Interpretation of P(y|m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random.

too simple too complex "just right" All possible data sets P(Y|Mi) Y

slide-32
SLIDE 32

Bayesian model selection

32

  • A model class m is a set of models parameterised by θm, e.g. the set of all possible

mixtures of m Gaussians.

  • The marginal likelihood of model class m:

P(y|m) = Z

Θm

P(y|θm, m)P(θm|m) dθm is also known as the Bayesian evidence for model m.

  • The ratio of two marginal likelihoods is known as the Bayes factor:

P(y|m) P(y|m0)

  • The Occam’s Razor principle is, roughly speaking, that one should prefer simpler

explanations than more complex explanations.

  • Bayesian inference formalises and automatically implements the Occam’s Razor principle.
slide-33
SLIDE 33

VBEM for GMM

  • Model descriptions:

33

p(Z|π) =

N

  • n=1

K

  • k=1

πznk

k

p(X|Z, µ, Λ) =

N

  • n=1

K

  • k=1

N

  • xn|µk, Λ−1

k

znk

p(π) = Dir(π|α0) = C(α0)

K

  • k=1

πα0−1

k

mix- de-

xn zn N π µ Λ

p(µ, Λ) = p(µ|Λ)p(Λ) =

K

  • k=1

N µk|m0, (β0Λk)−1 W(Λk|W0, ν0)

  • Prior distributions over parameters:

p(X, Z, π, µ, Λ) = p(X|Z, µ, Λ)p(Z|π)p(π)p(µ|Λ)p(Λ)

slide-34
SLIDE 34

How VBEM for GMM works

34

http://www.cs.ubc.ca/~murphyk/Software/VBEMGMM/index.html http://scikit-learn.org/stable/modules/mixture.html

15 60 120

slide-35
SLIDE 35

Determine K by the variational lower bound (free energy)

35

bound com- mixture data, alue and symbols i- y that suboptimal hap- K p(D|K) 1 2 3 4 5 6

slide-36
SLIDE 36

Thank you!

36

slide-37
SLIDE 37

S

HM MVN

  • 749MV46

K KAM2 KEAM31 KBAM+855L 31M2G+W

  • .

37

http://www.elecfans.com/d/604076.html