Clustering: Models and Algorithms
Shikui Tu 2019-03-07
1
Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - - PowerPoint PPT Presentation
Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline Gaussian Mixture Models (GMM) Expectation-Maximization (EM) for maximum likelihood Model selection, Bayesian learning 2 From distance to probability distance
1
2
“The closer, the more likely.” Sum or integral to be one Probability Gaussian distribution with the Mahalanobis distance
3
S
We have the following data: We want to cluster the data into two clusters (red and blue)
4
k=2 k=1
Gaussian Mixture Model (GMM)
5
Maximizing the log-likelihood function: Similarly we get = 0 and are the maximum likelihood estimates of the mean and the co-variance matrix.
6
7
http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf
Assume the points in the same cluster follow a Gaussian distribution We use zk = 1 to indicate a point x belongs to cluster k A mixing weight for each cluster: z = (z1, …, zK) prior probability of point belonging to a cluster k=2 k=1 So, we get a distribution for the data point x:
8
We use zk = 1 to indicate a point x belongs to cluster k
Assume the points in the same cluster follow a Gaussian distribution z = (z1, …, zK) k=2 k=1
9
A mixing weight for each cluster: prior probability of point belonging to a cluster
So, we get a distribution for the data point x:
10
Graphical representation of # $, & = #())# $ &
minimize
rn1 = 1 rn2 = 0
k=2 k=1
maximize likelihood Remember: The closer the distance, the more likely the probability.
X = {x1,..., xN} π = {π1,...,π K} µ = {µ1,…,µK} Σ = {Σ1,...,ΣK}
11
12
Initialization k=2 k=1
13
When the parameters are given, the assignments of the points can be calculated by the posterior probability, i.e., the probability of a data point belonging to a cluster once we have
Soft assignment: A point fractionally belongs to two clusters. For example, 0.2 belong to cluster 1 0.8 belong to cluster 2
14
When the assignments γ(znk) of the points to the clusters are known, parameters could be calculated for each cluster (Gaussian) separately.
L denotes the number of cycles of the EM algorithm. Mixing weight πk: the proportion of number of points in cluster k within all data points µk, Σk: the mean and the covariance matrix are calculated for each cluster
15
;
initialization E-Step M-Step L denotes the number of cycles of E-Step and M-Step. Convergence
16
17
Σk = εI
GMM considers covariance and mixing weights. One-in-K assignment Soft assignment
18
Fixed equal mixing weights
19
The General EM Algorithm Given a joint distribution p(X, Z|θ) over observed variables X and latent vari- ables Z, governed by parameters θ, the goal is to maximize the likelihood func- tion p(X|θ) with respect to θ.
θnew = arg max
θ
Q(θ, θold) (9.32) where Q(θ, θold) =
p(Z|X, θold) ln p(X, Z|θ). (9.33)
If the convergence criterion is not satisfied, then let θold ← θnew (9.34) and return to step 2.
20
21
quan-
ln p(X|θ) L(q, θ) KL(q||p)
ln[$ % & ' ] KL[+,
' ||$ . /, & '
] ℱ(+,
' , & ' )
Log likelihood Lower bound ln $ % & ' = ℱ(+,
'56 , & ' )
KL[+,
'56 | $ . /, & '
= 0 New lower bound New log likelihood
ln[$ % & '56 ]
KL[+,
'56 ||$ . /, & '56 ]
ℱ(+,
'56 , & '56 )
22
23
The KL[q(x)kp(x)] is non-negative and zero iff 8x : p(x) = q(x)
First let’s consider discrete distributions; the Kullback-Liebler divergence is: KL[qkp] =
X
i
qi log qi pi .
To find the distribution q which minimizes KL[qkp] we add a Lagrange multiplier to enforce the normalization constraint:
E
def
= KL[qkp] + λ
X
i
qi
X
i
qi log qi pi + λ
X
i
qi
∂E ∂qi = log qi log pi + 1 λ = 0 ) qi = pi exp(λ 1) ∂E ∂λ = 1 X
i
qi = 0 ) X
i
qi = 1 9 > > > = > > > ; ) qi = pi.
k
Check that the curvature (Hessian) is positive (definite), corresponding to a minimum:
∂2E ∂qi∂qi = 1 qi > 0, ∂2E ∂qi∂qj = 0,
showing that qi = pi is a genuine minimum. At the minimum is it easily verified that KL[pkp] = 0.
Jensen’s Inequality due to convexity
25
quan-
ln p(X|θ) L(q, θ) KL(q||p)
ln[$ % & ' ] KL[+,
' ||$ . /, & '
] ℱ(+,
' , & ' )
Log likelihood Lower bound ln $ % & ' = ℱ(+,
'56 , & ' )
KL[+,
'56 | $ . /, & '
= 0 New lower bound New log likelihood
ln[$ % & '56 ]
KL[+,
'56 ||$ . /, & '56 ]
ℱ(+,
'56 , & '56 )
26
J
K K0 J does not tell which K is better. K
Negative log-likelihood also decreases as K increases.
27
K
Criterion
Akaike’s Information Criterion (AIC) Bayesian Information Criterion (BIC) Models become more and more complex as K increases. Negative log- likelihood (or fitting error) to make sure the model fit the data well. A trade-off between fitting the data well and keeping the model simple
dk: number of free parameters N: sample size
28
29
7)
30
31
Using Occam’s Razor to Learn Model Structure
Compare model classes m using their posterior probability given the data: P(m|y) = P(y|m)P(m) P(y) , P(y|m) = Z
Θm
P(y|θm, m)P(θm|m) dθm Interpretation of P(y|m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random.
too simple too complex "just right" All possible data sets P(Y|Mi) Y
32
mixtures of m Gaussians.
P(y|m) = Z
Θm
P(y|θm, m)P(θm|m) dθm is also known as the Bayesian evidence for model m.
P(y|m) P(y|m0)
explanations than more complex explanations.
33
p(Z|π) =
N
K
πznk
k
p(X|Z, µ, Λ) =
N
K
N
k
znk
K
k
mix- de-
xn zn N π µ Λ
p(µ, Λ) = p(µ|Λ)p(Λ) =
K
N µk|m0, (β0Λk)−1 W(Λk|W0, ν0)
34
http://www.cs.ubc.ca/~murphyk/Software/VBEMGMM/index.html http://scikit-learn.org/stable/modules/mixture.html
15 60 120
35
bound com- mixture data, alue and symbols i- y that suboptimal hap- K p(D|K) 1 2 3 4 5 6
36
37
http://www.elecfans.com/d/604076.html