Clustering Techniques Clustering Techniques
Berlin Chen 2003
References:
1. Modern Information Retrieval, chapters 5, 7 2. Foundations of Statistical Natural Language Processing, Chapter 14
Clustering Techniques Clustering Techniques Berlin Chen 2003 - - PowerPoint PPT Presentation
Clustering Techniques Clustering Techniques Berlin Chen 2003 References: 1. Modern Information Retrieval, chapters 5, 7 2. Foundations of Statistical Natural Language Processing, Chapter 14 Clustering Place similar objects in the same
References:
1. Modern Information Retrieval, chapters 5, 7 2. Foundations of Statistical Natural Language Processing, Chapter 14
2
3
4
5
i
i
x P ⋅
j
i
j
j i c
x P
6
7
– Preferable for detailed data analysis – Provide more information than flat clustering – No single best algorithm (each of the algorithms only optimal for some applications) – Less efficient than flat clustering (minimally have to compute n x n matrix of similarity coefficients)
– Preferable if efficiency is a consideration or data sets are very large – K-means is the conceptually method and should probably be used
– K-means assumes a simple Euclidean representation space, and so cannot be used for many data sets, e.g., nominal data like colors – The EM algorithm is the most choice. It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
8
9
y x d y x sim , 1 1 , + =
10
11
cluster number
12
2 1 2
i m i i
=
=
m i i i
1 1
13
greatest similarity least similarity
j i
c y , c x j i
r r ∈ ∈
j i
c y , c x j i
r r ∈ ∈
14
15
16
∈ ≠ ∈
j j
c x x y c y j j j
r r r r
∈
j
c x j
r
j
j
1 1 1 − − ⋅ = ∴ + − = ⋅ + − = ⋅ = ⋅ = ⋅
∈ ∈ ∈ ∈ j j j j j j j j j j c x j j j c x c y j c x j j
c c c c s c s c SIM c c SIM c c x x c SIM c c y x c s x c s c s
j j j j
r r r r r r r r r r
r r r r
=1
17
i
j
j i j i j i j i j i j i
j i New j i New
18
19
20
21
:
22
23
MI, group average similarity, likelihood k-1 → k → k+1 Hierarchical clustering also has to face this problem
24
25
cluster centroid cluster assignment calculation of new centroid
26
27
government finance sports research name
28
29
i
1
c x P
i
1 1
c P = π
2 2
c P = π
k k
c P = π
2
c x P
i
k i c
x P
=
k 1 i
i i i i
=
k 1 i n 1 i 1
i i i n i i
− Σ − − Σ = Θ
− j T j j m i
x x c x P
j
µ µ π r r r r r
1 1
2 1 exp 2 1 ;
Continuous case:
30
=
k l l l i j j i i ij j i
1
= =
n i j i i n i j i j
1 1
= =
n i j i n i T j i j i j i j
1 1
= = =
k j n i j i n i j i j
1 1 1
31