Clustering
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2017
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Clustering CE-324: Modern Information Retrieval Sharif University - - PowerPoint PPT Presentation
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? Clustering:
Sharif University of Technology
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Docs within a cluster should be similar. Docs from different clusters should be dissimilar.
Unsupervised learning
learning from raw data, as opposed to supervised data where a
A common and important task that finds many applications in
2
3
5
6
7
8
News reading is not really search, but rather a process of
9
10
11
12
Cluster docs in corpus a priori When a query matches a doc 𝑒, also return other docs in the cluster
Because clustering grouped together docs containing car with
13
Doc representation
Vector space? Normalization? Centroids aren’t length normalized
Need a notion of similarity/distance
Fixed a priori? Completely data driven?
Avoid “trivial” clusters - too large or small
too large: for navigation purposes you've wasted an extra user click
14
We will use cosine similarity.
For many algorithms, easier to think in terms of a distance (rather than
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
15
16
More common and easier to do
17
Given: a set of docs and the number 𝐿 Find: a partition of docs into 𝐿 clusters that optimizes the
Finding a global optimum is intractable for many objective functions of
Effective heuristic methods: K-means and K-medoids algorithms
18
19
𝑘 such that 𝑒𝑗𝑡𝑢(𝒚(𝑗), 𝝂𝑘) is minimal.
𝝂𝑘 =
𝑗∈𝒟𝑘 𝒚(𝑗) 𝒟𝑘
20
21
22
A fixed number of iterations Doc partition unchanged 𝐾 < 𝜄: cost function falls below a threshold ∆𝐾 < 𝜄: the decrease in the cost function (in two successive
23
We must use tie-breaking when there are samples with the
24
2:
𝑗∈𝐷𝑙 𝒚(𝑗)– 𝝂𝑙
2 reaches minimum for 𝝂𝑙 =
1 𝐷𝑙 𝑗∈𝐷𝑙 𝒚(𝑗)
25
𝑁 is the dimensionality of the vectors.
26
Try out multiple starting points Select good seeds using a heuristic (e.g., doc
Initialize with the results of another method.
27
Instead,
It can improve speed of convergence of K-means
Sensitive to coordinate changes, weighting etc.
Doesn’t have a notion of “outliers” by default
But can add outlier filtering
28
Partition n docs into predetermined number of clusters
Given docs, partition into an “appropriate” no. of subsets.
E.g., for query results - ideal value of K not known up front - though
29
How many clusters? Four Clusters Two Clusters Six Clusters
30
31
application dependent
e.g., compressed summary of search results list.
32
Why is there always a clustering ofTotal Benefit n?
Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term
33
intra-class (that is, intra-cluster) similarity is high inter-class similarity is low
34
35
36
37
38
𝛾 = 𝛾2 + 1 𝑄𝑆
39
40
41
42
treat each doc as a singleton cluster at the outset then successively merge (or agglomerate) pairs of clusters until
43
44
then repeatedly joins the closest pair of clusters, until there is
45
46
47
48
Single-link
Similarity of the most similar pair (single-link)
Complete-link
Similarity of the “furthest” points, the least similar pair
Centroid
Clusters whose centroids (centers of gravity) are the most similar
Average-link
Average similarities between pairs of elements
49
50
due to chaining effect.
,
i j
i j x C y C
i j k i k j k
51
52
,
i j
i j x C y C
i j k i k j k
53
54
55
single-link clustering are the connected components of G(sk) complete-link clustering are maximal cliques of G(sk).
( ) ( ):
i j i j
i j x C C y C C y x i j i j
56
j
j x C
i j i j i j i j i j i j
57
58
59
60
Title of the doc closest to the centroid
Titles are easiest to read than a list of terms However, single doc is unlikely to be representative of all docs in a
A list of terms with high weights in the cluster centroid
We can use measures such as “Mutual Information” 𝐽 𝐷𝑙; 𝑌𝑗 = 𝑑𝑙∈{0,1} 𝑦𝑗 𝑄 𝑦𝑗, 𝑑𝑙 log
𝑄 𝑦𝑗,𝑑𝑙 𝑄 𝑦𝑗 𝑄 𝑑𝑙
61
many ways of influencing the outcome of clustering: number of
IIR 16 except 16.5 IIR 17 except 17.5
62