Clustering
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2018
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Clustering CE-324: Modern Information Retrieval Sharif University - - PowerPoint PPT Presentation
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? } Clustering:
Sharif University of Technology
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
} Docs within a cluster should be similar. } Docs from different clusters should be dissimilar.
} Unsupervised learning
} learning from raw data, as opposed to supervised data where a
} A common and important task that finds many applications in
2
3
} Effective “user recall” will be higher
} Better user interface: search without typing
} Better search results (like pseudo RF)
} Cluster-based retrieval gives faster search
4
5
6
7
8
} News reading is not really search, but rather a process of
9
10
11
12
} Cluster docs in corpus a priori } When a query matches a doc 𝑒, also return other docs in the cluster
} Because clustering grouped together docs containing car with
13
} Doc representation
} Vector space? Normalization? } Centroids aren’t length normalized
} Need a notion of similarity/distance
} Fixed a priori? } Completely data driven?
} Avoid “trivial” clusters - too large or small
¨ too large: for navigation purposes you've wasted an extra user click
14
} We will use cosine similarity.
} For many algorithms, easier to think in terms of a distance (rather than
} We will mostly speak of Euclidean distance
¨ But real implementations use cosine similarity
15
16
} More common and easier to do
17
} Given: a set of docs and the number 𝐿 } Find: a partition of docs into 𝐿 clusters that optimizes the
} Finding a global optimum is intractable for many objective functions of
} Effective heuristic methods: K-means and K-medoids algorithms
18
19
, such that 𝑒𝑗𝑡𝑢(𝒚(1), 𝝂𝑘) is minimal.
𝝂𝑘 =
∑
𝒚(2)
𝒟4
20
21
22
} A fixed number of iterations } Doc partition unchanged } 𝐾 < 𝜄: cost function falls below a threshold } ∆𝐾 < 𝜄: the decrease in the cost function (in two successive
23
} We must use tie-breaking when there are samples with the
24
2
} ∑
2
& IK ∑
25
} 𝑁 is the dimensionality of the vectors.
26
} Exclude outliers from the seed set } Try
} Select good seeds using a heuristic (e.g., doc
} Obtaining seeds from another method such
27
} Instead,
} It can improve speed of convergence of K-means
} Sensitive to coordinate changes, weighting etc.
} Doesn’t have a notion of “outliers” by default
} But can add outlier filtering
28
} Partition n docs into predetermined number of clusters
} Given docs, partition into an “appropriate” no. of subsets.
} E.g., for query results - ideal value of K not known up front - though
29
How many clusters? Four Clusters Two Clusters Six Clusters
30
31
} application dependent
} e.g., compressed summary of search results list.
32
} Why is there always a clustering of Total Benefit n?
} T
} But can stop when it doesn’t increase by “much”. The Cost term
33
} intra-class (that is, intra-cluster) similarity is high } inter-class similarity is low
34
35
36
37
38
[ = 𝛾9 + 1 𝑄𝑆
39
40
41
42
} treat each doc as a singleton cluster at the outset } then successively merge (or agglomerate) pairs of clusters until
43
}
44
} then repeatedly joins the closest pair of clusters, until there is
45
46
47
48
} Single-link
} Similarity of the most similar pair (single-link)
} Complete-link
} Similarity of the “furthest” points, the least similar pair
} Centroid
} Clusters whose centroids (centers of gravity) are the most similar
} Average-link
} Average similarities between pairs of elements
49
50
} due to chaining effect.
,
i j
i j x C y C
Î Î
i j k i k j k
51
52
,
i j
i j x C y C
Î Î
i j k i k j k
53
54
55
} single-link clustering are the connected components of G(sk) } complete-link clustering are maximal cliques of G(sk).
56
( ) ( ):
i j i j
i j x C C y C C y x i j i j
Î Î ¹
r r r r U U
57
j
j x C
Î
r
i j i j i j i j i j i j
58
59
60
61
62
63
} Title of the doc closest to the centroid
} Titles are easiest to read than a list of terms } However, single doc is unlikely to be representative of all docs in a
} A list of terms with high weights in the cluster centroid
} We can use measures such as “Mutual Information” } 𝐽 𝐷K; 𝑌1 = ∑
t u2,YJ t u2 t YJ u2 YJ∈{v,&}
64
} many ways of influencing the outcome of clustering: number of
} IIR 16 except 16.5 } IIR 17 except 17.5
65