 
              Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Ch. 16 What is clustering?  Clustering: grouping a set of objects into similar ones  Docs within a cluster should be similar.  Docs from different clusters should be dissimilar.  The commonest form of unsupervised learning  Unsupervised learning  learning from raw data, as opposed to supervised data where a classification of examples is given  A common and important task that finds many applications in IR and other places 2
Ch. 16 A data set with clear cluster structure  How would you design an algorithm for finding the three clusters in this case? 3
Applications of clustering in IR 5
Search result clustering 6
yippy.com – grouping search results 7
Clustering the collection  Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm)  User may prefer browsing over searching when they are unsure about which terms to use  Well suited to a collection of news stories  News reading is not really search, but rather a process of selecting a subset of stories about recent events 8
Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy isn ’ t clustering but is the kind of output you want from clustering 9
Google News: automatic clustering gives an effective news presentation metaphor 10
11
To improve efficiency and effectiveness of search system  Improve language modeling: replacing the collection model used for smoothing by a model derived from doc ’ s cluster  Clustering can speed-up search (via an inexact algorithm)  Clustering can improve recall 12
Sec. 16.1 For improving search recall  Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs  Therefore, to improve search recall:  Cluster docs in corpus a priori  When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒  Query car : also return docs containing automobile  Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 13
Sec. 16.2 Issues for clustering  Representation for clustering  Doc representation  Vector space? Normalization?  Centroids aren ’ t length normalized  Need a notion of similarity/distance  How many clusters?  Fixed a priori?  Completely data driven?  Avoid “ trivial ” clusters - too large or small  too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 14
Notion of similarity/distance  Ideal: semantic similarity.  Practical: term-statistical similarity  We will use cosine similarity.  For many algorithms, easier to think in terms of a distance (rather than similarity)  We will mostly speak of Euclidean distance  But real implementations use cosine similarity 15
Clustering algorithms categorization  Flat algorithms ( k- means )  Usually start with a random (partial) partitioning  Refine it iteratively  Hierarchical algorithms  Bottom-up, agglomerative  Top-down, divisive 16
Hard vs. soft clustering  Hard clustering : Each doc belongs to exactly one cluster  More common and easier to do  Soft clustering :A doc can belong to more than one cluster. 17
Partitioning algorithms  Construct a partition of 𝑂 docs into 𝐿 clusters  Given: a set of docs and the number 𝐿  Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion  Finding a global optimum is intractable for many objective functions of clustering  Effective heuristic methods: K -means and K -medoids algorithms 18
Sec. 16.4 K -means  Assumes docs are real-valued vectors 𝒚 (1) , … , 𝒚 (𝑂) .  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 𝑘 = 1 𝒚 (𝑗) 𝒟 𝒚 (𝑗) ∈𝒟 𝑘 𝑘  K-means cost function: 𝐿 2 𝒚 (𝑗) – 𝝂 𝑘 𝐾(𝒟) = 𝒚 (𝑗) ∈𝒟 𝑘 𝑘=1 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19
Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters ’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (𝑗) : Assign 𝒚 (𝑗) to the cluster 𝒟 𝑘 such that 𝑒𝑗𝑡𝑢( 𝒚 (𝑗) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝑗∈𝒟𝑘 𝒚 (𝑗) 𝝂 𝑘 = 𝒟 𝑘 Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20
21 [Bishop]
22
Sec. 16.4 Termination conditions  Several possibilities for terminal condition, e.g.,  A fixed number of iterations  Doc partition unchanged  𝐾 < 𝜄 : cost function falls below a threshold  ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23
Sec. 16.4 Convergence of K -means  K -means algorithm ever reaches a fixed point in which clusters don ’ t change.  We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24
Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence)  First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid.  Second, recomputation monotonically decreases each 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 2 : 2 reaches minimum for 𝝂 𝑙 = 1  𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 𝐷 𝑙 𝑗∈𝐷 𝑙 𝒚 (𝑗)  K -means typically converges quickly 25
Sec. 16.4 Time complexity of K -means  Computing distance between two docs: 𝑃(𝑁)  𝑁 is the dimensionality of the vectors.  Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) .  Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) .  Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26
Sec. 16.4 Seed choice  Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds  Some initializations get poor convergence rate, or convergence to sub-optimal clustering  Try out multiple starting points If you start with B and E as centroids you converge to  Select good seeds using a heuristic (e.g., doc {A,B,C} and {D,E,F} least similar to any existing mean)  Initialize with the results of another method. If you start with D and F, you converge to {A,B,D,E} {C,F} 27
Sec. 16.4 K -means issues, variations, etc.  Computes the centroid only after all points are re- assigned  Instead, we can re-compute the centroid after every assignment  It can improve speed of convergence of K -means  Assumes clusters are spherical in vector space  Sensitive to coordinate changes, weighting etc.  Disjoint and exhaustive  Doesn ’ t have a notion of “ outliers ” by default  But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28
How many clusters?  Number of clusters 𝐿 is given  Partition n docs into predetermined number of clusters  Finding the “ right ” number is part of the problem  Given docs, partition into an “ appropriate ” no. of subsets.  E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29
How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30
Selecting k 31
K not specified in advance  Tradeoff between having better focus within each cluster and having too many clusters  Solve an optimization problem: penalize having lots of clusters  application dependent  e.g., compressed summary of search results list. 𝑙 ∗ = min 𝑙 𝐾 𝑛𝑗𝑜 𝑙 + 𝜇𝑙 𝐾 𝑛𝑗𝑜 𝑙 : show the minimum value of 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } obtained in e.g. 100 runs of k-means (with different initializations) 32
Penalize lots of clusters  Benefit for a doc: cosine similarity to its centroid  Total Benefit: sum of the individual doc Benefits.  Why is there always a clustering ofTotal Benefit n ?  For each cluster, we have a Cost C .  For K clusters, the Total Cost is KC .  Value of a clustering = Total Benefit -Total Cost.  Find clustering of highest value , over all choices of K .  Total benefit increases with increasing K .  But can stop when it doesn ’ t increase by “ much ” . The Cost term enforces this. 33
Sec. 16.3 What is a good clustering?  Internal criterion:  intra-class (that is, intra-cluster) similarity is high  inter-class similarity is low  The measured quality of a clustering depends on both the doc representation and the similarity measure 34
Recommend
More recommend