clustering
play

Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? } Clustering:


  1. Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Ch. 16 What is clustering? } Clustering: grouping a set of objects into similar ones } Docs within a cluster should be similar. } Docs from different clusters should be dissimilar. } The commonest form of unsupervised learning } Unsupervised learning } learning from raw data, as opposed to supervised data where a classification of examples is given } A common and important task that finds many applications in IR and other places 2

  3. Ch. 16 A data set with clear cluster structure } How would you design an algorithm for finding the three clusters in this case? 3

  4. Sec. 16.1 Applications of clustering in IR } For better navigation of search results } Effective “user recall” will be higher } Whole corpus analysis/navigation } Better user interface: search without typing } For improving recall in search applications } Better search results (like pseudo RF) } For speeding up vector space retrieval } Cluster-based retrieval gives faster search 4

  5. Applications of clustering in IR 5

  6. Search result clustering 6

  7. yippy.com – grouping search results 7

  8. Clustering the collection } Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm) } User may prefer browsing over searching when they are unsure about which terms to use } Well suited to a collection of news stories } News reading is not really search, but rather a process of selecting a subset of stories about recent events 8

  9. Google News: automatic clustering gives an effective news presentation metaphor 9

  10. 10

  11. To improve efficiency and effectiveness of search system } Improve language modeling: replacing the collection model used for smoothing by a model derived from doc’s cluster } Clustering can speed-up search (via an inexact algorithm) } Clustering can improve recall 11

  12. Sec. 16.1 For improving search recall } Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs } Therefore, to improve search recall: } Cluster docs in corpus a priori } When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒 } Query car : also return docs containing automobile } Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 12

  13. Sec. 16.2 Issues for clustering } Representation for clustering } Doc representation } Vector space? Normalization? } Centroids aren’t length normalized } Need a notion of similarity/distance } How many clusters? } Fixed a priori? } Completely data driven? } Avoid “trivial” clusters - too large or small ¨ too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 13

  14. Notion of similarity/distance } Ideal: semantic similarity. } Practical: term-statistical similarity } We will use cosine similarity. } For many algorithms, easier to think in terms of a distance (rather than similarity) } We will mostly speak of Euclidean distance ¨ But real implementations use cosine similarity 14

  15. Clustering algorithms categorization } Flat algorithms ( k- means ) } Usually start with a random (partial) partitioning } Refine it iteratively } Hierarchical algorithms } Bottom-up, agglomerative } T op-down, divisive 15

  16. Hard vs. soft clustering } Hard clustering : Each doc belongs to exactly one cluster } More common and easier to do } Soft clustering :A doc can belong to more than one cluster. 16

  17. Partitioning algorithms } Construct a partition of 𝑂 docs into 𝐿 clusters } Given: a set of docs and the number 𝐿 } Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion } Finding a global optimum is intractable for many objective functions of clustering } Effective heuristic methods: K -means and K -medoids algorithms 17

  18. � K-means Clustering } Input: data {𝒚 & , … , 𝒚 (*) } and number of clusters 𝑙 } Output: 𝒟 & , … , 𝒟 / } Optimization problem: / 8 𝒚 (3) – 𝒅 𝑘 𝐾(𝒟) = 2 2 𝒚 (:) ∈𝒟 < =>& } This is an NP-Hard problem in general. 18

  19. � Sec. 16.4 K -means } Assumes docs are real-valued vectors 𝒚 (&) , … , 𝒚 (*) . } Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 = = 1 𝒚 (3) 2 𝒟 𝒚 (:) ∈𝒟 < = } K-means cost function: / 8 𝒚 (3) – 𝝂 𝑘 𝐾(𝒟) = 2 2 𝒚 (:) ∈𝒟 < =>& 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19

  20. � Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (3) : Assign 𝒚 (3) to the cluster 𝒟 = such that 𝑒𝑗𝑡𝑢( 𝒚 (3) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝒚 (:) ∑ :∈𝒟< 𝝂 𝑘 = 𝒟 < Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20

  21. 21 [Bishop]

  22. 22

  23. Sec. 16.4 Termination conditions } Several possibilities for terminal condition, e.g., } A fixed number of iterations } Doc partition unchanged } 𝐾 < 𝜄 : cost function falls below a threshold } ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23

  24. Sec. 16.4 Convergence of K -means } K -means algorithm ever reaches a fixed point in which clusters don’t change. } We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24

  25. � � � Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence) } First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid. } Second, recomputation monotonically decreases each 𝒚 (3) – 𝝂 𝑙 ∑ 2 : 3∈J K & 𝒚 (3) – 𝝂 𝑙 𝒚 (3) } ∑ J L ∑ 2 reaches minimum for 𝝂 𝑙 = 3∈J K 3∈J K } K -means typically converges quickly 25

  26. Sec. 16.4 Time complexity of K -means } Computing distance between two docs: 𝑃(𝑁) } 𝑁 is the dimensionality of the vectors. } Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) . } Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) . } Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26

  27. Sec. 16.4 Seed choice } Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds } Some initializations get poor convergence rate, or convergence to sub-optimal clustering If you start with B and E as } Exclude outliers from the seed set centroids you converge to } Try out multiple starting points and {A,B,C} and {D,E,F} choosing the clustering with lowest cost } Select good seeds using a heuristic (e.g., doc If you start with D and F, you least similar to any existing mean) converge to {A,B,D,E} {C,F} } Obtaining seeds from another method such as hierarchical clustering 27

  28. Sec. 16.4 K -means issues, variations, etc. } Computes the centroid after all points are re-assigned } Instead, we can re-compute the centroid after every assignment } It can improve speed of convergence of K -means } Assumes clusters are spherical in vector space } Sensitive to coordinate changes, weighting etc. } Disjoint and exhaustive } Doesn’t have a notion of “outliers” by default } But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28

  29. How many clusters? } Number of clusters 𝐿 is given } Partition n docs into predetermined number of clusters } Finding the “right” number is part of the problem } Given docs, partition into an “appropriate” no. of subsets. } E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29

  30. How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30

  31. Selecting k } Is it possible by assessing the cost function for different number of clusters? } Keep adding cluster until adding more no longer decreases error significantly (e.g., finding knees) 31

  32. K not specified in advance } Tradeoff between having better focus within each cluster and having too many clusters } Solve an optimization problem: penalize having lots of clusters } application dependent } e.g., compressed summary of search results list. 𝑙 ∗ = min L 𝐾 U3V 𝑙 + 𝜇𝑙 𝐾 U3V 𝑙 : 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } show the minimum value of obtained in e.g. 100 runs of k-means (with different initializations) 32

  33. Sec. 16.3 What is a good clustering? } Internal criterion: } intra-class (that is, intra-cluster) similarity is high } inter-class similarity is low } The measured quality of a clustering depends on both the doc representation and the similarity measure 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend