clustering
play

Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? Clustering:


  1. Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Ch. 16 What is clustering?  Clustering: grouping a set of objects into similar ones  Docs within a cluster should be similar.  Docs from different clusters should be dissimilar.  The commonest form of unsupervised learning  Unsupervised learning  learning from raw data, as opposed to supervised data where a classification of examples is given  A common and important task that finds many applications in IR and other places 2

  3. Ch. 16 A data set with clear cluster structure  How would you design an algorithm for finding the three clusters in this case? 3

  4. Applications of clustering in IR 5

  5. Search result clustering 6

  6. yippy.com – grouping search results 7

  7. Clustering the collection  Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm)  User may prefer browsing over searching when they are unsure about which terms to use  Well suited to a collection of news stories  News reading is not really search, but rather a process of selecting a subset of stories about recent events 8

  8. Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy isn ’ t clustering but is the kind of output you want from clustering 9

  9. Google News: automatic clustering gives an effective news presentation metaphor 10

  10. 11

  11. To improve efficiency and effectiveness of search system  Improve language modeling: replacing the collection model used for smoothing by a model derived from doc ’ s cluster  Clustering can speed-up search (via an inexact algorithm)  Clustering can improve recall 12

  12. Sec. 16.1 For improving search recall  Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs  Therefore, to improve search recall:  Cluster docs in corpus a priori  When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒  Query car : also return docs containing automobile  Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 13

  13. Sec. 16.2 Issues for clustering  Representation for clustering  Doc representation  Vector space? Normalization?  Centroids aren ’ t length normalized  Need a notion of similarity/distance  How many clusters?  Fixed a priori?  Completely data driven?  Avoid “ trivial ” clusters - too large or small  too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 14

  14. Notion of similarity/distance  Ideal: semantic similarity.  Practical: term-statistical similarity  We will use cosine similarity.  For many algorithms, easier to think in terms of a distance (rather than similarity)  We will mostly speak of Euclidean distance  But real implementations use cosine similarity 15

  15. Clustering algorithms categorization  Flat algorithms ( k- means )  Usually start with a random (partial) partitioning  Refine it iteratively  Hierarchical algorithms  Bottom-up, agglomerative  Top-down, divisive 16

  16. Hard vs. soft clustering  Hard clustering : Each doc belongs to exactly one cluster  More common and easier to do  Soft clustering :A doc can belong to more than one cluster. 17

  17. Partitioning algorithms  Construct a partition of 𝑂 docs into 𝐿 clusters  Given: a set of docs and the number 𝐿  Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion  Finding a global optimum is intractable for many objective functions of clustering  Effective heuristic methods: K -means and K -medoids algorithms 18

  18. Sec. 16.4 K -means  Assumes docs are real-valued vectors 𝒚 (1) , … , 𝒚 (𝑂) .  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 𝑘 = 1 𝒚 (𝑗) 𝒟 𝒚 (𝑗) ∈𝒟 𝑘 𝑘  K-means cost function: 𝐿 2 𝒚 (𝑗) – 𝝂 𝑘 𝐾(𝒟) = 𝒚 (𝑗) ∈𝒟 𝑘 𝑘=1 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19

  19. Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters ’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (𝑗) : Assign 𝒚 (𝑗) to the cluster 𝒟 𝑘 such that 𝑒𝑗𝑡𝑢( 𝒚 (𝑗) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝑗∈𝒟𝑘 𝒚 (𝑗) 𝝂 𝑘 = 𝒟 𝑘 Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20

  20. 21 [Bishop]

  21. 22

  22. Sec. 16.4 Termination conditions  Several possibilities for terminal condition, e.g.,  A fixed number of iterations  Doc partition unchanged  𝐾 < 𝜄 : cost function falls below a threshold  ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23

  23. Sec. 16.4 Convergence of K -means  K -means algorithm ever reaches a fixed point in which clusters don ’ t change.  We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24

  24. Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence)  First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid.  Second, recomputation monotonically decreases each 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 2 : 2 reaches minimum for 𝝂 𝑙 = 1  𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 𝐷 𝑙 𝑗∈𝐷 𝑙 𝒚 (𝑗)  K -means typically converges quickly 25

  25. Sec. 16.4 Time complexity of K -means  Computing distance between two docs: 𝑃(𝑁)  𝑁 is the dimensionality of the vectors.  Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) .  Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) .  Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26

  26. Sec. 16.4 Seed choice  Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds  Some initializations get poor convergence rate, or convergence to sub-optimal clustering  Try out multiple starting points If you start with B and E as centroids you converge to  Select good seeds using a heuristic (e.g., doc {A,B,C} and {D,E,F} least similar to any existing mean)  Initialize with the results of another method. If you start with D and F, you converge to {A,B,D,E} {C,F} 27

  27. Sec. 16.4 K -means issues, variations, etc.  Computes the centroid only after all points are re- assigned  Instead, we can re-compute the centroid after every assignment  It can improve speed of convergence of K -means  Assumes clusters are spherical in vector space  Sensitive to coordinate changes, weighting etc.  Disjoint and exhaustive  Doesn ’ t have a notion of “ outliers ” by default  But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28

  28. How many clusters?  Number of clusters 𝐿 is given  Partition n docs into predetermined number of clusters  Finding the “ right ” number is part of the problem  Given docs, partition into an “ appropriate ” no. of subsets.  E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29

  29. How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30

  30. Selecting k 31

  31. K not specified in advance  Tradeoff between having better focus within each cluster and having too many clusters  Solve an optimization problem: penalize having lots of clusters  application dependent  e.g., compressed summary of search results list. 𝑙 ∗ = min 𝑙 𝐾 𝑛𝑗𝑜 𝑙 + 𝜇𝑙 𝐾 𝑛𝑗𝑜 𝑙 : show the minimum value of 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } obtained in e.g. 100 runs of k-means (with different initializations) 32

  32. Penalize lots of clusters  Benefit for a doc: cosine similarity to its centroid  Total Benefit: sum of the individual doc Benefits.  Why is there always a clustering ofTotal Benefit n ?  For each cluster, we have a Cost C .  For K clusters, the Total Cost is KC .  Value of a clustering = Total Benefit -Total Cost.  Find clustering of highest value , over all choices of K .  Total benefit increases with increasing K .  But can stop when it doesn ’ t increase by “ much ” . The Cost term enforces this. 33

  33. Sec. 16.3 What is a good clustering?  Internal criterion:  intra-class (that is, intra-cluster) similarity is high  inter-class similarity is low  The measured quality of a clustering depends on both the doc representation and the similarity measure 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend