 
              INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 21/25: Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 15 Nov 2011 1 / 90
Administrativa Assignment 4 due 2 Dec (extended til 4 Dec). 2 / 90
Overview Clustering: Introduction 1 Clustering in IR 2 K -means 3 How many clusters? 4 Evaluation 5 Labeling clusters 6 Feature selection 7 3 / 90
Outline Clustering: Introduction 1 Clustering in IR 2 K -means 3 How many clusters? 4 Evaluation 5 Labeling clusters 6 Feature selection 7 4 / 90
Classification vs. Clustering Classification: supervised learning Clustering: unsupervised learning Classification: Classes are human-defined and part of the input to the learning algorithm. Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . . 5 / 90
What is clustering? (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 6 / 90
Data set with clear cluster structure 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 7 / 90
Outline Clustering: Introduction 1 Clustering in IR 2 K -means 3 How many clusters? 4 Evaluation 5 Labeling clusters 6 Feature selection 7 8 / 90
The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications in IR are based (directly or indirectly) on the cluster hypothesis. 9 / 90
Applications of clustering in IR Application What is Benefit Example clustered? Search result clustering search more effective infor- next slide results mation presentation to user Scatter-Gather (subsets of) alternative user inter- two slides ahead collection face: “search without typing” Collection clustering collection effective information McKeown et al. 2002, presentation for ex- news.google.com ploratory browsing Cluster-based retrieval collection higher efficiency: Salton 1971 faster search 10 / 90
Global clustering for navigation: Google News http://news.google.com 11 / 90
Clustering for improving recall To improve search recall: Cluster docs in collection a priori When a query matches a doc d , also return other docs in the cluster containing d Hope: if we do this: the query “car” will also return docs containing “automobile” Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”. 12 / 90
Data set with clear cluster structure Exercise: Come up with an algorithm for finding the three 2.5 clusters in this case 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 13 / 90
Document representations in clustering Vector space model As in vector space classification, we measure relatedness between vectors by Euclidean distance . . . . . . which is almost equivalent to cosine similarity. Almost: centroids are not length-normalized. For centroids, distance and cosine give different results. 14 / 90
Issues in clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. But how do we formalize this? How many clusters? Initially, we will assume the number of clusters K is given. Often: secondary goals in clustering Example: avoid very small and very large clusters Flat vs. hierarchical clustering Hard vs. soft clustering 15 / 90
Flat vs. Hierarchical clustering Flat algorithms Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K -means Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive 16 / 90
Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: sports apparel shoes You can only do that with a soft clustering approach. For soft clustering, see course text: 16.5,18 Today: Flat, hard clustering Next time: Hierarchical, hard clustering 17 / 90
Flat algorithms Flat algorithms compute a partition of N documents into a set of K clusters. Given: a set of documents and the number K Find: a partition in K clusters that optimizes the chosen partitioning criterion Global optimization: exhaustively enumerate partitions, pick optimal one Not tractable Effective heuristic method: K -means algorithm 18 / 90
Outline Clustering: Introduction 1 Clustering in IR 2 K -means 3 How many clusters? 4 Evaluation 5 Labeling clusters 6 Feature selection 7 19 / 90
K -means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 20 / 90
K -means Each cluster in K -means is defined by a centroid. Objective/partitioning criterion: minimize the average squared difference from the centroid Recall definition of centroid: µ ( ω ) = 1 � � � x | ω | x ∈ ω � where we use ω to denote a cluster. We try to find the minimum average squared difference by iterating two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment 21 / 90
K -means algorithm K -means ( { � x 1 , . . . ,� x N } , K ) 1 ( � s 1 ,� s 2 , . . . ,� s K ) ← SelectRandomSeeds ( { � x 1 , . . . ,� x N } , K ) 2 for k ← 1 to K 3 do � µ k ← � s k 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ω k ← {} 7 for n ← 1 to N 8 do j ← arg min j ′ | � µ j ′ − � x n | 9 ω j ← ω j ∪ { � x n } (reassignment of vectors) 10 for k ← 1 to K 1 11 do � µ k ← � x ∈ ω k � x (recomputation of centroids) | ω k | � 12 return { � µ 1 , . . . , � µ K } 22 / 90
Set of points to be clustered b b b b b b b b b b b b b b b b b b b b 23 / 90
Random selection of initial cluster centers ( k = 2 means) × b b b b b b b b b b b × b b b b b b b b Centroids after convergence? b 24 / 90
Assign points to closest centroid × b b b b b b b b b b b × b b b b b b b b b 25 / 90
Assignment × 2 2 2 2 2 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 1 26 / 90
Recompute cluster centroids × 2 × 2 2 2 2 1 1 1 1 1 × 1 1 1 × 1 1 1 1 1 1 1 27 / 90
Assign points to closest centroid b × b b b b b b b b b × b b b b b b b b b b 28 / 90
Assignment 2 × 2 2 2 2 2 2 1 1 1 × 1 1 1 1 1 1 1 1 1 1 29 / 90
Recompute cluster centroids 2 × 2 × 2 2 2 2 2 1 1 1 1 1 × 1 × 1 1 1 1 1 1 1 30 / 90
Assign points to closest centroid b × b b b b b b b b b b × b b b b b b b b b 31 / 90
Assignment 2 × 2 2 2 2 2 2 1 1 2 1 1 × 1 1 1 1 1 1 1 1 32 / 90
Recompute cluster centroids 2 2 × × 2 2 2 2 2 1 1 2 1 1 × 1 × 1 1 1 1 1 1 1 33 / 90
Assign points to closest centroid b b × b b b b b b b b b b × b b b b b b b b 34 / 90
Assignment 2 2 × 2 2 2 2 2 1 1 2 1 1 × 1 2 1 1 1 1 1 1 35 / 90
Recompute cluster centroids 2 2 × 2 2 × 2 2 2 1 1 2 1 1 × 1 × 2 1 1 1 1 1 1 36 / 90
Assign points to closest centroid b b × b bb b b b b b b b × b b b b b b b b 37 / 90
Assignment 2 2 × 2 1 1 2 2 1 1 2 1 2 × 1 2 1 1 1 1 1 1 38 / 90
Recompute cluster centroids 2 2 2 1 1 × × 2 2 1 1 2 × 1 2 1 × 2 1 1 1 1 1 1 39 / 90
Assign points to closest centroid b b b × b b b b b b b × b b b b b b b b b b 40 / 90
Assignment 2 2 1 1 1 × 2 2 1 1 2 × 1 2 1 2 1 1 1 1 1 1 41 / 90
Recompute cluster centroids 2 2 1 1 1 × × 2 2 1 1 × 2 1 2 1 × 2 1 1 1 1 1 1 42 / 90
Assign points to closest centroid b b b b b × b b b b × b b b b b b b b b b b 43 / 90
Assignment 2 2 1 1 1 × 2 1 1 1 × 2 1 2 1 2 1 1 1 1 1 1 44 / 90
Recompute cluster centroids 2 2 1 1 1 × × 2 1 1 1 × 2 1 2 × 1 2 1 1 1 1 1 1 45 / 90
Centroids and assignments after convergence 2 2 1 1 1 × 2 1 1 1 × 2 1 2 1 2 1 1 1 1 1 1 46 / 90
Set of points clustered b b b b b b b b b b b b b b b b b b b b 47 / 90
Set of points to be clustered b b b b b b b b b b b b b b b b b b b b 48 / 90
Recommend
More recommend