Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web ‐ intelligence g Based on slides by: Hinrich Schütze and Christina Lioma Hinrich Schütze and Christina Lioma Chapter 17: Hierarchical Clustering 1 Introduction to Information Retrieval Introduction to Information Retrieval Overview ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 2
Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 3 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical clustering Our goal in hierarchical clustering is to Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either top ‐ down or bottom ‐ up. The best known b tt bottom ‐ up method is hierarchical th d i hi hi l agglomerative clustering. 4 4
Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures We will look at four different cluster similarity measures. 5 5 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram The standard way of depicting this history is a dendrogram. 6 6
Introduction to Information Retrieval Introduction to Information Retrieval A dendogram The history of mergers Th hi f can be read off from bottom to top bottom to top. The horizontal line of each merger tells us what each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. clustering 7 7 Introduction to Information Retrieval Introduction to Information Retrieval Divisive clustering Divisive clustering is top ‐ down. Alternative to HAC (which is bottom up) Alternative to HAC (which is bottom up). Divisive clustering: Start with all docs in one big cluster Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own Eventually each node forms a cluster on its own. → Bisec � ng K ‐ means at the end For now: HAC (= bottom ‐ up) F HAC ( b tt ) 8 8
Introduction to Information Retrieval Introduction to Information Retrieval Naive HAC algorithm 9 9 Introduction to Information Retrieval Introduction to Information Retrieval Computational complexity of the naive algorithm First, we compute the similarity of all N × N pairs of documents. Then, in each of N iterations: Th i h f N it ti We scan the O(N × N ) similarities to find the maximum similarity. similarity We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other We compute the similarity of the new cluster with all other (surviving) clusters. There are O ( N ) iterations, each performing a O(N × N ) There are O ( N ) iterations, each performing a O(N N ) “scan” operation. Overall complexity is O ( N 3 ). p y ( ) We’ll look at more efficient algorithms later. 10 10
Introduction to Information Retrieval Introduction to Information Retrieval Key question: How to define cluster similarity Single ‐ link: Maximum similarity Maximum similarity of any two documents Complete ‐ link: Minimum similarity Minimum similarity of any two documents Centroid: Average “intersimilarity” Average similarity of all document pairs (but excluding pairs of docs in the same cluster) f d i h l ) This is equivalent to the similarity of the centroids. Group ‐ average: Average “intrasimilarity” G A “i i il i ” Average similary of all document pairs, including pairs of docs in the same cluster in the same cluster 11 11 Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Example 12 12
Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 13 13 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 14 14
Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters i i il i i il i f d i diff l 15 15 Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the i i il i i il i f i i l di h h two documents are in the same cluster 16 16
Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Larger Example 17 17 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 18 18
Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 19 19 Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity 20 20
Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity 21 21 Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ Introduction ❶ I t d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 22
Introduction to Information Retrieval Introduction to Information Retrieval Single link HAC The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second from the first cluster and a document from the second cluster. Once we have merged two clusters how do we update the Once we have merged two clusters, how do we update the similarity matrix? This is simple for single link: This is simple for single link: SIM ( ω i ( ω k 1 ∪ ω k 2 )) = max( SIM ( ω i ω k 1 ) SIM ( ω i ω k 2 )) SIM ( ω i , ( ω k 1 ∪ ω k 2 )) max( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) 23 23 Introduction to Information Retrieval Introduction to Information Retrieval This dendogram was produced by single ‐ link Notice: many small clusters (1 or 2 members) b being added to the main dd d h cluster There is no balanced 2 ‐ Th i b l d 2 cluster or 3 ‐ cluster clustering that can be clustering that can be derived by cutting the dendrogram. 24 24
Introduction to Information Retrieval Introduction to Information Retrieval Complete link HAC The similarity of two clusters is the minimum intersimilarity The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster. and a document from the second cluster. Once we have merged two clusters, how do we update the similarity matrix? Again, this is simple: SIM( ω i , ( ω k 1 ∪ ω k 2 )) = min( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) ∪ ω )) = min( SIM ( ω ω ) SIM ( ω SIM( ω ( ω ω )) We measure the similarity of two clusters by computing the y y p g diameter of the cluster that we would get if we merged them. 25 25 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link dendrogram Notice that this dendrogram is much more balanced than the b l d h h single ‐ link one. We can create a 2 ‐ cluster W t 2 l t clustering with two clusters of about the clusters of about the same size. 26 26
Introduction to Information Retrieval Introduction to Information Retrieval Exercise: Compute single and complete link clustering 27 27 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link clustering 28 28
Introduction to Information Retrieval Introduction to Information Retrieval Complete link clustering 29 29 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link vs. Complete link clustering 30 30
Recommend
More recommend