information retrieval

Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web intelligence g Based on slides by: Hinrich Schtze and Christina Lioma Hinrich Schtze and Christina Lioma Chapter


  1. Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web ‐ intelligence g Based on slides by: Hinrich Schütze and Christina Lioma Hinrich Schütze and Christina Lioma Chapter 17: Hierarchical Clustering 1 Introduction to Information Retrieval Introduction to Information Retrieval Overview ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 2

  2. Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 3 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical clustering Our goal in hierarchical clustering is to Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either top ‐ down or bottom ‐ up. The best known b tt bottom ‐ up method is hierarchical th d i hi hi l agglomerative clustering. 4 4

  3. Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  HAC creates a hierachy in the form of a binary tree  HAC creates a hierachy in the form of a binary tree.  Assumes a similarity measure for determining the similarity of two clusters of two clusters.  Up to now, our similarity measures were for documents.  We will look at four different cluster similarity measures  We will look at four different cluster similarity measures. 5 5 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  Start with each document in a separate cluster  Start with each document in a separate cluster  Then repeatedly merge the two clusters that are most similar similar  Until there is only one cluster  The history of merging is a hierarchy in the form of a binary  The history of merging is a hierarchy in the form of a binary tree.  The standard way of depicting this history is a dendrogram  The standard way of depicting this history is a dendrogram. 6 6

  4. Introduction to Information Retrieval Introduction to Information Retrieval A dendogram  The history of mergers Th hi f can be read off from bottom to top bottom to top.  The horizontal line of each merger tells us what each merger tells us what the similarity of the merger was.  We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. clustering 7 7 Introduction to Information Retrieval Introduction to Information Retrieval Divisive clustering  Divisive clustering is top ‐ down.  Alternative to HAC (which is bottom up)  Alternative to HAC (which is bottom up).  Divisive clustering:  Start with all docs in one big cluster  Start with all docs in one big cluster  Then recursively split clusters  Eventually each node forms a cluster on its own  Eventually each node forms a cluster on its own.  → Bisec � ng K ‐ means at the end  For now: HAC (= bottom ‐ up) F HAC ( b tt ) 8 8

  5. Introduction to Information Retrieval Introduction to Information Retrieval Naive HAC algorithm 9 9 Introduction to Information Retrieval Introduction to Information Retrieval Computational complexity of the naive algorithm  First, we compute the similarity of all N × N pairs of documents.  Then, in each of N iterations: Th i h f N it ti  We scan the O(N × N ) similarities to find the maximum similarity. similarity  We merge the two clusters with maximum similarity.  We compute the similarity of the new cluster with all other  We compute the similarity of the new cluster with all other (surviving) clusters.  There are O ( N ) iterations, each performing a O(N × N ) There are O ( N ) iterations, each performing a O(N N ) “scan” operation.  Overall complexity is O ( N 3 ). p y ( )  We’ll look at more efficient algorithms later. 10 10

  6. Introduction to Information Retrieval Introduction to Information Retrieval Key question: How to define cluster similarity  Single ‐ link: Maximum similarity  Maximum similarity of any two documents  Complete ‐ link: Minimum similarity  Minimum similarity of any two documents  Centroid: Average “intersimilarity”  Average similarity of all document pairs (but excluding pairs of docs in the same cluster) f d i h l )  This is equivalent to the similarity of the centroids.  Group ‐ average: Average “intrasimilarity” G A “i i il i ”  Average similary of all document pairs, including pairs of docs in the same cluster in the same cluster 11 11 Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Example 12 12

  7. Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 13 13 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 14 14

  8. Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters i i il i i il i f d i diff l 15 15 Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the i i il i i il i f i i l di h h two documents are in the same cluster 16 16

  9. Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Larger Example 17 17 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 18 18

  10. Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 19 19 Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity 20 20

  11. Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity 21 21 Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ Introduction ❶ I t d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 22

  12. Introduction to Information Retrieval Introduction to Information Retrieval Single link HAC  The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second from the first cluster and a document from the second cluster.  Once we have merged two clusters how do we update the Once we have merged two clusters, how do we update the similarity matrix?  This is simple for single link: This is simple for single link: SIM ( ω i ( ω k 1 ∪ ω k 2 )) = max( SIM ( ω i ω k 1 ) SIM ( ω i ω k 2 )) SIM ( ω i , ( ω k 1 ∪ ω k 2 )) max( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) 23 23 Introduction to Information Retrieval Introduction to Information Retrieval This dendogram was produced by single ‐ link  Notice: many small clusters (1 or 2 members) b being added to the main dd d h cluster  There is no balanced 2 ‐ Th i b l d 2 cluster or 3 ‐ cluster clustering that can be clustering that can be derived by cutting the dendrogram. 24 24

  13. Introduction to Information Retrieval Introduction to Information Retrieval Complete link HAC  The similarity of two clusters is the minimum intersimilarity  The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster. and a document from the second cluster.  Once we have merged two clusters, how do we update the similarity matrix?  Again, this is simple: SIM( ω i , ( ω k 1 ∪ ω k 2 )) = min( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) ∪ ω )) = min( SIM ( ω ω ) SIM ( ω SIM( ω ( ω ω ))  We measure the similarity of two clusters by computing the y y p g diameter of the cluster that we would get if we merged them. 25 25 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link dendrogram  Notice that this dendrogram is much more balanced than the b l d h h single ‐ link one.  We can create a 2 ‐ cluster W t 2 l t clustering with two clusters of about the clusters of about the same size. 26 26

  14. Introduction to Information Retrieval Introduction to Information Retrieval Exercise: Compute single and complete link clustering 27 27 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link clustering 28 28

  15. Introduction to Information Retrieval Introduction to Information Retrieval Complete link clustering 29 29 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link vs. Complete link clustering 30 30

Recommend


More recommend