Introduction to Information Retrieval - PowerPoint PPT Presentation

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Introduction to Information Retrieval http://informationretrieval.org IIR 17: Hierarchical Clustering Hinrich Sch¨ utze Institute for Natural Language Processing, Universit¨ at Stuttgart 2008.07.01 Sch¨ utze: Hierarchical clustering 1 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Outline Recap 1 Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Variants 5 Labeling clusters 6 Sch¨ utze: Hierarchical clustering 4 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya poultry oil & gas China UK France coffee Sch¨ utze: Hierarchical clustering 5 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya poultry oil & gas China UK France coffee We want to create this hierarchy automatically. Sch¨ utze: Hierarchical clustering 5 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya poultry oil & gas China UK France coffee We want to create this hierarchy automatically. We can do this either top-down or bottom-up. Sch¨ utze: Hierarchical clustering 5 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya poultry oil & gas China UK France coffee We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. Sch¨ utze: Hierarchical clustering 5 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Hierarchical agglomerative clustering (HAC) Assumes a similarity measure for determining the similarity of two clusters (up to now: similarity of documents). We will look at four different cluster similarity measures. Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging forms a binary tree or hierarchy. The standard way of depicting this history is a dendrogram. Sch¨ utze: Hierarchical clustering 6 / 58

A dendrogram Recap Sch¨ utze: Hierarchical clustering Introduction 1.0 0.8 0.6 0.4 0.2 0.0 Ag trade reform. Back−to−school spending is up Lloyd’s CEO questioned Single-link/Complete-link Lloyd’s chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Centroid/GAAC Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Variants Labeling clusters flat clustering. at 0.1 or 0.4) to get a particular point (e.g., dendrogram at a We can cut the the merger was. what the similarity of each merger tells us The horizontal line of bottom to top. can be read off from The history of mergers 7 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Divisive clustering Top-down (instead of bottom-up as in HAC) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. → Bisecting K -means at the end Sch¨ utze: Hierarchical clustering 8 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Naive HAC algorithm SimpleHAC ( d 1 , . . . , d N ) 1 for n ← 1 to N 2 do for i ← 1 to N 3 do C [ n ][ i ] ← Sim ( d n , d i ) 4 I [ n ] ← 1 (keeps track of active clusters) 5 A ← [] (collects clustering as a sequence of merges) 6 for k ← 1 to N − 1 7 do � i , m � ← arg max {� i , m � : i � = m ∧ I [ i ]=1 ∧ I [ m ]=1 } C [ i ][ m ] 8 A . Append ( � i , m � ) (store merge) 9 for j ← 1 to N 10 do C [ i ][ j ] ← Sim ( i , m , j ) 11 C [ j ][ i ] ← Sim ( i , m , j ) 12 I [ m ] ← 0 (deactivate cluster) 13 return A Sch¨ utze: Hierarchical clustering 9 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Computational complexity of the naive algorithm First, we compute the similarity of all N × N pairs of documents. Then, in each iteration: We scan the O ( N × N ) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O ( N ) iterations, each performing a O ( N × N ) “scan” operation. Overall complexity is O ( N 3 ). We’ll look at more efficient algorithms later. Sch¨ utze: Hierarchical clustering 10 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Key question: How to define cluster similarity Single-link: Maximum similarity Maximum over all document pairs Complete-link: Minimum similarity Minimum over all document pairs Centroid: Average “intersimilarity” Average over all document pairs This is equivalent to the similarity of the centroids. Group-average: Average “intrasimilarity” Average over all document pairs, including pairs of docs in the same cluster Sch¨ utze: Hierarchical clustering 11 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Cluster similarity: Example 4 3 b 2 b b 1 b 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 12 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Single-link: Maximum similarity 4 3 b 2 b b 1 b 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 13 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Complete-link: Minimum similarity 4 3 b 2 b b 1 b 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 14 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters 4 3 b 2 b b 1 b 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 15 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including those that are in cluster 1 and those that are in cluster 2 4 3 b 2 b b 1 b 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 16 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Cluster similarity: Larger example 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 17 / 58

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Single-link: Maximum similarity 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 Sch¨ utze: Hierarchical clustering 18 / 58

Introduction to Information Retrieval - PowerPoint PPT Presentation

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Introduction to Information Retrieval http://informationretrieval.org IIR 17: Hierarchical Clustering Hinrich Sch utze Institute for Natural Language

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Testing If its not tested, it doesnt work Why Unit Testing? If it is not tested,

WG 3.1 e-Government and e-Citizen Chair: Kari Strande (Norway), e-mail: kari.strande@statkart.no

PDP-6 & 10 Rick Lin and Keegan Griffee PDP-6 (1963-1966) Hardware Architecture 36-bit

from two-sided pricing to gated communities: welcome back to the eighties julien mailland

The Information Flow Framework: New Architecture Robert E. Kent CT 2006 Philosophy cannot

Trustworthy Computing CSE497b - Spring 2007 Introduction Computer and Network Security Professor

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

CPSC 121: Models of Computation Unit 10: A Working Computer Based on slides by Patrice Belleville

Introduction to Information Retrieval - PowerPoint PPT Presentation

Recap Introduction Single-link/Complete-link Centroid/GAAC Variants Labeling clusters Introduction to Information Retrieval http://informationretrieval.org IIR 17: Hierarchical Clustering Hinrich Sch utze Institute for Natural Language

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Testing If its not tested, it doesnt work Why Unit Testing? If it is not tested,

WG 3.1 e-Government and e-Citizen Chair: Kari Strande (Norway), e-mail: kari.strande@statkart.no

PDP-6 &amp; 10 Rick Lin and Keegan Griffee PDP-6 (1963-1966) Hardware Architecture 36-bit

from two-sided pricing to gated communities: welcome back to the eighties julien mailland

The Information Flow Framework: New Architecture Robert E. Kent CT 2006 Philosophy cannot

Trustworthy Computing CSE497b - Spring 2007 Introduction Computer and Network Security Professor

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

CPSC 121: Models of Computation Unit 10: A Working Computer Based on slides by Patrice Belleville

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

PDP-6 & 10 Rick Lin and Keegan Griffee PDP-6 (1963-1966) Hardware Architecture 36-bit