Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web ‐ intelligence g Based on slides by: Hinrich Schütze and Christina Lioma Hinrich Schütze and Christina Lioma Chapter 17: Hierarchical Clustering 1 Introduction to Information Retrieval Introduction to Information Retrieval Overview ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 2

Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ I t ❶ Introduction d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 3 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical clustering Our goal in hierarchical clustering is to Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either top ‐ down or bottom ‐ up. The best known b tt bottom ‐ up method is hierarchical th d i hi hi l agglomerative clustering. 4 4

Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  HAC creates a hierachy in the form of a binary tree  HAC creates a hierachy in the form of a binary tree.  Assumes a similarity measure for determining the similarity of two clusters of two clusters.  Up to now, our similarity measures were for documents.  We will look at four different cluster similarity measures  We will look at four different cluster similarity measures. 5 5 Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  Start with each document in a separate cluster  Start with each document in a separate cluster  Then repeatedly merge the two clusters that are most similar similar  Until there is only one cluster  The history of merging is a hierarchy in the form of a binary  The history of merging is a hierarchy in the form of a binary tree.  The standard way of depicting this history is a dendrogram  The standard way of depicting this history is a dendrogram. 6 6

Introduction to Information Retrieval Introduction to Information Retrieval A dendogram  The history of mergers Th hi f can be read off from bottom to top bottom to top.  The horizontal line of each merger tells us what each merger tells us what the similarity of the merger was.  We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. clustering 7 7 Introduction to Information Retrieval Introduction to Information Retrieval Divisive clustering  Divisive clustering is top ‐ down.  Alternative to HAC (which is bottom up)  Alternative to HAC (which is bottom up).  Divisive clustering:  Start with all docs in one big cluster  Start with all docs in one big cluster  Then recursively split clusters  Eventually each node forms a cluster on its own  Eventually each node forms a cluster on its own.  → Bisec � ng K ‐ means at the end  For now: HAC (= bottom ‐ up) F HAC ( b tt ) 8 8

Introduction to Information Retrieval Introduction to Information Retrieval Naive HAC algorithm 9 9 Introduction to Information Retrieval Introduction to Information Retrieval Computational complexity of the naive algorithm  First, we compute the similarity of all N × N pairs of documents.  Then, in each of N iterations: Th i h f N it ti  We scan the O(N × N ) similarities to find the maximum similarity. similarity  We merge the two clusters with maximum similarity.  We compute the similarity of the new cluster with all other  We compute the similarity of the new cluster with all other (surviving) clusters.  There are O ( N ) iterations, each performing a O(N × N ) There are O ( N ) iterations, each performing a O(N N ) “scan” operation.  Overall complexity is O ( N 3 ). p y ( )  We’ll look at more efficient algorithms later. 10 10

Introduction to Information Retrieval Introduction to Information Retrieval Key question: How to define cluster similarity  Single ‐ link: Maximum similarity  Maximum similarity of any two documents  Complete ‐ link: Minimum similarity  Minimum similarity of any two documents  Centroid: Average “intersimilarity”  Average similarity of all document pairs (but excluding pairs of docs in the same cluster) f d i h l )  This is equivalent to the similarity of the centroids.  Group ‐ average: Average “intrasimilarity” G A “i i il i ”  Average similary of all document pairs, including pairs of docs in the same cluster in the same cluster 11 11 Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Example 12 12

Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 13 13 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 14 14

Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters i i il i i il i f d i diff l 15 15 Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the i i il i i il i f i i l di h h two documents are in the same cluster 16 16

Introduction to Information Retrieval Introduction to Information Retrieval Cluster similarity: Larger Example 17 17 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link: Maximum similarity 18 18

Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link: Minimum similarity 19 19 Introduction to Information Retrieval Introduction to Information Retrieval Centroid: Average intersimilarity 20 20

Introduction to Information Retrieval Introduction to Information Retrieval Group average: Average intrasimilarity 21 21 Introduction to Information Retrieval Introduction to Information Retrieval Outline ❶ Introduction ❶ I t d ti ❸ Single link/ Complete link ❸ Single ‐ link/ Complete ‐ link ❹ Centroid/ GAAC ❹ Centroid/ GAAC ❺ Variants ❺ ❻ Labeling clusters 22

Introduction to Information Retrieval Introduction to Information Retrieval Single link HAC  The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second from the first cluster and a document from the second cluster.  Once we have merged two clusters how do we update the Once we have merged two clusters, how do we update the similarity matrix?  This is simple for single link: This is simple for single link: SIM ( ω i ( ω k 1 ∪ ω k 2 )) = max( SIM ( ω i ω k 1 ) SIM ( ω i ω k 2 )) SIM ( ω i , ( ω k 1 ∪ ω k 2 )) max( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) 23 23 Introduction to Information Retrieval Introduction to Information Retrieval This dendogram was produced by single ‐ link  Notice: many small clusters (1 or 2 members) b being added to the main dd d h cluster  There is no balanced 2 ‐ Th i b l d 2 cluster or 3 ‐ cluster clustering that can be clustering that can be derived by cutting the dendrogram. 24 24

Introduction to Information Retrieval Introduction to Information Retrieval Complete link HAC  The similarity of two clusters is the minimum intersimilarity  The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster. and a document from the second cluster.  Once we have merged two clusters, how do we update the similarity matrix?  Again, this is simple: SIM( ω i , ( ω k 1 ∪ ω k 2 )) = min( SIM ( ω i , ω k 1 ), SIM ( ω i , ω k 2 )) ∪ ω )) = min( SIM ( ω ω ) SIM ( ω SIM( ω ( ω ω ))  We measure the similarity of two clusters by computing the y y p g diameter of the cluster that we would get if we merged them. 25 25 Introduction to Information Retrieval Introduction to Information Retrieval Complete ‐ link dendrogram  Notice that this dendrogram is much more balanced than the b l d h h single ‐ link one.  We can create a 2 ‐ cluster W t 2 l t clustering with two clusters of about the clusters of about the same size. 26 26

Introduction to Information Retrieval Introduction to Information Retrieval Exercise: Compute single and complete link clustering 27 27 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link clustering 28 28

Introduction to Information Retrieval Introduction to Information Retrieval Complete link clustering 29 29 Introduction to Information Retrieval Introduction to Information Retrieval Single ‐ link vs. Complete link clustering 30 30

Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web intelligence g Based on slides by: Hinrich Schtze and Christina Lioma Hinrich Schtze and Christina Lioma Chapter

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Sinnar Taluka Overview and preparation for field trip Pooja Prasad (Ph D scholar) 22/8/2017 1

India - At a Glance (+) Size : 3,287,263 square kilometers 7th largest country in the

#rsif2016 www.sruc.ac.uk/rsif2016 Rural Scotland in Focus 2016: Informing Rural Policy in

FNCE4040 Derivatives Chapter 1 Introduction The Landscape Forwards and Option Contracts

www.crblprogramasoya.org Bogot Laureles Distrito 4281 DRIVERS FOR SUCCESS OF SOYCOW

recommended school supply list to purchase supplies. Please D DO O N OT T label

Reinforcement Learning in Continuous Environments 64.425 Integrated Seminar: Intelligent Robotics

Semantics in Practice Semantics of Practice How do we write semantics? 1: pen-and-paper How do

Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web intelligence g Based on slides by: Hinrich Schtze and Christina Lioma Hinrich Schtze and Christina Lioma Chapter

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Sinnar Taluka Overview and preparation for field trip Pooja Prasad (Ph D scholar) 22/8/2017 1

India - At a Glance (+) Size : 3,287,263 square kilometers 7th largest country in the

#rsif2016 www.sruc.ac.uk/rsif2016 Rural Scotland in Focus 2016: Informing Rural Policy in

FNCE4040 Derivatives Chapter 1 Introduction The Landscape Forwards and Option Contracts

www.crblprogramasoya.org Bogot Laureles Distrito 4281 DRIVERS FOR SUCCESS OF SOYCOW

recommended school supply list to purchase supplies. Please D DO O N OT T label

Reinforcement Learning in Continuous Environments 64.425 Integrated Seminar: Intelligent Robotics

Semantics in Practice Semantics of Practice How do we write semantics? 1: pen-and-paper How do

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models