DocumentClustering CISC489/689010,Lecture#17 Monday,April20 th - PDF document

4/21/09  Document Clustering  CISC489/689‐010, Lecture #17  Monday, April 20 th   Ben CartereGe  ClassificaIon Review  • Items (documents, web pages, emails) are  represented with features  • Some items are assigned a class from a fixed  set  • ClassificaIon goal:  use known class  assignments to “learn” a general funcIon f(x)  for classifying new instances  • Naïve Bayes classifier:    1 

4/21/09  Clustering  • A set of algorithms that aGempt to find latent  (hidden) structure in a set of items  • Goal is to idenIfy groups (clusters) of similar  items  – Two items in the same group should be similar to  one another  – An item in one group should be dissimilar to an  item in another group  Clustering Example  • Suppose I gave you the shape, color, vitamin C  content, and price of various fruits and asked  you to cluster them  – What criteria would you use?  – How would you define similarity?  • Clustering is very sensiIve to how items are  represented and how similarity is defined!  2 

4/21/09  Clustering in Two Dimensions  How would you  cluster these  points?  ClassificaIon vs Clustering  • ClassificaIon is  supervised   – You are given a fixed set of classes  – You are given class labels for certain instances  – This is data you can use to learn the classificaIon funcIon  • Clustering is  unsupervised   – You are not given any informaIon about how documents  should be grouped  – You don’t even know how many groups there should be  – There is no training data to learn from  • One way to think of it:  learning vs discovery  3 

4/21/09  Clustering in IR  • Cluster hypothesis:  – “Closely associated documents tend to be relevant  to the same requests” – van Rijsbergen ‘79  • Document clusters may capture relevance  beGer than individual documents  • Clusters may capture “subtopics”  Cluster‐Based Search  4 

4/21/09  Yahoo! Hierarchy  www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Not based on clustering approaches, but one possible use of clustering.  Example from “IntroducIon to IR” slides by Hinrich Schutze  Clustering Algorithms  • General outline of clustering algorithms  1. Decide how items will be represented (e.g., feature  vectors)  2. Define similarity measure between pairs or groups  of items (e.g., cosine similarity)  3. Determine what makes a “good” clustering  4. IteraIvely construct clusters that are increasingly  “good”  5. Stop ajer a local/global opImum clustering is found  • Steps 3 and 4 differ the most across algorithms  5 

4/21/09  Item RepresentaIon  • Typical representaIon for documents in IR:  – “Bag of words” – a vector of terms appearing in the  document with associated weights  – N‐grams  – etc.  • Any representaIon used in retrieval can  (theoreIcally) be used in clustering or  classificaIon  – Though specialized representaIons may be beGer for  parIcular tasks  Item Similarity  • Cluster hypothesis suggests that document  similarity should be based on informaIon content  – Ideally semanIc content, but we have already seen  how hard that is  • Instead, use the same idea as in query‐based  retrieval  – The score of a document to a query is based on how  similar they are in the words they contain  • Cosine angle between vectors; P(R | Q, D); P(Q | D)  – The similarity of two documents will be based on how  similar they are in the words they contain  6 

4/21/09  Document Similarity  D 1  D 2  D 1  Euclidean distance  D 2  Cosine similarity  D 1  D 2  ManhaGan distance  “similarity” vs “distance”:  in pracIce, you can use either  What Makes a Good Cluster?  • Large vs small?  – Is it OK to have a cluster with one item?  – Is it OK to have a cluster with 10,000 items?  • Similarity between items?  – Is it OK for things in a cluster to be very far apart, as long  as they are closer to each other than to things in other  cluster?  – Is it OK for things to be so close together that other similar  things are excluded from the cluster?  • Overlapping vs non‐overlapping?  – Is it OK for two clusters to contain some items in common?  – Should clusters “nest” within one another?  7 

4/21/09  Example Approaches  • “Hard” clustering  – Every item is in only one cluster  • “Soj” clustering  – Items can belong to more than one cluster  – Nested hierarchy:  item belongs to a cluster, as well as  the cluster’s parent cluster, and so on  – Non‐nested:  item belongs to two separate clusters  • E.g. a document about jaguar cats riding in Jaguar cars might  belong to the “animal” cluster and the “car” cluster  Example Approaches  • Flat clustering:  – No overlap:  every item in exactly one cluster  – K clusters total  – Start with random groups, then refine them unIl they are  “good”  • Hierarchical clustering:  – Clusters are nested:  a cluster can be made up of two or more  smaller clusters  – No fixed number  – Start with one group and split it unIl there are good clusters  – Or start with N groups and agglomerate them unIl there are  good clusters  8 

4/21/09  Flat Clustering  • Goal:  parIIon N documents into K clusters  • Given:  N document feature vectors, a number K  • OpImal algorithm:  – Try every possible clustering and take whichever one  is the “best”  – ComputaIon Ime:  O(K N )  • HeurisIc approach:  – Split documents into K clusters randomly  – Move documents from one cluster to another unIl  the clusters seem “good”  K‐Means Clustering  • K‐means is a parIIoning heurisIc  • Documents are represented as vectors  • Clusters are represented as a  centroid vector   • Basic algorithm:  – Step 0: Choose  K  docs to be iniIal cluster centroids  – Step 1: Assign points to closet centroid  – Step 2: Recompute cluster centroids  – Step 3: Goto 1  9 

4/21/09  K‐Means Clustering Algorithm  Input:  N documents, a number K  • A[1], A[2], …, A[N] := 0  • C 1 , C 2 , …, C K  := iniIal cluster assignment (pick K docs)  • do  – changed = false  – for each document D i , i = 1 to N  • k = argmin k  dist(D i , C k )         (equivalently, k = argmax k  sim(D i , C k ))  • if A[i] != k then  – A[i] = k  – changed = true  – if changed then C 1 , C 2 , …, C K  := cluster centroids  • unIl changed is false  • return A[1..N]  K‐Means Decisions  • K – number of clusters  – K=2?  K=10?  K=500?  • Cluster iniIalizaIon  – Random iniIalizaIon ojen used  – A bad iniIal assignment can result in bad clusters  • Distance measure  – Cosine similarity most common  – Euclidean distance, ManhaGan distance, manifold distances  • Stopping condiIon  – UnIl no documents have changed clusters  – UnIl centroids do not change  – Fixed number of iteraIons  10 

DocumentClustering CISC489/689010,Lecture#17 Monday,April20 th - PDF document

4/21/09 DocumentClustering CISC489/689010,Lecture#17 Monday,April20 th BenCartereGe ClassificaIonReview Items(documents,webpages,emails)are representedwithfeatures

Document #15 Document #15 Document #15 Document #15 Document #15 Document #15 Document #15

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Web Information Retrieval Lecture 15 Clustering Todays Topic: Clustering Document

Clustering in Swedish The Impact of some Properties of the Swedish Language on Document

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Drag And Its Reduction Siddharth Joshi Mechanical Engg. Deptt. VIT Vellore AE-705 Introduction

Linear Programming and Network Optimization Zongpeng Li Department of Computer Science

Matrix Problems Associated to Some Brauer Configuration Algebras Maurice Auslander Distinguished

AE-705: Introduction to Flight Pressure & Airspeed Measurement Part-II Siddharth Joshi

Some results on the stabilization and on the controllability of nonlinear wave equations

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

Outline for Today Monday, Dec. 3 Chapter 11: Intermolecular Forces and Liquids Phase

AMPL Hands-On Session Robert Fourer Department of Industrial Engineering & Management

DocumentClustering CISC489/689010,Lecture#17 Monday,April20 th - PDF document

4/21/09 DocumentClustering CISC489/689010,Lecture#17 Monday,April20 th BenCartereGe ClassificaIonReview Items(documents,webpages,emails)are representedwithfeatures

Document #15 Document #15 Document #15 Document #15 Document #15 Document #15 Document #15

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Web Information Retrieval Lecture 15 Clustering Todays Topic: Clustering Document

Clustering in Swedish The Impact of some Properties of the Swedish Language on Document

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Drag And Its Reduction Siddharth Joshi Mechanical Engg. Deptt. VIT Vellore AE-705 Introduction

Linear Programming and Network Optimization Zongpeng Li Department of Computer Science

Matrix Problems Associated to Some Brauer Configuration Algebras Maurice Auslander Distinguished

AE-705: Introduction to Flight Pressure &amp; Airspeed Measurement Part-II Siddharth Joshi

Some results on the stabilization and on the controllability of nonlinear wave equations

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

Outline for Today Monday, Dec. 3 Chapter 11: Intermolecular Forces and Liquids Phase

AMPL Hands-On Session Robert Fourer Department of Industrial Engineering &amp; Management

AE-705: Introduction to Flight Pressure & Airspeed Measurement Part-II Siddharth Joshi

AMPL Hands-On Session Robert Fourer Department of Industrial Engineering & Management