SLIDE 1 Clustering the Tagged Web
Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina
Stanford University WSDM 2009
Images from del.icio.us, lbaumann.com, www.hometrainingtools.com
SLIDE 2
Web document text
SLIDE 3
Web document text
Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals
SLIDE 4
Web document text
Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages
SLIDE 5
Web document text
Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages Tags: science homeschool education shopping curriculum homeschooling experiments tools chemistry supplies
SLIDE 6
Why tags? – del.icio.us
SLIDE 7
Why tags? – del.icio.us
≈120,000 posts / day 12-75 million (≈107–108) unique URLs (versus 109–1011 total URLs) Disproportionately the web’s most useful URLs (and those URLs have many tags)
SLIDE 8 Using tags to understand the web
The web is large and growing: anything that helps
us understand high level structure is useful
Tags encode semantically meaningful labels Tags cover much of the web’s best content How can we use tags to provide high-level insight?
SLIDE 9 Web page clustering task
Given a collection of web
pages
SLIDE 10 Web page clustering task
A A A B B A C C
Given a collection of web
pages
Assign each page to a
cluster, maximizing similarity within clusters
SLIDE 11 Web page clustering task
A A A B B A C C
Given a collection of web
pages
Assign each page to a
cluster, maximizing similarity within clusters
Applications: improved
user interfaces, collection clustering, search result diversity, language-model based retrieval
SLIDE 12
Structure of this talk
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Features Models
SLIDE 13
Models: K-means and MM-LDA
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Features Models
SLIDE 14 K-means assumes the
standard Vector Space Model: documents are Euclidean normalized real-valued vectors
Algorithm: iteratively
Re-assign documents to closest cluster centroid Update cluster centroids from document assignments
Model 1: K-means clustering
SLIDE 15 LDA assumes each
document’s words generated by some topic’s word distribution
Model 2: Latent Dirichlet Allocation
Words: information about catalog pricing changes 2008 welcome looking hands-on science ideas try kitchen Topic 12 catalog shopping buy Internet checkout cart Topic 5 science experiment learning ideas practice information … Document 22
SLIDE 16 LDA assumes each
document’s words generated by some topic’s word distribution
Paired with an inference
mechanism (Gibbs sampling), learns per- document distributions
distributions over words
Model 2: Latent Dirichlet Allocation
Topic 12 catalog shopping buy Internet checkout cart Topic 5 science experiment learning ideas practice information … Document 22 Words: information about catalog pricing changes 2008 welcome looking hands-on science ideas try kitchen
SLIDE 17
Features: words, anchors, and tags
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Features Models
SLIDE 18 Feature Combination Feature Space Size Words Anchors Tags
Combining features
Words Tags Anchors
SLIDE 19 Feature Combination Feature Space Size Words Anchors Tags Tags as Words
Combining features
Words Tags as Words Tags Anchors Tags & Anchors as Words Anchors as Words
SLIDE 20 Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words
Combining features
Words Tags as Words Words Tags Tags Anchors Words Tags Anchors Words Anchors
SLIDE 21 Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words
Combining features
Words Tags as Words Words Tags Tags Anchors Words Tags Anchors Words Anchors
Simple feature space modifi- cations for existing models
SLIDE 22 Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words Words + Tags
Combining features
Words Tags as Words Words Tags Tags Anchors Words Anchors Words Tags Anchors Words Tags
SLIDE 23 Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words Words + Tags
Combining features
Words Tags as Words Words Tags Tags Anchors Words Anchors Words Tags Anchors
K-means: normalize feature input vectors independently LDA : multiple parallel sets of
Words Tags
SLIDE 24 Experiments
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Features Models
words and tags in the VSM
SLIDE 25 Experiments
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
models, at multiple levels of specificity
Features Models
SLIDE 26 Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Experiments
complement or substitute for anchor text?
Features Models
SLIDE 27 Experimental Setup
Construct surrogate “gold standard” clustering
using Open Directory Project
Reflects a (problematic) consensus clustering, with
known number of clusters
ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary
SLIDE 28 Experimental Setup
Score predicted clusterings with ODP, but not
trying to predict ODP
Useful for relative system performance
ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary
SLIDE 29
Evaluation: Cluster F1
A A A B B A C C Reference Health Intuition: balance pairwise precision (place only similar documents together) with pairwise recall (keep all similar documents together)
SLIDE 30
Evaluation: Cluster F1
A A A B B A C C Reference Health
Same Label Different Label Same Cluster Different Cluster
SLIDE 31
Same Label Different Label Same Cluster Different Cluster
Evaluation: Cluster F1
A A A B B A C C Reference Health
Same Label Different Label Same Cluster
5
Different Cluster
SLIDE 32
Same Label Different Label Same Cluster
5
Different Cluster
Evaluation: Cluster F1
A A A B B A C C Reference Health
Same Label Different Label Same Cluster
5 3
Different Cluster
Cluster Precision: 5/8
SLIDE 33
Same Label Different Label Same Cluster
5 3
Different Cluster
Evaluation: Cluster F1
A A A B B A C C Reference Health
Same Label Different Label Same Cluster
5 3
Different Cluster
8
Cluster Precision: 5/8 Cluster Recall: 5/13
SLIDE 34
Same Label Different Label Same Cluster
5 3
Different Cluster
8
Evaluation: Cluster F1
A A A B B A C C Reference Health
Cluster Precision: 5/8 Cluster Recall: 5/13 Cluster F1: .476
SLIDE 35 Experiments
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Features Models
words and tags in the VSM
SLIDE 36 Features K-means Words .139 Tags .219 Words+Tags .225
Result: normalize words and tags independently in the Vector Space Model
Possible utility for other applications of the VSM
Words Words Tags Tags
SLIDE 37 Features K-means Words .139 Tags .219 Words+Tags .225 Tags as Words (×1) .158 Tags as Words (×2) .176 Tags as New Words .154
Result: normalize words and tags independently in the Vector Space Model
Possible utility for other applications of the VSM
Words Words Tags Tags Words Tags Tags as Words Tags as Words
SLIDE 38 Experiments
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
models, at multiple levels of specificity
Features Models
SLIDE 39 Result: MM-LDA outperforms K-means on top-level ODP categories
Features K-means (MM-)LDA Words .139 .260 Tags .219 .270 Words+Tags .225 .307
Words Words Tags Tags
SLIDE 40 Tagging at multiple basic levels
People use tags to help find the same page later,
- ften at a “natural” level of specificity
Programming/Languages (1094 documents)
Java PHP Python C++ JavaScript Perl Lisp Ruby C
Society/Social Sciences (1590 documents)
Issues, Religion & Spirituality, People, Politics, History, Law, Philosophy
SLIDE 41 Tagging at multiple basic levels
People use tags to help find the same page later,
- ften at a “natural” level of specificity
Programming/Languages (1094 documents)
Java PHP Python C++ JavaScript Perl Lisp Ruby C
Society/Social Sciences (1590 documents)
Issues, Religion & Spirituality, People, Politics, History, Law, Philosophy java applies to 73% of Programming/Java pages but software applies to only 21% of Top/Computer pages
SLIDE 42 Result: Sometimes, tags tell you more about cluster membership than words do
Tags are very discriminating in subcategories K-means wins when the feature space is cleaner
Features K-means (MM-)LDA Programming Languages Words Tags Words+Tags .189 .567 .556 .288 .463 .297 Social Sciences Words Tags Words+Tags .196 .307 .308 .300 .310 .302
SLIDE 43 Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
Experiments
complement or substitute for anchor text?
Features Models
SLIDE 44
Result: Tags complement anchor text
Anchors can depress performance, but adding tags brings to within delta of Words+Tags.
Features K-means (MM-)LDA Words .139 .260 Words+Anchors .128 .248 Words+Anchors+Tags .224 .306
SLIDE 45 Conclusions
Tags add real value when high-level semantic
information is needed
Tags act differently than words, anchor text At the right level of specificity, tags describe pages
better than anything else
Treat tags and words as separate information
channels to maximize utility Thanks! Questions?
SLIDE 46
SLIDE 47
Backup material
SLIDE 48 Result: Tags complement anchor text
Anchor text acts as annotations from another web
- author. Noisier than words and tags, but can be
usfeully integrated into a joint model.
SLIDE 49 Future directions
More targeted graphical models
Individual users with individual vocabularies Time series
Direct evaluation in retrieval / browsing More types of annotated documents
Product reviews; academic papers; blog posts
SLIDE 50
Content age: ODP versus del.icio.us
57% of Tag Crawl data initially indexed by Google
SLIDE 51 Clustering (flat, parametric)
Input
Number of clusters K Set of documents: <words,tags,anchors>
Output
Assignment of documents to clusters
Evaluation
Comparison to a gold standard
SLIDE 52 Outline
The tagged web Dataset and methodology Clustering with tags and words
K-Means in tag-augmented vector space Multi-Multinomial LDA
Experiments Discussion
SLIDE 53 Outline
The tagged web Dataset and methodology Clustering with tags and words
K-Means in tag-augmented vector space Multi-Multinomial LDA
Experiments Discussion
SLIDE 54 Outline
The tagged web Dataset and methodology Clustering with tags and words
K-Means in Tag-Augmented Vector Space Multi-Multinomial LDA
Experiments Discussion
SLIDE 55 Automatic cluster evaluation
Pick a slice of ODP with k subtrees Cluster relevant documents into k sets Compare inferred assignments to ODP labeling
SLIDE 56 Automatic cluster evaluation
Pick a slice of ODP with k subtrees Cluster relevant documents into k sets Compare inferred assignments to ODP labeling
Advantages
Scalable, automatic, reflects “consensus” clustering
Drawbacks
May not translate to performance gains in task Does not address choosing best k
SLIDE 57
F-measure of cluster quality
D
B
Different ODP Category
C
A
Same ODP Category Different cluster Same cluster # Pairs of Examples
B A A P
R = A A C
F 1= 2 P R P R
SLIDE 58 A tagged document
Tags curriculum education(2) homeschool imported learning science(4) shopping slinky teachers teaching tools
ODP Label: Top/Reference
Top/Reference/Education/K_through_12/Home_Schooling/Curriculum/Science
SLIDE 59 MM-LDA implementation
Collapsed Gibbs-sampler with hard assignments
Repeatedly samples new z for each word Usually converges within several dozen passes Could be parallelized
Runtime:
22 min (MM-LDA) versus 6 min (K-means) on 2000 documents
SLIDE 60
K-means generated clusters
SLIDE 61
MM-LDA generated clusters
SLIDE 62
K-means term weighting
SLIDE 63 Impact
Social bookmarking is big and getting bigger Tags hold promise of specific, relevant indexing
vocabulary for the web
Not quite full-text indexing Not quite controlled pre-coordinate indexing
Tagging data improves web clustering
performance, which promises better IR
How else will tagging impact IR?
SLIDE 64
Scatter/Gather [Cutting et al 1992]
SLIDE 65 Stanford tag crawl dataset
Heymann, et. al 2008
SLIDE 66
Stanford tag crawl dataset
SLIDE 67 K-means [CS276]
Assumes documents are real-valued vectors Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c:
Reassignment of instances to clusters is based
- n distances to the current cluster centroids
μ c = 1 ∣c∣ ∑
x ∈c
x
SLIDE 68 K-means example (K=2) [CS276]
Pick seeds Reassign clusters Compute centroids
x x
Reassign clusters
x x x x
Compute centroids
Reassign clusters Converged!
SLIDE 69
MM-LDA outperforms K-means
On top-level ODP categories
SLIDE 70
Latent Dirichlet Allocation (LDA)
D – number of documents N – number of words in document alpha – symmetric Dirichlet prior theta – per document topic multinomial zw – per word topic assignment w – word observation beta – per topic word multinomial
SLIDE 71 MM-LDA Properties
Natural extension of LDA Jointly models multiple types of observations
Similar to Blei et al.'s GM-LDA for images with captions
Words and tags counted independently, contribute
jointly to document topic model
SLIDE 72
SLIDE 73
A web document collection
Stanford Tag Crawl Dataset: One month of del.icio.us posts in May/June 2007
SLIDE 74
Most web pages come with words
Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ...
SLIDE 75
Words can be used to cluster
SLIDE 76
Text surrounds links from other pages
Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...
SLIDE 77
Social bookmarking websites add tags
Tags: curriculum education homeschool imported learning science shopping slinky teachers teaching tools Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...
SLIDE 78
How do we use words, anchor text, and tags together to most improve clustering?
SLIDE 79
How do we test if clustering improves?
SLIDE 80
Many pages have a “gold standard” label
Tags: curriculum education homeschool imported learning science shopping slinky teachers teaching tools Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...
Reference/Education
SLIDE 81
Open Directory Project
SLIDE 82
Cluster evaluation
Reference Arts
SLIDE 83
Cluster evaluation
Reference Arts Precision: 4 / 7 pairs Recall: 4 / 12
SLIDE 84
Cluster evaluation
Reference Arts Precision: 9 / 9 pairs Recall: 9 / 12 pairs
SLIDE 85 Outline
The tagged web Dataset and methodology Algorithms for clustering with tags and words
K-Means in tag-augmented vector space Multi-Multinomial LDA
Results
Tag and word normalization Clustering at varying levels of specificity Incorporating anchor text
SLIDE 86 Multi-Multinomial LDA (MM-LDA)
Extends Latent Dirichlet Allocation: Words and tags (and anchors, etc.) are counted independently, contribute jointly to topic probabilities
SLIDE 87 Multi-Multinomial LDA (MM-LDA)
Words Tags
Extends Latent Dirichlet Allocation: Words and tags (and anchors, etc.) are counted independently, contribute jointly to topic probabilities
SLIDE 88 Multi-Multinomial LDA (MM-LDA)
Words Tags
Which topic generates each
Word/Tag
from 1 to N Per topic Word/Tag distribution Per document topic distribution
SLIDE 89
The tagged web (Heymann, et al., WSDM 2008)
SLIDE 90
Same Label Different Label Same Cluster
5 3
Different Cluster
8
Evaluation: Cluster F1
A A A B B A C C Reference Arts
Same Label Different Label Same Cluster
5 3
Different Cluster
8 12
Cluster Precision: 5/8 Cluster Recall: 5/13
SLIDE 91 Goal: clustering for information retrieval
Better user interfaces
e.g. Clusty, Vivisimo, Scatter/Gather, and friends
Collection clustering
e.g. Columbia Newsblaster, Google News
Improved language models for better retrieval
e.g. Liu and Croft 2004; Wei and Croft 2006
Better cluster based-retrieval
e.g. Salton 1971
SLIDE 92
Stanford tag crawl / ODP intersection
SLIDE 93 K-means feature vectors
Words Tags as Words Words Tags Words Tags Tags
Words Tags Tags as Weighted Words Tags as New Words Tags+Words Strategy Feature Space Size
SLIDE 94 Experiments
Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA
for multiple feature types
Features Models
SLIDE 95 Result: Sometimes, tags tell you more about cluster membership than words do
Tags are very discriminating in subcategories
“java” applies to 73% of Programming/Java pages, but “software” applies to only 21% of Computer pages
K-means wins when the feature space is cleaner
Features K-means (MM-)LDA All Words Tags Words+Tags .139 .219 .225 .260 .270 .307 Programming Languages Words Tags Words+Tags .189 .567 .556 .288 .463 .297 Social Sciences Words Tags Words+Tags .196 .307 .308 .300 .310 .302