Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher - - PowerPoint PPT Presentation

clustering the tagged web
SMART_READER_LITE
LIVE PREVIEW

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher - - PowerPoint PPT Presentation

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina Stanford University WSDM 2009 Images from del.icio.us, lbaumann.com, www.hometrainingtools.com Web document text Web document text Words:


slide-1
SLIDE 1

Clustering the Tagged Web

Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina

Stanford University WSDM 2009

Images from del.icio.us, lbaumann.com, www.hometrainingtools.com

slide-2
SLIDE 2

Web document text

slide-3
SLIDE 3

Web document text

Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals

slide-4
SLIDE 4

Web document text

Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages

slide-5
SLIDE 5

Web document text

Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages Tags: science homeschool education shopping curriculum homeschooling experiments tools chemistry supplies

slide-6
SLIDE 6

Why tags? – del.icio.us

slide-7
SLIDE 7

Why tags? – del.icio.us

≈120,000 posts / day 12-75 million (≈107–108) unique URLs (versus 109–1011 total URLs) Disproportionately the web’s most useful URLs (and those URLs have many tags)

slide-8
SLIDE 8

Using tags to understand the web

 The web is large and growing: anything that helps

us understand high level structure is useful

 Tags encode semantically meaningful labels  Tags cover much of the web’s best content  How can we use tags to provide high-level insight?

slide-9
SLIDE 9

Web page clustering task

 Given a collection of web

pages

slide-10
SLIDE 10

Web page clustering task

A A A B B A C C

 Given a collection of web

pages

 Assign each page to a

cluster, maximizing similarity within clusters

slide-11
SLIDE 11

Web page clustering task

A A A B B A C C

 Given a collection of web

pages

 Assign each page to a

cluster, maximizing similarity within clusters

 Applications: improved

user interfaces, collection clustering, search result diversity, language-model based retrieval

slide-12
SLIDE 12

Structure of this talk

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Features Models

slide-13
SLIDE 13

Models: K-means and MM-LDA

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Features Models

slide-14
SLIDE 14

 K-means assumes the

standard Vector Space Model: documents are Euclidean normalized real-valued vectors

 Algorithm: iteratively

Re-assign documents to closest cluster centroid Update cluster centroids from document assignments

Model 1: K-means clustering

slide-15
SLIDE 15

 LDA assumes each

document’s words generated by some topic’s word distribution

Model 2: Latent Dirichlet Allocation

Words: information about catalog pricing changes 2008 welcome looking hands-on science ideas try kitchen Topic 12 catalog shopping buy Internet checkout cart Topic 5 science experiment learning ideas practice information … Document 22

slide-16
SLIDE 16

 LDA assumes each

document’s words generated by some topic’s word distribution

 Paired with an inference

mechanism (Gibbs sampling), learns per- document distributions

  • ver topics, per-topic

distributions over words

Model 2: Latent Dirichlet Allocation

Topic 12 catalog shopping buy Internet checkout cart Topic 5 science experiment learning ideas practice information … Document 22 Words: information about catalog pricing changes 2008 welcome looking hands-on science ideas try kitchen

slide-17
SLIDE 17

Features: words, anchors, and tags

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Features Models

slide-18
SLIDE 18

Feature Combination Feature Space Size Words Anchors Tags

Combining features

Words Tags Anchors

slide-19
SLIDE 19

Feature Combination Feature Space Size Words Anchors Tags Tags as Words

Combining features

Words Tags as Words Tags Anchors Tags & Anchors as Words Anchors as Words

slide-20
SLIDE 20

Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words

Combining features

Words Tags as Words Words Tags Tags Anchors Words Tags Anchors Words Anchors

slide-21
SLIDE 21

Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words

Combining features

Words Tags as Words Words Tags Tags Anchors Words Tags Anchors Words Anchors

Simple feature space modifi- cations for existing models

slide-22
SLIDE 22

Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words Words + Tags

Combining features

Words Tags as Words Words Tags Tags Anchors Words Anchors Words Tags Anchors Words Tags

slide-23
SLIDE 23

Feature Combination Feature Space Size Words Anchors Tags Tags as Words Tags as New Words Words + Tags

Combining features

Words Tags as Words Words Tags Tags Anchors Words Anchors Words Tags Anchors

K-means: normalize feature input vectors independently LDA : multiple parallel sets of

  • bservations via MM-LDA

Words Tags

slide-24
SLIDE 24

Experiments

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Features Models

  • 1. Combining

words and tags in the VSM

slide-25
SLIDE 25

Experiments

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

  • 2. Comparing

models, at multiple levels of specificity

Features Models

slide-26
SLIDE 26

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Experiments

  • 3. Do words and tags

complement or substitute for anchor text?

Features Models

slide-27
SLIDE 27

Experimental Setup

 Construct surrogate “gold standard” clustering

using Open Directory Project

 Reflects a (problematic) consensus clustering, with

known number of clusters

ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

slide-28
SLIDE 28

Experimental Setup

 Score predicted clusterings with ODP, but not

trying to predict ODP

 Useful for relative system performance

ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

slide-29
SLIDE 29

Evaluation: Cluster F1

A A A B B A C C Reference Health Intuition: balance pairwise precision (place only similar documents together) with pairwise recall (keep all similar documents together)

slide-30
SLIDE 30

Evaluation: Cluster F1

A A A B B A C C Reference Health

Same Label Different Label Same Cluster Different Cluster

slide-31
SLIDE 31

Same Label Different Label Same Cluster Different Cluster

Evaluation: Cluster F1

A A A B B A C C Reference Health

Same Label Different Label Same Cluster

5

Different Cluster

slide-32
SLIDE 32

Same Label Different Label Same Cluster

5

Different Cluster

Evaluation: Cluster F1

A A A B B A C C Reference Health

Same Label Different Label Same Cluster

5 3

Different Cluster

Cluster Precision: 5/8

slide-33
SLIDE 33

Same Label Different Label Same Cluster

5 3

Different Cluster

Evaluation: Cluster F1

A A A B B A C C Reference Health

Same Label Different Label Same Cluster

5 3

Different Cluster

8

Cluster Precision: 5/8 Cluster Recall: 5/13

slide-34
SLIDE 34

Same Label Different Label Same Cluster

5 3

Different Cluster

8

Evaluation: Cluster F1

A A A B B A C C Reference Health

Cluster Precision: 5/8 Cluster Recall: 5/13 Cluster F1: .476

slide-35
SLIDE 35

Experiments

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Features Models

  • 1. Combining

words and tags in the VSM

slide-36
SLIDE 36

Features K-means Words .139 Tags .219 Words+Tags .225

Result: normalize words and tags independently in the Vector Space Model

Possible utility for other applications of the VSM

Words Words Tags Tags

slide-37
SLIDE 37

Features K-means Words .139 Tags .219 Words+Tags .225 Tags as Words (×1) .158 Tags as Words (×2) .176 Tags as New Words .154

Result: normalize words and tags independently in the Vector Space Model

Possible utility for other applications of the VSM

Words Words Tags Tags Words Tags Tags as Words Tags as Words

slide-38
SLIDE 38

Experiments

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

  • 2. Comparing

models, at multiple levels of specificity

Features Models

slide-39
SLIDE 39

Result: MM-LDA outperforms K-means on top-level ODP categories

Features K-means (MM-)LDA Words .139 .260 Tags .219 .270 Words+Tags .225 .307

Words Words Tags Tags

slide-40
SLIDE 40

Tagging at multiple basic levels

People use tags to help find the same page later,

  • ften at a “natural” level of specificity

Programming/Languages (1094 documents)

Java PHP Python C++ JavaScript Perl Lisp Ruby C

Society/Social Sciences (1590 documents)

Issues, Religion & Spirituality, People, Politics, History, Law, Philosophy

slide-41
SLIDE 41

Tagging at multiple basic levels

People use tags to help find the same page later,

  • ften at a “natural” level of specificity

Programming/Languages (1094 documents)

Java PHP Python C++ JavaScript Perl Lisp Ruby C

Society/Social Sciences (1590 documents)

Issues, Religion & Spirituality, People, Politics, History, Law, Philosophy java applies to 73% of Programming/Java pages but software applies to only 21% of Top/Computer pages

slide-42
SLIDE 42

Result: Sometimes, tags tell you more about cluster membership than words do

 Tags are very discriminating in subcategories  K-means wins when the feature space is cleaner

Features K-means (MM-)LDA Programming Languages Words Tags Words+Tags .189 .567 .556 .288 .463 .297 Social Sciences Words Tags Words+Tags .196 .307 .308 .300 .310 .302

slide-43
SLIDE 43

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

Experiments

  • 3. Do words and tags

complement or substitute for anchor text?

Features Models

slide-44
SLIDE 44

Result: Tags complement anchor text

Anchors can depress performance, but adding tags brings to within delta of Words+Tags.

Features K-means (MM-)LDA Words .139 .260 Words+Anchors .128 .248 Words+Anchors+Tags .224 .306

slide-45
SLIDE 45

Conclusions

 Tags add real value when high-level semantic

information is needed

 Tags act differently than words, anchor text  At the right level of specificity, tags describe pages

better than anything else

 Treat tags and words as separate information

channels to maximize utility Thanks! Questions?

slide-46
SLIDE 46
slide-47
SLIDE 47

Backup material

slide-48
SLIDE 48

Result: Tags complement anchor text

Anchor text acts as annotations from another web

  • author. Noisier than words and tags, but can be

usfeully integrated into a joint model.

slide-49
SLIDE 49

Future directions

 More targeted graphical models

Individual users with individual vocabularies Time series

 Direct evaluation in retrieval / browsing  More types of annotated documents

Product reviews; academic papers; blog posts

slide-50
SLIDE 50

Content age: ODP versus del.icio.us

57% of Tag Crawl data initially indexed by Google

slide-51
SLIDE 51

Clustering (flat, parametric)

 Input

Number of clusters K Set of documents: <words,tags,anchors>

 Output

Assignment of documents to clusters

 Evaluation

Comparison to a gold standard

slide-52
SLIDE 52

Outline

 The tagged web  Dataset and methodology  Clustering with tags and words

K-Means in tag-augmented vector space Multi-Multinomial LDA

 Experiments  Discussion

slide-53
SLIDE 53

Outline

 The tagged web  Dataset and methodology  Clustering with tags and words

K-Means in tag-augmented vector space Multi-Multinomial LDA

 Experiments  Discussion

slide-54
SLIDE 54

Outline

 The tagged web  Dataset and methodology  Clustering with tags and words

K-Means in Tag-Augmented Vector Space Multi-Multinomial LDA

 Experiments  Discussion

slide-55
SLIDE 55

Automatic cluster evaluation

 Pick a slice of ODP with k subtrees  Cluster relevant documents into k sets  Compare inferred assignments to ODP labeling

slide-56
SLIDE 56

Automatic cluster evaluation

 Pick a slice of ODP with k subtrees  Cluster relevant documents into k sets  Compare inferred assignments to ODP labeling

Advantages

Scalable, automatic, reflects “consensus” clustering

Drawbacks

May not translate to performance gains in task Does not address choosing best k

slide-57
SLIDE 57

F-measure of cluster quality

D

B

Different ODP Category

C

A

Same ODP Category Different cluster Same cluster # Pairs of Examples

B A A P

R = A A ฀C

F 1= 2 P R P ฀R

slide-58
SLIDE 58

A tagged document

Tags curriculum education(2) homeschool imported learning science(4) shopping slinky teachers teaching tools

ODP Label: Top/Reference

Top/Reference/Education/K_through_12/Home_Schooling/Curriculum/Science

slide-59
SLIDE 59

MM-LDA implementation

 Collapsed Gibbs-sampler with hard assignments

Repeatedly samples new z for each word Usually converges within several dozen passes Could be parallelized

 Runtime:

22 min (MM-LDA) versus 6 min (K-means) on 2000 documents

slide-60
SLIDE 60

K-means generated clusters

slide-61
SLIDE 61

MM-LDA generated clusters

slide-62
SLIDE 62

K-means term weighting

slide-63
SLIDE 63

Impact

 Social bookmarking is big and getting bigger  Tags hold promise of specific, relevant indexing

vocabulary for the web

Not quite full-text indexing Not quite controlled pre-coordinate indexing

 Tagging data improves web clustering

performance, which promises better IR

How else will tagging impact IR?

slide-64
SLIDE 64

Scatter/Gather [Cutting et al 1992]

slide-65
SLIDE 65

Stanford tag crawl dataset

Heymann, et. al 2008

slide-66
SLIDE 66

Stanford tag crawl dataset

slide-67
SLIDE 67

K-means [CS276]

 Assumes documents are real-valued vectors  Clusters based on centroids (aka the center of

gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based

  • n distances to the current cluster centroids

฀ μ ฀ c ฀ = 1 ∣c∣ ∑

฀ x ∈c

฀ x

slide-68
SLIDE 68

K-means example (K=2) [CS276]

Pick seeds Reassign clusters Compute centroids

x x

Reassign clusters

x x x x

Compute centroids

Reassign clusters Converged!

slide-69
SLIDE 69

MM-LDA outperforms K-means

On top-level ODP categories

slide-70
SLIDE 70

Latent Dirichlet Allocation (LDA)

D – number of documents N – number of words in document alpha – symmetric Dirichlet prior theta – per document topic multinomial zw – per word topic assignment w – word observation beta – per topic word multinomial

slide-71
SLIDE 71

MM-LDA Properties

 Natural extension of LDA  Jointly models multiple types of observations

Similar to Blei et al.'s GM-LDA for images with captions

 Words and tags counted independently, contribute

jointly to document topic model

slide-72
SLIDE 72
slide-73
SLIDE 73

A web document collection

Stanford Tag Crawl Dataset: One month of del.icio.us posts in May/June 2007

slide-74
SLIDE 74

Most web pages come with words

Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ...

slide-75
SLIDE 75

Words can be used to cluster

slide-76
SLIDE 76

Text surrounds links from other pages

Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...

slide-77
SLIDE 77

Social bookmarking websites add tags

Tags: curriculum education homeschool imported learning science shopping slinky teachers teaching tools Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...

slide-78
SLIDE 78

How do we use words, anchor text, and tags together to most improve clustering?

slide-79
SLIDE 79

How do we test if clustering improves?

slide-80
SLIDE 80

Many pages have a “gold standard” label

Tags: curriculum education homeschool imported learning science shopping slinky teachers teaching tools Words: welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals ... Anchor Text: tools home science links click buy supplies experiments ...

Reference/Education

slide-81
SLIDE 81

Open Directory Project

slide-82
SLIDE 82

Cluster evaluation

Reference Arts

slide-83
SLIDE 83

Cluster evaluation

Reference Arts Precision: 4 / 7 pairs Recall: 4 / 12

slide-84
SLIDE 84

Cluster evaluation

Reference Arts Precision: 9 / 9 pairs Recall: 9 / 12 pairs

slide-85
SLIDE 85

Outline

 The tagged web  Dataset and methodology  Algorithms for clustering with tags and words

K-Means in tag-augmented vector space Multi-Multinomial LDA

 Results

Tag and word normalization Clustering at varying levels of specificity Incorporating anchor text

slide-86
SLIDE 86

Multi-Multinomial LDA (MM-LDA)

Extends Latent Dirichlet Allocation: Words and tags (and anchors, etc.) are counted independently, contribute jointly to topic probabilities

slide-87
SLIDE 87

Multi-Multinomial LDA (MM-LDA)

Words Tags

Extends Latent Dirichlet Allocation: Words and tags (and anchors, etc.) are counted independently, contribute jointly to topic probabilities

slide-88
SLIDE 88

Multi-Multinomial LDA (MM-LDA)

Words Tags

Which topic generates each

  • bservation

Word/Tag

  • bservation

from 1 to N Per topic Word/Tag distribution Per document topic distribution

slide-89
SLIDE 89

The tagged web (Heymann, et al., WSDM 2008)

slide-90
SLIDE 90

Same Label Different Label Same Cluster

5 3

Different Cluster

8

Evaluation: Cluster F1

A A A B B A C C Reference Arts

Same Label Different Label Same Cluster

5 3

Different Cluster

8 12

Cluster Precision: 5/8 Cluster Recall: 5/13

slide-91
SLIDE 91

Goal: clustering for information retrieval

 Better user interfaces

e.g. Clusty, Vivisimo, Scatter/Gather, and friends

 Collection clustering

e.g. Columbia Newsblaster, Google News

 Improved language models for better retrieval

e.g. Liu and Croft 2004; Wei and Croft 2006

 Better cluster based-retrieval

e.g. Salton 1971

slide-92
SLIDE 92

Stanford tag crawl / ODP intersection

slide-93
SLIDE 93

K-means feature vectors

Words Tags as Words Words Tags Words Tags Tags

Words Tags Tags as Weighted Words Tags as New Words Tags+Words Strategy Feature Space Size

slide-94
SLIDE 94

Experiments

Words Tags Anchors Vector Space Model: K-means Generative Model: MM-LDA

  • 2. Extending LDA

for multiple feature types

Features Models

slide-95
SLIDE 95

Result: Sometimes, tags tell you more about cluster membership than words do

 Tags are very discriminating in subcategories

“java” applies to 73% of Programming/Java pages, but “software” applies to only 21% of Computer pages

 K-means wins when the feature space is cleaner

Features K-means (MM-)LDA All Words Tags Words+Tags .139 .219 .225 .260 .270 .307 Programming Languages Words Tags Words+Tags .189 .567 .556 .288 .463 .297 Social Sciences Words Tags Words+Tags .196 .307 .308 .300 .310 .302