Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher - PowerPoint PPT Presentation

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina Stanford University WSDM 2009 Images from del.icio.us, lbaumann.com, www.hometrainingtools.com

Web document text

Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals

Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages

Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages Tags: science homeschool education shopping curriculum homeschooling experiments tools chemistry supplies

Why tags? – del.icio.us

Why tags? – del.icio.us ≈120,000 posts / day 12- 75 million (≈10 7 – 10 8 ) unique URLs (versus 10 9 – 10 11 total URLs) Disproportionately the web’s most useful URLs (and those URLs have many tags)

Using tags to understand the web  The web is large and growing: anything that helps us understand high level structure is useful  Tags encode semantically meaningful labels  Tags cover much of the web’s best content  How can we use tags to provide high-level insight?

Web page clustering task  Given a collection of web pages

Web page clustering task  Given a collection of web pages A B  Assign each page to a A cluster, maximizing similarity within clusters A B C A C

Web page clustering task  Given a collection of web pages A B  Assign each page to a A cluster, maximizing similarity within clusters A B  Applications: improved user interfaces, collection clustering, search result diversity, language-model C based retrieval A C

Structure of this talk Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

Models: K-means and MM-LDA Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

Model 1: K-means clustering  K-means assumes the standard Vector Space Model: documents are Euclidean normalized real-valued vectors  Algorithm: iteratively Re-assign documents to closest cluster centroid Update cluster centroids from document assignments

Model 2: Latent Dirichlet Allocation  LDA assumes each Words: Document 22 document’s words information about catalog generated by some pricing changes topic’s word distribution 2008 welcome looking hands-on science ideas try kitchen Topic 5 Topic 12 science catalog experiment shopping … learning buy ideas Internet practice checkout information cart

Model 2: Latent Dirichlet Allocation  LDA assumes each Words: Document 22 document’s words information about catalog generated by some pricing changes topic’s word distribution 2008 welcome looking hands-on science ideas try  Paired with an inference kitchen mechanism (Gibbs sampling), learns per- Topic 5 Topic 12 document distributions science catalog over topics, per-topic experiment shopping … learning buy distributions over words ideas Internet practice checkout information cart

Features: words, anchors, and tags Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Anchors as Words Tags & Anchors as Words

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words Anchors Words Tags Anchors

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Simple feature space modifi- Tags as Words Tags as Words cations for existing models Tags as New Words Words Tags Words Anchors Words Tags Anchors

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words + Tags Words Tags Words Anchors Words Tags Anchors

Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words + Tags K-means: normalize feature input Words Tags vectors independently Words Anchors LDA : multiple parallel sets of Words Tags Anchors observations via MM-LDA

Experiments Features Words Tags Anchors 1. Combining Vector Space words and tags Model: in the VSM K-means Generative Models Model: MM-LDA

Experiments Features Words Tags Anchors Vector Space Model: 2. Comparing K-means models, at Generative multiple levels of Models Model: specificity MM-LDA

Experiments Features Words Tags Anchors Vector Space Model: K-means 3. Do words and tags complement or substitute Generative Models for anchor text? Model: MM-LDA

Experimental Setup  Construct surrogate “gold standard” clustering using Open Directory Project  Reflects a (problematic) consensus clustering, with known number of clusters ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

Experimental Setup  Score predicted clusterings with ODP, but not trying to predict ODP  Useful for relative system performance ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

Evaluation: Cluster F1 A Reference B Intuition: balance A pairwise precision A (place only similar B documents together) with pairwise recall (keep all similar documents C Health together) A C

Evaluation: Cluster F1 A Reference B Same Different Label Label A Same A B Cluster Different Cluster C Health A C

Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 B Cluster Cluster Different Different Cluster Cluster C Health A C

Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 5 3 B Cluster Cluster Different Different Cluster Cluster C Health A Cluster Precision: 5/8 C

Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 5 3 3 B Cluster Cluster Different Different 8 Cluster Cluster C Health A Cluster Precision: 5/8 C Cluster Recall: 5/13

Evaluation: Cluster F1 A Reference B Same Different Label Label A Same A 5 3 B Cluster Different 8 Cluster C Health A Cluster Precision: 5/8 C Cluster Recall: 5/13 Cluster F1: .476

Experiments Features Words Tags Anchors 1. Combining Vector Space words and tags Model: in the VSM K-means Generative Models Model: MM-LDA

Result: normalize words and tags independently in the Vector Space Model Features K-means Words .139 Words Tags .219 Tags Words+Tags .225 Words Tags Possible utility for other applications of the VSM

Result: normalize words and tags independently in the Vector Space Model Features K-means Words .139 Words Tags .219 Tags Words+Tags .225 Words Tags Tags as Words (×1) Tags as Words .158 Tags as Words Tags as Words (×2) .176 Words Tags Tags as New Words .154 Possible utility for other applications of the VSM

Experiments Features Words Tags Anchors Vector Space Model: 2. Comparing K-means models, at Generative multiple levels of Models Model: specificity MM-LDA

Result: MM-LDA outperforms K-means on top-level ODP categories Features K-means (MM-)LDA Words .139 .260 Words Tags .219 .270 Tags Words+Tags .225 .307 Words Tags

Tagging at multiple basic levels People use tags to help find the same page later, often at a “natural” level of specificity Programming/ Languages Society/ Social Sciences (1094 documents) (1590 documents) Java PHP Python C++ Issues, Religion & JavaScript Perl Lisp Spirituality, People, Ruby C Politics, History, Law, Philosophy

Tagging at multiple basic levels People use tags to help find the same page later, often at a “natural” level of specificity Programming/ Languages Society/ Social Sciences (1094 documents) (1590 documents) Java PHP Python C++ Issues, Religion & JavaScript Perl Lisp Spirituality, People, Ruby C Politics, History, Law, java applies to 73% of Philosophy Programming/Java pages but software applies to only 21% of Top/Computer pages

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher - PowerPoint PPT Presentation

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina Stanford University WSDM 2009 Images from del.icio.us, lbaumann.com, www.hometrainingtools.com Web document text Web document text Words:

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

_ b _ q b 1 0 1 g b g _ g b _ 0 q b 1 1 b ~ ~ Search for Gluino b b

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Intercity and Regional Bus Network Study I-70 Transit Advisory Group (TAG) Meeting #1 February

(TAG) at River Trail MS Spring 20 2020 20 TAG students are placed in Advanced or TAG classes

2016 Fall Parent Meeting For Parents of TAG Iden1fied Students and Those Who Are Interested in

TAG Acquisition April 8, 2019 DISCLAIMER This publication may include forward-looking statements

Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering Chris Brooks

Replay, Relay and Inverse-Sybil Attacks on Proximity Tracing Apps Krzysztof Pietrzak 2020

Parenting a bright child is a unique challenge. Passion for Learning, Avid Reader, Zany

QuizZoodle capture your audience! J. Allali QuizZoodle J. Allali (QuizZoodle) QuizZoodle 1 /