similarity search evaluating strategies for
play

Similarity search Evaluating Strategies for Given a query Web page - PDF document

Taher H. Haveliwala Similarity search Evaluating Strategies for Given a query Web page q , return Web Similarity Search on the Web pages that are similar to q Taher H. Haveliwala www.moneycentral.com Aristides Gionis


  1. ☎ ✂ � Taher H. Haveliwala Similarity search Evaluating Strategies for ✁ Given a query Web page q , return Web Similarity Search on the Web pages that are “similar” to q Taher H. Haveliwala www.moneycentral.com Aristides Gionis Dan Klein www.pathfinder.com/money Piotr Indyk www.moneyworld.co.uk www.money.com {taherh,gionis,klein}@cs.stanford.edu www.etrade.com indyk@theory.lcs.mit.edu www.moneyclub.com Similarity search Related work ✁ Finding Related Pages in the WWW ✁ Two major issues: ✄ Choose the strategy that best captures the ✄ [Dean,Henzinger WWW8 ’99] ✁ Automatic Resource Compilation ... notion of Web-page “similarity” ✄ Scaling up the chosen strategy to repository ✄ [Chakrabarti et al WWW7 ’98] of millions of pages ✁ Commercial search engines 1

  2. ✄ � ✆ ☎ Taher H. Haveliwala Model for document similarity Model for document similarity ✁ Represent each Web page as bag of ✁ For pages a and b , with respective bags α and β , define terms α ∩ β ✂ content, anchor-text, links, ... ( , ) = sim a b ✁ Similarity of two pages is given by α ∪ β similarity their respective bags ✁ Strategy for (page → bag) is the crucial ✂ cosine step in quality of sim() ✂ Jaccard Similarity search system Similarity search system Query Processing Query Processing Sim Page Sim Page Web Web Index Representations Index Representations Query-time Query-time page page → Indexing → θ Indexing representation representation Using strategy θ Preprocessing Preprocessing 2

  3. ☎ ✁ � ✁ ✂ Taher H. Haveliwala Similarity search system Possible term choices http://www.foobar.com/ http://www.music.com/ Query Processing ...click here for a MusicWorld great music page... ...click here for great sports page... Enter our site Sim Page Web Index http://www.baz.com/ Representations Query-time ...what I had for lunch... page Indexing → ...this music is great... representation Preprocessing Content Links http://www.music.com/ http://www.music.com/ http://www.foobar.com/ http://www.foobar.com/ ...click here for a ...click here for a MusicWorld MusicWorld great music page... great music page... ...click here for great ...click here for great sports page... sports page... Welcome Enter our site http://www.baz.com/ http://www.baz.com/ ...what I had for ...what I had for bag: www.music.com bag: www.music.com lunch... lunch... music 1 www.foobar.com 1 ...this music is great... ...this music is great... world 1 www.baz.com 1 welcome 1 ✁✄✁ 3

  4. ✂ ✝ � � � ✆ ✁ � Taher H. Haveliwala Parameter space for bag Anchor windows generation http://www.foobar.com/ http://www.music.com/ ✄ Space of parameters considered: ...click here for a MusicWorld ☎ content vs. links vs. anchor windows great music page... ☎ anchor window length ...click here for great sports page... ☎ term weighting schemes Enter our site ✄ Choice of a particular assignment of http://www.baz.com/ parameters, θ , defines a similarity search ...what I had for bag: www.music.com lunch... strategy music 2 ...this music is great... great 2 click 1 ... Similarity search system (Strategy, query) → similarity ordering ✄ Inputs: Query Processing ☎ θ ∈ Θ : strategy (i.e., parameter setting) ☎ q ∈ Web: query page ✄ Outputs: Sim Page ☎ τ : list of web pages ordered by similarity to q Web Index Representations Query-time using strategy θ ✄ τ = Τ ( θ , q ) page → θ Indexing representation Using strategy θ Preprocessing 4

  5. � ✁ ✆ � � ☎ Taher H. Haveliwala Evaluating strategies Web directories (Yahoo!, ODP) ✂ Hand-constructed hierarchical directories ✂ Goal: find “best” θ i ∈ Θ ✂ Develop system to measure quality of such as Yahoo! and the Open Directory Project (ODP) can be used as an external different parameter settings quality measure ✄ What do you choose as the ground truth for ✂ Do not directly provide ranked similarity Web-page similarity? listings ✄ How do you compare a particular strategy to ✂ Do contain many implicit similarity this ground truth? judgements Directory → Similarity judgements (Directory, query) → similarity ordering Open Directory Computers Hardware Software Unrelated www.hardware.com www.software.com Cousin Class Sibling Class Same Class www.programming.com www.machine.com ✝✟✞ Query 5

  6. ✄ ✞ ✄ ☎ ☎ ✄ ☎ � ✄ Taher H. Haveliwala Evaluating strategies (Directory, query) → similarity ordering Inputs: Restrict attention during evaluation 1. phase to pages in the directory D D : hierarchical directory q ∈ D : query page Compare similarity ordering induced by 2. Outputs: parameter setting θ i to the similarity τ : list of pages of D partially ordered by similarity to q, ordering induced by the directory, over using the ordering implicit in D test set of query pages τ = Τ ( D , q) Choose the θ i that agrees most closely The above is for evaluating similarity search, not 3. with the judgements in D performing it! �✂✁ �✆� Directory vs. Strategy Comparing two orderings Open Directory ✟ Based on Kruskal-Goodman Γ ✟ Inputs ✠ τ odp : strict weak ordering of pages (ODP) weak order ✠ τ i : total ordering of pages according to θ i ODP ✟ Output ✠ -1 ≤ Γ ≤ 1: measure of agreement Unrelated Cousin Class 2 × Pr[ τ odp and τ i agree on ordering of (u,v)] - 1 Sibling Class Same Class total order Query Strategy θ i �✆✝ 6

  7. ☎ Taher H. Haveliwala Directory vs. Strategy Directory vs. Strategy ODP ODP Agreement Strategy θ i Strategy θ i Disagreement! �✂✁ �✂✄ Example of two rankings with different Γ scores Evaluating strategies Query page: www.aabga.org For each θ i ∈ Θ 1. (American Association of Botanical Gardens and Arboreta) Γ θ i = Avg q ∈ D [ Γ ( Τ ( D , q), Τ ( θ i , q) ) ] Canadian Botanical Conservation Network The Huntington Library, Art Collections, and Botanical Gardens Select strategy θ * = argmax θ i [ Γ θ i ] http://www.rbg.ca/cbcn 2. www.huntington.org The Royal Horticultural Society The American Rhododendron Society www.rhs.org.uk www.rhododendron.org The American Rhododendron Society American Chiropractic Association http://www.rhododendron.org Only assumes that higher agreement, www.amerchiro.org Gardener’s Supply Company on average, with ODP is a good thing American Trakehner Association (horses) www.vg.com www.americantrakehner.com The New England Botanical Club American Subcontractors Association www.herbaria.harvard.edu/collections/neb c/nebc.html www.asaonline.com Γ =0.5312 Γ =0.3096 �✂✆ �✂✝ 7

  8. ☛ ✟ ✡ ✠ ✡ ✠ ✡ ✠ ✟ ✟ Taher H. Haveliwala Experimental results Directory vs. Strategy Open Directory ✄ 42 million page subset of the Web from the Stanford WebBase ✄ Following results restrict attention to two weak order colors: same class and sibling class ODP ✄ D: 300 pairs of sibling clusters from ODP Unrelated Cousin Class Sibling Class Same Class total order Query Strategy θ i �✂✁ ☎✂✆ Γ scores Feature space: term selection Content 0.45 Inlinks 0.40 Anchor-windows 0.35 0.30 Basic Sibling- Γ 0.25 window size W ∈ {0,4,8,16,32} 0.20 Syntactic 0.15 averaged 3 words in both directions 0.10 Topical 0.05 averaged 21 words in both directions 0.00 s s 0 4 8 6 2 c a l t k i n w w w 1 3 t c n w w c e a p i t l i n t o n t o y c s ☎✞✝ ☎✂� 8

  9. ✎ ✠ ☞ ✂ � ☎ ✟ ✏ ✍ Taher H. Haveliwala Directory → Similarity judgements Orthogonality Computers 1 Fraction of Pairs that are 0.9 0.8 0.7 Orthogonal 0.6 0.5 Hardware Software 0.4 0.3 0.2 0.1 www.hardware.com www.software.com 0 s 0 4 8 6 2 s c a l t k w w w 1 3 n n w w t i c c e l i p i t a n t o n t o y c s www.programming.com www.machine.com �✁� Composite schemes Feature space: term weighting ✢ Distance weighting for anchor-window 0.440 terms 0.438 0.436 Sibling- Γ 0.434 0.432 0.430 Left window Anchor text Right window 0.428 0.426 Anchor-Window-32 Anchor-Window-32, Content Anchor-Window-32, Content, Links ✆✞✝✁✟ ✠☛✡✌☞☛✍ ✑☛✒ ✠✔✓✕✠☛✍ ✖✁✗✁✟ ✑✁✖✘✓✙✏ ✎✔✠ ✑☛✏ ✚☛✛ �✁✄ �✁✜ 9

  10. ☞ ✟ Taher H. Haveliwala Weighting schemes Feature space: term weighting ☎ Frequency based weighting schemes 0.46 ✆ Inverse Document Frequency (IDF) ✝ attenuate weights for frequent terms 0.44 ✆ Nonmonotonic Document Frequency (NMDF) Sibling- 0.42 ✝ attenuate weights for frequent and infrequent terms 0.40 0.38 None Distance �✂✁ �✂✄ Term weighting (*DF) Comparison of best and worst 0.48 1 0.9 0.47 0.8 0.7 0.46 Sibling- Γ 0.6 Sibling- Γ 0.45 0.5 0.4 0.44 0.3 0.43 0.2 0.1 0.42 0 None log sqrt NMDF Worst setting Best setting �✂✞ ✠☛✡ 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend