Similarity search Evaluating Strategies for Given a query Web page - - PDF document

similarity search evaluating strategies for
SMART_READER_LITE
LIVE PREVIEW

Similarity search Evaluating Strategies for Given a query Web page - - PDF document

Taher H. Haveliwala Similarity search Evaluating Strategies for Given a query Web page q , return Web Similarity Search on the Web pages that are similar to q Taher H. Haveliwala www.moneycentral.com Aristides Gionis


slide-1
SLIDE 1

Taher H. Haveliwala 1

Evaluating Strategies for Similarity Search on the Web

Taher H. Haveliwala Aristides Gionis Dan Klein Piotr Indyk

{taherh,gionis,klein}@cs.stanford.edu indyk@theory.lcs.mit.edu

  • Similarity search
✁ Given a query Web page q, return Web

pages that are “similar” to q

www.moneyclub.com www.etrade.com www.money.com www.moneyworld.co.uk www.pathfinder.com/money www.moneycentral.com

Similarity search

✁ Two major issues: ✄ Choose the strategy that best captures the

notion of Web-page “similarity”

✄ Scaling up the chosen strategy to repository
  • f millions of pages

Related work

✁ Finding Related Pages in the WWW ✄ [Dean,Henzinger WWW8 ’99] ✁ Automatic Resource Compilation ... ✄ [Chakrabarti et al WWW7 ’98] ✁ Commercial search engines
slide-2
SLIDE 2

Taher H. Haveliwala 2

  • Model for document similarity
✁ Represent each Web page as bag of

terms

✂ content, anchor-text, links, ... ✁ Similarity of two pages is given by

similarity their respective bags

✂ cosine ✂ Jaccard ✄

Model for document similarity

✁ For pages a and b, with respective bags α

and β, define

✁ Strategy for (page → bag) is the crucial

step in quality of sim() β α β α ∪ ∩ = ) , ( b a sim

Similarity search system

Web page → representation Page Representations Indexing Sim Index Query Processing Preprocessing Query-time

Similarity search system

Web page →θ representation Page Representations Indexing Sim Index Query Processing Preprocessing Query-time Using strategy θ

slide-3
SLIDE 3

Taher H. Haveliwala 3

  • Similarity search system

Web page → representation Page Representations Indexing Sim Index Query Processing Preprocessing Query-time

✁ ✂

Possible term choices

...click here for a great music page... ...click here for great sports page... ...this music is great... ...what I had for lunch...

http://www.foobar.com/ http://www.baz.com/ http://www.music.com/

Enter our site MusicWorld

✁✄✁

Content

...click here for a great music page... ...click here for great sports page... ...this music is great... ...what I had for lunch...

http://www.foobar.com/ http://www.baz.com/ http://www.music.com/

Welcome MusicWorld

1 welcome 1 world 1 music bag: www.music.com

✁ ☎

Links

...click here for a great music page... ...click here for great sports page... ...this music is great... ...what I had for lunch...

http://www.foobar.com/ http://www.baz.com/ http://www.music.com/

Enter our site MusicWorld

1 www.baz.com 1 www.foobar.com bag: www.music.com

slide-4
SLIDE 4

Taher H. Haveliwala 4

Anchor windows

...click here for a great music page... ...click here for great sports page... ...this music is great... ...what I had for lunch...

http://www.foobar.com/ http://www.baz.com/ http://www.music.com/

Enter our site MusicWorld

2 great ... 1 click 2 music bag: www.music.com

Parameter space for bag generation

✄ Space of parameters considered: ☎ content vs. links vs. anchor windows ☎ anchor window length ☎ term weighting schemes ✄ Choice of a particular assignment of

parameters, θ, defines a similarity search strategy

Similarity search system

Web page →θ representation Page Representations Indexing Sim Index Query Processing Preprocessing Query-time Using strategy θ

(Strategy, query) → similarity ordering

✄ Inputs: ☎ θ ∈ Θ: strategy (i.e., parameter setting) ☎ q ∈ Web: query page ✄ Outputs: ☎ τ: list of web pages ordered by similarity to q

using strategy θ

✄ τ = Τ(θ, q)
slide-5
SLIDE 5

Taher H. Haveliwala 5

Evaluating strategies

✂ Goal: find “best” θi ∈ Θ ✂ Develop system to measure quality of

different parameter settings

✄ What do you choose as the ground truth for

Web-page similarity?

✄ How do you compare a particular strategy to

this ground truth?

Web directories (Yahoo!, ODP)

✂ Hand-constructed hierarchical directories

such as Yahoo! and the Open Directory Project (ODP) can be used as an external quality measure

✂ Do not directly provide ranked similarity

listings

✂ Do contain many implicit similarity

judgements

Directory → Similarity judgements

Computers

Hardware Software

www.hardware.com www.software.com www.machine.com www.programming.com

✝✟✞

Unrelated Query Same Class Sibling Class Cousin Class

Open Directory

(Directory, query) → similarity ordering

slide-6
SLIDE 6

Taher H. Haveliwala 6

✂✁ ✄

Inputs:

D: hierarchical directory

q ∈ D: query page

Outputs:

τ: list of pages of D partially ordered by similarity to q, using the ordering implicit in D

τ = Τ(D, q)

The above is for evaluating similarity search, not performing it!

(Directory, query) → similarity ordering

Evaluating strategies

1.

Restrict attention during evaluation phase to pages in the directory D

2.

Compare similarity ordering induced by parameter setting θi to the similarity

  • rdering induced by the directory, over

test set of query pages

3.

Choose the θi that agrees most closely with the judgements in D

✆✝

Unrelated Query Same Class Sibling Class Cousin Class

Open Directory

Directory vs. Strategy

ODP Strategy θi weak order total order

Comparing two orderings

✟ Based on Kruskal-Goodman Γ ✟ Inputs ✠ τodp: strict weak ordering of pages (ODP) ✠ τi : total ordering of pages according to θi ✟ Output ✠ -1 ≤ Γ ≤ 1: measure of agreement

2×Pr[τodp and τi agree on ordering of (u,v)] - 1

slide-7
SLIDE 7

Taher H. Haveliwala 7

✂✁

Directory vs. Strategy

Strategy θi Disagreement! ODP

✂✄

Directory vs. Strategy

Strategy θi Agreement

ODP

✂✆

Example of two rankings with different Γ scores

The Royal Horticultural Society www.rhs.org.uk The New England Botanical Club www.herbaria.harvard.edu/collections/neb c/nebc.html Gardener’s Supply Company www.vg.com The American Rhododendron Society http://www.rhododendron.org Canadian Botanical Conservation Network http://www.rbg.ca/cbcn The American Rhododendron Society www.rhododendron.org American Subcontractors Association www.asaonline.com American Trakehner Association (horses) www.americantrakehner.com American Chiropractic Association www.amerchiro.org The Huntington Library, Art Collections, and Botanical Gardens www.huntington.org

Query page: www.aabga.org (American Association of Botanical Gardens and Arboreta)

Γ=0.5312 Γ=0.3096

✂✝

Evaluating strategies

1.

For each θi ∈ Θ Γθi = Avgq∈D [ Γ(Τ(D, q), Τ(θi , q)) ]

2.

Select strategy θ* = argmaxθi [Γθi ] Only assumes that higher agreement,

  • n average, with ODP is a good thing
slide-8
SLIDE 8

Taher H. Haveliwala 8

✂✁

Experimental results

✄ 42 million page subset of the Web from

the Stanford WebBase

✄ Following results restrict attention to two

colors: same class and sibling class

✄ D: 300 pairs of sibling clusters from ODP ☎✂✆

Unrelated Query Same Class Sibling Class Cousin Class

Open Directory

Directory vs. Strategy

ODP Strategy θi weak order total order

☎✞✝

Feature space: term selection

Content

Inlinks

Anchor-windows

Basic

window size W ∈ {0,4,8,16,32}

Syntactic

averaged 3 words in both directions

Topical

averaged 21 words in both directions

☎✂

Γ scores

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 c

  • n

t e n t s l i n k s w w 4 w 8 w 1 6 w 3 2 s y n t a c t i c t

  • p

i c a l

Sibling-Γ

slide-9
SLIDE 9

Taher H. Haveliwala 9

Orthogonality

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c

  • n

t e n t s l i n k s w w 4 w 8 w 1 6 w 3 2 s y n t a c t i c t

  • p

i c a l

Fraction of Pairs that are Orthogonal

Directory → Similarity judgements

Computers

Hardware Software

www.hardware.com www.software.com www.machine.com www.programming.com

✁✄

Composite schemes

0.426 0.428 0.430 0.432 0.434 0.436 0.438 0.440 Anchor-Window-32 Anchor-Window-32, Content Anchor-Window-32, Content, Links Sibling-Γ

☎ ✆✞✝✁✟ ✠☛✡✌☞☛✍ ✎ ✟ ✏ ✑☛✒ ✠✔✓✕✠☛✍ ✖✁✗✁✟ ✍ ✑✁✖✘✓✙✏ ✎✔✠ ✑☛✏ ✚☛✛ ✠ ☞ ✁✜

Feature space: term weighting

✢ Distance weighting for anchor-window

terms

Right window Anchor text Left window

slide-10
SLIDE 10

Taher H. Haveliwala 10

✂✁

Weighting schemes

0.38 0.40 0.42 0.44 0.46

None Distance Sibling-

✂✄

Feature space: term weighting

☎ Frequency based weighting schemes ✆ Inverse Document Frequency (IDF) ✝ attenuate weights for frequent terms ✆ Nonmonotonic Document Frequency (NMDF) ✝ attenuate weights for frequent and infrequent

terms

✂✞

Term weighting (*DF)

0.42 0.43 0.44 0.45 0.46 0.47 0.48

None log sqrt NMDF Sibling-Γ

✟ ✠☛✡

Comparison of best and worst

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Worst setting Best setting Sibling- Γ

slide-11
SLIDE 11

Taher H. Haveliwala 11

✂✁

In summary

The previous experiments allow us to choose the parameters θ* that most closely accord with the similarity judgements implicitly embodied in ODP

term selection:

page content

size 32 anchor windows

weighting schemes:

distance

NMDF

✞✝

Scaling to large repositories

✟ Goal: generate a similarity index that

allows efficient runtime query processing, using strategy θ*

✟ Dataset: 80M URLs from Stanford

WebBase

✞✠

Scalability:

keyword search ≠ similarity search

✟ For standard keyword search query, # of

accesses to inverted index equals # of terms in query

✟ The postings lists for most terms are of

reasonable length

Typical keyword search

financial ... ... aardvark ... advice

typical keyword search query: “financial advice”

DocId: 1,5,8,10,50 DocId: 3,5,9,10,50,51

Inverted index lookup is manageable

slide-12
SLIDE 12

Taher H. Haveliwala 12

✂✁

Scalability:

keyword search ≠ similarity search

✄ For similarity search, # of accesses to

inverted index equals # of terms in the query page’s (potentially large) bag

Many of these terms could have huge postings list in the inverted index

☎ content words ☎ very wide anchor windows ✂✆

Scalability

financial aardvark ... advice association treasury ... ...

typical similarity search query: “www.money.com”

DocId: 1,2,3,5,8,9,10,11,50,51,52,55,58 DocId: 3,5,8,9,10,50,51,60,90,92,98,105

Inverted index lookup is not manageable

DocId: 3, 10, 15, 25, 28, 32, 66, 95, 102, 115, 193, 200, 205, ... DocId: 3,8,9,10,55,58,85,99,105,110,125,130,150,155,158, ...

✂✝

Scalability

✄ Solution summary: ☎ Use special kind of signature generation

technique to represent bags with fixed-length signature vector

☎ Similar signature vectors indicate similar

bags, w.h.p.

[Broder et al STOC ’98], [Indyk SODA ’99]

✂✟

Sample results

Reuters MoneyNet MutualFunds The Money Page MorningStar Money Club ETrade Money MoneyExtra Money Magazine MSN Money MSN Money Nullsoft Winamp Gracenote (cddb) Launch.com Listen.com AudioGalaxy Lycos Music EMusic CMJ: New Music First EMusic International Music Network MP3.com

slide-13
SLIDE 13

Taher H. Haveliwala 13

✂✁

Sample bags

music, cdnow, amazon, records, books www.cdnow.com java, jdk, technology, microsystems, api java.sun.com music, audio, player, artist, napster www.mp3.com finance, business, cnn, cnnfn, stock www.cnnfn.com weather, channel, forecasts, fbc, enter www.weather.com money, finance, msn, website, moneycentral moneycentral.msn.com

Top 5 words from each bag are shown

✄✆☎

Future work

✝ What if ODP pages aren’t representative
  • f web pages in general?
✝ Calculate several “best” parameter

settings, based on certain page properties

✞ Calculate separate Γ scores for strategy over

low indegree and high indegree pages

✞ Partition scores for other properties as well ...