Navigating the Web graph Workshop on Networks and Navigation Santa - - PowerPoint PPT Presentation

navigating the web graph
SMART_READER_LITE
LIVE PREVIEW

Navigating the Web graph Workshop on Networks and Navigation Santa - - PowerPoint PPT Presentation

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington Outline Topical locality: Content, link, and semantic topologies


slide-1
SLIDE 1

Navigating the Web graph

Workshop on Networks and Navigation Santa Fe Institute, August 2008

Filippo Menczer

Informatics & Computer Science Indiana University, Bloomington

slide-2
SLIDE 2

Outline

Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

slide-3
SLIDE 3

The Web as a text corpus

Pages close in word vector space tend to be related

Cluster hypothesis (van Rijsbergen 1979) The WebCrawler (Pinkerton 1994) The whole first generation of search engines

weapons mass destruction

p1 p2

slide-4
SLIDE 4

Enter the Web’s link structure

Broder & al. 2000

p(i) = α N + (1 − α)

  • j:j→i

p(j) |ℓ : j → ℓ|

Brin & Page 1998 Barabasi & Albert 1999

slide-5
SLIDE 5

Three network topologies

Text Links

slide-6
SLIDE 6

Three network topologies

Text Links Meaning

slide-7
SLIDE 7

Connection between semantic topology (topicality

  • r relevance) and link topology (hypertext)

G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) | rel(q) AND link(q,p)]

Related nodes are “clustered” if R > G (modularity)

Necessary and sufficient condition for a random crawler to find pages related to start points

G = 5/15 C = 2 R = 3/6 = 2/4

The “link-cluster” conjecture

ICML 1997

slide-8
SLIDE 8
  • Stationary hit rate for a random crawler:

Link-cluster conjecture

η(t + 1) = η(t) ⋅ R + (1 −η(t))⋅ G ≥ η(t) η

t →∞

 →   η

∗ =

G 1− (R − G) η∗ > G ⇔ R > G η∗ G −1 = R− G 1 − (R− G)

Value added Conjecture

slide-9
SLIDE 9

Pages that link to each other tend to be related Preservation of semantics (meaning)

A.k.a. topic drift

Link-cluster conjecture

L(q,δ) ≡ path(q, p)

{p: path(q,p) ≤δ }

{p : path(q, p) ≤ δ}

R(q,δ) G(q) ≡ Pr rel(p) | rel(q)∧ path(q, p) ≤ δ

[ ]

Pr[rel(p)]

JASIST 2004

slide-10
SLIDE 10

9

Correlation of lexical and linkage topology L(δ): average link distance S(δ): average similarity to start (topic) page from pages up to distance δ Correlation ρ(L,S) = –0.76

The “link-content” conjecture

S(q,δ) ≡ sim(q, p)

{p: path(q,p) ≤δ }

{p : path(q, p) ≤ δ}

slide-11
SLIDE 11

Heterogeneity of link-content correlation

S = c + (1− c)eaLb

edu net gov

com

  • signif. diff. a only (p<0.05)
  • signif. diff. a & b (p<0.05)
  • rg
slide-12
SLIDE 12

Mapping the relationship between links, content, and semantic topologies

  • Given any pair of pages, need ‘similarity’ or

‘proximity’ metric for each topology:

– Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification

  • Data: Open Directory Project (dmoz.org)

– ~ 1 M pages after cleanup – ~ 1.3*1012 page pairs!

slide-13
SLIDE 13

Content similarity

σ c p1, p2

( ) =

p1 ⋅ p2 p1 ⋅ p2

term i term j term k

p1 p2 p1 p2

σ l(p1, p2) = U p1 ∩U p2 U p1 ∪U p2

Link similarity

slide-14
SLIDE 14

Semantic similarity

  • Information-theoretic

measure based on classification tree (Lin 1998)

  • Classic path distance in special case of balanced tree

σ s(c1,c2) = 2logPr[lca(c1,c2)] logPr[c1]+ logPr[c2]

top

lca c1 c2

slide-15
SLIDE 15

Individual metric distributions

semantic conten t link

slide-16
SLIDE 16

| Retrieved & Relevant | | Retrieved | | Retrieved & Relevant | | Relevant | Precision = Recall =

slide-17
SLIDE 17

| Retrieved & Relevant | | Retrieved | | Retrieved & Relevant | | Relevant |

P(sc,sl) = σ s(p,q)

{p,q:σ c = sc ,σ l = sl }

{p,q :σ c = sc,σ l = sl} R(sc,sl) = σ s(p,q)

{p,q:σ c = sc ,σ l = sl }

σ s(p,q)

{p,q}

Averaging semantic similarity Summing semantic similarity

Precision = Recall =

slide-18
SLIDE 18

Science

σc σl

log Recall Precision

slide-19
SLIDE 19

Adult

σc σl

log Recall Precision

slide-20
SLIDE 20

News

σc σl

log Recall Precision

slide-21
SLIDE 21

All pairs

σc σl

log Recall Precision

slide-22
SLIDE 22

Outline

Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

slide-23
SLIDE 23

Link probability vs lexical distance

r =1 σ c −1 Pr(λ | ρ) = (p,q) : r = ρ ∧σ l > λ (p,q) : r = ρ

slide-24
SLIDE 24

Link probability vs lexical distance

r =1 σ c −1 Pr(λ | ρ) = (p,q) : r = ρ ∧σ l > λ (p,q) : r = ρ

Phase transition Power law tail

Pr(λ | ρ) ~ ρ−α(λ) ρ*

  • Proc. Natl. Acad.
  • Sci. USA 99(22):

14014-14019, 2002

slide-25
SLIDE 25

Local content-based growth model

  • Similar to preferential

attachment (BA)

  • Use degree info

(popularity/ importance) only for nearby (similar/ related) pages

Pr(pt → pi<t) = k(i) mt if r(pi, pt) < ρ* c[r(pi, pt)]−α

  • therwise

    

slide-26
SLIDE 26

So, many models can predict degree distributions...

Which is “right” ? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs

slide-27
SLIDE 27

None of these models is right!

slide-28
SLIDE 28

The mixture model

Pr(i) ∝ ψ · 1 t + (1 − ψ) · k(i) mt

degree-uniform mixture

a

i2 i1 i3 t

b

i2 i1 i3 t

c

i2 i1 i3 t

slide-29
SLIDE 29

The mixture model

Bias choice by content similarity instead

  • f uniform distribution

Pr(i) ∝ ψ · 1 t + (1 − ψ) · k(i) mt

degree-uniform mixture

a

i2 i1 i3 t

b

i2 i1 i3 t

c

i2 i1 i3 t

slide-30
SLIDE 30

Degree-similarity mixture model

Pr(i) ∝ ψ · ˆ Pr(i) + (1 − ψ) · k(i) mt

slide-31
SLIDE 31

Degree-similarity mixture model

Pr(i) ∝ ψ · ˆ Pr(i) + (1 − ψ) · k(i) mt

ψ = 0.2, α = 1.7

ˆ Pr(i) ∝ [r(i, t)]−α

slide-32
SLIDE 32

Both mixture models get the degree distribution right…

slide-33
SLIDE 33

…but the degree-similarity mixture model predicts the similarity distribution better

  • Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
slide-34
SLIDE 34

2 PNAS 0.25 0.5 0.75 1 c 0.25 0.5 0.75 1 l

Citation networks

15,785 articles published in PNAS between 1997 and 2002

slide-35
SLIDE 35

Citation networks

slide-36
SLIDE 36

Citation networks

slide-37
SLIDE 37

Open Questions

Understand distribution of content similarity across all pairs of pages Growth model to explain co-evolution of both link topology and content similarity The role of search engines

slide-38
SLIDE 38

Efficient crawling algorithms?

Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:

~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)

slide-39
SLIDE 39

Efficient crawling algorithms?

Practice: can’t find them!

  • Greedy algorithms based on location in geographical small

world networks: ~ poly(N) (Kleinberg 2000)

  • Greedy algorithms based on degree in power law

networks: ~ N (Adamic, Huberman & al. 2001)

Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:

~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)

slide-40
SLIDE 40

Exception # 1

  • Geographical networks

(Kleinberg 2000)

– Local links to all lattice neighbors – Long-range link probability distribution: power law Pr ~ r–α

  • r: lattice (Manhattan) distance
  • α: constant clustering exponent

t ~ log2 N ⇔α = D

slide-41
SLIDE 41

Is the Web a geographical network?

local links long range links (power law tail) Replace lattice distance by lexical distance

r = (1 / σc) – 1

slide-42
SLIDE 42

Exception # 2

  • Hierarchical networks

(Kleinberg 2002, Watts & al. 2002)

– Nodes are classified at the leaves of tree – Link probability distribution: exponential tail Pr ~ e–h

  • h: tree distance (height of lowest common ancestor)

h=1 h=2

t ~ logε N,ε ≥1

slide-43
SLIDE 43

exponential tail

Is the Web a hierarchical network?

Replace tree distance by semantic distance

h = 1 – σs

top

lca c1 c2

slide-44
SLIDE 44

Take home message: the Web is a “friendly” place!

slide-45
SLIDE 45

Outline

Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

slide-46
SLIDE 46

Crawler applications

  • Universal Crawlers

– Search engines!

  • Topical crawlers

– Live search (e.g., myspiders.informatics.indiana.edu) – Topical search engines & portals – Business intelligence (find competitors/partners) – Distributed, collaborative search

slide-47
SLIDE 47

spears

[sic]

Topical crawlers

slide-48
SLIDE 48

Evaluating topical crawlers

  • Goal: build “better” crawlers to support applications
  • Build an unbiased evaluation framework

– Define common tasks of measurable difficulty – Identify topics, relevant targets – Identify appropriate performance measures

  • Effectiveness: quality of crawler pages, order, etc.
  • Efficiency: separate CPU & memory of crawler algorithms from

bandwidth & common utilities

Information Retrieval 2005

slide-49
SLIDE 49

Evaluating topical crawlers: Topics

  • Automate

evaluation using edited directories

  • Different

sources of relevance assessments Keywords Description Targets

slide-50
SLIDE 50

Evaluating topical crawlers: Tasks

Start from seeds, find targets and/or pages similar to target descriptions d=2 d=3

slide-51
SLIDE 51

Examples of crawling algorithms

  • Breadth-First

– Visit links in order encountered

  • Best-First

– Priority queue sorted by similarity – Variants: – explore top N at a time – tag tree context – hub scores

  • SharkSearch

– Priority queue sorted by combination of similarity, anchor text, similarity of parent, etc.

  • InfoSpiders
slide-52
SLIDE 52

Examples of crawling algorithms

  • Breadth-First

– Visit links in order encountered

  • Best-First

– Priority queue sorted by similarity – Variants: – explore top N at a time – tag tree context – hub scores

  • SharkSearch

– Priority queue sorted by combination of similarity, anchor text, similarity of parent, etc.

  • InfoSpiders
slide-53
SLIDE 53

Exploration vs. Exploitation

Pages crawled Avg target recall

slide-54
SLIDE 54

Co-citation: hub scores

Link scorehub = linear combination between link and hub score

slide-55
SLIDE 55

Recall (159 ODP topics)

Split ODP URLs between seeds and targets Add 10 best hubs to seeds for 94 topics

5 10 15 20 25 30 35 40 45 2000 4000 6000 8000 10000 average target recall@N (%) N (pages crawled) Breadth-First Naive Best-First DOM Hub-Seeker 5 10 15 20 25 30 35 40 45 2000 4000 6000 8000 10000 average target recall@N (%) N (pages crawled) Breadth-First Naive Best-First DOM Hub-Seeker

ECDL 2003

slide-56
SLIDE 56

InfoSpiders

adaptive distributed algorithm using an evolving population of learning agents

slide-57
SLIDE 57

InfoSpiders

adaptive distributed algorithm using an evolving population of learning agents

keyword vector neural net local frontier

slide-58
SLIDE 58

InfoSpiders

adaptive distributed algorithm using an evolving population of learning agents

keyword vector neural net local frontier

  • ffspring
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62

Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost If E < 0: Die Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring

Evolutionary Local Selection Algorithm (ELSA)

selective query expansion match resource bias reinforcement learning

slide-63
SLIDE 63

Action selection

slide-64
SLIDE 64

Q-learning

Compare estimated relevance of visited document with estimated relevance of link followed from previous page

Teaching input: E(D) + µ maxl(D) λl

slide-65
SLIDE 65

Performance

ACM Trans. Internet Technology 2003

Pages crawled Avg target recall

slide-66
SLIDE 66
slide-67
SLIDE 67
slide-68
SLIDE 68

http://sixearch.org

slide-69
SLIDE 69

http://sixearch.org

slide-70
SLIDE 70

http://sixearch.org

slide-71
SLIDE 71

6S: Collaborative Peer Search

WWW2004 WWW2005 WTAS2005 P2PIR2006 bookmarks local storage WWW

Index Crawler

Peer

slide-72
SLIDE 72

6S: Collaborative Peer Search

Data mining & referral

  • pportunities

WWW2004 WWW2005 WTAS2005 P2PIR2006 bookmarks local storage WWW

Index Crawler

Peer

slide-73
SLIDE 73

6S: Collaborative Peer Search

Data mining & referral

  • pportunities

WWW2004 WWW2005 WTAS2005 P2PIR2006 bookmarks local storage WWW

Index Crawler

Peer

slide-74
SLIDE 74

Reinforcement Learning

slide-75
SLIDE 75

Query Routing

slide-76
SLIDE 76

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

slide-77
SLIDE 77

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

slide-78
SLIDE 78

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

slide-79
SLIDE 79

P@10

slide-80
SLIDE 80

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.05 0.1 0.15 0.2 0.25 Precision Recall 6S Centralized Search Engine Google

Distributed vs Centralized

slide-81
SLIDE 81

! " #! #" $! $" !$! ! $! %! &! '! #!! #$! #%! ()*+,*-./*+./**+ 01*+02*.34052*.678 39)-:*+.3;*<<,3,*5: =,0>*:*+

Small-world

slide-82
SLIDE 82

Semantic Similarity

slide-83
SLIDE 83

Semantic Similarity

Arts/Movies/Filmmaking Business/Arts_and_Entertainment/Fashion Business/E-Commerce/Developers Business/Telecommunications/Call_Centers Computers/Programming/Graphics Health/Conditions_and_Diseases/Cancer Health/Mental_Health/Grief,_Loss_and_Bereavement Health/Professions/Midwifery Health/Reproductive_Health/Birth_Control Home/Family/Pregnancy Shopping/Clothing/Accessories Shopping/Clothing/Footwear Shopping/Clothing/Uniforms Shopping/Sports/Cycling Shopping/Visual_Arts/Artist_Created_Prints Society/Issues/Abortion Society/People/Women Sports/Cycling/Racing

slide-84
SLIDE 84

Ongoing Work

  • Improve coverage/diversity in query

routing algorithm

  • Spam protection: trust/reputation

subsystem

  • User study with 6S application
slide-85
SLIDE 85

User study

slide-86
SLIDE 86

Query network

slide-87
SLIDE 87

Result network

slide-88
SLIDE 88

http://sixearch.org

Questions?

slide-89
SLIDE 89

Thank you! Questions?

informatics.indiana.edu/fil

Research supported by NSF CAREER Award IIS-0348940