Web Dynamics
Part 3 – Searching the Dynamic Web
3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web
Summer Term 2009 Web Dynamics 3‐1
Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - - PowerPoint PPT Presentation
Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2009 Web Dynamics 3 1 Why crawling is difficult Huge size of the Web (billions of pages) High dynamics of
Summer Term 2009 Web Dynamics 3‐1
Summer Term 2009 Web Dynamics 3‐2
Summer Term 2009 Web Dynamics 3‐3
Web Web
DNS Text indexer
content
URLs
Initialize with seed urls
Summer Term 2009 Web Dynamics 3‐4
Summer Term 2009 Web Dynamics 3‐5
F front queues …
1 2 F
Biased front queue selector back queue router
1 2 B
B back queues (unique set of hosts on each) … Back queue selector heap Entries: (back queue, next access time)
Summer Term 2009 Web Dynamics 3‐6
Summer Term 2009 Web Dynamics 3‐7
Summer Term 2009 Web Dynamics 3‐8
Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐9
Summer Term 2009 Web Dynamics 3‐10
i i i
i i
ε ε λ λ positive any for 1 ] | [| lim = < −
∞ → i i n
P
i i
Summer Term 2009 Web Dynamics 3‐11
Summer Term 2009 Web Dynamics 3‐12
Probability that a specific document is β‐current in interval [0;I]: Now average over all documents (assuming distribution w(λ) for change rates): (see paper: 1/λ is Weibull‐distributed)
I I t
β λ β β λ β ) ( ) (
− − − −
time t‐β t I grace period fetch fetch
∞ − −
) (
β λ
I
if t<β,
if t>β, prob. decays exponentially with delay
Summer Term 2009 Web Dynamics 3‐13
∞ →
=
t t
dt t p F t p F ) ; ( 1 lim ) (
) , ( 1 ) (
1
t p F N t F
N i i
=
= ) ( 1 ∑ =
N
p F F
1 = i i
N
Summer Term 2009 Web Dynamics 3‐14
F(p;t) A(p;t) Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐15
Cho & Garcia‐Molina, VLDB 2000
grey area: time when crawler is active solid line: F(t) dotted line: average of F(t) Theorem: Average freshness is the same for both crawlers if load is the same
Summer Term 2009 Web Dynamics 3‐16
t t t
λ λ λ − − −
− −
t t s
λ λ
P[first change of p after time s] P[p changed at time t]
Summer Term 2009 Web Dynamics 3‐17
Cho & Garcia‐Molina, TODS 2003
E[F(p;t)] E[A(p;t)]
Summer Term 2009 Web Dynamics 3‐18
F f e I e dt t p F E I dt t p F E t p F
f I I t t
= − = − = = =
− − ∞ →
/ 1 1 )] ; ( [ 1 )] ; ( [ 1 lim ) (
/
λ λ
λ λ
E[F(p;t)]
Summer Term 2009 Web Dynamics 3‐19
Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐20
Summer Term 2009 Web Dynamics 3‐21
assume update here
Summer Term 2009 Web Dynamics 3‐22
Summer Term 2009 Web Dynamics 3‐23
i
= = n i i n i i
1 1
Summer Term 2009 Web Dynamics 3‐24
i i i u
i p
p i i u
Summer Term 2009 Web Dynamics 3‐25
=
N i i i f
1
1
i N i i
=
i i i
=
N i i
1
Summer Term 2009 Web Dynamics 3‐26
fsolve({diff((1‐exp(‐9/f1))/(9/f1),f1)=y, diff((1‐exp(‐1/f2))/(1/f2),f2)=y,\ f1+f2=2},{f1,f2,y},{f1=0...2,f2=0...2,y=0...100}); {f1 = 0.2358671910, f2 = 1.764132809, y = 0.1111111111} fsolve({diff((1‐exp(‐9/f1))/(9/f1),f1)=y, diff((1‐exp(‐1/f2))/(1/f2),f2)=y,\ f1+f2=10},{f1,f2,y},{f1=0...10,f2=0...10,y=0...100}); {f1 = 6.885783095, f2 = 3.114216905, y = 0.04174104014}
Summer Term 2009 Web Dynamics 3‐27
Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐28
Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐29
= =
N i i N i i i
1 1
Summer Term 2009 Web Dynamics 3‐30
Cho & Garcia‐Molina, TODS 2003
Summer Term 2009 Web Dynamics 3‐31
Summer Term 2009 Web Dynamics 3‐32
Olston&Pandey,WWW 2008
Summer Term 2009 Web Dynamics 3‐33
t t t t p
p p
t t t t p
p p
Summer Term 2009 Web Dynamics 3‐34
* p p
t p p p
* *
Summer Term 2009 Web Dynamics 3‐35
Olston&Pandey,WWW 2008
Summer Term 2009 Web Dynamics 3‐36
HS: freshness‐based („holistic“) FS: staleness‐based („fragment“) ‐S: estimation of D* over all snapshots ‐D: estimation of D* for each snapshot
Summer Term 2009 Web Dynamics 3‐37
Summer Term 2009 Web Dynamics 3‐38
Summer Term 2009 Web Dynamics 3‐39
Summer Term 2009 Web Dynamics 3‐40
Summer Term 2009 Web Dynamics 3‐41
Local schema:
(make, model, price)
Local schema:
(make, model, price, miles)
Local schema:
(marke, modell, version, endpreis, …)
Integrated Schema + Mappers (make, model, version, price, …) Structured query interface: make=Honda, model=Civic, … Keyword query interface+mapper: honda, civic, … Automated domain selection query: honda civic
Summer Term 2009 Web Dynamics 3‐42
Summer Term 2009 Web Dynamics 3‐43
http://www.autoscout24.de/List.aspx?vis=1&make=9&model=18683&...
~ 2000 combinations 100 values ~ 100,000 values 11 values
Summer Term 2009 Web Dynamics 3‐44
Summer Term 2009 Web Dynamics 3‐45
Summer Term 2009 Web Dynamics 3‐46
Summer Term 2009 Web Dynamics 3‐47
Madhavan et al., VLDB 2008
Summer Term 2009 Web Dynamics 3‐48
Summer Term 2009 Web Dynamics 3‐49
Madhavan et al., VLDB 2008
Summer Term 2009 Web Dynamics 3‐50
Summer Term 2009 Web Dynamics 3‐51
Main references:
28(4), 390‐426, 2003 Additional references:
2008
3(3), 256—290, 2003
901—924, 2006
Science 34(6), 917—831, 2008
WWW Conference, 2005
SIGMOD Conference, 2006.
Summer Term 2009 Web Dynamics 3‐52