web dynamics
play

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - PowerPoint PPT Presentation

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2010 Web Dynamics 3-1 Why crawling is difficult Huge size of the Web (billions of pages) High dynamics


  1. Web Dynamics Part 3 – Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2010 Web Dynamics 3-1

  2. Why crawling is difficult • Huge size of the Web (billions of pages) • High dynamics of the Web (page creations, updates, deletions) • High diversity in the Web (page importance, quality, formats, conformance to standards) • Huge amount of noise, malicious content (spam), duplicate content (Wikipedia copies) Summer Term 2010 Web Dynamics 3-2

  3. Requirements for a Crawler • Robustness: resilience to (malicious or unintended) crawler traps • Politeness : respect servers’ policies for accessing pages (which & how frequent) • Quality : focus on downloading “important” pages • Freshness : make sure that crawled snapshots correspond to current version of pages • Scalability : cope with growing load by adding machines & bandwidth • Efficiency : make efficient use of system resources • Extensibility : possible to add new features (data formats, protocols) Summer Term 2010 Web Dynamics 3-3

  4. Basic Crawler Architecture Text DNS indexer Page filter Parse content Fetch Link filter Elimination Web Duplicate Web URLs Initialize with seed urls URL frontier / queue Summer Term 2010 Web Dynamics 3-4

  5. Crawler Types • Snapshot crawler : get at most one snapshot of each page (important for archiving) • Batch-mode crawler : revisit known pages periodcally (collection is fixed) • Steady crawler : continuously revisit know pages (collection is fixed) • Incremental crawler : continuously revisit known pages and increase crawl quality by finding new good pages Summer Term 2010 Web Dynamics 3-5

  6. Queue design for snapshot crawlers Goals: • Allow for different crawl priorities, but provide fairness • Keep crawler busy while being polite Prioritizer 1 2 F F front queues … Biased front queue selector back queue router 1 2 B B back queues Entries: (urls from a single host on each) (back queue, next … access time) Back queue selector heap Summer Term 2010 Web Dynamics 3-6

  7. Modeling page changes over time Observation: Page changes can be modeled by Poisson process with change rate λ: Probability for at least one change until t: Expectation: E[t]=1/λ, variance: var[t]=1/λ 2 Note: change rates differ per page (and maybe over time) Summer Term 2010 Web Dynamics 3-7

  8. Poisson processes on real data Cho & Garcia-Molina, TODS 2003: • Daily crawl of 720,000 pages from 270 sites over approx 4.5 months • Seeds: popular pages from large Web crawl Summer Term 2010 Web Dynamics 3-8

  9. Change rate distributions on the Web Cho & Garcia-Molina, TODS 2003 Summer Term 2010 Web Dynamics 3-9

  10. Sampling change rates Goal : determine λ i for fixed page i Simple estimator: n i accesses with frequency f i , T i =(n i -1)/f i For X i monitored updates in time T i , estimate X i λ = : i T i Question: is this a good estimator? No. • Is it unbiased? λ = λ E [ ] i i No. • Is it consistent? λ − λ < ε = ε lim P [| | ] 1 for any positive i i → ∞ n i Better estimator: n i accesses with frequency f i , page was not changed Y i times + Y 0 . 5 i λ = − ⋅ : log f i i + n 0 . 5 i Summer Term 2010 Web Dynamics 3-10

  11. Crawling the dynamic Web Challenges: • How do we model the „up-to-dateness“ of our index • How frequently do we recrawl? – On average, update each of N pages once within I time units (average update frequency f=1/ I ) • How frequently do we schedule per-page revisits? – uniformly vs. depending on the change rates • In which order do we revisit pages – fixed order vs. recrawl (random) vs. purely random Summer Term 2010 Web Dynamics 3-11

  12. Measures for recency of the index (1) Definition: Index is (α,β)-current at time t when the probability that random page has been up-to-date β time units ago is at least α. Question to answer : How frequently do we need to recrawl to guarantee to be (95%,1 week)-current? Answer: every 18 days for 800 million page sample [Brewington and Cybenko 2000] Summer Term 2010 Web Dynamics 3-12

  13. Brewington&Cybenko Model grace period fetch fetch time 0 t- β t I Probability that a specific document is β -current in interval [0; I ]: β I − λ − β ( I ) β − 1 1 1 e ∫ + ∫ − λ − β ( t ) = + dt e dt λ I I I I β 0 if t< β , if t> β , prob. decays prob. is 1 exponentially with delay Now average over all documents (assuming distribution w(λ) for change rates):   ∞ − λ − β ( I ) β − 1 e ∫ α = λ  +  λ w ( ) d  λ  I I 0 (see paper: 1/λ is Weibull-distributed) Summer Term 2010 Web Dynamics 3-13

  14. Measures for recency of the index (2) • Freshness F(p;t) of a page p at time t: 1 if p is up-to-date at time t, 0 otherwise • Age A(p;t) of a page p at time t: time since the last update of p that is not reflected in the index • Freshness F(t) of the index at time t: N 1 ∑ = F ( t ) F ( p , t ) i N = i 1 • Average Freshness F(p) of a page p: t 1 ∫ = F ( p ) lim F ( p ; t ) dt t → ∞ t 0 • Average Freshness of the index: N 1 ∑ = F F ( p ) i N = i 1 Summer Term 2010 Web Dynamics 3-14

  15. Example: freshness and age for page p F(p;t) Cho & Garcia-Molina, TODS 2003 A(p;t) Summer Term 2010 Web Dynamics 3-15

  16. Freshness and age for different crawlers Cho & Garcia-Molina, VLDB 2000 grey area: time when crawler is active solid line: F(t) dotted line: average of F(t) Theorem : Average freshness is the same for both crawlers if load is the same Summer Term 2010 Web Dynamics 3-16

  17. Expected freshness and age of a page Assume for page p: – p changes with rate λ – p is synch‘ed at time 0 Then – Expected freshness of p at time t ≥ 0: ( ) − λ − λ − λ t t t = ⋅ − + ⋅ = E [ F ( p ; t )] 0 1 e 1 e e P[p changed at time t] – Expected age of p at time t ≥ 0:   t − λ t − 1 e ∫   − λ s = − λ = − E [ A ( p ; t )] ( t s ) e ds t 1    λ  t 0 P[first change of p after time s] Summer Term 2010 Web Dynamics 3-17

  18. Expected freshness and age over time Cho & Garcia-Molina, TODS 2003 E[F(p;t)] E[A(p;t)] Summer Term 2010 Web Dynamics 3-18

  19. Which avg. freshness can we achieve? Assume that • All pages change at the same rate λ • All pages are sync‘ed every I time units (at rate f= 1/I ) • Pages are always sync‘ed in a fixed order E[F(p;t)] Theorem: t I − λ − λ I / f − − 1 1 1 1 e e ∫ ∫ = = = = = F ( p ) lim E [ F ( p ; t )] dt E [ F ( p ; t )] dt F λ λ t I I / f → ∞ t 0 0 Summer Term 2010 Web Dynamics 3-19

  20. Are other orders better? • Random order : update all pages once, but in random order (e.g., by recrawling) • Purely random order : pick page to update at random Cho & Garcia-Molina, TODS 2003 Summer Term 2010 Web Dynamics 3-20

  21. Non-uniform update frequencies Now • page p i changes with rate λ i • page p i is updated at fixed interval I i (=1/f i ) Question: How are f i and λ i related? Simple answer f i ∝ λ i is wrong! Summer Term 2010 Web Dynamics 3-21

  22. Simple example: two pages, one update p 1 assume update here p 2 1 day Assume • p 1 changes once per interval (=9 times/day) • p 2 changes once per day • probability for change uniform in each interval Now estimate expected benefit of updating p 2 in the middle of the day • with prob. ½ change occurs later ⇒ benefit 0 • with prob. ½ change occurs before ⇒ benefit ½ • Expected benefit: ½ * ½ = ¼ Similar computation for p 1 (update in the middle of any interval): • Expected benefit: 1/2 * 1/18 = 1/36 Summer Term 2010 Web Dynamics 3-22

  23. Two pages, more updates Rules of thumb: • When sync frequency (f1+f2) much smaller than change frequency (λ1+λ2), don‘t sync quickly changing pages • Even for f1+f2≈λ1+λ2, uniform (5:5) better than proportional (9:1) Can we prove this? Summer Term 2010 Web Dynamics 3-23

  24. Proof (1) Notation/Definition: • F(λ i ,f i ): average freshness of p i when p i changes with rate λ i and is updated with rate f i 1 ∑ • average change rate λ = λ i N   n n 1 1 ∑ ∑   • function f(x) is convex if ≥ f ( x ) f x i i   n n = = i 1 i 1 F(λ i ,f i ) is convex in λ i independent of the sync strategy Summer Term 2010 Web Dynamics 3-24

  25. Proof(2) With uniform update frequency (f i =f): 1 1 ∑ ∑ = λ = λ F F ( , f ) F ( , f ) u i i i N N With proportional update frequency: 1 1 ∑ ∑ = = λ = λ F F ( p ) F ( , f ) F ( , f ) p i N N here, F(pi)=F(λi,fi)=F(λ,f) because F(pi) depends only on r=λ/f Then: 1 1 ∑ ∑ = λ ≥ λ = λ = F F ( , f ) F ( , f ) F ( , f ) F u i i p N N Summer Term 2010 Web Dynamics 3-25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend