Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - PowerPoint PPT Presentation

Web Dynamics Part 3 – Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2010 Web Dynamics 3-1

Why crawling is difficult • Huge size of the Web (billions of pages) • High dynamics of the Web (page creations, updates, deletions) • High diversity in the Web (page importance, quality, formats, conformance to standards) • Huge amount of noise, malicious content (spam), duplicate content (Wikipedia copies) Summer Term 2010 Web Dynamics 3-2

Requirements for a Crawler • Robustness: resilience to (malicious or unintended) crawler traps • Politeness : respect servers’ policies for accessing pages (which & how frequent) • Quality : focus on downloading “important” pages • Freshness : make sure that crawled snapshots correspond to current version of pages • Scalability : cope with growing load by adding machines & bandwidth • Efficiency : make efficient use of system resources • Extensibility : possible to add new features (data formats, protocols) Summer Term 2010 Web Dynamics 3-3

Basic Crawler Architecture Text DNS indexer Page filter Parse content Fetch Link filter Elimination Web Duplicate Web URLs Initialize with seed urls URL frontier / queue Summer Term 2010 Web Dynamics 3-4

Crawler Types • Snapshot crawler : get at most one snapshot of each page (important for archiving) • Batch-mode crawler : revisit known pages periodcally (collection is fixed) • Steady crawler : continuously revisit know pages (collection is fixed) • Incremental crawler : continuously revisit known pages and increase crawl quality by finding new good pages Summer Term 2010 Web Dynamics 3-5

Queue design for snapshot crawlers Goals: • Allow for different crawl priorities, but provide fairness • Keep crawler busy while being polite Prioritizer 1 2 F F front queues … Biased front queue selector back queue router 1 2 B B back queues Entries: (urls from a single host on each) (back queue, next … access time) Back queue selector heap Summer Term 2010 Web Dynamics 3-6

Modeling page changes over time Observation: Page changes can be modeled by Poisson process with change rate λ: Probability for at least one change until t: Expectation: E[t]=1/λ, variance: var[t]=1/λ 2 Note: change rates differ per page (and maybe over time) Summer Term 2010 Web Dynamics 3-7

Poisson processes on real data Cho & Garcia-Molina, TODS 2003: • Daily crawl of 720,000 pages from 270 sites over approx 4.5 months • Seeds: popular pages from large Web crawl Summer Term 2010 Web Dynamics 3-8

Change rate distributions on the Web Cho & Garcia-Molina, TODS 2003 Summer Term 2010 Web Dynamics 3-9

Sampling change rates Goal : determine λ i for fixed page i Simple estimator: n i accesses with frequency f i , T i =(n i -1)/f i For X i monitored updates in time T i , estimate X i λ = : i T i Question: is this a good estimator? No. • Is it unbiased? λ = λ E [ ] i i No. • Is it consistent? λ − λ < ε = ε lim P [| | ] 1 for any positive i i → ∞ n i Better estimator: n i accesses with frequency f i , page was not changed Y i times + Y 0 . 5 i λ = − ⋅ : log f i i + n 0 . 5 i Summer Term 2010 Web Dynamics 3-10

Crawling the dynamic Web Challenges: • How do we model the „up-to-dateness“ of our index • How frequently do we recrawl? – On average, update each of N pages once within I time units (average update frequency f=1/ I ) • How frequently do we schedule per-page revisits? – uniformly vs. depending on the change rates • In which order do we revisit pages – fixed order vs. recrawl (random) vs. purely random Summer Term 2010 Web Dynamics 3-11

Measures for recency of the index (1) Definition: Index is (α,β)-current at time t when the probability that random page has been up-to-date β time units ago is at least α. Question to answer : How frequently do we need to recrawl to guarantee to be (95%,1 week)-current? Answer: every 18 days for 800 million page sample [Brewington and Cybenko 2000] Summer Term 2010 Web Dynamics 3-12

Brewington&Cybenko Model grace period fetch fetch time 0 t- β t I Probability that a specific document is β -current in interval [0; I ]: β I − λ − β ( I ) β − 1 1 1 e ∫ + ∫ − λ − β ( t ) = + dt e dt λ I I I I β 0 if t< β , if t> β , prob. decays prob. is 1 exponentially with delay Now average over all documents (assuming distribution w(λ) for change rates):   ∞ − λ − β ( I ) β − 1 e ∫ α = λ  +  λ w ( ) d  λ  I I 0 (see paper: 1/λ is Weibull-distributed) Summer Term 2010 Web Dynamics 3-13

Measures for recency of the index (2) • Freshness F(p;t) of a page p at time t: 1 if p is up-to-date at time t, 0 otherwise • Age A(p;t) of a page p at time t: time since the last update of p that is not reflected in the index • Freshness F(t) of the index at time t: N 1 ∑ = F ( t ) F ( p , t ) i N = i 1 • Average Freshness F(p) of a page p: t 1 ∫ = F ( p ) lim F ( p ; t ) dt t → ∞ t 0 • Average Freshness of the index: N 1 ∑ = F F ( p ) i N = i 1 Summer Term 2010 Web Dynamics 3-14

Example: freshness and age for page p F(p;t) Cho & Garcia-Molina, TODS 2003 A(p;t) Summer Term 2010 Web Dynamics 3-15

Freshness and age for different crawlers Cho & Garcia-Molina, VLDB 2000 grey area: time when crawler is active solid line: F(t) dotted line: average of F(t) Theorem : Average freshness is the same for both crawlers if load is the same Summer Term 2010 Web Dynamics 3-16

Expected freshness and age of a page Assume for page p: – p changes with rate λ – p is synch‘ed at time 0 Then – Expected freshness of p at time t ≥ 0: ( ) − λ − λ − λ t t t = ⋅ − + ⋅ = E [ F ( p ; t )] 0 1 e 1 e e P[p changed at time t] – Expected age of p at time t ≥ 0:   t − λ t − 1 e ∫   − λ s = − λ = − E [ A ( p ; t )] ( t s ) e ds t 1    λ  t 0 P[first change of p after time s] Summer Term 2010 Web Dynamics 3-17

Expected freshness and age over time Cho & Garcia-Molina, TODS 2003 E[F(p;t)] E[A(p;t)] Summer Term 2010 Web Dynamics 3-18

Which avg. freshness can we achieve? Assume that • All pages change at the same rate λ • All pages are sync‘ed every I time units (at rate f= 1/I ) • Pages are always sync‘ed in a fixed order E[F(p;t)] Theorem: t I − λ − λ I / f − − 1 1 1 1 e e ∫ ∫ = = = = = F ( p ) lim E [ F ( p ; t )] dt E [ F ( p ; t )] dt F λ λ t I I / f → ∞ t 0 0 Summer Term 2010 Web Dynamics 3-19

Are other orders better? • Random order : update all pages once, but in random order (e.g., by recrawling) • Purely random order : pick page to update at random Cho & Garcia-Molina, TODS 2003 Summer Term 2010 Web Dynamics 3-20

Non-uniform update frequencies Now • page p i changes with rate λ i • page p i is updated at fixed interval I i (=1/f i ) Question: How are f i and λ i related? Simple answer f i ∝ λ i is wrong! Summer Term 2010 Web Dynamics 3-21

Simple example: two pages, one update p 1 assume update here p 2 1 day Assume • p 1 changes once per interval (=9 times/day) • p 2 changes once per day • probability for change uniform in each interval Now estimate expected benefit of updating p 2 in the middle of the day • with prob. ½ change occurs later ⇒ benefit 0 • with prob. ½ change occurs before ⇒ benefit ½ • Expected benefit: ½ * ½ = ¼ Similar computation for p 1 (update in the middle of any interval): • Expected benefit: 1/2 * 1/18 = 1/36 Summer Term 2010 Web Dynamics 3-22

Two pages, more updates Rules of thumb: • When sync frequency (f1+f2) much smaller than change frequency (λ1+λ2), don‘t sync quickly changing pages • Even for f1+f2≈λ1+λ2, uniform (5:5) better than proportional (9:1) Can we prove this? Summer Term 2010 Web Dynamics 3-23

Proof (1) Notation/Definition: • F(λ i ,f i ): average freshness of p i when p i changes with rate λ i and is updated with rate f i 1 ∑ • average change rate λ = λ i N   n n 1 1 ∑ ∑   • function f(x) is convex if ≥ f ( x ) f x i i   n n = = i 1 i 1 F(λ i ,f i ) is convex in λ i independent of the sync strategy Summer Term 2010 Web Dynamics 3-24

Proof(2) With uniform update frequency (f i =f): 1 1 ∑ ∑ = λ = λ F F ( , f ) F ( , f ) u i i i N N With proportional update frequency: 1 1 ∑ ∑ = = λ = λ F F ( p ) F ( , f ) F ( , f ) p i N N here, F(pi)=F(λi,fi)=F(λ,f) because F(pi) depends only on r=λ/f Then: 1 1 ∑ ∑ = λ ≥ λ = λ = F F ( , f ) F ( , f ) F ( , f ) F u i i p N N Summer Term 2010 Web Dynamics 3-25

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - PowerPoint PPT Presentation

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2010 Web Dynamics 3-1 Why crawling is difficult Huge size of the Web (billions of pages) High dynamics

Web Dynamics Part 1 Introduction 1.1 Dimensions of dynamics in the Web 1.2 Application examples

Web Dynamics Part 1 - Introduction 1.1 Dimensions of dynamics in the Web 1.2 Application

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

7. Dynamics & Age Outline 7.1. Dynamics & Age 7.2. Temporal Information 7.3. Search in

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - PowerPoint PPT Presentation

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2010 Web Dynamics 3-1 Why crawling is difficult Huge size of the Web (billions of pages) High dynamics

Web Dynamics Part 1 Introduction 1.1 Dimensions of dynamics in the Web 1.2 Application examples

Web Dynamics Part 1 - Introduction 1.1 Dimensions of dynamics in the Web 1.2 Application

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

7. Dynamics &amp; Age Outline 7.1. Dynamics &amp; Age 7.2. Temporal Information 7.3. Search in

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

7. Dynamics & Age Outline 7.1. Dynamics & Age 7.2. Temporal Information 7.3. Search in

Set11 Search Engines & SEO Outline How do search engines work? Basic operation