What's Really New on the Web? Identifying New Pages from a Series - PowerPoint PPT Presentation

What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo

Web as a projection of the world • Web is now reflecting various events in the real and virtual world • Evolution of past topics can be tracked by observing the Web • Identifying and tracking new information new information is important for observing new trends new trends – Sociology, marketing, and survey research War Online news Tsunami weblogs Sports BBS Computer virus

Observing Trends on the Web (1/2) • Recall (Internet Archive) [Patterson 2003] – # pages including query keywords

Observing Trends on the Web (2/2) • WebRelievo [Toyoda 2005] – Evolution of link structure

Periodic Crawling for Observing Trends on the Web WWW WWW Crawler Crawler Comparison Comparison Archive Archive T 1 T 2 Time T N

Difficulties in Periodic Crawling (1/2) • Stable crawls miss new information – Crawling a fixed set of pages [Fetterly et al 2003] ↑ Can identify changes in the pages ↓ Overlook new pages – Crawling all the pages in a fixed set of sites [Ntoulas et al 2004] ↑ Can identify new pages in these sites ↓ Overlook new sites ↓ Possible only on a small subset of sites • Massive crawls are necessary for discovering new pages and new sites

Difficulties in Periodic Crawling (2/2) • Massive crawls make snapshots unstable unstable – Cannot crawl the whole of the Web • # of uncrawled pages overwhelms # of crawled pages even after crawling 1B pages [Eiron et al 2004] – Novelty of a page crawled for the first time – Novelty of a page crawled for the first time remains uncertain remains uncertain • The page might exist at the previous time • “Last-Modified” time guarantees only that the page is older than that time

Our Contribution • Propose a novelty measure novelty measure for estimating the certainty that a newly crawled page is really new – New pages can be extracted from a series of unstable snapshots • Evaluate the precision, recall, and miss rate of the novelty measure • Apply the novelty measure to our Web archive search engine

Basic Ideas • The novelty of a page p p is the certainty that p p appeared between t 1 and t t- -1 t – p p appears when it can first be crawled and indexed – – p p is new when it is pointed to only by new links – – If only new pages and links point to p p , p may also be novel p • The novelty measure can be defined recursively and can be calculated in a similar way to PageRank [Brin and Page 1998] • Reverse of the decay measure [Bar-Yossef et al 2004] – p p is decayed if p p points to dead or decayed pages –

Novelty Measure • N(p ): The novelty of page p p (0 1) • N(p): – 1: The highest certainty that p p is novel – 0: The novelty of p p is totally unknown (not old) • Pages in a snapshot W(t ) are classified into W(t) old pages O(t ) and unknown pages U(t O(t) U(t) ) • Each page p in U(t ) is assigned N(p U(t) N(p) )

Old and Unknown Pages Crawled pages: W(t W(t) ) U(t) ) U(t ? Crawled pages: W(t ? W(t- -1) 1) ? ? O(t) ) O(t t-1 t

How to Define Novelty Measure If all in-links come from pages crawled last 2 times( L (t) ) L 2 2 (t) Crawled last 2 times L 2 (t) L 2 (t) New p ) 1 N(p) N(p t-1 t

How to Define Novelty Measure If some in-links come from O(t) O(t)- -L L 2 (t) 2 (t) q ? New p ) 0.75 N(p) N(p t-1 t

How to Define Novelty Measure If some in-links come from U(t ) ? U(t) q ? p ) ? N(p) N(p t-1 t

How to Define Novelty Measure Determine the novelty measure recursively q ) 0.5 N(q) N(q 50% New p ) (3+0.5) / 4 N(p) N(p t-1 t

Definition of Novelty Measure • : damping factor – probability that there were links to p p before t-1

Experiments • Data set • Convergence of calculation • Distribution of the novelty measure • Precision and recall • Miss rate

Data Set • A massively crawled Time Period Crawled pages Links Jul to Aug 17M 120M 1999 Japanese web Jun to Aug 17M 112M 2000 archive Oct 40M 331M 2001 – 2002: .jp only Feb 45M 375M 2002 Feb 66M 1058M 2003 – 2003 : Japanese Jul 97M 1589M 2003 pages in any domain Jan 81M 3452M 2004 May 96M 4505M 2004 Time Jul 2003 Jan 2004 May 2004 49M 61M 46M |L2(t)| 23M 14M 20M |O(t) - L2(t)| 25M 6M 30M |U(t)| 97M 81M 96M |W(t)|

Convergence of Calculation • 10 iterations are sufficient for 0 < 3000000 Total difference from the previous iteration delta=0 delta=0.1 2500000 delta=0.2 2000000 1500000 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of iterations

Distributions of the Novelty Measure 20,000,000 2003-07 delta=0.2 18,000,000 2003-07 delta=0.1 • Have 2 peaks on 0 and MAX 16,000,000 2003-07 delta=0.0 14,000,000 – cf. Power-law of in-link distribution Number of pages 12,000,000 • Depend on the fraction of L 2 (t) 10,000,000 and U(t) 8,000,000 6,000,000 • Not change drastically by delta 4,000,000 except for the maximum value 2,000,000 0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 Novelty measure 20,000,000 20,000,000 2004-01 delta=0.2 18,000,000 2004-05 delta=0.2 18,000,000 2004-01 delta=0.1 2004-05 delta=0.1 2004-01 delta=0.0 16,000,000 16,000,000 2004-05 delta=0.0 14,000,000 14,000,000 Number of pages Number of pages 12,000,000 12,000,000 10,000,000 10,000,000 8,000,000 8,000,000 6,000,000 6,000,000 4,000,000 4,000,000 2,000,000 2,000,000 0 0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 Novelty measure Novelty measure

Precision and Recall • Given threshold , p is judged to be novel when < N(p N(p) ) p – Precision: #(correctly judged) / #(judged to be novel) – Recall: #(correctly judged) / #(all novel pages) • Use URLs including dates as a golden set – Assume that they appeared at their including time – E.g. http://foo.com/2004/05 – Patterns: YYYYMM, YYYY/MM, YYYY-DD Jul 2003 Jan 2004 May 2004 With old date (before t-1) 299,591 (33%) 87,878 (24%) 402,365 (33%) With new date (t-1 to t) 593,317 (65%) 270,355 (74%) 776,360 (64%) With future date (after t) 24,286 (2%) 7,679 (2%) 36,476 (3%) 917,194 (100%) 365,912 (100%) 1,215,201 (100%) Total

Precision and Recall (1/2) 1 • Positive gives 0.9 80% to 90% precision in all 0.8 snapshots 0.7 Precision / Recall 0.6 • Precision jumps from the 0.5 baseline when becomes 0.4 positive, then gradually 0.3 2003-07 Precision delta=0.2 increases 0.2 2003-07 Precision delta=0.1 0.1 • Positive delta values give 2003-07 Precision delta=0.0 0 slightly better precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Precision / Recall Precision / Recall 0.6 0.6 0.5 0.5 0.4 0.4 2004-01 Precision delta=0.2 0.3 0.3 2004-05 Precision delta=0.2 0.2 2004-01 Precision delta=0.1 0.2 2004-05 Precision delta=0.1 0.1 0.1 2004-01 Precision delta=0.0 2004-05 Precision delta=0.0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold Novelty measure min. threshold

Precision and Recall (2/2) 1 • Recall drops according 0.9 0.8 to the distribution of 0.7 Precision / Recall novelty measure 0.6 0.5 • Positive delta values 0.4 decrease the recall 0.3 2003-07 Precision delta=0.2 2003-07 Precision delta=0.1 0.2 2003-07 Precision delta=0.0 2003-07 Racall delta=0.0 0.1 2003-07 Racall delta=0.1 2003-07 Recall delta=0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Precision / Recall Precision / Recall 0.6 0.6 0.5 0.5 0.4 0.4 2004-01 Recall delta=0.0 0.3 0.3 2004-05 Precision delta=0.2 2004-01 Recall delta=0.1 2004-05 Precision delta=0.1 2004-01 Recall delta=0.2 0.2 0.2 2004-05 Precision delta=0.0 2004-01 Precision delta=0.2 2004-05 Recall delta=0.0 2004-01 Precision delta=0.1 0.1 0.1 2004-05 Recall delta=0.1 2004-01 Precision delta=0.0 2004-05 Recall delta=0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold Novelty measure min. threshold

Guideline for Selecting Parameters • When higher precision is required – 0 < < 0.2 – Higher • When higher recall is required – = 0 – Small positive

Miss Rate • Fraction of pages miss-judged to be novel – Use a set of old pages as a golden set • Last-Modified time < t -1 – Check how many pages are assigned positive N values N Time # old pages in U(t) |U(t)| Jul 2003 4.8M 25M Jan 2004 0.17M 6M May 2004 3.8M 30M

What's Really New on the Web? Identifying New Pages from a Series - PowerPoint PPT Presentation

What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo Web as a projection of the world Web is now reflecting various events in the real

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

When (Low ) Pow er Really Matters When (Low ) Pow er Really Matters When (Low ) Pow er Really

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

The War for the Web Tim OReilly Web 2.0 Expo NY November 17, 2009 I really love the web

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Responsive Web Design Introduction to Web Design Responsive Web Design Introduction to Web

Collision Detection That Collision Detection That Collision Detection That Really Works Really

What- -Really Really- -Happened Happened What according to most people, history is

n2Mate Exploiting social capital to create a standards-rich semantic network David Peterson

Challenges and Opportunities in ICT Challenges and Opportunities in ICT a European perspective a

iBGP Scalability Part II: Route Reflectors topologies Eduardo

rst s rs

MongoDB Analysis with Prometheus and Grafana Akira Kurogane Percona Talk Overview The

What do you mean, Backwards Compatibility? Trisha Gee, Java Driver Developer @trisha_gee

Ensuring Balance WHO Guidelines Jim Cleary, MBBS FAChPM ! ! Associate Professor of Medicine

Complex Chronic Pain: Cases from the Field Soraya Azari, MD Assistant Clinical Professor of