what s really new on the web
play

What's Really New on the Web? Identifying New Pages from a Series - PowerPoint PPT Presentation

What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo Web as a projection of the world Web is now reflecting various events in the real


  1. What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo

  2. Web as a projection of the world • Web is now reflecting various events in the real and virtual world • Evolution of past topics can be tracked by observing the Web • Identifying and tracking new information new information is important for observing new trends new trends – Sociology, marketing, and survey research War Online news Tsunami weblogs Sports BBS Computer virus

  3. Observing Trends on the Web (1/2) • Recall (Internet Archive) [Patterson 2003] – # pages including query keywords

  4. Observing Trends on the Web (2/2) • WebRelievo [Toyoda 2005] – Evolution of link structure

  5. Periodic Crawling for Observing Trends on the Web WWW WWW Crawler Crawler Comparison Comparison Archive Archive T 1 T 2 Time T N

  6. Difficulties in Periodic Crawling (1/2) • Stable crawls miss new information – Crawling a fixed set of pages [Fetterly et al 2003] ↑ Can identify changes in the pages ↓ Overlook new pages – Crawling all the pages in a fixed set of sites [Ntoulas et al 2004] ↑ Can identify new pages in these sites ↓ Overlook new sites ↓ Possible only on a small subset of sites • Massive crawls are necessary for discovering new pages and new sites

  7. Difficulties in Periodic Crawling (2/2) • Massive crawls make snapshots unstable unstable – Cannot crawl the whole of the Web • # of uncrawled pages overwhelms # of crawled pages even after crawling 1B pages [Eiron et al 2004] – Novelty of a page crawled for the first time – Novelty of a page crawled for the first time remains uncertain remains uncertain • The page might exist at the previous time • “Last-Modified” time guarantees only that the page is older than that time

  8. Our Contribution • Propose a novelty measure novelty measure for estimating the certainty that a newly crawled page is really new – New pages can be extracted from a series of unstable snapshots • Evaluate the precision, recall, and miss rate of the novelty measure • Apply the novelty measure to our Web archive search engine

  9. Basic Ideas • The novelty of a page p p is the certainty that p p appeared between t 1 and t t- -1 t – p p appears when it can first be crawled and indexed – – p p is new when it is pointed to only by new links – – If only new pages and links point to p p , p may also be novel p • The novelty measure can be defined recursively and can be calculated in a similar way to PageRank [Brin and Page 1998] • Reverse of the decay measure [Bar-Yossef et al 2004] – p p is decayed if p p points to dead or decayed pages –

  10. Novelty Measure • N(p ): The novelty of page p p (0 1) • N(p): – 1: The highest certainty that p p is novel – 0: The novelty of p p is totally unknown (not old) • Pages in a snapshot W(t ) are classified into W(t) old pages O(t ) and unknown pages U(t O(t) U(t) ) • Each page p in U(t ) is assigned N(p U(t) N(p) )

  11. Old and Unknown Pages Crawled pages: W(t W(t) ) U(t) ) U(t ? Crawled pages: W(t ? W(t- -1) 1) ? ? O(t) ) O(t t-1 t

  12. How to Define Novelty Measure If all in-links come from pages crawled last 2 times( L (t) ) L 2 2 (t) Crawled last 2 times L 2 (t) L 2 (t) New p ) 1 N(p) N(p t-1 t

  13. How to Define Novelty Measure If some in-links come from O(t) O(t)- -L L 2 (t) 2 (t) q ? New p ) 0.75 N(p) N(p t-1 t

  14. How to Define Novelty Measure If some in-links come from U(t ) ? U(t) q ? p ) ? N(p) N(p t-1 t

  15. How to Define Novelty Measure Determine the novelty measure recursively q ) 0.5 N(q) N(q 50% New p ) (3+0.5) / 4 N(p) N(p t-1 t

  16. Definition of Novelty Measure • : damping factor – probability that there were links to p p before t-1

  17. Experiments • Data set • Convergence of calculation • Distribution of the novelty measure • Precision and recall • Miss rate

  18. Data Set • A massively crawled Time Period Crawled pages Links Jul to Aug 17M 120M 1999 Japanese web Jun to Aug 17M 112M 2000 archive Oct 40M 331M 2001 – 2002: .jp only Feb 45M 375M 2002 Feb 66M 1058M 2003 – 2003 : Japanese Jul 97M 1589M 2003 pages in any domain Jan 81M 3452M 2004 May 96M 4505M 2004 Time Jul 2003 Jan 2004 May 2004 49M 61M 46M |L2(t)| 23M 14M 20M |O(t) - L2(t)| 25M 6M 30M |U(t)| 97M 81M 96M |W(t)|

  19. Convergence of Calculation • 10 iterations are sufficient for 0 < 3000000 Total difference from the previous iteration delta=0 delta=0.1 2500000 delta=0.2 2000000 1500000 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of iterations

  20. Distributions of the Novelty Measure 20,000,000 2003-07 delta=0.2 18,000,000 2003-07 delta=0.1 • Have 2 peaks on 0 and MAX 16,000,000 2003-07 delta=0.0 14,000,000 – cf. Power-law of in-link distribution Number of pages 12,000,000 • Depend on the fraction of L 2 (t) 10,000,000 and U(t) 8,000,000 6,000,000 • Not change drastically by delta 4,000,000 except for the maximum value 2,000,000 0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 Novelty measure 20,000,000 20,000,000 2004-01 delta=0.2 18,000,000 2004-05 delta=0.2 18,000,000 2004-01 delta=0.1 2004-05 delta=0.1 2004-01 delta=0.0 16,000,000 16,000,000 2004-05 delta=0.0 14,000,000 14,000,000 Number of pages Number of pages 12,000,000 12,000,000 10,000,000 10,000,000 8,000,000 8,000,000 6,000,000 6,000,000 4,000,000 4,000,000 2,000,000 2,000,000 0 0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 =0 <=0.1 <=0.2 <=0.3 <=0.4 <=0.5 <=0.6 <=0.7 <=0.8 <=0.9 <=1.0 Novelty measure Novelty measure

  21. Precision and Recall • Given threshold , p is judged to be novel when < N(p N(p) ) p – Precision: #(correctly judged) / #(judged to be novel) – Recall: #(correctly judged) / #(all novel pages) • Use URLs including dates as a golden set – Assume that they appeared at their including time – E.g. http://foo.com/2004/05 – Patterns: YYYYMM, YYYY/MM, YYYY-DD Jul 2003 Jan 2004 May 2004 With old date (before t-1) 299,591 (33%) 87,878 (24%) 402,365 (33%) With new date (t-1 to t) 593,317 (65%) 270,355 (74%) 776,360 (64%) With future date (after t) 24,286 (2%) 7,679 (2%) 36,476 (3%) 917,194 (100%) 365,912 (100%) 1,215,201 (100%) Total

  22. Precision and Recall (1/2) 1 • Positive gives 0.9 80% to 90% precision in all 0.8 snapshots 0.7 Precision / Recall 0.6 • Precision jumps from the 0.5 baseline when becomes 0.4 positive, then gradually 0.3 2003-07 Precision delta=0.2 increases 0.2 2003-07 Precision delta=0.1 0.1 • Positive delta values give 2003-07 Precision delta=0.0 0 slightly better precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Precision / Recall Precision / Recall 0.6 0.6 0.5 0.5 0.4 0.4 2004-01 Precision delta=0.2 0.3 0.3 2004-05 Precision delta=0.2 0.2 2004-01 Precision delta=0.1 0.2 2004-05 Precision delta=0.1 0.1 0.1 2004-01 Precision delta=0.0 2004-05 Precision delta=0.0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold Novelty measure min. threshold

  23. Precision and Recall (2/2) 1 • Recall drops according 0.9 0.8 to the distribution of 0.7 Precision / Recall novelty measure 0.6 0.5 • Positive delta values 0.4 decrease the recall 0.3 2003-07 Precision delta=0.2 2003-07 Precision delta=0.1 0.2 2003-07 Precision delta=0.0 2003-07 Racall delta=0.0 0.1 2003-07 Racall delta=0.1 2003-07 Recall delta=0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Precision / Recall Precision / Recall 0.6 0.6 0.5 0.5 0.4 0.4 2004-01 Recall delta=0.0 0.3 0.3 2004-05 Precision delta=0.2 2004-01 Recall delta=0.1 2004-05 Precision delta=0.1 2004-01 Recall delta=0.2 0.2 0.2 2004-05 Precision delta=0.0 2004-01 Precision delta=0.2 2004-05 Recall delta=0.0 2004-01 Precision delta=0.1 0.1 0.1 2004-05 Recall delta=0.1 2004-01 Precision delta=0.0 2004-05 Recall delta=0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Novelty measure min. threshold Novelty measure min. threshold

  24. Guideline for Selecting Parameters • When higher precision is required – 0 < < 0.2 – Higher • When higher recall is required – = 0 – Small positive

  25. Miss Rate • Fraction of pages miss-judged to be novel – Use a set of old pages as a golden set • Last-Modified time < t -1 – Check how many pages are assigned positive N values N Time # old pages in U(t) |U(t)| Jul 2003 4.8M 25M Jan 2004 0.17M 6M May 2004 3.8M 30M

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend