web archiving
play

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May - PowerPoint PPT Presentation

Web Dynamics Web Archiving Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-1/77 Agenda Introduction - Indexing vs. archiving Web


  1. Web Dynamics Web Archiving Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrücken, May 27, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-1/77

  2. Agenda • Introduction - Indexing vs. archiving Web Archiving - Temporal coherence of Web archives Dr. Marc Spaniol • Aspects of Web archiving - Selection - Capturing  Conceptual approaches  Coherence aware archiving  Quantifying (in-)coherence - Archiving - Hosting • Summary Databases and • References Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-2/77

  3. Indexing vs. Archiving • Indexing - Completeness Web Archiving - Access to content ⇒ “Taking a Photo” - Scalability (speed) Dr. Marc Spaniol - Efficiency - Freshness • Archiving - Completeness - Access to content ⇒ “Shooting a Movie” - Scalability (coverage) - Authenticity - Coherence Databases and - Durability Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-3/77

  4. The Challenge of Web Archiving • World Wide Web - A disorganized free-for-all Web Archiving - Very little metadata - Unpredictable additions, deletions, modifications Dr. Marc Spaniol - No (coordinated) preservation strategy • HTTP cannot ask for only new or modified contents - Timestamps have limited benefit - No list of pages that have been deleted, changed, and added - Each content must be requested, one at a time, by name • There is no “SELECT *” in HTTP - Crawlers can only GET one resource at a time, by name - HTTP cannot give a crawler a list of all URLs for the site ⇒ Undiscovered or hidden resources will not be captured or refreshed ⇒ “Strategy” required Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-4/77

  5. Temporal Coherence of Web Archives Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-5/77

  6. The Challenge of Archive Coherence • Crawler operations Web Archiving - Visit (pages)  Extract (links from pages) Dr. Marc Spaniol Taking place  Compare (versions of pages) - Follow (links) in parallel ⇓ • Website operations Potentially - Modifications “inside” pages incoherent  Content (text)  Structure (links) - Modifications “inside” site  Page creation  Page deletion Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-6/77

  7. Potential Pitfalls in Web Archiving • Crawling takes a long (!) time Smart(er) - Politeness Crawling - Multiple seeds per crawl Web Archiving Strategies - Spam Dr. Marc Spaniol • Crawlers aren’t “really” smart - Highly volatile against dynamics in CMS - Easy to be trapped, if not exactly configured - Doesn’t recognize patterns of “identical” contents Archive in ⇒ ⇒ Pre-analysis of site(s) needed Danger! • Some examples of crawler behavior - Enjoy link generation from JavaScript, PHP, etc. - Tend to go for shopping Evaluation of - Like time travelling in calendars Crawl ⇒ Crawling is simply “unpredictable” Coherence ⇒ Crawlers need “constant” monitoring Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-7/77

  8. Aspects of Web Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-8/77

  9. Selection Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-9/77

  10. Selection of Seed(s) and Scope • Entry point / seed: Where the Web Archiving capturing process (crawl) starts. Top Dr. Marc Spaniol of the hypertext path that will be followed. • Scope: The extent of the area that will be included in the gathering, as defined by criteria applicable to each node. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-10/77

  11. Completeness • Vertically: Number of Web Archiving relevant nodes found from entry Dr. Marc Spaniol point • Horizontally: Number of relevant entry points found within the designated perimeter Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-11/77

  12. Extensive Collection • Horizontal completeness Web Archiving is preferred to vertical Dr. Marc Spaniol completeness • Holistic, domain based, or topic-centric archiving Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-12/77

  13. Intensive Collection • Vertical completeness is Web Archiving preferred to horizontal completeness Dr. Marc Spaniol • Site-based archiving • Defines the high level target of a collection • Explicit exclusion to avoid duplicate content with other collections Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-13/77

  14. Capturing Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-14/77

  15. A Webmaster’s Omniscient View MySQL 1. Data1 2. User.abc Dynamic Web Archiving 3. Fred.foo Entry point / seed Authenticated Dr. Marc Spaniol Tagged: No robots Orphaned httpd 1. file1 2. /dir/wwx 3. Foo.html Deep Databases and Information Systems Prof. Dr. G. Weikum Unknown/not visible MPII-Sp-0510-15/77

  16. Web Server’s View of a Web Site Require authentication Web Archiving Entry point / seed Generated on-the-fly Dr. Marc Spaniol (e.g. by CGI) Tagged: No robots Databases and Information Systems Unknown/not visible Prof. Dr. G. Weikum MPII-Sp-0510-16/77

  17. A Craw ler’s View of a Web Site Not crawled Entry point / seed Web Archiving (protected) Not crawled (generated on-the-fly, Dr. Marc Spaniol e.g. by CGI) Not crawled robots.txt or robots META tag Not crawled (unadvertised & unlinked) Crawled pages Not crawled Not crawled (remote link only) (too deep) Databases and Information Systems Remote web site Prof. Dr. G. Weikum MPII-Sp-0510-17/77

  18. Web Information Systems Dynamic Web sites Web Archiving Dr. Marc Spaniol Hidden Web • Each interaction with a Web information system can potentially generate a unique customized response ⇒ Document the context of this interaction, or pseudo-transaction Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-18/77

  19. Craw ler-Server Collaboration • Open Archives Initiative (OAI) Protocol for Metadata Harvesting • Provided flat list (maybe hidden for public) Web Archiving • RSS feeds Dr. Marc Spaniol • OAI server - Pushed by search-engines - Yahoo content acquisition program, google ⇒ The sitemap standard is intended to list the resources at a site Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-19/77

  20. Server Side Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-20/77

  21. Transaction based Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-21/77

  22. Client Side Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-22/77

  23. Capturing Approaches Summary Approach Benefits Drawbacks + Extremely comprehensive - Change monitoring may decrease Web Archiving server performance + Changes are fully traceable Server Side - Needs sophisticated set-up + Instantaneous snapshots Dr. Marc Spaniol Archiving - Requires server access + No network latency or limitations + Deep Web “compliant” + Comes for “free” - Unsystematic (requires constant traffic) + “Smart” coverage achieved by - Data quality is potentially poor Transaction human interaction based - Needs traffic monitoring Archiving + Simple maintenance - Privacy issues + No server collaboration required - Potential network latency or limitations + No server collaboration needed - Changes might get lost + Only crawler set-up required - Sophisticated crawling strategy needed Client Side Archiving - Potential network latency or limitations + Mostly automated process Databases and Information Systems (daily/weekly/monthly) - Computational “expensive” Prof. Dr. G. Weikum MPII-Sp-0510-23/77

  24. Temporal Coherence • What means coherence? - “The action or fact of cleaving or sticking together” Web Archiving - “Harmonious connexion of the several parts, so that the whole ‘hangs together’” Dr. Marc Spaniol Oxford English Dictionary [http://dictionary.oed.com] • Temporal coherence in Web archiving: - Capturing Web sites as “authentic” as possible - Ensure an “as of time point x (or interval [x, y])” capture of a Web site ⇒ Periodic domain scope crawls of Web sites to obtain a best possible representation with respect to a time point / interval Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-24/77

  25. Assumptions and Notations • Basic Assumptions - Web site to be crawled consists of n Web pages Web Archiving - Changes of Web pages occur per time unit and independent of each other - Change rates are assumed / given Dr. Marc Spaniol - Delay between downloads of pages is the same - Download time is neglected • Basic Notation - Crawl: c - Web pages: p 1 ,…, p n λ i - Change probability of page p i : - Time of downloading page p i : t(p i ) µ i - Last modified value of page p i : θ (p i ) - Content hash or etag of page p i : - Crawl interval: [t s ,t e ] Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-25/77

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend