introduction to web archiving
play

Introduction to Web Archiving Marc Spaniol Marc Spaniol - PowerPoint PPT Presentation

Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50 Agenda Motivation - Indexing vs.


  1. Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrücken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50

  2. Agenda • Motivation - Indexing vs. archiving Introduction to Web Archiving - The challenge of Web archiving - Next generation Web archiving Marc Spaniol • Aspects of Web archiving - Web archiving tools - Selection - Capturing - Archiving - Hosting • Summary Databases and • References Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-2/50

  3. Indexing vs. Archiving • Indexing - Completeness Introduction to Web Archiving - Access to content ⇒ “Taking a Photo” - Scalability (speed) Marc Spaniol - Efficiency - Freshness • Archiving - Completeness - Access to content ⇒ “Shooting a Movie” - Scalability (coverage) - Authenticity - Coherence Databases and - Durability Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-3/50

  4. The Challenge of Web Archiving • Digital library Introduction to Web Archiving - Organized - Groomed content Marc Spaniol - Lots of metadata - Structured changes - Active preservation policies • World Wide Web - A disorganized free-for-all - Very little metadata - Unpredictable additions, deletions, modifications - No (coordinated) preservation strategy Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-4/50

  5. Goals of Web Archiving • Role of Web - Providing information and services for seemingly all domains Introduction to Web Archiving - Reflecting all types of events, opinions, and developments within society, science, politics, environment, business, etc. Marc Spaniol - Giving room for the articulation for a multitude of stakeholders ⇒ Archiving this quickly changing multifaceted information space has becomes a relevant issue for cultural heritage • Web archiving imposes various challenges: ... Hidden Web Inherent New types of ephemeral content Change & character Evolution Social Web Preservation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-5/50

  6. Next Generation Web Archiving Development of Web archiving technology for - High quality Web archives Introduction to Web Archiving - Long-term archive usability Marc Spaniol ⇒ From Web page storage to “Living Web Archives“ Evolution Living Usage Variety Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-6/50

  7. Archive Fidelity Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity by Introduction to Web Archiving - Capturing all types of content - Capturing of hidden Web Marc Spaniol - Detecting traps Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-7/50

  8. Advanced Filtering Next generation Web archiving methods and tools: • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features - Capture all types of content Marc Spaniol - Detect traps - Filtering Web spam - Filtering noise Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-8/50

  9. Archive Coherence Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol - Deal with issues of temporal Web construction - Identify, analyze and repair temporal gaps - Consistent Web archive federation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-9/50

  10. Archive Interpretability Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol • Facilitate (long-term) archive Interpretability - Dealing with terminology evolution - Handling semantic evolution - Preparing for evolution aware access support Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-10/50

  11. Goals of Web Archiving Summarized • Archiving function α applied to website W produces a capture C W of the web site’s resources and related metadata: Introduction to Web Archiving α (W) → C W Marc Spaniol • Restoration function ρ “unpacks” the capture C W and reproduces the original site: ρ (C W ) → W • Transformation function τ “unpacks” the capture C W , converts the components to the modern-day equivalent, and reproduces the original site within a new environment: Databases and τ (C W ) → W ∆ Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-11/50

  12. Aspects of Web Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-12/50

  13. Web Archiving Tools Introduction to Web Archiving Marc Spaniol AIP: Archival Information Package DIP: Data Information Package Databases and SIP: Submission Information Package Information Systems Prof. Dr. G. Weikum OAIS: Open Archival Information System MPII-Sp-0509-13/50

  14. Selection of Seed(s) and Scope • Entry point / seed: Where the Introduction to capturing process Web Archiving (crawl) starts. Top Marc Spaniol of the hypertext path that will be followed. • Scope: The extent of the area that will be included in the gathering, as defined by criteria applicable to each node. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-14/50

  15. Completeness • Vertically: Number of Introduction to relevant nodes Web Archiving found from entry Marc Spaniol point. • Horizontally: Number of relevant entry points found within the designated perimeter. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-15/50

  16. Extensive Collection • Horizontal completeness Introduction to is preferred to Web Archiving vertical Marc Spaniol completeness • Holistic, domain based, or topic-centric archiving Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-16/50

  17. Intensive Collection • Vertical completeness is preferred to Introduction to Web Archiving horizontal completeness Marc Spaniol • Site-based archiving • Defines the high level target of a collection • Explicit exclusion to avoid duplicate content with other Databases and collections Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-17/50

  18. The Challenge of Web Archiving • HTTP cannot ask for only new or modified contents - Timestamps have limited benefit Introduction to Web Archiving - No list of pages that have been deleted, changed, and added - Each content must be requested, one at a time, by name Marc Spaniol • There is no “SELECT *” in HTTP - Crawlers can only GET one resource at a time, by name - HTTP cannot give a crawler a list of all URLs for the site ⇒ Undiscovered or hidden resources will not be captured or refreshed ⇒ “Strategy” required Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-18/50

  19. Server Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-19/50

  20. Server Side Archiving Revisited • Benefits + Extremely comprehensive Introduction to Web Archiving + Changes are fully traceable (if budget permits) + Instantaneous snapshots possible Marc Spaniol + No network latency or limitations + Deep Web compliant • Drawbacks - Change monitoring may decrease server performance - Needs sophisticated set-up - Requires server access Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-20/50

  21. Transaction based Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-21/50

  22. Transaction based Archiving Revisited • Benefits + Comes for “free” Introduction to Web Archiving + “Smart” coverage achieved by human interaction + Simple maintenance Marc Spaniol + No server collaboration/manipulation required • Drawbacks - Unsystematic - Data quality is potentially poor - Needs traffic monitoring - Privacy issues - Potential network latency or limitations - Requires constant traffic Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-22/50

  23. Client Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-23/50

  24. Client Side Archiving Revisited • Benefits + No server collaboration/manipulation needed Introduction to Web Archiving + Only crawler set-up required + Mostly automated process (daily/weekly/monthly) Marc Spaniol • Drawbacks - Changes might get lost - Good data quality requires sophisticated crawling strategies - Potential network latency or limitations - Computational “expensive” Next week’s lecture: “Data Quality in Web Archiving” Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-24/50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend