Introduction to Web Archiving Marc Spaniol Marc Spaniol - PowerPoint PPT Presentation

Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrücken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50

Agenda • Motivation - Indexing vs. archiving Introduction to Web Archiving - The challenge of Web archiving - Next generation Web archiving Marc Spaniol • Aspects of Web archiving - Web archiving tools - Selection - Capturing - Archiving - Hosting • Summary Databases and • References Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-2/50

Indexing vs. Archiving • Indexing - Completeness Introduction to Web Archiving - Access to content ⇒ “Taking a Photo” - Scalability (speed) Marc Spaniol - Efficiency - Freshness • Archiving - Completeness - Access to content ⇒ “Shooting a Movie” - Scalability (coverage) - Authenticity - Coherence Databases and - Durability Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-3/50

The Challenge of Web Archiving • Digital library Introduction to Web Archiving - Organized - Groomed content Marc Spaniol - Lots of metadata - Structured changes - Active preservation policies • World Wide Web - A disorganized free-for-all - Very little metadata - Unpredictable additions, deletions, modifications - No (coordinated) preservation strategy Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-4/50

Goals of Web Archiving • Role of Web - Providing information and services for seemingly all domains Introduction to Web Archiving - Reflecting all types of events, opinions, and developments within society, science, politics, environment, business, etc. Marc Spaniol - Giving room for the articulation for a multitude of stakeholders ⇒ Archiving this quickly changing multifaceted information space has becomes a relevant issue for cultural heritage • Web archiving imposes various challenges: ... Hidden Web Inherent New types of ephemeral content Change & character Evolution Social Web Preservation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-5/50

Next Generation Web Archiving Development of Web archiving technology for - High quality Web archives Introduction to Web Archiving - Long-term archive usability Marc Spaniol ⇒ From Web page storage to “Living Web Archives“ Evolution Living Usage Variety Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-6/50

Archive Fidelity Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity by Introduction to Web Archiving - Capturing all types of content - Capturing of hidden Web Marc Spaniol - Detecting traps Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-7/50

Advanced Filtering Next generation Web archiving methods and tools: • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features - Capture all types of content Marc Spaniol - Detect traps - Filtering Web spam - Filtering noise Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-8/50

Archive Coherence Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol - Deal with issues of temporal Web construction - Identify, analyze and repair temporal gaps - Consistent Web archive federation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-9/50

Archive Interpretability Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol • Facilitate (long-term) archive Interpretability - Dealing with terminology evolution - Handling semantic evolution - Preparing for evolution aware access support Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-10/50

Goals of Web Archiving Summarized • Archiving function α applied to website W produces a capture C W of the web site’s resources and related metadata: Introduction to Web Archiving α (W) → C W Marc Spaniol • Restoration function ρ “unpacks” the capture C W and reproduces the original site: ρ (C W ) → W • Transformation function τ “unpacks” the capture C W , converts the components to the modern-day equivalent, and reproduces the original site within a new environment: Databases and τ (C W ) → W ∆ Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-11/50

Aspects of Web Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-12/50

Web Archiving Tools Introduction to Web Archiving Marc Spaniol AIP: Archival Information Package DIP: Data Information Package Databases and SIP: Submission Information Package Information Systems Prof. Dr. G. Weikum OAIS: Open Archival Information System MPII-Sp-0509-13/50

Selection of Seed(s) and Scope • Entry point / seed: Where the Introduction to capturing process Web Archiving (crawl) starts. Top Marc Spaniol of the hypertext path that will be followed. • Scope: The extent of the area that will be included in the gathering, as defined by criteria applicable to each node. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-14/50

Completeness • Vertically: Number of Introduction to relevant nodes Web Archiving found from entry Marc Spaniol point. • Horizontally: Number of relevant entry points found within the designated perimeter. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-15/50

Extensive Collection • Horizontal completeness Introduction to is preferred to Web Archiving vertical Marc Spaniol completeness • Holistic, domain based, or topic-centric archiving Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-16/50

Intensive Collection • Vertical completeness is preferred to Introduction to Web Archiving horizontal completeness Marc Spaniol • Site-based archiving • Defines the high level target of a collection • Explicit exclusion to avoid duplicate content with other Databases and collections Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-17/50

The Challenge of Web Archiving • HTTP cannot ask for only new or modified contents - Timestamps have limited benefit Introduction to Web Archiving - No list of pages that have been deleted, changed, and added - Each content must be requested, one at a time, by name Marc Spaniol • There is no “SELECT *” in HTTP - Crawlers can only GET one resource at a time, by name - HTTP cannot give a crawler a list of all URLs for the site ⇒ Undiscovered or hidden resources will not be captured or refreshed ⇒ “Strategy” required Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-18/50

Server Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-19/50

Server Side Archiving Revisited • Benefits + Extremely comprehensive Introduction to Web Archiving + Changes are fully traceable (if budget permits) + Instantaneous snapshots possible Marc Spaniol + No network latency or limitations + Deep Web compliant • Drawbacks - Change monitoring may decrease server performance - Needs sophisticated set-up - Requires server access Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-20/50

Transaction based Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-21/50

Transaction based Archiving Revisited • Benefits + Comes for “free” Introduction to Web Archiving + “Smart” coverage achieved by human interaction + Simple maintenance Marc Spaniol + No server collaboration/manipulation required • Drawbacks - Unsystematic - Data quality is potentially poor - Needs traffic monitoring - Privacy issues - Potential network latency or limitations - Requires constant traffic Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-22/50

Client Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-23/50

Client Side Archiving Revisited • Benefits + No server collaboration/manipulation needed Introduction to Web Archiving + Only crawler set-up required + Mostly automated process (daily/weekly/monthly) Marc Spaniol • Drawbacks - Changes might get lost - Good data quality requires sophisticated crawling strategies - Potential network latency or limitations - Computational “expensive” Next week’s lecture: “Data Quality in Web Archiving” Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-24/50

Introduction to Web Archiving Marc Spaniol Marc Spaniol - PowerPoint PPT Presentation

Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50 Agenda Motivation - Indexing vs.

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke,

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013

Archiving the Websites of Contemporary Composers Bess Pittman, Project Web and Processing

Public Librarians to Create Community History Web Archives Hoan-Vu Do Public Library Why is web

Archiving Quantitative Child Maltreatment Data National Data Archive on Child Abuse and Neglect

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA

TM OPARG Optical Archive Group Alliance to promote Optical Disc Archiving - For

Digital archiving and on-line publishing of old relief models Mtys Gede Jnos Mszros

Data Archiving in iRODS Data Management Platform User Interfaces ? CLI CLI Plugins APIs

Bare-Bones Measurement Data Archiving Dave Plonka University of Wisconsin Madison DoIT

Jihye Kwon , Matthew M. Ziegler , Luca P. Carloni *Department of Computer Science, Columbia

Key Management Lifecycle Key Management Lifecycle Cryptographic key management encompasses the

Cultural Custodianship and the Digitisation of First Nations Community Media Archives Simon

simpleArchive Making an Archive Accessible to the User Marius Politze, Florian Claus RWTH

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

CS 241: Systems Programming Lecture 29. Static Libraries Fall 2019 Prof. Stephen Checkoway 1

FROM PROVIDER TO PARTNER: THE CHANGING ROLE OF LIBRARIES AND DATA MINING A BIG DATA VIEW

CQARank:Jointly Model Topics and Expertise in Community Question Answering Liu Yang, Minghui Qiu,

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Web Archiving Marc Spaniol Marc Spaniol - PowerPoint PPT Presentation

Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50 Agenda Motivation - Indexing vs.

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke,

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013

Archiving the Websites of Contemporary Composers Bess Pittman, Project Web and Processing

Public Librarians to Create Community History Web Archives Hoan-Vu Do Public Library Why is web

Archiving Quantitative Child Maltreatment Data National Data Archive on Child Abuse and Neglect

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA

TM OPARG Optical Archive Group Alliance to promote Optical Disc Archiving - For

Digital archiving and on-line publishing of old relief models Mtys Gede Jnos Mszros

Data Archiving in iRODS Data Management Platform User Interfaces ? CLI CLI Plugins APIs

Bare-Bones Measurement Data Archiving Dave Plonka University of Wisconsin Madison DoIT

Jihye Kwon *, Matthew M. Ziegler , Luca P. Carloni* *Department of Computer Science, Columbia

Key Management Lifecycle Key Management Lifecycle Cryptographic key management encompasses the

Cultural Custodianship and the Digitisation of First Nations Community Media Archives Simon

simpleArchive Making an Archive Accessible to the User Marius Politze, Florian Claus RWTH

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

CS 241: Systems Programming Lecture 29. Static Libraries Fall 2019 Prof. Stephen Checkoway 1

FROM PROVIDER TO PARTNER: THE CHANGING ROLE OF LIBRARIES AND DATA MINING A BIG DATA VIEW

CQARank:Jointly Model Topics and Expertise in Community Question Answering Liu Yang, Minghui Qiu,

Sambuz

Useful Links

Newsletter

Mail Us

Jihye Kwon , Matthew M. Ziegler , Luca P. Carloni *Department of Computer Science, Columbia