Databases and Information Systems
- Prof. Dr. G. Weikum
Marc Spaniol MPII-Sp-0509-1/50 Introduction to Web Archiving
Introduction to Web Archiving Marc Spaniol Marc Spaniol - - PowerPoint PPT Presentation
Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50 Agenda Motivation - Indexing vs.
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-1/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-2/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-3/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-4/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-5/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-6/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-7/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-8/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-9/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-10/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-11/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-12/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-13/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-14/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-15/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-16/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-17/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-18/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-19/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-20/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-21/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-22/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-23/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-24/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-25/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-26/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-27/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-28/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-29/50 Introduction to Web Archiving
Tagged: No robots
MySQL
httpd
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-30/50 Introduction to Web Archiving
Tagged: No robots
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-31/50 Introduction to Web Archiving
(unadvertised & unlinked)
(too deep)
(protected)
(remote link only)
(generated on-the-fly, e.g. by CGI) Not crawled
robots.txt or robots META tag
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-32/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-33/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-34/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-35/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-36/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-37/50 Introduction to Web Archiving
Machine-readable Human-readable
Complex Object
Apache Web Server
JHOVE METADATA
MD-5 LS
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-38/50 Introduction to Web Archiving
Dublin Core metadata METS
MPEG-21 DIDL
simple highly expressive more expressive highly expressive
MARCXML metadata
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-39/50 Introduction to Web Archiving
http://www.sample.edu/modoai?verb=ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=mime:video:mpeg
Give me a list of all resources, include Dublin Core metadata, dating from 9/15/2004 through today, and that are MIME type video-MPEG.
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-40/50 Introduction to Web Archiving
http://www.takeda.de/unternehmen/pdf/fantastisch/pdf8_17.pdf
encoded as an MPEG-21 DIDL <didl> <metadata source="jhove">...</metadata> <metadata source="file">...</metadata> <metadata source="essence">...</metadata> <metadata source="grep">...</metadata> ... <resource mimeType="application/pdf" identifier="http://www.takeda.de/unternehmen/ pdf/fantastisch/pdf8_17.pdf" encoding="base64"> SADLFJSALDJF...SLDKFJASLDJ </resource> </didl> Jhove metadata DC metadata Checksum
Provenance
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-41/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-42/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-43/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-44/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-45/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-46/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-47/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-48/50 Introduction to Web Archiving
High Low High
LOCKSS Browser cache TTApache iPROXY Furl/Spurl InfoMonitor Filesystem backups
High
Web archives Search engine caches Hanzo:web
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-49/50 Introduction to Web Archiving
Databases and Information Systems
Marc Spaniol MPII-Sp-0509-50/50 Introduction to Web Archiving