selective w eb archiving at the germ an national library
play

Selective W eb Archiving at the Germ an National Library 1 | 8 - PowerPoint PPT Presentation

Tobias Steinke Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving | 21.04.2016 Digital Publications Legal deposit for digital publications on carriers and net publications (since 2006) On


  1. Tobias Steinke Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving | 21.04.2016

  2. Digital Publications – Legal deposit for digital publications on carriers and net publications (since 2006) – On carriers (CD-ROM’s, floppy disks, DVD-ROM’s): Multimedia, educational software, e-books, no games – E-journals, e-thesis, e-books, digitized books – Music: CD’s, digitized analogue carriers, files – Web pages – Access: At least in the reading rooms 2 | 8 | Selective Web Archiving | 21.04.2016

  3. W eb Harvesting – Member of the International Internet Preservation Consortium (IIPC) – Several event harvests (e. g. elections) with the European Archive – Selective workflow with German company oia since 2012 – Additional broad crawl of .de domain in 2014 3 | 8 | Selective Web Archiving | 21.04.2016

  4. W eb Harvesting: W orkflow – Libraries in DNB use special tool to select URLs, parameters and metadata of web sites – oia use their own crawler to harvest web pages, check the quality and store the data on their own servers – Metadata will be automatically integrated in the catalogue of DNB – Exclusive access in the reading rooms of DNB via catalogue and full text search – Interface for long-term preservation in DNB archival system 4 | 8 | Selective Web Archiving | 21.04.2016

  5. DNB OI A URLs Selection of Harvesting web sites Access via Quality check catalogue Archival Stored web Long-term packages pages archive (WARC) 5 | 8 | Selective Web Archiving | 21.04.2016

  6. W eb Harvesting: Status – Topic related web sites (e. g. federal institutions, cultural organizations) – Default: Sites are crawled twice a year – Event crawls (e. g. elections, sports events) – Co-operations planned for selections of web sites – Currently ca. 1,700 sites with ca. 8,300 crawls 6 | 8 | Selective Web Archiving | 21.04.2016

  7. Craw ling of new s sites – Challenging: Updated very often, links to articles no longer on start page, pay-walls – Financial Times Deutschland (www.ftd.de) was closed down in 2013 - List of links to all articles was provided - Complete crawl was archived and is accessible by full text search – Workshop with German publishers in 2014 - Articles are not deleted and links don’t change - Advice to use Google sitemaps - Skeptical about giving crawler access behind pay-walls 7 | 8 | Selective Web Archiving | 21.04.2016

  8. Craw ling of new s sites: Status – Test crawl of SPIEGEL ONLINE (www.spiegel.de) – Import of XML based Google sitemap as source – Difficulties with crawler parameters – In discussion - How often? - Additional crawls of start page - Access to pages and articles behind pay-walls 8 | 8 | Selective Web Archiving | 21.04.2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend