Selective W eb Archiving at the Germ an National Library 1 | 8 - - PowerPoint PPT Presentation

selective w eb archiving at the germ an national library
SMART_READER_LITE
LIVE PREVIEW

Selective W eb Archiving at the Germ an National Library 1 | 8 - - PowerPoint PPT Presentation

Tobias Steinke Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving | 21.04.2016 Digital Publications Legal deposit for digital publications on carriers and net publications (since 2006) On


slide-1
SLIDE 1

| 8 | Selective Web Archiving | 21.04.2016 1

Selective W eb Archiving at the Germ an National Library

Tobias Steinke

slide-2
SLIDE 2

| 8 | Selective Web Archiving | 21.04.2016 2

Digital Publications

– Legal deposit for digital publications on carriers and net publications (since 2006) – On carriers (CD-ROM’s, floppy disks, DVD-ROM’s): Multimedia, educational software, e-books, no games – E-journals, e-thesis, e-books, digitized books – Music: CD’s, digitized analogue carriers, files – Web pages – Access: At least in the reading rooms

slide-3
SLIDE 3

| 8 | Selective Web Archiving | 21.04.2016 3

W eb Harvesting

– Member of the International Internet Preservation Consortium (IIPC) – Several event harvests (e. g. elections) with the European Archive – Selective workflow with German company oia since 2012 – Additional broad crawl of .de domain in 2014

slide-4
SLIDE 4

W eb Harvesting: W orkflow

– Libraries in DNB use special tool to select URLs, parameters and metadata of web sites – oia use their own crawler to harvest web pages, check the quality and store the data on their own servers – Metadata will be automatically integrated in the catalogue of DNB – Exclusive access in the reading rooms of DNB via catalogue and full text search – Interface for long-term preservation in DNB archival system

| 8 | Selective Web Archiving | 21.04.2016 4

slide-5
SLIDE 5

| 8 | Selective Web Archiving | 21.04.2016 5

Selection of web sites DNB Harvesting Quality check Stored web pages Access via catalogue Long-term archive OI A URLs Archival packages (WARC)

slide-6
SLIDE 6

| 8 | Selective Web Archiving | 21.04.2016 6

W eb Harvesting: Status

– Topic related web sites (e. g. federal institutions, cultural organizations) – Default: Sites are crawled twice a year – Event crawls (e. g. elections, sports events) – Co-operations planned for selections of web sites – Currently ca. 1,700 sites with ca. 8,300 crawls

slide-7
SLIDE 7

| 8 | Selective Web Archiving | 21.04.2016 7

Craw ling of new s sites

– Challenging: Updated very often, links to articles no longer on start page, pay-walls – Financial Times Deutschland (www.ftd.de) was closed down in 2013

  • List of links to all articles was provided
  • Complete crawl was archived and is accessible by full text search

– Workshop with German publishers in 2014

  • Articles are not deleted and links don’t change
  • Advice to use Google sitemaps
  • Skeptical about giving crawler access behind pay-walls
slide-8
SLIDE 8

| 8 | Selective Web Archiving | 21.04.2016 8

Craw ling of new s sites: Status

– Test crawl of SPIEGEL ONLINE (www.spiegel.de) – Import of XML based Google sitemap as source – Difficulties with crawler parameters – In discussion

  • How often?
  • Additional crawls of start page
  • Access to pages and articles behind pay-walls