Digital Libraries & Archives Max Kemman University of - - PowerPoint PPT Presentation

digital libraries archives
SMART_READER_LITE
LIVE PREVIEW

Digital Libraries & Archives Max Kemman University of - - PowerPoint PPT Presentation

Digital Libraries & Archives Max Kemman University of Luxembourg October 4, 2015 Doing Digital History: Introduction to Tools and Technology Recap from last time Why would we want to write for the web? Can we write an HTML document?


slide-1
SLIDE 1

Digital Libraries & Archives

Max Kemman

University of Luxembourg October 4, 2015

Doing Digital History: Introduction to Tools and Technology

slide-2
SLIDE 2

Recap from last time

Why would we want to write for the web? Can we write an HTML document?

slide-3
SLIDE 3

Today

Libraries & Archives →

  • Turning the "analog signal" into a "digital signal" →
  • Turning the "digital signal" into machine-readable data →
  • Making the machine-readable data searchable →
  • Current state of the art →
  • A Digital Archive of Letters →
  • Next time →
slide-4
SLIDE 4

Libraries & Archives

What is a library? What is an archive?

slide-5
SLIDE 5

Aspects of an archive

Provenance

  • Respect des fonds
  • Respect de l'ordre
  • Context
  • Historical sensation?
slide-6
SLIDE 6

What is a digital library/archive?

Borgman, C. L. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227– 243.

Content collected on behalf of users

  • Institution
  • Service
  • Is a digital library or archive more than a database?
slide-7
SLIDE 7

Reasons for digitising

Terras - Digitisation and Digital Resources in the Humanities What are the 8 things Terras describes?

  • 1. Access
  • 2. Search
  • 3. Reinstate out of print materials
  • 4. Display material in inaccessible formats
  • 5. Enhancing of digital images
  • 6. Conserve fragile objects
  • 7. Integration into teaching materials
  • 8. Collection of geographically dispersed material
slide-8
SLIDE 8

Reasons for digitising Enhancing of digital images: Google Art Project Collection of geographically dispersed material: Europeana

slide-9
SLIDE 9

What is digitisation?

Terras describes 3 stages of digitisation, what are they?

Turning the "analog signal" into a "digital signal"

  • Turning the "digital signal" into machine-readable data
  • Making the machine-readable data searchable
slide-10
SLIDE 10

Turning the "analog signal" into a "digital signal"

Terras describes three forms of material:

Text

  • Sound and moving images
  • 3D objects
slide-11
SLIDE 11

Text

(all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)

  • 1. Digital photography

Grazer Büchertisch Wolfenbütteler Buchspiegel Multispectral photography

  • 2. Scan

Flatbed scanner Overhead scanner

slide-12
SLIDE 12

Digital photography Grazer Büchertisch

slide-13
SLIDE 13

Digital photography Wolfenbütteler Buchspiegel

slide-14
SLIDE 14

Digital photography

slide-15
SLIDE 15

Digital photography Multispectral Imaging

slide-16
SLIDE 16

Scan Flatbed scanner Overhead scanner

slide-17
SLIDE 17

Scan Automatic scanning https://www.youtube.com/embed/cmhIJOqepVU

slide-18
SLIDE 18

Scanning a book without opening it

http://gizmodo.com/mit-invented-a-camera-that-can-read-closed-books-1786522492

slide-19
SLIDE 19

Requirements for digital images

Resolution in DPI (dots per inch): minimum of 300

  • RGB colour space
  • TIFF format
slide-20
SLIDE 20

Audio and moving images

If you thought text was hard...

slide-21
SLIDE 21

Photos kindly provided by NISV - made by Marco Hofsté

slide-22
SLIDE 22

Photos kindly provided by NISV - made by Marco Hofsté

slide-23
SLIDE 23

Photos kindly provided by NISV - made by Marco Hofsté

slide-24
SLIDE 24

Photos kindly provided by NISV - made by Marco Hofsté

slide-25
SLIDE 25

Audio and moving images

After digitising the film, need to synchronize with the audio

slide-26
SLIDE 26

3D objects

Two characteristics of interest

Setting

  • Tabletop
  • Tripod
  • Handheld
  • Light
  • Laser
  • White
slide-27
SLIDE 27

Scent?

http://www.atlasobscura.com/articles/meet-the-woman-who-is-preserving-the-smell-of-history

slide-28
SLIDE 28

Turning the "digital signal" into machine-readable data

Re-keying vs OCR? Re-keying: manual transcription

slide-29
SLIDE 29

Turning the "digital signal" into machine-readable data

Re-keying vs OCR? Re-keying: manual transcription OCR (Object Character Recognition): computer interprets each letter

slide-30
SLIDE 30

Object Character Recognition difficulties

OCR is not perfect (image source) Letters change: s / ſ / f (image source)

slide-31
SLIDE 31

OCR difficulties

OCR quality depends on

Quality of the original document: letters and pages

  • Quality of the image
  • Not possible for hand-written material
slide-32
SLIDE 32

Handwritten material

(Monk project)

slide-33
SLIDE 33

Audio and visual material (simplified)

Speech to text Keyframes Edge detection

slide-34
SLIDE 34

Making the machine-readable data searchable

Bush - As We May Think Too much information out there Compression for storage is not enough: need to be able to consult it Not just extraction, but selection

slide-35
SLIDE 35

Selecting material

Searching libraries and archives? In non-digital archives & libraries, distinction between:

Data - the object

  • Metadata - the description of the object
  • Metadata is used to find the object

Indexing: data sorted alphabetically or numerically

slide-36
SLIDE 36

Index

Alphabetical list with points to location Full-text search: the contents used to find the object: meta/data? Keyword search: term frequency-inverse document frequency

slide-37
SLIDE 37

Association of documents

Bush: human mind works by association Memex: tying items together Web: hyperlinks! Keyword search: Google PageRank

slide-38
SLIDE 38

Association of documents/objects

Linked Data / Semantic Web

https://www.youtube.com/embed/TJfrNo3Z-DU

Keyword search: Google Knowledge Graph (example)

slide-39
SLIDE 39

Audiovisual material

Similarity search Content search?

slide-40
SLIDE 40

Audiovisual material

Search in video?

slide-41
SLIDE 41

Audiovisual material

Search in video?

slide-42
SLIDE 42

Audiovisual material

Search in video?

slide-43
SLIDE 43

Audiovisual material

Search in video?

slide-44
SLIDE 44

Current state of the art

slide-45
SLIDE 45

Heritage digitized in Europe

About 10% digitized In Europeana: 12%

  • f digitized material

Estimated cost of digitising 100%: €100 billion

slide-46
SLIDE 46

Aspects of an archive

Provenance

  • Respect des fonds
  • Original order
  • Context
  • Historical sensation?
  • Does a digital archive reflect this?

Keyword search: no order, limited context No authentic documents

slide-47
SLIDE 47

Search

Full-text search works, but limited by imperfections of OCR Audiovisual search is starting to get interesting

slide-48
SLIDE 48

Search

With these millions of objects, Terras states simple access tools are not enough Can we research the digital library or archive as a whole?

slide-49
SLIDE 49

A Digital Archive of Letters

During this course we will use a collection of letters How are letters different from other texts (Dobson)? Data & Metadata

Content of the letters

  • Sender
  • Receiver
  • Date
  • Location
slide-50
SLIDE 50

A single letter

What is the letter about? Why did the author write this letter?

slide-51
SLIDE 51

A set of letters

What are the letters about? Are there differences between the letters? Who are the senders and receivers? Do we find a community?

slide-52
SLIDE 52

A whole lot of letters

What kind of subjects are covered in the collection? Are there differences in time? Who are the senders and receivers? Do we find communities of people writing one another?

slide-53
SLIDE 53

Digital letters

To do such research with a computer, we need a lot of letters in digital form As we just saw, digitisation is not trivial Can we use digital-born letters?

slide-54
SLIDE 54
slide-55
SLIDE 55

A Republic of Emails

Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy

Hillary Clinton used her own email server for government business

  • When this was discovered, she was made to disclose her email, and the gov had to

provide emails as part of a FOIA request

  • Wikileaks then hosted the emails on their website: https://wikileaks.org/clinton-emails/
  • We have 30,322 emails & attachments, 50,547 pages, from the period 30 June 2010 to

12 August 2013

  • A total of 7,570 emails sent by Hillary Clinton (25%)
slide-56
SLIDE 56

Creating a database of emails

Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2 Let's try another one: https://wikileaks.org/clinton-emails/emailid/123 What is an email? Is it the same as a letter? Can we do this for 30,322 emails?

slide-57
SLIDE 57

Creating a database of emails

We 'scraped' wikileaks automatically to get all the emails Because of the size, we separated the content from the metadata and saved these per 1,000:

slide-58
SLIDE 58

Folder #items Folder #items Folder #items Folder #items f-0 999 f-10 1,000 f-20 1,000 f-30 323 f-1 1,000 f-11 1,000 f-21 1,000 f-2 1,000 f-12 998 f-22 1,000 f-3 1,000 f-13 997 f-23 1,000 f-4 1,000 f-14 998 f-24 1,000 f-5 1,000 f-15 1,000 f-25 998 f-6 1,000 f-16 1,000 f-26 1,000 f-7 1,000 f-17 1,000 f-27 1,000 f-8 1,000 f-18 1,000 f-28 999 f-9 1,000 f-19 998 f-28 999

Current state of the database

Is our database complete? Does it matter?

slide-59
SLIDE 59

For next time

11 October

Big Data

Reading: (see Moodle)

Wallach, H. (2014). Big Data, Machine Learning , and the Social Sciences: Fairness, Accountability, and Transparency. Medium.

  • Hitchcock, T. (2014). Big Data, Small Data and Meaning. Historyonics.