SLIDE 1 Digital Libraries & Archives
Max Kemman
University of Luxembourg October 4, 2015
Doing Digital History: Introduction to Tools and Technology
SLIDE 2
Recap from last time
Why would we want to write for the web? Can we write an HTML document?
SLIDE 3 Today
Libraries & Archives →
- Turning the "analog signal" into a "digital signal" →
- Turning the "digital signal" into machine-readable data →
- Making the machine-readable data searchable →
- Current state of the art →
- A Digital Archive of Letters →
- Next time →
SLIDE 4
Libraries & Archives
What is a library? What is an archive?
SLIDE 5 Aspects of an archive
Provenance
- Respect des fonds
- Respect de l'ordre
- Context
- Historical sensation?
SLIDE 6 What is a digital library/archive?
Borgman, C. L. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227– 243.
Content collected on behalf of users
- Institution
- Service
- Is a digital library or archive more than a database?
SLIDE 7 Reasons for digitising
Terras - Digitisation and Digital Resources in the Humanities What are the 8 things Terras describes?
- 1. Access
- 2. Search
- 3. Reinstate out of print materials
- 4. Display material in inaccessible formats
- 5. Enhancing of digital images
- 6. Conserve fragile objects
- 7. Integration into teaching materials
- 8. Collection of geographically dispersed material
SLIDE 8
Reasons for digitising Enhancing of digital images: Google Art Project Collection of geographically dispersed material: Europeana
SLIDE 9 What is digitisation?
Terras describes 3 stages of digitisation, what are they?
Turning the "analog signal" into a "digital signal"
- Turning the "digital signal" into machine-readable data
- Making the machine-readable data searchable
SLIDE 10 Turning the "analog signal" into a "digital signal"
Terras describes three forms of material:
Text
- Sound and moving images
- 3D objects
SLIDE 11 Text
(all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)
Grazer Büchertisch Wolfenbütteler Buchspiegel Multispectral photography
Flatbed scanner Overhead scanner
SLIDE 12
Digital photography Grazer Büchertisch
SLIDE 13
Digital photography Wolfenbütteler Buchspiegel
SLIDE 14
Digital photography
SLIDE 15
Digital photography Multispectral Imaging
SLIDE 16
Scan Flatbed scanner Overhead scanner
SLIDE 17
Scan Automatic scanning https://www.youtube.com/embed/cmhIJOqepVU
SLIDE 18 Scanning a book without opening it
http://gizmodo.com/mit-invented-a-camera-that-can-read-closed-books-1786522492
SLIDE 19 Requirements for digital images
Resolution in DPI (dots per inch): minimum of 300
- RGB colour space
- TIFF format
SLIDE 20
Audio and moving images
If you thought text was hard...
SLIDE 21 Photos kindly provided by NISV - made by Marco Hofsté
SLIDE 22 Photos kindly provided by NISV - made by Marco Hofsté
SLIDE 23 Photos kindly provided by NISV - made by Marco Hofsté
SLIDE 24 Photos kindly provided by NISV - made by Marco Hofsté
SLIDE 25
Audio and moving images
After digitising the film, need to synchronize with the audio
SLIDE 26 3D objects
Two characteristics of interest
Setting
- Tabletop
- Tripod
- Handheld
- Light
- Laser
- White
SLIDE 27 Scent?
http://www.atlasobscura.com/articles/meet-the-woman-who-is-preserving-the-smell-of-history
SLIDE 28
Turning the "digital signal" into machine-readable data
Re-keying vs OCR? Re-keying: manual transcription
SLIDE 29
Turning the "digital signal" into machine-readable data
Re-keying vs OCR? Re-keying: manual transcription OCR (Object Character Recognition): computer interprets each letter
SLIDE 30
Object Character Recognition difficulties
OCR is not perfect (image source) Letters change: s / ſ / f (image source)
SLIDE 31 OCR difficulties
OCR quality depends on
Quality of the original document: letters and pages
- Quality of the image
- Not possible for hand-written material
SLIDE 32 Handwritten material
(Monk project)
SLIDE 33
Audio and visual material (simplified)
Speech to text Keyframes Edge detection
SLIDE 34
Making the machine-readable data searchable
Bush - As We May Think Too much information out there Compression for storage is not enough: need to be able to consult it Not just extraction, but selection
SLIDE 35 Selecting material
Searching libraries and archives? In non-digital archives & libraries, distinction between:
Data - the object
- Metadata - the description of the object
- Metadata is used to find the object
Indexing: data sorted alphabetically or numerically
SLIDE 36
Index
Alphabetical list with points to location Full-text search: the contents used to find the object: meta/data? Keyword search: term frequency-inverse document frequency
SLIDE 37
Association of documents
Bush: human mind works by association Memex: tying items together Web: hyperlinks! Keyword search: Google PageRank
SLIDE 38 Association of documents/objects
Linked Data / Semantic Web
https://www.youtube.com/embed/TJfrNo3Z-DU
Keyword search: Google Knowledge Graph (example)
SLIDE 39
Audiovisual material
Similarity search Content search?
SLIDE 40
Audiovisual material
Search in video?
SLIDE 41
Audiovisual material
Search in video?
SLIDE 42
Audiovisual material
Search in video?
SLIDE 43
Audiovisual material
Search in video?
SLIDE 44
Current state of the art
SLIDE 45 Heritage digitized in Europe
About 10% digitized In Europeana: 12%
Estimated cost of digitising 100%: €100 billion
SLIDE 46 Aspects of an archive
Provenance
- Respect des fonds
- Original order
- Context
- Historical sensation?
- Does a digital archive reflect this?
Keyword search: no order, limited context No authentic documents
SLIDE 47
Search
Full-text search works, but limited by imperfections of OCR Audiovisual search is starting to get interesting
SLIDE 48
Search
With these millions of objects, Terras states simple access tools are not enough Can we research the digital library or archive as a whole?
SLIDE 49 A Digital Archive of Letters
During this course we will use a collection of letters How are letters different from other texts (Dobson)? Data & Metadata
Content of the letters
- Sender
- Receiver
- Date
- Location
SLIDE 50
A single letter
What is the letter about? Why did the author write this letter?
SLIDE 51
A set of letters
What are the letters about? Are there differences between the letters? Who are the senders and receivers? Do we find a community?
SLIDE 52
A whole lot of letters
What kind of subjects are covered in the collection? Are there differences in time? Who are the senders and receivers? Do we find communities of people writing one another?
SLIDE 53
Digital letters
To do such research with a computer, we need a lot of letters in digital form As we just saw, digitisation is not trivial Can we use digital-born letters?
SLIDE 54
SLIDE 55 A Republic of Emails
Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy
Hillary Clinton used her own email server for government business
- When this was discovered, she was made to disclose her email, and the gov had to
provide emails as part of a FOIA request
- Wikileaks then hosted the emails on their website: https://wikileaks.org/clinton-emails/
- We have 30,322 emails & attachments, 50,547 pages, from the period 30 June 2010 to
12 August 2013
- A total of 7,570 emails sent by Hillary Clinton (25%)
SLIDE 56
Creating a database of emails
Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2 Let's try another one: https://wikileaks.org/clinton-emails/emailid/123 What is an email? Is it the same as a letter? Can we do this for 30,322 emails?
SLIDE 57
Creating a database of emails
We 'scraped' wikileaks automatically to get all the emails Because of the size, we separated the content from the metadata and saved these per 1,000:
SLIDE 58 Folder #items Folder #items Folder #items Folder #items f-0 999 f-10 1,000 f-20 1,000 f-30 323 f-1 1,000 f-11 1,000 f-21 1,000 f-2 1,000 f-12 998 f-22 1,000 f-3 1,000 f-13 997 f-23 1,000 f-4 1,000 f-14 998 f-24 1,000 f-5 1,000 f-15 1,000 f-25 998 f-6 1,000 f-16 1,000 f-26 1,000 f-7 1,000 f-17 1,000 f-27 1,000 f-8 1,000 f-18 1,000 f-28 999 f-9 1,000 f-19 998 f-28 999
Current state of the database
Is our database complete? Does it matter?
SLIDE 59 For next time
11 October
Big Data
Reading: (see Moodle)
Wallach, H. (2014). Big Data, Machine Learning , and the Social Sciences: Fairness, Accountability, and Transparency. Medium.
- Hitchcock, T. (2014). Big Data, Small Data and Meaning. Historyonics.