User Forum T 1 OCR cataloguing project Mark Bell and Katie Fox 18 - - PowerPoint PPT Presentation

user forum t 1 ocr cataloguing project
SMART_READER_LITE
LIVE PREVIEW

User Forum T 1 OCR cataloguing project Mark Bell and Katie Fox 18 - - PowerPoint PPT Presentation

User Forum T 1 OCR cataloguing project Mark Bell and Katie Fox 18 August 2016 What is T 1? T 1 is a record series that contains correspondence of the Treasury Board and in-letters to the Treasury. The series covers the years


slide-1
SLIDE 1
slide-2
SLIDE 2

Mark Bell and Katie Fox 18 August 2016

User Forum – T 1 OCR cataloguing project

slide-3
SLIDE 3

What is T 1?

  • T 1 is a record series that

contains correspondence of the Treasury Board and in-letters to the Treasury.

  • The series covers the years

1557-1946.

slide-4
SLIDE 4

How to access T 1

  • Early material is accessible via calendars
  • 1920-1946 keyword searchable via Discovery
  • 1777-1920 you can access T 1 through the registry system (indexes in

T 2 or T 108 and skeleton registers in T 3)

  • Information in the research guide: Treasury Board letters and papers

1557-1920

  • For 1910-1920 you can also keyword search on Discovery
slide-5
SLIDE 5

The Supplementary Finding Aid

5

slide-6
SLIDE 6
  • Researching automated text recognition products
  • T1 was an ideal case study opportunity
  • Relatively small set of documents
  • Modern typeface
  • “Well worn” so challenging
  • Separate data items to treat differently

Automating the transcription

slide-7
SLIDE 7

The recognition process

slide-8
SLIDE 8

Correcting catalogue references

x 4 x 1

5 outputs Pick the majority/best outputs from Tesseract

<catref> [10965/494] </catref> <datatype>C</datatype>

E10d65/h94) [10965/494]

Tagging

slide-9
SLIDE 9

Comparison and Checking

9

slide-10
SLIDE 10

Challenges

10

slide-11
SLIDE 11

11

What next for OCR?

  • Undertake similar projects
  • Improve the QA and correction process
  • Learn from the QA and correction process
slide-12
SLIDE 12

Any Questions?

12