Compilation of a Large Ground-Truth Data Set Using Transkribus - - PowerPoint PPT Presentation

compilation of a large ground truth data set using
SMART_READER_LITE
LIVE PREVIEW

Compilation of a Large Ground-Truth Data Set Using Transkribus - - PowerPoint PPT Presentation

Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig & Kay-Michael Wrzner {boenig|wuerzner}@bbaw.de Transkribus User Conference Vienna, 2nd November 2017 Overview Goal: Compilation of a large, homogeneous Ground


slide-1
SLIDE 1

Compilation of a Large Ground-Truth Data Set Using Transkribus

Matthias Boenig & Kay-Michael Würzner

{boenig|wuerzner}@bbaw.de

Transkribus User Conference Vienna, 2nd November 2017

slide-2
SLIDE 2

2nd November 2017, Transkribus User Conference

Overview

Goal: Compilation of a large, homogeneous Ground Truth (GT) data set

p Various heterogeneous sources p Annotation on the textual and/or structural level

Background: OCR-D initiative

  • a. Funding by the Deutsche Forschungsgemeinschaft

→ Improvement of OCR tools for historical printings (i.e. VD 16, 17, 18)

  • b. Coordination project
p Identify to-dos, desiderata and improvement options p Development of a call for proposals p Merge (sub-)project results into a productive workflow

Procedure: Annotation with Transkribus

  • 1. Import images and existing text and/or structural information
  • 2. Harmonization and completion within Transkribus
slide-3
SLIDE 3

2nd November 2017, Transkribus User Conference

Overview

p Various GT sources p Containing either text or structural annotations

in differing quality

p By now, ≈ 130 documents with ≈ 500 pages p A lot more to come!
slide-4
SLIDE 4

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure

p p p p
slide-5
SLIDE 5

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images

p p p p
slide-6
SLIDE 6

2nd November 2017, Transkribus User Conference

Workflows

slide-7
SLIDE 7

2nd November 2017, Transkribus User Conference

Workflows

slide-8
SLIDE 8

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML

p p p p
slide-9
SLIDE 9

2nd November 2017, Transkribus User Conference

Workflows

slide-10
SLIDE 10

2nd November 2017, Transkribus User Conference

Workflows

slide-11
SLIDE 11

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version

p p p p
slide-12
SLIDE 12

2nd November 2017, Transkribus User Conference

Workflows

slide-13
SLIDE 13

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region

p p p p
slide-14
SLIDE 14

2nd November 2017, Transkribus User Conference

Workflows

slide-15
SLIDE 15

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region Manually correct text

p p p p
slide-16
SLIDE 16

2nd November 2017, Transkribus User Conference

Workflows

Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region Manually correct text

p Somewhat naïve approach p External Page XML creation or p Intermediate export and (re-)import as alternative options p Not very comfortable
slide-17
SLIDE 17

2nd November 2017, Transkribus User Conference

Desiderata

p Transkribus is a wonderful tool! t Support for polygonal regions t Multiple OCR options t Collaborative working environment with basic version control t TEI export p For GT creation, we would welcome t OCR application on specific regions also for FineReader t Dedicated text import functionalities (e.g. on paragraph level) t METS import which accounts for existing structural annotations and linked

ALTO

t Automatic support during manual post correction t TEI import
slide-18
SLIDE 18

2nd November 2017, Transkribus User Conference

Collaboration

p OCR-D GT Guidelines t Documentation of existing OCR-D GT t Instructions for GT creation
  • Already used within the OCR-D project
  • Perspectively also used in a broader context (community use)
t Automatic validation of GT data t (Semi-)automatic conversion of existing GT data sets t Plans for setting up a GT repository for print publications and handwritten

documents

p Availability

View: https://kaskade.dwds.de/~matthias/ocr-d/ Sources: https://github.com/OCR-D/

slide-19
SLIDE 19

2nd November 2017, Transkribus User Conference

Collaboration

p Transkribus User Documentation: A proposal
  • 1. Step:

Change the documentation format from Wiki to DITA

t XML-based documentation format t Topic-oriented internal and “external” structure (i.e. presentation) t Various automatically generated presentation modes
  • 2. Step:

Build and organize a documentation source repository (e.g. on github)

  • 3. Step:

Involve the user community into the documentation process

t Non-developer view point t Recipes for frequent tasks
slide-20
SLIDE 20

2nd November 2017, Transkribus User Conference

Many thanks for your attention.