compilation of a large ground truth data set using
play

Compilation of a Large Ground-Truth Data Set Using Transkribus - PowerPoint PPT Presentation

Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig & Kay-Michael Wrzner {boenig|wuerzner}@bbaw.de Transkribus User Conference Vienna, 2nd November 2017 Overview Goal: Compilation of a large, homogeneous Ground


  1. Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig & Kay-Michael Würzner {boenig|wuerzner}@bbaw.de Transkribus User Conference Vienna, 2nd November 2017

  2. Overview Goal: Compilation of a large, homogeneous Ground Truth (GT) data set p Various heterogeneous sources p Annotation on the textual and/or structural level Background: OCR-D initiative a. Funding by the Deutsche Forschungsgemeinschaft → Improvement of OCR tools for historical printings (i.e. VD 16, 17, 18) b. Coordination project p Identify to-dos, desiderata and improvement options p Development of a call for proposals p Merge (sub-)project results into a productive workflow Procedure: Annotation with Transkribus 1. Import images and existing text and/or structural information 2. Harmonization and completion within Transkribus 2nd November 2017, Transkribus User Conference

  3. Overview p Various GT sources p Containing either text or structural annotations in differing quality p By now, ≈ 130 documents with ≈ 500 pages p A lot more to come! 2nd November 2017, Transkribus User Conference

  4. Workflows Existing text Existing structure p p p p 2nd November 2017, Transkribus User Conference

  5. Workflows Existing text Existing structure Import images p p p p 2nd November 2017, Transkribus User Conference

  6. Workflows 2nd November 2017, Transkribus User Conference

  7. Workflows 2nd November 2017, Transkribus User Conference

  8. Workflows Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML p p p p 2nd November 2017, Transkribus User Conference

  9. Workflows 2nd November 2017, Transkribus User Conference

  10. Workflows 2nd November 2017, Transkribus User Conference

  11. Workflows Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version p p p p 2nd November 2017, Transkribus User Conference

  12. Workflows 2nd November 2017, Transkribus User Conference

  13. Workflows Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region p p p p 2nd November 2017, Transkribus User Conference

  14. Workflows 2nd November 2017, Transkribus User Conference

  15. Workflows Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region Manually correct text p p p p 2nd November 2017, Transkribus User Conference

  16. Workflows Existing text Existing structure Import images Run FineReader for initial layout version Import Page XML Manually correct layout Run external OCR for initial text version Copy and paste text region by region Manually correct text p Somewhat naïve approach p External Page XML creation or p Intermediate export and (re-)import as alternative options p Not very comfortable 2nd November 2017, Transkribus User Conference

  17. Desiderata p Transkribus is a wonderful tool! t Support for polygonal regions t Multiple OCR options t Collaborative working environment with basic version control t TEI export p For GT creation , we would welcome t OCR application on specific regions also for FineReader t Dedicated text import functionalities (e.g. on paragraph level) t METS import which accounts for existing structural annotations and linked ALTO t Automatic support during manual post correction t TEI import 2nd November 2017, Transkribus User Conference

  18. Collaboration p OCR-D GT Guidelines t Documentation of existing OCR-D GT t Instructions for GT creation • Already used within the OCR-D project • Perspectively also used in a broader context (community use) t Automatic validation of GT data t (Semi-)automatic conversion of existing GT data sets t Plans for setting up a GT repository for print publications and handwritten documents p Availability View: https://kaskade.dwds.de/~matthias/ocr-d/ Sources: https://github.com/OCR-D/ 2nd November 2017, Transkribus User Conference

  19. Collaboration p Transkribus User Documentation: A proposal 1. Step: Change the documentation format from Wiki to DITA t XML-based documentation format t Topic-oriented internal and “external” structure (i.e. presentation) t Various automatically generated presentation modes 2. Step: Build and organize a documentation source repository (e.g. on github) 3. Step: Involve the user community into the documentation process t Non-developer view point t Recipes for frequent tasks 2nd November 2017, Transkribus User Conference

  20. Many thanks for your attention. 2nd November 2017, Transkribus User Conference

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend