AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND - - PowerPoint PPT Presentation

an open source framework for integrating multi source
SMART_READER_LITE
LIVE PREVIEW

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND - - PowerPoint PPT Presentation

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05 OVERVIEW 1. Why OCR-D 2. The


slide-1
SLIDE 1

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS

KAY-MICHAEL WÜRZNER KONSTANTIN BAIERER Bibliotheca Baltica 2018 Rostock 2018-10-05

1 . 1
slide-2
SLIDE 2

OVERVIEW

  • 1. Why OCR-D
  • 2. The OCR-D initiative
  • 3. Architecture
  • 4. State of OCR-D tools
  • 5. Scalability
  • 6. Open source
2 . 1
slide-3
SLIDE 3

USERS WANT TEXT DATA

3 . 1
slide-4
SLIDE 4

MASSIVE AMOUNTS

3 . 2
slide-5
SLIDE 5

STRUCTURED

3 . 3
slide-6
SLIDE 6

EASILY ACCESSIBLE

3 . 4
slide-7
SLIDE 7

LIBRARIES PROVIDE TEXT DATA

4 . 1
slide-8
SLIDE 8

LARGE AMOUNTS

4 . 2
slide-9
SLIDE 9

UNSTRUCTURED

4 . 3
slide-10
SLIDE 10

HARD TO ACCESS

4 . 4
slide-11
SLIDE 11

WHY THE DISCREPANCY?

5 . 1
slide-12
SLIDE 12

UNDERSPECIFIED REQUIREMENTS ON OCR BY FUNDERS AND USERS

5 . 2
slide-13
SLIDE 13

OCR OF HISTORICAL DATA OF LITTLE ECONOMIC INTEREST LITTLE COMPETITION

5 . 3
slide-14
SLIDE 14

ACADEMICAL SOLUTIONS NON-SUSTAINABLE INFLEXIBLE WORKFLOWS

5 . 4
slide-15
SLIDE 15

THE OCR-D INITIATIVE

DFG-organized expert Workshop (2014) Verfahren zur Verbesserung von OCR-Ergebnissen Result: A concerted effort for improving OCR is seen as required.

6 . 1
slide-16
SLIDE 16

OCR-D COORDINATION PROJECT

6 . 2
slide-17
SLIDE 17

PHASE 1: EXPLORING THE DOMAIN (2015-2017)

Surveyed (open-source) ecosystem around OCR and OLR Identied Tasks Prepared call for proposals for DFG

6 . 3
slide-18
SLIDE 18

PHASE 2: MODULE PROJECT STAGE (2018-2019)

6 . 4
slide-19
SLIDE 19

PHASE 3: GOING PRODUCTIVE (2018-2020)

Integrate with existing digitization workow software, e.g. Kitodo Make OCR-D-developed software uniformly deployable Advise DFG on OCR requirements for "Praxisrichtlinien"

6 . 5
slide-20
SLIDE 20

NOT IN THIS TALK

(BUT IN OCR-D)

GROUND TRUTH FOR OCR OCR ENGINE TRAINING OCR RESEARCH DATA REPOSITORY WORKFLOW COMPOSITION AND PROVENANCE O G S O O OC

6 . 6
slide-21
SLIDE 21

ARCHITECTURE

7 . 1
slide-22
SLIDE 22

"MULTI SOURCE"

existing tools by OCR-D partners (tesseract, PoCoTo, LAREX...) new developments within OCR-D (font identication, post-correction...) existing tools outside OCR-D (ocropus, kraken, ScanTailor, OLENA...)

7 . 2
slide-23
SLIDE 23

฀ MODULAR ฀ ฀ MONOLITHIC ฀

7 . 3
slide-24
SLIDE 24

SPECIFICATION +

IMPLEMENTATION

7 . 4
slide-25
SLIDE 25

METS + PAGE-XML (+ ALTO) STRUCTURED TOOL DESCRIPTIONS COMMAND LINE INTERFACE HTTP INTERFACE

OCR-D/SPEC

7 . 5
slide-26
SLIDE 26

ACTIONABLE DOCUMENTATION

7 . 6
slide-27
SLIDE 27

VALIDATION AND HELPER FUNCTIONS PYTHON LIBRARY SHELL LIBRARY

OCR-D/CORE

7 . 7
slide-28
SLIDE 28

WHY PYTHON?

Python widely used in computer vision and machine learning (keras, pytorch...) Wrapping existing tools with minimal friction (ocropus, kraken ...) Bindings for low-level APIs (opencv, tesserocr ...)

7 . 8
slide-29
SLIDE 29

WHY SHELL?

Lowest common denominator Wrap arbitrary command line tools Process callout possible in every framework/workow engine/programming environment

7 . 9
slide-30
SLIDE 30

STATE OF THE OCR-D TOOLSET

8 . 1
slide-31
SLIDE 31

PREPROCESSING

Tool Developer Functionality Wrapper anyOCR DFKI Kaiserslautern binarization, cropping, deskewing, dewarping (python) OCR-D binarization shell UB Mannheim, ASV Leipzig binarization python OCR-D binarization python OCR-D binarization python OCR-D binarization, conversion shell OLENA tesseract OCRopus kraken ImageMagick

8 . 2
slide-32
SLIDE 32

LAYOUT RECOGNITION

Tool Developer Functionality Wrapper anyOCR DFKI Kaiserslautern block+line seg8n, block class7n, document analysis (python) LAREX Uni Würzburg block+line seg8n, block class7n (shell) OCR-D line seg8n python OCR-D line seg8n python UB Mannheim, ASV Leipzig block+line seg8n python dh_segment OCR-D block+line seg8n (shell) OCRopus kraken tesseract

8 . 3
slide-33
SLIDE 33

TEXT RECOGNITION

Tool Developer Functionality Wrapper OCR-D text recognition python OCR-D text recognition python UB Mannheim, ASV Leipzig text recognition python OCR-D text recognition (python) OCR-D text recognition (shell) OCRopus kraken tesseract calamari

  • crad
8 . 4
slide-34
SLIDE 34

POSTPROCESSING

Tool Developer Functionality Wrapper corASV ASV Leipzig post correction (python) CIS München post correction python ASV Leipzig post correction python OCR-D evaluation (shell) PoCoTo keraslm

  • crevalUAtion
8 . 5
slide-35
SLIDE 35

YOUR TOOL?

8 . 6
slide-36
SLIDE 36

SCALABILITY

9 . 1
slide-37
SLIDE 37

<IMPRESSIVE NUMBER HERE>

9 . 2
slide-38
SLIDE 38

GEARED TOWARDS REAL DIGITIZATION SCENARIOS

Cooperation with Kitodo and commercial providers Frequent reality check with current practices ("Pilotbibliotheken")

9 . 3
slide-39
SLIDE 39

MODULARITY + UNIFORM INTERFACES ⇒ ADAPTIVE WORKFLOWS

(Instantiation and composition up to users)

9 . 4
slide-40
SLIDE 40

OPEN SOURCE IS MORE THAN "OPEN SOURCE"

10 . 1
slide-41
SLIDE 41

STEP 1: GET FUNDED!

10 . 2
slide-42
SLIDE 42

STEP 2: DEVELOP!

10 . 3
slide-43
SLIDE 43

STEP 3: PUBLISH CODE!

10 . 4
slide-44
SLIDE 44

SUSTAINABILITY AND REUSE!

10 . 5
slide-45
SLIDE 45

BEST PRACTICES

Transparency from day one Unit tests Unied Continuous Integration Semantic versioning Docker base image Releases to GitHub, PyPI, DockerHub test assets

10 . 6
slide-46
SLIDE 46

COMMUNITY

Issues Pull requests Code review Support chat

10 . 7
slide-47
SLIDE 47

DEVELOPER DOCUMENTATION "COOKBOOK" USER GUIDE DOCUMENTATION DOCUMENTATION

OCR-D/DOCS

10 . 8
slide-48
SLIDE 48

฀ THANK YOU ฀

  • cr-d.de
  • cr-d.github.io
  • cr-d.github.io/docs

github.com/OCR-D gitter.im/OCR-D/Lobby

11