an open source framework for integrating multi source
play

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND - PowerPoint PPT Presentation

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05 OVERVIEW 1. Why OCR-D 2. The


  1. AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WÜRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05

  2. OVERVIEW 1. Why OCR-D 2. The OCR-D initiative 3. Architecture 4. State of OCR-D tools 5. Scalability 6. Open source 2 . 1

  3. USERS WANT TEXT DATA 3 . 1

  4. MASSIVE AMOUNTS 3 . 2

  5. STRUCTURED 3 . 3

  6. EASILY ACCESSIBLE 3 . 4

  7. LIBRARIES PROVIDE TEXT DATA 4 . 1

  8. LARGE AMOUNTS 4 . 2

  9. UNSTRUCTURED 4 . 3

  10. HARD TO ACCESS 4 . 4

  11. WHY THE DISCREPANCY? 5 . 1

  12. UNDERSPECIFIED REQUIREMENTS ON OCR BY FUNDERS AND USERS 5 . 2

  13. OCR OF HISTORICAL DATA OF LITTLE ECONOMIC INTEREST LITTLE COMPETITION 5 . 3

  14. ACADEMICAL SOLUTIONS NON-SUSTAINABLE INFLEXIBLE WORKFLOWS 5 . 4

  15. THE OCR-D INITIATIVE DFG-organized expert Workshop (2014) Verfahren zur Verbesserung von OCR-Ergebnissen Result: A concerted effort for improving OCR is seen as required. 6 . 1

  16. OCR-D COORDINATION PROJECT 6 . 2

  17. PHASE 1: EXPLORING THE DOMAIN (2015-2017) Surveyed (open-source) ecosystem around OCR and OLR Identi�ed Tasks Prepared call for proposals for DFG 6 . 3

  18. PHASE 2: MODULE PROJECT STAGE (2018-2019) 6 . 4

  19. PHASE 3: GOING PRODUCTIVE (2018-2020) Integrate with existing digitization work�ow software, e.g. Kitodo Make OCR-D-developed software uniformly deployable Advise DFG on OCR requirements for "Praxisrichtlinien" 6 . 5

  20. NOT IN THIS TALK (BUT IN OCR-D) GROUND TRUTH FOR OCR OCR ENGINE TRAINING OCR RESEARCH DATA REPOSITORY WORKFLOW COMPOSITION AND PROVENANCE 6 . 6 O G S O O OC

  21. ARCHITECTURE 7 . 1

  22. "MULTI SOURCE" existing tools by OCR-D partners (tesseract, PoCoTo, LAREX...) new developments within OCR-D (font identi�cation, post-correction...) existing tools outside OCR-D (ocropus, kraken, ScanTailor, OLENA...) 7 . 2

  23. ฀ MODULAR ฀ ฀ MONOLITHIC ฀ 7 . 3

  24. SPECIFICATION + IMPLEMENTATION 7 . 4

  25. OCR-D/SPEC METS + PAGE-XML (+ ALTO) STRUCTURED TOOL DESCRIPTIONS COMMAND LINE INTERFACE HTTP INTERFACE 7 . 5

  26. ACTIONABLE DOCUMENTATION 7 . 6

  27. OCR-D/CORE VALIDATION AND HELPER FUNCTIONS PYTHON LIBRARY SHELL LIBRARY 7 . 7

  28. WHY PYTHON? Python widely used in computer vision and machine learning (keras, pytorch...) Wrapping existing tools with minimal friction (ocropus, kraken ...) Bindings for low-level APIs (opencv, tesserocr ...) 7 . 8

  29. WHY SHELL? Lowest common denominator Wrap arbitrary command line tools Process callout possible in every framework/work�ow engine/programming environment 7 . 9

  30. STATE OF THE OCR-D TOOLSET 8 . 1

  31. PREPROCESSING Tool Developer Functionality Wrapper anyOCR DFKI binarization, (python) Kaiserslautern cropping, deskewing, dewarping OLENA OCR-D binarization shell tesseract UB Mannheim, binarization python ASV Leipzig OCRopus OCR-D binarization python kraken OCR-D binarization python ImageMagick OCR-D binarization, shell conversion 8 . 2

  32. LAYOUT RECOGNITION Tool Developer Functionality Wrapper anyOCR DFKI block+line seg8n, block (python) Kaiserslautern class7n, document analysis LAREX Uni Würzburg block+line seg8n, block (shell) class7n OCRopus OCR-D line seg8n python kraken OCR-D line seg8n python tesseract UB Mannheim, block+line seg8n python ASV Leipzig dh_segment OCR-D block+line seg8n (shell) 8 . 3

  33. TEXT RECOGNITION Tool Developer Functionality Wrapper OCRopus OCR-D text recognition python kraken OCR-D text recognition python tesseract UB Mannheim, text recognition python ASV Leipzig calamari OCR-D text recognition (python) ocrad OCR-D text recognition (shell) 8 . 4

  34. POSTPROCESSING Tool Developer Functionality Wrapper corASV ASV Leipzig post correction (python) PoCoTo CIS München post correction python keraslm ASV Leipzig post correction python ocrevalUAtion OCR-D evaluation (shell) 8 . 5

  35. YOUR TOOL? 8 . 6

  36. SCALABILITY 9 . 1

  37. <IMPRESSIVE NUMBER HERE> 9 . 2

  38. GEARED TOWARDS REAL DIGITIZATION SCENARIOS Cooperation with Kitodo and commercial providers Frequent reality check with current practices ("Pilotbibliotheken") 9 . 3

  39. MODULARITY + UNIFORM INTERFACES ⇒ ADAPTIVE WORKFLOWS (Instantiation and composition up to users) 9 . 4

  40. OPEN SOURCE IS MORE THAN "OPEN SOURCE" 10 . 1

  41. STEP 1: GET FUNDED! 10 . 2

  42. STEP 2: DEVELOP! 10 . 3

  43. STEP 3: PUBLISH CODE! 10 . 4

  44. SUSTAINABILITY AND REUSE! 10 . 5

  45. BEST PRACTICES Transparency from day one Unit tests Uni�ed test assets Continuous Integration Semantic versioning Docker base image Releases to GitHub, PyPI, DockerHub 10 . 6

  46. COMMUNITY Issues Pull requests Code review Support chat 10 . 7

  47. OCR-D/DOCS DEVELOPER DOCUMENTATION "COOKBOOK" USER GUIDE DOCUMENTATION DOCUMENTATION 10 . 8

  48. ฀ THANK YOU ฀ ocr-d.de ocr-d.github.io ocr-d.github.io/docs github.com/OCR-D gitter.im/OCR-D/Lobby 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend