OCR Errors by Michael Barz Motivation In general: How to get - PowerPoint PPT Presentation

OCR Errors by Michael Barz

Motivation • In general: How to get information out of noisy input? – Dealing with noisy input (scan/fax/e- mail…) in written form • Approach: Combination of diverse NLP tools in one pipeline – Optical Character Recognition (OCR) – Sentence Boundary Detection – Tokenization – Part-of-Speech Tagging • Efficient evaluation method for OCR results (from pipeline) – Dynamic programming approaches  mathematical description – Error identification (where does the error come from?) • Techniques to improve pipeline (avoid errors) – Table spotting

Pipeline Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Noisy Input Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Noisy Input Clean Noisy

Noisy Input • Generating noisy input to test pipeline – Printed digital writing – Scanned directly for clean input – Repeated copies combined with fax  noisy input

Optical Character Recognition Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Optical Character Recognition • “Conversion of the scanned input image from bitmap format to encoded text” • Possible Errors (impact on later stages) – Punctuation errors – Substitution errors – Space deletion • Tools: gocr, Tesseract

Sentence Boundary Detection Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Sentence Boundary Detection • “break the input text into sentence -sized units, one per line” • Usage of syntactic (and semantic) information • Tool: MXTERMINATOR

Tokenization Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Tokenization • “breaks it into individual tokens which are delimited by whitespace” – Tokens: words, punctuation symbols • Tool: Penn Treebank tokenizer

Part-of-Speech Tagging Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

Part-of-Speech Tagging • Assigns meta information to tokens due to their part of speech • Tool: MXPOST

Sample Result Result

Why evaluation? • Errors occur – Propagate through stages of pipeline – Different types (as mentioned at OCR) • Which impact do errors have?

Performance Evaluation • Dynamic programming approach Character - • Levenshtein distance for Distance each stage (adjusted) • Compare part-of- Token - Distance speech tags after • Try to backtrack where Sentence - errors arise and which Distance impact they have

Performance Evaluation ε T o r ε 0 1 2 3 T 1 0 1 2 i 2 1 1 2 e 3 2 2 2 r 4 3 3 2

Performance Evaluation • Extention: Substitution of more than one sign

Performance Evaluation Token-Distance (dist2) Sentence-Distance (dist3) • Costs for inserting, deleting • Costs for inserting, deleting or substituting a token are or substituting a sentence defined as are defined as – dist1( ε , t) – dist2( ε , t) – dist1(s, ε ) – dist2(s, ε ) – Distance between substituted – Distance between substituted substrings tokens

Evaluation 2005

Improve pipeline • Tables are no sentences  Pipeline won’t work well • Don’t regard Tables  We need an algorithm to find and spot all tables

Table Spotting

Evaluation 2008

Error identification

QUESTIONS?

Sources: “Performance Evaluation for Text Processing of Noisy Inputs” ( Daniel Lopresti, 2005) “Optical Character Recognition Errors and Their Effects on Natural Language Processing” ( Daniel Lopresti, 2009) THANK YOU FOR YOUR ATTENTION!

OCR Errors by Michael Barz Motivation In general: How to get - PowerPoint PPT Presentation

OCR Errors by Michael Barz Motivation In general: How to get information out of noisy input? Dealing with noisy input (scan/fax/e- mail) in written form Approach: Combination of diverse NLP tools in one pipeline Optical

Process for OCR Audit and Remediation What is an OCR Complaint? How do I resolve an OCR

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

ABBYY Fi ABBYY Fi ABBYY FineReader ABBYY FineReader R R d d OCR and PDF Conversion OCR and

M-Files OCR Presented By: Syed Raza What is OCR? OCR - Optical Character Recognition

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR

What Does OCR Do? OCR enforces several civil rights laws. These laws prohibit discrimination on

OCR Level 2 ITQ - Unit 59 - Presentation Software Using OCR Level 2 ITQ - Unit 59 - Presentation

OCR Level 1 ITQ - Unit 58 - Presentation Software Using OCR Level 1 ITQ - Unit 58 - Presentation

Evaluating Binarization for OCR Donald B. Curtis MyFamily.com, Inc. Genealogical Data

Introduction to OCR ZHANG Xinyun SmartMore Outline Background Text Detection Text

OCR vs. text2Pitman ... Tell me about plans. OCR How old are you? It is time to close

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

RetrievingOCRText: ASurveyofCurrentApproaches InformationRetrievalLab

Themes underlying legislative developments in the past decade Optimising well-established 3 rd

TAKING ON YOUR EMPLOYEES FUTURE TOGETHER WHAT ABOUT AUTO ENROLMENT? 1,489,815 10.07M

Implementing Auto Enrolment Successfully Andy Agathangelou, Head of Strategic Relationships,

Shape Context Matching For Efficient OCR Sudeep Pillai May 14, 2012 Sudeep Pillai Shape Context

Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig & Kay-Michael

TINY TEXT AHEAD! Move up! Quality OCR A TANGO OF AVAILABLE RESOURCES Michelle Paolillo,

JBIG2 Supported by OCR Radim Hatlapatka Masaryk University, Faculty of Informatics, Brno, Czech

OCR Errors by Michael Barz Motivation In general: How to get - PowerPoint PPT Presentation

OCR Errors by Michael Barz Motivation In general: How to get information out of noisy input? Dealing with noisy input (scan/fax/e- mail) in written form Approach: Combination of diverse NLP tools in one pipeline Optical

Process for OCR Audit and Remediation What is an OCR Complaint? How do I resolve an OCR

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

ABBYY Fi ABBYY Fi ABBYY FineReader ABBYY FineReader R R d d OCR and PDF Conversion OCR and

M-Files OCR Presented By: Syed Raza What is OCR? OCR - Optical Character Recognition

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR

What Does OCR Do? OCR enforces several civil rights laws. These laws prohibit discrimination on

OCR Level 2 ITQ - Unit 59 - Presentation Software Using OCR Level 2 ITQ - Unit 59 - Presentation

OCR Level 1 ITQ - Unit 58 - Presentation Software Using OCR Level 1 ITQ - Unit 58 - Presentation

Evaluating Binarization for OCR Donald B. Curtis MyFamily.com, Inc. Genealogical Data

Introduction to OCR ZHANG Xinyun SmartMore Outline Background Text Detection Text

OCR vs. text2Pitman ... Tell me about plans. OCR How old are you? It is time to close

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

RetrievingOCRText: ASurveyofCurrentApproaches InformationRetrievalLab

Themes underlying legislative developments in the past decade Optimising well-established 3 rd

TAKING ON YOUR EMPLOYEES FUTURE TOGETHER WHAT ABOUT AUTO ENROLMENT? 1,489,815 10.07M

Implementing Auto Enrolment Successfully Andy Agathangelou, Head of Strategic Relationships,

Shape Context Matching For Efficient OCR Sudeep Pillai May 14, 2012 Sudeep Pillai Shape Context

Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig &amp; Kay-Michael

TINY TEXT AHEAD! Move up! Quality OCR A TANGO OF AVAILABLE RESOURCES Michelle Paolillo,

JBIG2 Supported by OCR Radim Hatlapatka Masaryk University, Faculty of Informatics, Brno, Czech

Compilation of a Large Ground-Truth Data Set Using Transkribus Matthias Boenig & Kay-Michael