ocr errors
play

OCR Errors by Michael Barz Motivation In general: How to get - PowerPoint PPT Presentation

OCR Errors by Michael Barz Motivation In general: How to get information out of noisy input? Dealing with noisy input (scan/fax/e- mail) in written form Approach: Combination of diverse NLP tools in one pipeline Optical


  1. OCR Errors by Michael Barz

  2. Motivation • In general: How to get information out of noisy input? – Dealing with noisy input (scan/fax/e- mail…) in written form • Approach: Combination of diverse NLP tools in one pipeline – Optical Character Recognition (OCR) – Sentence Boundary Detection – Tokenization – Part-of-Speech Tagging • Efficient evaluation method for OCR results (from pipeline) – Dynamic programming approaches  mathematical description – Error identification (where does the error come from?) • Techniques to improve pipeline (avoid errors) – Table spotting

  3. Pipeline Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  4. Noisy Input Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  5. Noisy Input Clean Noisy

  6. Noisy Input • Generating noisy input to test pipeline – Printed digital writing – Scanned directly for clean input – Repeated copies combined with fax  noisy input

  7. Optical Character Recognition Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  8. Optical Character Recognition • “Conversion of the scanned input image from bitmap format to encoded text” • Possible Errors (impact on later stages) – Punctuation errors – Substitution errors – Space deletion • Tools: gocr, Tesseract

  9. Sentence Boundary Detection Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  10. Sentence Boundary Detection • “break the input text into sentence -sized units, one per line” • Usage of syntactic (and semantic) information • Tool: MXTERMINATOR

  11. Tokenization Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  12. Tokenization • “breaks it into individual tokens which are delimited by whitespace” – Tokens: words, punctuation symbols • Tool: Penn Treebank tokenizer

  13. Part-of-Speech Tagging Noisy Input Optical Sentence Character Part-of-Speech Boundary Tokenization Recognition Tagging Detection (OCR) Result

  14. Part-of-Speech Tagging • Assigns meta information to tokens due to their part of speech • Tool: MXPOST

  15. Sample Result Result

  16. Why evaluation? • Errors occur – Propagate through stages of pipeline – Different types (as mentioned at OCR) • Which impact do errors have?

  17. Performance Evaluation • Dynamic programming approach Character - • Levenshtein distance for Distance each stage (adjusted) • Compare part-of- Token - Distance speech tags after • Try to backtrack where Sentence - errors arise and which Distance impact they have

  18. Performance Evaluation ε T o r ε 0 1 2 3 T 1 0 1 2 i 2 1 1 2 e 3 2 2 2 r 4 3 3 2

  19. Performance Evaluation • Extention: Substitution of more than one sign

  20. Performance Evaluation Token-Distance (dist2) Sentence-Distance (dist3) • Costs for inserting, deleting • Costs for inserting, deleting or substituting a token are or substituting a sentence defined as are defined as – dist1( ε , t) – dist2( ε , t) – dist1(s, ε ) – dist2(s, ε ) – Distance between substituted – Distance between substituted substrings tokens

  21. Evaluation 2005

  22. Improve pipeline • Tables are no sentences  Pipeline won’t work well • Don’t regard Tables  We need an algorithm to find and spot all tables

  23. Table Spotting

  24. Table Spotting

  25. Table Spotting

  26. Evaluation 2008

  27. Error identification

  28. QUESTIONS?

  29. Sources: “Performance Evaluation for Text Processing of Noisy Inputs” ( Daniel Lopresti, 2005) “Optical Character Recognition Errors and Their Effects on Natural Language Processing” ( Daniel Lopresti, 2009) THANK YOU FOR YOUR ATTENTION!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend