OCR Errors by Michael Barz Motivation In general: How to get - - PowerPoint PPT Presentation

ocr errors
SMART_READER_LITE
LIVE PREVIEW

OCR Errors by Michael Barz Motivation In general: How to get - - PowerPoint PPT Presentation

OCR Errors by Michael Barz Motivation In general: How to get information out of noisy input? Dealing with noisy input (scan/fax/e- mail) in written form Approach: Combination of diverse NLP tools in one pipeline Optical


slide-1
SLIDE 1

OCR Errors

by Michael Barz

slide-2
SLIDE 2

Motivation

  • In general: How to get information out of noisy input?

– Dealing with noisy input (scan/fax/e-mail…) in written form

  • Approach: Combination of diverse NLP tools in one pipeline

– Optical Character Recognition (OCR) – Sentence Boundary Detection – Tokenization – Part-of-Speech Tagging

  • Efficient evaluation method for OCR results (from pipeline)

– Dynamic programming approaches  mathematical description – Error identification (where does the error come from?)

  • Techniques to improve pipeline (avoid errors)

– Table spotting

slide-3
SLIDE 3

Pipeline

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-4
SLIDE 4

Noisy Input

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-5
SLIDE 5

Noisy Input

Clean Noisy

slide-6
SLIDE 6

Noisy Input

  • Generating noisy input to test pipeline

– Printed digital writing – Scanned directly for clean input – Repeated copies combined with fax  noisy input

slide-7
SLIDE 7

Optical Character Recognition

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-8
SLIDE 8

Optical Character Recognition

  • “Conversion of the scanned input image from

bitmap format to encoded text”

  • Possible Errors (impact on later stages)

– Punctuation errors – Substitution errors – Space deletion

  • Tools: gocr, Tesseract
slide-9
SLIDE 9

Sentence Boundary Detection

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-10
SLIDE 10

Sentence Boundary Detection

  • “break the input text into sentence-sized

units, one per line”

  • Usage of syntactic (and semantic) information
  • Tool: MXTERMINATOR
slide-11
SLIDE 11

Tokenization

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-12
SLIDE 12

Tokenization

  • “breaks it into individual tokens which are

delimited by whitespace”

– Tokens: words, punctuation symbols

  • Tool: Penn Treebank tokenizer
slide-13
SLIDE 13

Part-of-Speech Tagging

Noisy Input Optical Character Recognition (OCR) Sentence Boundary Detection Tokenization Part-of-Speech Tagging Result

slide-14
SLIDE 14

Part-of-Speech Tagging

  • Assigns meta information to tokens due to

their part of speech

  • Tool: MXPOST
slide-15
SLIDE 15

Sample Result

Result

slide-16
SLIDE 16

Why evaluation?

  • Errors occur

– Propagate through stages of pipeline – Different types (as mentioned at OCR)

  • Which impact do errors have?
slide-17
SLIDE 17

Performance Evaluation

Character - Distance Token - Distance Sentence - Distance

  • Dynamic programming

approach

  • Levenshtein distance for

each stage (adjusted)

  • Compare part-of-

speech tags after

  • Try to backtrack where

errors arise and which impact they have

slide-18
SLIDE 18

Performance Evaluation

ε T

  • r

ε 1 2 3 T 1 1 2 i 2 1 1 2 e 3 2 2 2 r 4 3 3 2

slide-19
SLIDE 19

Performance Evaluation

  • Extention: Substitution of more than one sign
slide-20
SLIDE 20

Performance Evaluation

Token-Distance (dist2)

  • Costs for inserting, deleting
  • r substituting a token are

defined as

– dist1(ε, t) – dist1(s, ε) – Distance between substituted substrings

Sentence-Distance (dist3)

  • Costs for inserting, deleting
  • r substituting a sentence

are defined as

– dist2(ε, t) – dist2(s, ε) – Distance between substituted tokens

slide-21
SLIDE 21

Evaluation 2005

slide-22
SLIDE 22

Improve pipeline

  • Tables are no sentences  Pipeline won’t

work well

  • Don’t regard Tables  We need an algorithm

to find and spot all tables

slide-23
SLIDE 23

Table Spotting

slide-24
SLIDE 24

Table Spotting

slide-25
SLIDE 25

Table Spotting

slide-26
SLIDE 26

Evaluation 2008

slide-27
SLIDE 27

Error identification

slide-28
SLIDE 28

QUESTIONS?

slide-29
SLIDE 29

THANK YOU FOR YOUR ATTENTION!

Sources: “Performance Evaluation for Text Processing of Noisy Inputs” (Daniel Lopresti, 2005) “Optical Character Recognition Errors and Their Effects on Natural Language Processing” (Daniel Lopresti, 2009)