The Same is Not The Same Postcorrection of Alphabet Confusion Errors - - PowerPoint PPT Presentation

the same is not the same
SMART_READER_LITE
LIVE PREVIEW

The Same is Not The Same Postcorrection of Alphabet Confusion Errors - - PowerPoint PPT Presentation

The Same is Not The Same Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition by Jonas Hempel Bulgerian-German to English In English: Ivan plowed the field. 'opa' is German word for 'grandfather' Alphabet Similarities


slide-1
SLIDE 1

The Same is Not The Same

Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition

by Jonas Hempel

slide-2
SLIDE 2

Bulgerian-German to English

In English: Ivan plowed the field. 'opa' is German word for 'grandfather'

slide-3
SLIDE 3

Alphabet Similarities (1)

  • Latin-Cyrillic transition table
  • Upper font is Times New Roman
  • Lower font is Universum

Table taken from the paper.

slide-4
SLIDE 4

Alphabet Similarities (2)

  • Latin-Greek transition table
  • Upper font is Times New Roman
  • Lower font is Verdana Cursive

Table taken from the paper.

slide-5
SLIDE 5

Training and Test corpora

  • Sophia-Munich corpus
  • Bulgarian EC corpus
  • Greek-Latin corpus
slide-6
SLIDE 6

Algorithm

  • Levenshtein distance d0(wi, v)
  • Normalized similarity value s(v, wi)
  • collocation frequency value f(v, wi-1, wi+1)

→ score(v) = α*s(v, wi) + (1-α)*f(v)

  • α balance parameter
  • τ threshold parameter
slide-7
SLIDE 7

Evaluation Results (1)

  • Bulgarian Sophia-Munich and Bulgarian EC corpus
  • Error rate for plain OCR recognition and postcorrection
  • Training (Tr) and Test (Te) data
  • ac-error: alphabet confusion error

Table taken from the paper.

slide-8
SLIDE 8

Evaluation Results (2)

  • Greek newspaper corpus
  • Cursive Times (Ti) and cursive Verdana (Vd) font
  • ac-error: alphabet confusion error

Table taken from the paper.

slide-9
SLIDE 9