RetrievingOCRText: ASurveyofCurrentApproaches - - PowerPoint PPT Presentation

retrieving ocr text a survey of current approaches
SMART_READER_LITE
LIVE PREVIEW

RetrievingOCRText: ASurveyofCurrentApproaches - - PowerPoint PPT Presentation

RetrievingOCRText: ASurveyofCurrentApproaches InformationRetrievalLab IllinoisInstituteofTechnology S. Beitzel E.Jensen D.Grossman {steve,ej,grossman}@ir.iit.edu Overview


slide-1
SLIDE 1

RetrievingOCRText: ASurveyofCurrentApproaches InformationRetrievalLab IllinoisInstituteofTechnology

  • S. Beitzel

E.Jensen D.Grossman {steve,ej,grossman}@ir.iit.edu

slide-2
SLIDE 2

2

Overview

  • ModelsforOCRText
  • ProcessingOCRTextforCategorization
  • Auto-correctionofOCRErrors
slide-3
SLIDE 3

3

ModelsforOCRText

  • Mittendorf,Schauble,andSheridan(1995,1996)
  • IncorporateprobabilitiesoftypicalOCRerrors
  • Harding,Croft,Weir(1997)

– Additionofcharacter-basedn-gramstothemodel. – Ex:Environment

  • _enenvnviviriroonmnmemenent– 3-grams
slide-4
SLIDE 4

4

Auto-CorrectionofOCRErrors

  • Liu(1991)

– Classifyeachtypeoferror – Usedictionarylookuptoidentifycandidate terms

  • Taghva,Borsack andCondit(1994)

– Clusteringtogroupmis-spellingsinwiththeir correctlymis-spelledterms

slide-5
SLIDE 5

5

OCRTextforCategorization

  • Hoch (1994)

– UseofcategorizeronOCRtext,showed degradedperformancewithOCRdata.

  • JunkerandHoch (1997)

– N-gramswereusedtoshowsomeimprovement aswellin[Junk97].

slide-6
SLIDE 6

6

Summary

  • ModelsexistforOCRretrieval
  • N-gramshavebeenshowntohavesome

success

  • NolargestandardtestcollectionofOCR

data,smallcollectionsexistwithsomeearly TRECdata.