OCR for CJK
Mark Ravina CEAL Technology Forum 2018
OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR - - PowerPoint PPT Presentation
OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer I am a passionate open-source advocate Commercial software needs to be worth the cost Can we put high-quality CJK OCR on every computer in
Mark Ravina CEAL Technology Forum 2018
Options for Japanese OCR GoogleDrive Free Poor Adobe Acrobat Expensive Adequate ABBYYFineReader $199 Very good eTypist ¥19,800 Very good Tesseract Free Very good
“Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. Like a supernova, it appeared from nowhere for the 1995 UNLV Annual Test of OCR Accuracy, shone brightly with its results . . .
. . . . and then vanished back under the same cloak
for the first time, details of the architecture and algorithms can be revealed.” Ray Smith, Google (2007)
Tesseract currently supports:
Challenges to deploying Tesseract compared to commercial OCR
users
and pre-processing
Command line interface
Replacements for command line
Example: tesseract sample.tiff output –l jpn PDF
interface
Replacements for command line
Example: tesseract sample.tiff output –l jpn PDF
/Volumes/Transcend/Documents_prime/Research papers/Meiji_petitions_research/v5/tiffs/combined
Replacements for command line
layouts and languages?
Challenges to deploying Tesseract compared to commercial OCR
users
pre-processing
PDF to TIFF conversion
Challenges to deploying Tesseract compared to commercial OCR
users
pre-processing
Text segmentation problems: shift from vertical to horizontal
OCR will often put this header text in the middle of body text
Similar problems for page numbers . . .
OCR will often try to interpret shadows as text, with confusing results
Half line glosses make OCR crazy
Improving page segmentation and image processing