OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR - - PowerPoint PPT Presentation

▶

Mar 02, 2024 739 likes •1.01k views

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer I am a passionate open-source advocate Commercial software needs to be worth the cost Can we put high-quality CJK OCR on every computer in

SLIDE 1

OCR for CJK

Mark Ravina CEAL Technology Forum 2018

SLIDE 2

I am an OCR end-user, not an OCR developer
I am a passionate open-source advocate
Commercial software needs to be worth the cost

SLIDE 3

Can we put high-quality CJK OCR

n every computer in a library for

for little or no cost?

SLIDE 4

Options for Japanese OCR GoogleDrive Free Poor Adobe Acrobat Expensive Adequate ABBYYFineReader $199 Very good eTypist ¥19,800 Very good Tesseract Free Very good

SLIDE 5

“Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. Like a supernova, it appeared from nowhere for the 1995 UNLV Annual Test of OCR Accuracy, shone brightly with its results . . .

SLIDE 6

. . . . and then vanished back under the same cloak

f secrecy under which it had been developed. Now

for the first time, details of the architecture and algorithms can be revealed.” Ray Smith, Google (2007)

SLIDE 7

Japanese
Korean
Chinese simplified
Chinese traditional
Vietnamese
Uighur
Lao
Khmer

Tesseract currently supports:

SLIDE 8

Challenges to deploying Tesseract compared to commercial OCR

Command line interface
Documentation is incomprehensible to most end-

users

Reads tiff but not pdf
Lacks strong page segmentation, image cleaning,

and pre-processing

Lacks specialized features: e.g., furigana extraction

SLIDE 9

Command line interface

Command structure is simple . . .

SLIDE 10

Replacements for command line

Command structure is simple . . .

Example: tesseract sample.tiff output –l jpn PDF

But, many users are intimidated by ANY command line

interface

SLIDE 11

SLIDE 12

Replacements for command line

Command structure is simple . . .

Example: tesseract sample.tiff output –l jpn PDF

But, many users are intimidated by ANY command line interface
File locations can be cumbersome and confusing:

/Volumes/Transcend/Documents_prime/Research papers/Meiji_petitions_research/v5/tiffs/combined

SLIDE 13

Replacements for command line

Write a small script for end users
Automator on Mac OSX
Third-party options for Windows
Drag and drop
Add options or multiple icons for different

layouts and languages?

SLIDE 14

Challenges to deploying Tesseract compared to commercial OCR

Command line interface
Documentation is incomprehensible to most end-

users

Reads tiff but not pdf
Lacks strong page segmentation, image cleaning,

pre-processing

Lacks specialized features: e.g., furigana extraction

SLIDE 15

PDF to TIFF conversion

Can be included in Automator on Mac OSX
Free Third-party websites

SLIDE 16

Challenges to deploying Tesseract compared to commercial OCR

Command line interface
Documentation is incomprehensible to most end-

users

Reads tiff but not pdf
Lacks strong page segmentation, image cleaning,

pre-processing

Lacks specialized features: e.g., furigana extraction

SLIDE 17

SLIDE 18

SLIDE 19

SLIDE 20

Text segmentation problems: shift from vertical to horizontal

SLIDE 21

OCR will often put this header text in the middle of body text

SLIDE 22