OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR - - PowerPoint PPT Presentation

ocr for cjk
SMART_READER_LITE
LIVE PREVIEW

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR - - PowerPoint PPT Presentation

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer I am a passionate open-source advocate Commercial software needs to be worth the cost Can we put high-quality CJK OCR on every computer in


slide-1
SLIDE 1

OCR for CJK

Mark Ravina CEAL Technology Forum 2018

slide-2
SLIDE 2
  • I am an OCR end-user, not an OCR developer
  • I am a passionate open-source advocate
  • Commercial software needs to be worth the cost
slide-3
SLIDE 3

Can we put high-quality CJK OCR

  • n every computer in a library for

for little or no cost?

slide-4
SLIDE 4

Options for Japanese OCR GoogleDrive Free Poor Adobe Acrobat Expensive Adequate ABBYYFineReader $199 Very good eTypist ¥19,800 Very good Tesseract Free Very good

slide-5
SLIDE 5

“Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. Like a supernova, it appeared from nowhere for the 1995 UNLV Annual Test of OCR Accuracy, shone brightly with its results . . .

slide-6
SLIDE 6

. . . . and then vanished back under the same cloak

  • f secrecy under which it had been developed. Now

for the first time, details of the architecture and algorithms can be revealed.” Ray Smith, Google (2007)

slide-7
SLIDE 7
  • Japanese
  • Korean
  • Chinese simplified
  • Chinese traditional
  • Vietnamese
  • Uighur
  • Lao
  • Khmer

Tesseract currently supports:

slide-8
SLIDE 8

Challenges to deploying Tesseract compared to commercial OCR

  • Command line interface
  • Documentation is incomprehensible to most end-

users

  • Reads tiff but not pdf
  • Lacks strong page segmentation, image cleaning,

and pre-processing

  • Lacks specialized features: e.g., furigana extraction
slide-9
SLIDE 9

Command line interface

  • Command structure is simple . . .
slide-10
SLIDE 10

Replacements for command line

  • Command structure is simple . . .

Example: tesseract sample.tiff output –l jpn PDF

  • But, many users are intimidated by ANY command line

interface

slide-11
SLIDE 11
slide-12
SLIDE 12

Replacements for command line

  • Command structure is simple . . .

Example: tesseract sample.tiff output –l jpn PDF

  • But, many users are intimidated by ANY command line interface
  • File locations can be cumbersome and confusing:

/Volumes/Transcend/Documents_prime/Research papers/Meiji_petitions_research/v5/tiffs/combined

slide-13
SLIDE 13

Replacements for command line

  • Write a small script for end users
  • Automator on Mac OSX
  • Third-party options for Windows
  • Drag and drop
  • Add options or multiple icons for different

layouts and languages?

slide-14
SLIDE 14

Challenges to deploying Tesseract compared to commercial OCR

  • Command line interface
  • Documentation is incomprehensible to most end-

users

  • Reads tiff but not pdf
  • Lacks strong page segmentation, image cleaning,

pre-processing

  • Lacks specialized features: e.g., furigana extraction
slide-15
SLIDE 15

PDF to TIFF conversion

  • Can be included in Automator on Mac OSX
  • Free Third-party websites
slide-16
SLIDE 16

Challenges to deploying Tesseract compared to commercial OCR

  • Command line interface
  • Documentation is incomprehensible to most end-

users

  • Reads tiff but not pdf
  • Lacks strong page segmentation, image cleaning,

pre-processing

  • Lacks specialized features: e.g., furigana extraction
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

Text segmentation problems: shift from vertical to horizontal

slide-21
SLIDE 21

OCR will often put this header text in the middle of body text

slide-22
SLIDE 22

Similar problems for page numbers . . .

slide-23
SLIDE 23

OCR will often try to interpret shadows as text, with confusing results

slide-24
SLIDE 24

Half line glosses make OCR crazy

slide-25
SLIDE 25
slide-26
SLIDE 26

Improving page segmentation and image processing

  • Use OS X bundled tools
  • Contrast
  • Crop
  • Greyscale
  • Pre-process with open source tools
  • ImageMagick (noise and skew)
  • OpenCV
  • Don’t use Tesseract on dirty or complex images