ocr for cjk
play

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR - PowerPoint PPT Presentation

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer I am a passionate open-source advocate Commercial software needs to be worth the cost Can we put high-quality CJK OCR on every computer in


  1. OCR for CJK Mark Ravina CEAL Technology Forum 2018

  2. • I am an OCR end-user, not an OCR developer • I am a passionate open-source advocate • Commercial software needs to be worth the cost

  3. Can we put high-quality CJK OCR on every computer in a library for for little or no cost?

  4. Options for Japanese OCR GoogleDrive Free Poor Adobe Acrobat Expensive Adequate ABBYYFineReader $199 Very good eTypist ¥19,800 Very good Tesseract Free Very good

  5. “Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. Like a supernova, it appeared from nowhere for the 1995 UNLV Annual Test of OCR Accuracy, shone brightly with its results . . .

  6. . . . . and then vanished back under the same cloak of secrecy under which it had been developed. Now for the first time, details of the architecture and algorithms can be revealed.” Ray Smith, Google (2007)

  7. Tesseract currently supports: • Japanese • Korean • Chinese simplified • Chinese traditional • Vietnamese • Uighur • Lao • Khmer

  8. Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, and pre-processing • Lacks specialized features: e.g., furigana extraction

  9. Command line interface • Command structure is simple . . .

  10. Replacements for command line Command structure is simple . . . • Example: tesseract sample.tiff output –l jpn PDF But, many users are intimidated by ANY command line • interface

  11. Replacements for command line Command structure is simple . . . • Example: tesseract sample.tiff output –l jpn PDF But, many users are intimidated by ANY command line interface • File locations can be cumbersome and confusing: • /Volumes/Transcend/Documents_prime/Research papers/Meiji_petitions_research/v5/tiffs/combined

  12. Replacements for command line Write a small script for end users • Automator on Mac OSX • Third-party options for Windows • Drag and drop • Add options or multiple icons for different • layouts and languages?

  13. Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, pre-processing • Lacks specialized features: e.g., furigana extraction

  14. PDF to TIFF conversion • Can be included in Automator on Mac OSX • Free Third-party websites

  15. Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, pre-processing • Lacks specialized features: e.g., furigana extraction

  16. Text segmentation problems: shift from vertical to horizontal

  17. OCR will often put this header text in the middle of body text

  18. Similar problems for page numbers . . .

  19. OCR will often try to interpret shadows as text, with confusing results

  20. Half line glosses make OCR crazy

  21. Improving page segmentation and image processing Use OS X bundled tools • Contrast • Crop • Greyscale • Pre-process with open source tools • ImageMagick (noise and skew) • OpenCV • Don’t use Tesseract on dirty or complex images •

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend