Multilingual and Noisy Data Challenges in Large-Scale Book Scanning - - PowerPoint PPT Presentation

multilingual and noisy data challenges in large scale
SMART_READER_LITE
LIVE PREVIEW

Multilingual and Noisy Data Challenges in Large-Scale Book Scanning - - PowerPoint PPT Presentation

Noisy text in Google Books OCR in Google Research Challenges Multilingual and Noisy Data Challenges in Large-Scale Book Scanning Ashok C. Popat Staff Research Scientist, Google, Inc. September 17, 2011 Ashok C. Popat September 17, 2011 1 /


slide-1
SLIDE 1

Noisy text in Google Books OCR in Google Research Challenges

Multilingual and Noisy Data Challenges in Large-Scale Book Scanning

Ashok C. Popat

Staff Research Scientist, Google, Inc.

September 17, 2011

Ashok C. Popat September 17, 2011 1 / 73

slide-2
SLIDE 2

Noisy text in Google Books OCR in Google Research Challenges

Outline

1

Noisy text in Google Books

2

OCR in Google Research

3

Challenges

Ashok C. Popat September 17, 2011 2 / 73

slide-3
SLIDE 3

Noisy text in Google Books OCR in Google Research Challenges

Noisy text in Google Books

Joint work with Matt Casey, David Petrou, Viresh Ratnakar, others

Ashok C. Popat September 17, 2011 3 / 73

slide-4
SLIDE 4

Noisy text in Google Books OCR in Google Research Challenges

Noisy text in Google Books

  • OCR layer in digitized books can be “noisy”
  • Causes include
  • Mis-specification of language(s) when running OCR
  • OCR’ing unsupported scripts or languages
  • Mis- or un-detected rotated pages
  • Scanning defects: smudges, tears, etc.
  • Errors in page-layout analysis (OCR’ing math or pictures)
  • Normal OCR errors

Ashok C. Popat September 17, 2011 4 / 73

slide-5
SLIDE 5

Noisy text in Google Books OCR in Google Research Challenges

Examples

  • Good or bad?

Ashok C. Popat September 17, 2011 5 / 73

slide-6
SLIDE 6

Noisy text in Google Books OCR in Google Research Challenges

Examples

  • Good or bad?
  • Bad. Chinese original; apparent mis-specification of language

when OCR’ing

Ashok C. Popat September 17, 2011 5 / 73

slide-7
SLIDE 7

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?

Ashok C. Popat September 17, 2011 6 / 73

slide-8
SLIDE 8

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?
  • Fairly good. Notice specialized terms and abbreviations.

Ashok C. Popat September 17, 2011 6 / 73

slide-9
SLIDE 9

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?

Ashok C. Popat September 17, 2011 7 / 73

slide-10
SLIDE 10

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?
  • Bad. Vertical text apparently detected as left-to-right in OCR.

Ashok C. Popat September 17, 2011 7 / 73

slide-11
SLIDE 11

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?

Ashok C. Popat September 17, 2011 8 / 73

slide-12
SLIDE 12

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?
  • Fairly bad – light text in image.

Ashok C. Popat September 17, 2011 8 / 73

slide-13
SLIDE 13

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?

Ashok C. Popat September 17, 2011 9 / 73

slide-14
SLIDE 14

Noisy text in Google Books OCR in Google Research Challenges

Examples (continued)

  • Good or bad?
  • Fairly good, but in an obscure language.

Ashok C. Popat September 17, 2011 9 / 73

slide-15
SLIDE 15

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source

Ashok C. Popat September 17, 2011 10 / 73

slide-16
SLIDE 16

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source
  • Fit a mixture of language-specific sequentially predictive

character-level N-gram models at each position in the text

Ashok C. Popat September 17, 2011 10 / 73

slide-17
SLIDE 17

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source
  • Fit a mixture of language-specific sequentially predictive

character-level N-gram models at each position in the text

  • In estimating mixing weights, impose a requirement that they be

spatially coherent

Ashok C. Popat September 17, 2011 10 / 73

slide-18
SLIDE 18

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source
  • Fit a mixture of language-specific sequentially predictive

character-level N-gram models at each position in the text

  • In estimating mixing weights, impose a requirement that they be

spatially coherent

  • Include low-order all-language “background LM” in the mixture

Ashok C. Popat September 17, 2011 10 / 73

slide-19
SLIDE 19

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source
  • Fit a mixture of language-specific sequentially predictive

character-level N-gram models at each position in the text

  • In estimating mixing weights, impose a requirement that they be

spatially coherent

  • Include low-order all-language “background LM” in the mixture
  • Use the estimated mixing weights, observed language-specific

log-likelihoods, and expected per-language likelihoods to derive a score at each position.

Ashok C. Popat September 17, 2011 10 / 73

slide-20
SLIDE 20

Noisy text in Google Books OCR in Google Research Challenges

Summary of approach

  • Model string as a locally-stationary Markov source
  • Fit a mixture of language-specific sequentially predictive

character-level N-gram models at each position in the text

  • In estimating mixing weights, impose a requirement that they be

spatially coherent

  • Include low-order all-language “background LM” in the mixture
  • Use the estimated mixing weights, observed language-specific

log-likelihoods, and expected per-language likelihoods to derive a score at each position.

  • Estimated mixing weights can be used to estimate the distribution
  • ver languages

Ashok C. Popat September 17, 2011 10 / 73

slide-21
SLIDE 21

Noisy text in Google Books OCR in Google Research Challenges

Language coverage

  • Approximately 50 languages trained, mixed and lower case
  • Training
  • Train seed models on curated reference translations, using 9-fold

crossvalidation to determine model order

  • Use seed models to filter parallel training corpora
  • Train full models on filtered data
  • Prune to various memory sizes on the basis of context count

thresholds

Ashok C. Popat September 17, 2011 11 / 73

slide-22
SLIDE 22

Noisy text in Google Books OCR in Google Research Challenges

An English PPM-C Model’s Idea of ‘Typical’ (N = 8)

  • Aggressive pruning

Ashok C. Popat September 17, 2011 12 / 73

slide-23
SLIDE 23

Noisy text in Google Books OCR in Google Research Challenges

An English PPM-C Model’s Idea of ‘Typical’ (N = 8)

  • Aggressive pruning
  • Moderate pruning

Ashok C. Popat September 17, 2011 12 / 73

slide-24
SLIDE 24

Noisy text in Google Books OCR in Google Research Challenges

An English PPM-C Model’s Idea of ‘Typical’ (N = 8)

  • Aggressive pruning
  • Moderate pruning
  • Conservative pruning

Ashok C. Popat September 17, 2011 12 / 73

slide-25
SLIDE 25

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘G’ (N = 8)

  • Aggressive pruning

Ashok C. Popat September 17, 2011 13 / 73

slide-26
SLIDE 26

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘G’ (N = 8)

  • Aggressive pruning
  • Moderate pruning

Ashok C. Popat September 17, 2011 13 / 73

slide-27
SLIDE 27

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘G’ (N = 8)

  • Aggressive pruning
  • Moderate pruning
  • Conservative pruning

Ashok C. Popat September 17, 2011 13 / 73

slide-28
SLIDE 28

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘J’ (N = 8)

  • Aggressive pruning

Ashok C. Popat September 17, 2011 14 / 73

slide-29
SLIDE 29

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘J’ (N = 8)

  • Aggressive pruning
  • Moderate pruning

Ashok C. Popat September 17, 2011 14 / 73

slide-30
SLIDE 30

Noisy text in Google Books OCR in Google Research Challenges

Mystery Language ‘J’ (N = 8)

  • Aggressive pruning
  • Moderate pruning
  • Conservative pruning

Ashok C. Popat September 17, 2011 14 / 73

slide-31
SLIDE 31

Noisy text in Google Books OCR in Google Research Challenges

Evaluation for application to e-book

  • Data: text fragment 5-tuples sampled from a large number of

books

  • Ranking task: sort text samples in decreasing order of

“presentability”

  • Calibration task: pick worst sample that is acceptable, if any, from

each tuple

  • Compute Kendall’s tau for each set using human- and

machine-produced generated rankings

  • Average value of Kendall’s tau is taken as a measure of how well

the garbage detector matches human judgments

Ashok C. Popat September 17, 2011 15 / 73

slide-32
SLIDE 32

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: ranking phase

Ashok C. Popat September 17, 2011 16 / 73

slide-33
SLIDE 33

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: ranking phase, continued

Ashok C. Popat September 17, 2011 17 / 73

slide-34
SLIDE 34

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: ranking phase, continued

Ashok C. Popat September 17, 2011 18 / 73

slide-35
SLIDE 35

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: ranking phase, continued

Ashok C. Popat September 17, 2011 19 / 73

slide-36
SLIDE 36

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: ranking phase, continued

Ashok C. Popat September 17, 2011 20 / 73

slide-37
SLIDE 37

Noisy text in Google Books OCR in Google Research Challenges

Evaluation interface: calibration phase

Ashok C. Popat September 17, 2011 21 / 73

slide-38
SLIDE 38

Noisy text in Google Books OCR in Google Research Challenges

Methods compared against

  • OCR-confidence output, using several versions of a commercial

engine

  • HEUR-T, HEUR-K: Heuristics by Taghva ‘01; Kulp ‘07
  • Dictionaries: Extract vocabulary files from Web data. Use the

most frequent N terms, where N ranges from 1K to 1M

  • Hard Dictionary (HDL, HDM): Penalize passage by C1 for each

OOV term; penalize by C2 for each punctuation-or-symbol tokenized as a singleton.

  • Soft Dictionary (SDL, SDM): For each term in a passage, find edit

distance to dictionary word (or C2 for punctuation-or-symbol tokenized as a singleton.) Penalty for the passage is the total edit distance divided by the passage length in Unicode points

Ashok C. Popat September 17, 2011 22 / 73

slide-39
SLIDE 39

Noisy text in Google Books OCR in Google Research Challenges

Comparison among methods

Table: All considered languages (approx 30)

condition ¯ τ N s2 95% CI intra-rater 0.790 522 0.050 (0.770, 0.809) inter-rater 0.668 3056 0.087 (0.658, 0.679) OCR-conf 0.263 2610 0.216 (0.245, 0.280) HEUR-T 0.339 2610 0.146 (0.325, 0.354) HEUR-K 0.381 2610 0.149 (0.367, 0.396) SEQ 0.600 2610 0.090 (0.589, 0.612) SPA 0.665 2610 0.086 (0.654, 0.676)

Ashok C. Popat September 17, 2011 23 / 73

slide-40
SLIDE 40

Noisy text in Google Books OCR in Google Research Challenges

Comparison among methods (continued)

Table: Eleven intersection languages

condition ¯ τ N s2 95% CI intra-rater 0.803 291 0.052 (0.777, 0.829) inter-rater 0.665 1895 0.093 (0.651, 0.679) OCR-conf 0.251 1455 0.239 (0.226, 0.276) HEUR-T 0.375 1455 0.135 (0.356, 0.394) HEUR-K 0.428 1455 0.141 (0.408, 0.447) HDM1M 0.516 1455 0.111 (0.499, 0.533) SDM50K 0.586 1455 0.106 (0.570, 0.603) SEQ 0.607 1455 0.094 (0.592, 0.623) SPA 0.670 1455 0.087 (0.655, 0.686)

Ashok C. Popat September 17, 2011 24 / 73

slide-41
SLIDE 41

Noisy text in Google Books OCR in Google Research Challenges

Application to e-book readers

  • For a given paragraph in an e-book, is it better to render the text or

swap in the image?

Ashok C. Popat September 17, 2011 25 / 73

slide-42
SLIDE 42

Noisy text in Google Books OCR in Google Research Challenges

Application to e-book readers

  • For a given paragraph in an e-book, is it better to render the text or

swap in the image?

Ashok C. Popat September 17, 2011 25 / 73

slide-43
SLIDE 43

Noisy text in Google Books OCR in Google Research Challenges

Application to mobile device OCR

  • Can we select only the Good OCR text from a given image region?
  • Viterbi search:
  • Two states: garbage and clean
  • Scores computed as described, plus transition costs
  • Transitions discounted based on image distance between symbols
  • About 30 languages enabled; language not set in advance

Ashok C. Popat September 17, 2011 26 / 73

slide-44
SLIDE 44

Noisy text in Google Books OCR in Google Research Challenges

Example 1

Ashok C. Popat September 17, 2011 27 / 73

slide-45
SLIDE 45

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine A

Ashok C. Popat September 17, 2011 28 / 73

slide-46
SLIDE 46

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine B

Ashok C. Popat September 17, 2011 29 / 73

slide-47
SLIDE 47

Noisy text in Google Books OCR in Google Research Challenges

Example 2

Ashok C. Popat September 17, 2011 30 / 73

slide-48
SLIDE 48

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine A

Ashok C. Popat September 17, 2011 31 / 73

slide-49
SLIDE 49

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine B

Ashok C. Popat September 17, 2011 32 / 73

slide-50
SLIDE 50

Noisy text in Google Books OCR in Google Research Challenges

Example 3

Ashok C. Popat September 17, 2011 33 / 73

slide-51
SLIDE 51

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine A

Ashok C. Popat September 17, 2011 34 / 73

slide-52
SLIDE 52

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine B

Ashok C. Popat September 17, 2011 35 / 73

slide-53
SLIDE 53

Noisy text in Google Books OCR in Google Research Challenges

Example 4

Ashok C. Popat September 17, 2011 36 / 73

slide-54
SLIDE 54

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine A

Ashok C. Popat September 17, 2011 37 / 73

slide-55
SLIDE 55

Noisy text in Google Books OCR in Google Research Challenges

OCR Engine B

Ashok C. Popat September 17, 2011 38 / 73

slide-56
SLIDE 56

Noisy text in Google Books OCR in Google Research Challenges

Summary

  • Pan-lingual detector of noisy text
  • Spatial and sequential versions
  • Works well for most of the approx. 30 languages considered
  • Works well relative to several plausible alternatives
  • Application in books and beyond

Ashok C. Popat September 17, 2011 39 / 73

slide-57
SLIDE 57

Noisy text in Google Books OCR in Google Research Challenges

. . . which brings us to OCR

  • Joint work with. . .

Eugene Ie, Mike Jahr, Dmitriy Genzel, Franz Och, Andrew Senior, Nemanja Spasojevic, Frank Tang, Remco Teunen, others

Ashok C. Popat September 17, 2011 40 / 73

slide-58
SLIDE 58

Noisy text in Google Books OCR in Google Research Challenges

OCR in Google Research

  • Organize the world’s information and make it universally

accessible and useful

  • OCR still unavailable for some important languages
  • Take advantage of latest technologies
  • Massive amounts of data available
  • Goal: Best-in-the-world OCR for all scripts and languages

Ashok C. Popat September 17, 2011 41 / 73

slide-59
SLIDE 59

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-60
SLIDE 60

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-61
SLIDE 61

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-62
SLIDE 62

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-63
SLIDE 63

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-64
SLIDE 64

Noisy text in Google Books OCR in Google Research Challenges

A non-trivial task. . .

Ashok C. Popat September 17, 2011 42 / 73

slide-65
SLIDE 65

Noisy text in Google Books OCR in Google Research Challenges

Approach

  • Entirely from scratch

Ashok C. Popat September 17, 2011 43 / 73

slide-66
SLIDE 66

Noisy text in Google Books OCR in Google Research Challenges

Approach

  • Entirely from scratch
  • Multiple models and features

Ashok C. Popat September 17, 2011 43 / 73

slide-67
SLIDE 67

Noisy text in Google Books OCR in Google Research Challenges

Approach

  • Entirely from scratch
  • Multiple models and features
  • MERT-optimized log-linear

combination

Ashok C. Popat September 17, 2011 43 / 73

slide-68
SLIDE 68

Noisy text in Google Books OCR in Google Research Challenges

Approach

  • Entirely from scratch
  • Multiple models and features
  • MERT-optimized log-linear

combination

  • Latest algorithms, e.g., from speech

Ashok C. Popat September 17, 2011 43 / 73

slide-69
SLIDE 69

Noisy text in Google Books OCR in Google Research Challenges

Approach

  • Entirely from scratch
  • Multiple models and features
  • MERT-optimized log-linear

combination

  • Latest algorithms, e.g., from speech
  • Data-driven based on massive

amounts of data

Ashok C. Popat September 17, 2011 43 / 73

slide-70
SLIDE 70

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 44 / 73

slide-71
SLIDE 71

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 45 / 73

slide-72
SLIDE 72

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 46 / 73

slide-73
SLIDE 73

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 47 / 73

slide-74
SLIDE 74

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 48 / 73

slide-75
SLIDE 75

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 49 / 73

slide-76
SLIDE 76

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 50 / 73

slide-77
SLIDE 77

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 51 / 73

slide-78
SLIDE 78

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 52 / 73

slide-79
SLIDE 79

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 53 / 73

slide-80
SLIDE 80

Noisy text in Google Books OCR in Google Research Challenges

Early results on book images (cont.)

  • Image
  • Transcription (Ref = human annotator; Rec = Google Research)

Ashok C. Popat September 17, 2011 54 / 73

slide-81
SLIDE 81

Noisy text in Google Books OCR in Google Research Challenges

Good progress so far. . .

Ashok C. Popat September 17, 2011 55 / 73

slide-82
SLIDE 82

Noisy text in Google Books OCR in Google Research Challenges

. . . but by no means done

Ashok C. Popat September 17, 2011 56 / 73

slide-83
SLIDE 83

Noisy text in Google Books OCR in Google Research Challenges

Example: Thai

Ashok C. Popat September 17, 2011 57 / 73

slide-84
SLIDE 84

Noisy text in Google Books OCR in Google Research Challenges

Example: Thai

Ashok C. Popat September 17, 2011 58 / 73

slide-85
SLIDE 85

Noisy text in Google Books OCR in Google Research Challenges

Bootstrapping a basic Thai-capable system

  • Steps

1 Download 25Mb of Thai text from Wikisource 2 Generate synthetic training data from text 3 Split data into training and dev set 4 Train LM from training set 5 Train optical models from training set 6 Tune system on dev set (MERT) 7 Run on images from Google books

  • Entire process: ∼ 12 hours!
  • Crippled system: small LM, small optical models, few fonts, no

real dev set, no ground-truth test set

Ashok C. Popat September 17, 2011 59 / 73

slide-86
SLIDE 86

Noisy text in Google Books OCR in Google Research Challenges

Current topics of interest

  • Synthetic training data
  • Unsupervised / discriminative training
  • Discriminative feature extraction
  • More languages

Ashok C. Popat September 17, 2011 60 / 73

slide-87
SLIDE 87

Noisy text in Google Books OCR in Google Research Challenges

Challenges in Google Books

Ashok C. Popat September 17, 2011 61 / 73

slide-88
SLIDE 88

Noisy text in Google Books OCR in Google Research Challenges

Joint work with. . .

Dar-Shyang Lee, Jeff Breidenbach, Stavan Parikh, Viresh Ratnakar, Ray Smith, Ranjith Unnikrishnan, others

Ashok C. Popat September 17, 2011 62 / 73

slide-89
SLIDE 89

Noisy text in Google Books OCR in Google Research Challenges

Challenges: Multiple scripts/languages on a page

Ashok C. Popat September 17, 2011 63 / 73

slide-90
SLIDE 90

Noisy text in Google Books OCR in Google Research Challenges

Challenges: per-word script and language variation

Ashok C. Popat September 17, 2011 64 / 73

slide-91
SLIDE 91

Noisy text in Google Books OCR in Google Research Challenges

Challenges: Geometric and graylevel distortions

Ashok C. Popat September 17, 2011 65 / 73

slide-92
SLIDE 92

Noisy text in Google Books OCR in Google Research Challenges

Other challenges

  • Multiple languages in same or similar script
  • Arabic-Farsi, Marathi-Hindi-Nepali
  • Bad initial OCR can become a virtue

Ashok C. Popat September 17, 2011 66 / 73

slide-93
SLIDE 93

Noisy text in Google Books OCR in Google Research Challenges

Other challenges

  • Multiple languages in same or similar script
  • Arabic-Farsi, Marathi-Hindi-Nepali
  • Bad initial OCR can become a virtue
  • The same language in multiple scripts
  • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi,

Serbian, Pali

Ashok C. Popat September 17, 2011 66 / 73

slide-94
SLIDE 94

Noisy text in Google Books OCR in Google Research Challenges

Other challenges

  • Multiple languages in same or similar script
  • Arabic-Farsi, Marathi-Hindi-Nepali
  • Bad initial OCR can become a virtue
  • The same language in multiple scripts
  • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi,

Serbian, Pali

  • Archaic and reformed orthographies
  • Fraktur, Imperial Russian, 18th century English

Ashok C. Popat September 17, 2011 66 / 73

slide-95
SLIDE 95

Noisy text in Google Books OCR in Google Research Challenges

Other challenges

  • Multiple languages in same or similar script
  • Arabic-Farsi, Marathi-Hindi-Nepali
  • Bad initial OCR can become a virtue
  • The same language in multiple scripts
  • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi,

Serbian, Pali

  • Archaic and reformed orthographies
  • Fraktur, Imperial Russian, 18th century English
  • Dark matter: what scripts and languages are actually present?

Ashok C. Popat September 17, 2011 66 / 73

slide-96
SLIDE 96

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples

Ashok C. Popat September 17, 2011 67 / 73

slide-97
SLIDE 97

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples

Ashok C. Popat September 17, 2011 67 / 73

slide-98
SLIDE 98

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 68 / 73

slide-99
SLIDE 99

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 68 / 73

slide-100
SLIDE 100

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 69 / 73

slide-101
SLIDE 101

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 69 / 73

slide-102
SLIDE 102

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 70 / 73

slide-103
SLIDE 103

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 70 / 73

slide-104
SLIDE 104

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 71 / 73

slide-105
SLIDE 105

Noisy text in Google Books OCR in Google Research Challenges

More challenge examples (cont.)

Ashok C. Popat September 17, 2011 71 / 73

slide-106
SLIDE 106

Noisy text in Google Books OCR in Google Research Challenges

Thank You

Ashok C. Popat September 17, 2011 72 / 73

slide-107
SLIDE 107

Noisy text in Google Books OCR in Google Research Challenges

Bibliography I

[Kopec and Chou, 1994] Kopec, G. E. and Chou, P . A. (1994). Document image decoding using Markov source models. IEEE Trans. Pattern Anal. Mach. Intell., 16(6):602–617. [Popat, 2009] Popat, A. C. (2009). A panlingual anomalous text detector. In Proceedings of the 9th ACM symposium on Document engineering, DocEng ’09, pages 201–204, New York, NY, USA. ACM.

Ashok C. Popat September 17, 2011 73 / 73