Towards Searchable Indexes for Handwritten Documents Douglas J. - - PowerPoint PPT Presentation

towards searchable indexes for handwritten documents
SMART_READER_LITE
LIVE PREVIEW

Towards Searchable Indexes for Handwritten Documents Douglas J. - - PowerPoint PPT Presentation

Towards Searchable Indexes for Handwritten Documents Douglas J. Kennard and William A. Barrett BYU Computer Science Department Family History Technology Workshop (2006) Goal: Ability to search handwritten documents Transcriptions are


slide-1
SLIDE 1

Douglas J. Kennard and William A. Barrett BYU Computer Science Department

Towards Searchable Indexes for Handwritten Documents

Family History Technology Workshop (2006)

slide-2
SLIDE 2

Goal: Ability to “search” handwritten documents Transcriptions are created manually:

  • Time-consuming
  • Costly
slide-3
SLIDE 3

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)

Difficulties in Automatic Handwriting Recognition

slide-4
SLIDE 4

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)

inconsistent spacing

Difficulties in Automatic Handwriting Recognition

slide-5
SLIDE 5

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Ascenders/Descenders touching other lines of text

Difficulties in Automatic Handwriting Recognition

slide-6
SLIDE 6

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) No space between words, space within a single word

Difficulties in Automatic Handwriting Recognition

slide-7
SLIDE 7

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Same letter shaped differently

Difficulties in Automatic Handwriting Recognition

slide-8
SLIDE 8

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Different letters shaped similarly (n, m, r, ...)

Difficulties in Automatic Handwriting Recognition

slide-9
SLIDE 9

Other Problems:

Undulating / curved lines Poor penmanship Digitization artifacts / lens distortion Faded ink Smears, blobs, uneven background Deteriorated pages Bleed-through / shine-through

Conclusion: Handwriting Recognition is Hard! Difficulties in Automatic Handwriting Recognition

slide-10
SLIDE 10

A Small Sampling of HR Approaches:

Dynamic Programming

  • Split words into segments
  • Use DP to match letters to the segments

Hidden Markov Models

  • Hidden states representing “letters of a possible interpretation”
  • Probability of state transitions producing the observed features

Human Reading Models

  • Top-down and Bottom-up combined
  • We can't fully segment without some recognition,

can't fully recognize without segmentation.

Holistic (word-level) Features

  • Avoid segmenting words

(See references in syllabus)

slide-11
SLIDE 11

Perfect Transcriptions Aren't Necessary

Work done by researchers in France:

  • Automatic “annotation”
  • Made Available Online
  • Users correct errors as they find them
slide-12
SLIDE 12

Handwriting Recognition is Still Hard! _i_e _on_

five live time dime jive hive

. . .

bone gone pony

. . .

What are these words? (recognition / transcription)

slide-13
SLIDE 13

Handwriting Recognition is Still Hard! _i_e _on_

Find the word “lime”

(We don't need a transcription, just a “search” for probable matches.)

slide-14
SLIDE 14
slide-15
SLIDE 15

Excellent Penmanship Relatively “Clean” Images 100 Pages of Training

slide-16
SLIDE 16

Our Recent Work

Improve Input to HR or Search Systems:

  • Improve Text Line Segmentation
  • Mark Ambiguities
slide-17
SLIDE 17

Line Segmentation – Simple Profile Method

slide-18
SLIDE 18

Line Segmentation – Simple Profile Method

slide-19
SLIDE 19
  • Preprocess
  • Find Locations of Text Lines
  • Split / Merge Text Lines
  • Output Text Line Images

Our Text Line Separation Method

slide-20
SLIDE 20

Preprocessing: Background Removal

slide-21
SLIDE 21

Preprocessing: Deskew Page

slide-22
SLIDE 22

Preprocessing: Choose Threshold

Otsu's Method: Threshold too low

slide-23
SLIDE 23

Good Threshold

Preprocessing: Choose Threshold

slide-24
SLIDE 24

Threshold too high

Preprocessing: Choose Threshold

slide-25
SLIDE 25

# Connected Components

Threshold Value

Preprocessing: Choose Threshold

slide-26
SLIDE 26

Preprocessing: Remove Rule Lines

slide-27
SLIDE 27

Find Lines of Text

Transition Count Map Bitonal (Black / White)

slide-28
SLIDE 28

Find Lines of Text

slide-29
SLIDE 29

Find Lines of Text

slide-30
SLIDE 30

Find Lines of Text

slide-31
SLIDE 31

Find Lines of Text

Transition Count Map Bitonal (Black / White)

slide-32
SLIDE 32

Find Lines of Text

Thresholded Transition Count Map Bitonal (Black / White)

slide-33
SLIDE 33

Find Lines of Text

“Cleaned-Up” Transition Count Map

(small components removed)

Bitonal (Black / White)

slide-34
SLIDE 34

Split Lines of Text

slide-35
SLIDE 35

Split Lines of Text

slide-36
SLIDE 36

“Min-Cut / Max-Flow” Graph Cut used iteratively to split lines

Split Lines of Text

slide-37
SLIDE 37

Merge Spurious Lines of Text

slide-38
SLIDE 38
  • Expand component region
  • Ignore outside of expanded region
  • Anything touching another line component

considered ambiguous (within angle constraint)

Output Line Images

slide-39
SLIDE 39

Output Line Images

Grayscale Output Image Output Mask Image

slide-40
SLIDE 40

? crossing Motivation for Ambiguous component information

slide-41
SLIDE 41

Planned Future Work

Reduce amount of manual training:

  • Train interactively instead of transcribing

(many words get used over and over)

slide-42
SLIDE 42

Reduce amount of manual training:

  • Train interactively instead of transcribing

(many words get used over and over)

Example: (from 36 pages of an Overland Trails diary) “and” = 311 times “the” = 286 times 6,212 words total 860 distinct words 86% of the total words are redundant!

Planned Future Work

slide-43
SLIDE 43

Planned Future Work

Reduce amount of manual training:

  • Train interactively instead of transcribing

(many words get used over and over)

  • Sub-word matching (letters and

combinations of letters)

  • Existing methods for generating artificial

training data

slide-44
SLIDE 44

Conclusions

Current Technology permits searching handwritten documents (at least for good quality, large collections) Won't work perfectly. Still very useful– much better than nothing at all! Current and future work will reduce amount of training needed, and improve accuracy by providing better input to the systems.

slide-45
SLIDE 45

Questions