Towards Searchable Indexes for Handwritten Documents Douglas J. - - PowerPoint PPT Presentation

▶

Sep 04, 2022 100 likes •560 views

Towards Searchable Indexes for Handwritten Documents Douglas J. Kennard and William A. Barrett BYU Computer Science Department Family History Technology Workshop (2006) Goal: Ability to search handwritten documents Transcriptions are

SLIDE 1

Douglas J. Kennard and William A. Barrett BYU Computer Science Department

Towards Searchable Indexes for Handwritten Documents

Family History Technology Workshop (2006)

SLIDE 2

Goal: Ability to “search” handwritten documents Transcriptions are created manually:

Time-consuming
Costly

SLIDE 3

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)

Difficulties in Automatic Handwriting Recognition

SLIDE 4

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)

inconsistent spacing

Difficulties in Automatic Handwriting Recognition

SLIDE 5

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Ascenders/Descenders touching other lines of text

Difficulties in Automatic Handwriting Recognition

SLIDE 6

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) No space between words, space within a single word

Difficulties in Automatic Handwriting Recognition

SLIDE 7

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Same letter shaped differently

Difficulties in Automatic Handwriting Recognition

SLIDE 8

“Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Different letters shaped similarly (n, m, r, ...)

Difficulties in Automatic Handwriting Recognition

SLIDE 9

Conclusion: Handwriting Recognition is Hard! Difficulties in Automatic Handwriting Recognition

SLIDE 10

A Small Sampling of HR Approaches:

Dynamic Programming

Split words into segments
Use DP to match letters to the segments

Hidden Markov Models

Hidden states representing “letters of a possible interpretation”
Probability of state transitions producing the observed features

Human Reading Models

Top-down and Bottom-up combined
We can't fully segment without some recognition,

can't fully recognize without segmentation.

Holistic (word-level) Features

Avoid segmenting words

(See references in syllabus)

SLIDE 11

Perfect Transcriptions Aren't Necessary

Work done by researchers in France:

Automatic “annotation”
Made Available Online
Users correct errors as they find them

SLIDE 12

Handwriting Recognition is Still Hard! _i_e _on_

five live time dime jive hive

. . .

bone gone pony

. . .

What are these words? (recognition / transcription)

SLIDE 13

Handwriting Recognition is Still Hard! _i_e _on_

Find the word “lime”

(We don't need a transcription, just a “search” for probable matches.)

SLIDE 14

SLIDE 15

Excellent Penmanship Relatively “Clean” Images 100 Pages of Training

SLIDE 16

Our Recent Work

Improve Input to HR or Search Systems:

Improve Text Line Segmentation
Mark Ambiguities

SLIDE 17

Line Segmentation – Simple Profile Method

SLIDE 18

Line Segmentation – Simple Profile Method

SLIDE 19

Preprocess
Find Locations of Text Lines
Split / Merge Text Lines
Output Text Line Images

Our Text Line Separation Method

SLIDE 20

Preprocessing: Background Removal

SLIDE 21

Preprocessing: Deskew Page

SLIDE 22

Preprocessing: Choose Threshold

Otsu's Method: Threshold too low

SLIDE 23

Good Threshold

Preprocessing: Choose Threshold

SLIDE 24

Threshold too high

Preprocessing: Choose Threshold

SLIDE 25

# Connected Components

Threshold Value

Preprocessing: Choose Threshold

SLIDE 26

Preprocessing: Remove Rule Lines

SLIDE 27

Find Lines of Text

Transition Count Map Bitonal (Black / White)

SLIDE 28

Find Lines of Text

SLIDE 29

Find Lines of Text

SLIDE 30

Find Lines of Text

SLIDE 31

Find Lines of Text

Transition Count Map Bitonal (Black / White)

SLIDE 32

Find Lines of Text

Thresholded Transition Count Map Bitonal (Black / White)

SLIDE 33

Find Lines of Text

“Cleaned-Up” Transition Count Map

(small components removed)

Bitonal (Black / White)

SLIDE 34

Split Lines of Text

SLIDE 35

Split Lines of Text

SLIDE 36

“Min-Cut / Max-Flow” Graph Cut used iteratively to split lines

Split Lines of Text

SLIDE 37

Merge Spurious Lines of Text

SLIDE 38

Expand component region
Ignore outside of expanded region
Anything touching another line component

considered ambiguous (within angle constraint)

Output Line Images

SLIDE 39

Output Line Images

Grayscale Output Image Output Mask Image

SLIDE 40

? crossing Motivation for Ambiguous component information

SLIDE 41

Planned Future Work

Reduce amount of manual training:

Train interactively instead of transcribing

(many words get used over and over)

SLIDE 42

Reduce amount of manual training:

Train interactively instead of transcribing

(many words get used over and over)

Example: (from 36 pages of an Overland Trails diary) “and” = 311 times “the” = 286 times 6,212 words total 860 distinct words 86% of the total words are redundant!

Planned Future Work

SLIDE 43

Planned Future Work

Reduce amount of manual training:

Train interactively instead of transcribing

(many words get used over and over)

Sub-word matching (letters and

combinations of letters)

Existing methods for generating artificial

training data

SLIDE 44

Conclusions

Current Technology permits searching handwritten documents (at least for good quality, large collections) Won't work perfectly. Still very useful– much better than nothing at all! Current and future work will reduce amount of training needed, and improve accuracy by providing better input to the systems.

SLIDE 45

Douglas J. Kennard and William A. Barrett BYU Computer Science Department

Towards Searchable Indexes for Handwritten Documents

Goal: Ability to “search” handwritten documents Transcriptions are created manually:

Difficulties in Automatic Handwriting Recognition

inconsistent spacing

Difficulties in Automatic Handwriting Recognition

Difficulties in Automatic Handwriting Recognition

Difficulties in Automatic Handwriting Recognition

Difficulties in Automatic Handwriting Recognition

Difficulties in Automatic Handwriting Recognition

Other Problems:

Undulating / curved lines Poor penmanship Digitization artifacts / lens distortion Faded ink Smears, blobs, uneven background Deteriorated pages Bleed-through / shine-through

Conclusion: Handwriting Recognition is Hard! Difficulties in Automatic Handwriting Recognition

A Small Sampling of HR Approaches:

Dynamic Programming

Hidden Markov Models

Human Reading Models

can't fully recognize without segmentation.

Holistic (word-level) Features

Perfect Transcriptions Aren't Necessary

Work done by researchers in France:

Handwriting Recognition is Still Hard! _i_e _on_

five live time dime jive hive

bone gone pony

What are these words? (recognition / transcription)

Handwriting Recognition is Still Hard! _i_e _on_

Find the word “lime”

(We don't need a transcription, just a “search” for probable matches.)

Excellent Penmanship Relatively “Clean” Images 100 Pages of Training

Our Recent Work

Improve Input to HR or Search Systems:

Line Segmentation – Simple Profile Method

Line Segmentation – Simple Profile Method

Our Text Line Separation Method

Preprocessing: Background Removal

Preprocessing: Deskew Page

Preprocessing: Choose Threshold

Preprocessing: Choose Threshold

Preprocessing: Choose Threshold

# Connected Components

Threshold Value

Preprocessing: Choose Threshold

Preprocessing: Remove Rule Lines

Find Lines of Text

Transition Count Map Bitonal (Black / White)

Find Lines of Text

Find Lines of Text

Find Lines of Text

Find Lines of Text

Transition Count Map Bitonal (Black / White)

Find Lines of Text

Thresholded Transition Count Map Bitonal (Black / White)

Find Lines of Text

“Cleaned-Up” Transition Count Map

(small components removed)

Bitonal (Black / White)

Split Lines of Text

Split Lines of Text

“Min-Cut / Max-Flow” Graph Cut used iteratively to split lines

Split Lines of Text

Merge Spurious Lines of Text

considered ambiguous (within angle constraint)

Output Line Images

Output Line Images

Grayscale Output Image Output Mask Image

? crossing Motivation for Ambiguous component information

Planned Future Work

Reduce amount of manual training:

(many words get used over and over)

Reduce amount of manual training:

(many words get used over and over)

Example: (from 36 pages of an Overland Trails diary) “and” = 311 times “the” = 286 times 6,212 words total 860 distinct words 86% of the total words are redundant!

Planned Future Work

Planned Future Work

Reduce amount of manual training:

(many words get used over and over)

combinations of letters)

training data

Conclusions

Questions