SLIDE 1 Douglas J. Kennard and William A. Barrett BYU Computer Science Department
Towards Searchable Indexes for Handwritten Documents
Family History Technology Workshop (2006)
SLIDE 2 Goal: Ability to “search” handwritten documents Transcriptions are created manually:
SLIDE 3 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)
Difficulties in Automatic Handwriting Recognition
SLIDE 4 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection)
inconsistent spacing
Difficulties in Automatic Handwriting Recognition
SLIDE 5 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Ascenders/Descenders touching other lines of text
Difficulties in Automatic Handwriting Recognition
SLIDE 6 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) No space between words, space within a single word
Difficulties in Automatic Handwriting Recognition
SLIDE 7 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Same letter shaped differently
Difficulties in Automatic Handwriting Recognition
SLIDE 8 “Trails of Hope: Overland Diaries and Letters, 1846-1869” (BYU Library online collection) Different letters shaped similarly (n, m, r, ...)
Difficulties in Automatic Handwriting Recognition
SLIDE 9
Other Problems:
Undulating / curved lines Poor penmanship Digitization artifacts / lens distortion Faded ink Smears, blobs, uneven background Deteriorated pages Bleed-through / shine-through
Conclusion: Handwriting Recognition is Hard! Difficulties in Automatic Handwriting Recognition
SLIDE 10 A Small Sampling of HR Approaches:
Dynamic Programming
- Split words into segments
- Use DP to match letters to the segments
Hidden Markov Models
- Hidden states representing “letters of a possible interpretation”
- Probability of state transitions producing the observed features
Human Reading Models
- Top-down and Bottom-up combined
- We can't fully segment without some recognition,
can't fully recognize without segmentation.
Holistic (word-level) Features
(See references in syllabus)
SLIDE 11 Perfect Transcriptions Aren't Necessary
Work done by researchers in France:
- Automatic “annotation”
- Made Available Online
- Users correct errors as they find them
SLIDE 12 Handwriting Recognition is Still Hard! _i_e _on_
five live time dime jive hive
. . .
bone gone pony
. . .
What are these words? (recognition / transcription)
SLIDE 13
Handwriting Recognition is Still Hard! _i_e _on_
Find the word “lime”
(We don't need a transcription, just a “search” for probable matches.)
SLIDE 14
SLIDE 15
Excellent Penmanship Relatively “Clean” Images 100 Pages of Training
SLIDE 16 Our Recent Work
Improve Input to HR or Search Systems:
- Improve Text Line Segmentation
- Mark Ambiguities
SLIDE 17
Line Segmentation – Simple Profile Method
SLIDE 18
Line Segmentation – Simple Profile Method
SLIDE 19
- Preprocess
- Find Locations of Text Lines
- Split / Merge Text Lines
- Output Text Line Images
Our Text Line Separation Method
SLIDE 20
Preprocessing: Background Removal
SLIDE 21
Preprocessing: Deskew Page
SLIDE 22 Preprocessing: Choose Threshold
Otsu's Method: Threshold too low
SLIDE 23 Good Threshold
Preprocessing: Choose Threshold
SLIDE 24 Threshold too high
Preprocessing: Choose Threshold
SLIDE 25
# Connected Components
Threshold Value
Preprocessing: Choose Threshold
SLIDE 26
Preprocessing: Remove Rule Lines
SLIDE 27
Find Lines of Text
Transition Count Map Bitonal (Black / White)
SLIDE 28
Find Lines of Text
SLIDE 29
Find Lines of Text
SLIDE 30
Find Lines of Text
SLIDE 31
Find Lines of Text
Transition Count Map Bitonal (Black / White)
SLIDE 32
Find Lines of Text
Thresholded Transition Count Map Bitonal (Black / White)
SLIDE 33
Find Lines of Text
“Cleaned-Up” Transition Count Map
(small components removed)
Bitonal (Black / White)
SLIDE 34
Split Lines of Text
SLIDE 35
Split Lines of Text
SLIDE 36
“Min-Cut / Max-Flow” Graph Cut used iteratively to split lines
Split Lines of Text
SLIDE 37
Merge Spurious Lines of Text
SLIDE 38
- Expand component region
- Ignore outside of expanded region
- Anything touching another line component
considered ambiguous (within angle constraint)
Output Line Images
SLIDE 39
Output Line Images
Grayscale Output Image Output Mask Image
SLIDE 40
? crossing Motivation for Ambiguous component information
SLIDE 41 Planned Future Work
Reduce amount of manual training:
- Train interactively instead of transcribing
(many words get used over and over)
SLIDE 42 Reduce amount of manual training:
- Train interactively instead of transcribing
(many words get used over and over)
Example: (from 36 pages of an Overland Trails diary) “and” = 311 times “the” = 286 times 6,212 words total 860 distinct words 86% of the total words are redundant!
Planned Future Work
SLIDE 43 Planned Future Work
Reduce amount of manual training:
- Train interactively instead of transcribing
(many words get used over and over)
- Sub-word matching (letters and
combinations of letters)
- Existing methods for generating artificial
training data
SLIDE 44
Conclusions
Current Technology permits searching handwritten documents (at least for good quality, large collections) Won't work perfectly. Still very useful– much better than nothing at all! Current and future work will reduce amount of training needed, and improve accuracy by providing better input to the systems.
SLIDE 45
Questions