Flexible Computer Assisted Transcription of Historical Documents - - PowerPoint PPT Presentation

flexible computer assisted transcription of historical
SMART_READER_LITE
LIVE PREVIEW

Flexible Computer Assisted Transcription of Historical Documents - - PowerPoint PPT Presentation

Flexible Computer Assisted Transcription of Historical Documents Through Subword Spotting Brian Davis, Robert Clawson and William Barrett What if? Effective crowdsourced transcription of documents via - Smartphone users - Only a few


slide-1
SLIDE 1

Flexible Computer Assisted Transcription of Historical Documents Through Subword Spotting

Brian Davis, Robert Clawson and William Barrett

slide-2
SLIDE 2

What if…?

Effective crowdsourced transcription of documents via

  • Smartphone users
  • Only a few minutes at a time
slide-3
SLIDE 3

What if…?

Effective crowdsourced transcription of documents via

  • Smartphone users
  • Only a few minutes at a time
slide-4
SLIDE 4

What if…?

Effective crowdsourced transcription of documents via

  • Smartphone users
  • Only a few minutes at a time
slide-5
SLIDE 5

What if…?

Effective crowdsourced transcription of documents via

  • Smartphone users
  • Only a few minutes at a time

http://telecoms.com/463552/global-smartphone-market-q4-2015-peak-smartphone-approaches/

slide-6
SLIDE 6

Computer Assisted Transcription (CAT)

Why not do it all manually? Why not do it automatically?

slide-7
SLIDE 7

Prefix Based CAT

User makes correction to automatic transcription, approving all previous

  • content. Recognition algorithm makes new prediction for remaining text.

Image of Toselli et al’s

  • nline demo

Requires sequential text.

  • A. Toselli, V. Romero, M. Pastor, , and E. Vidal, “Multimodal interactive transcription of text images,” Pattern Recognition, vol. 43, no. 5, pp. 1814–1825, 2010.
  • N. Serrano, A. Gimenez, J. Civera, A. Sanchis, and A. Juan, “Interactive handwriting recognition with limited user effort,”IJDAR, vol. 17, no. 1, pp. 47–59, 2014.
slide-8
SLIDE 8

CAT Through Word Spotting

Find words that look the same and label them the same. Zagoris et al (2015) use a relevance feedback loop to learn from every correct match the user selects.

  • K. Zagoris, I. Pratikakis, and B. Gatos, “A framework for efficient transcription of historical documents using keyword spotting,” in Proc. HIP. ACM, 2015.
slide-9
SLIDE 9

CAT Through Word Spotting

Find words that look the same and label them the same. Robert Clawson’s Intelligent Indexing (2014) relies on user filtering of matches.

  • R. Clawson, “Intelligent indexing: A semi-automated, trainable system for field labeling,” Master’s thesis, Brigham Young University, 2014. [Online]. Available: scholarsarchive.byu.edu/etd/5307/
slide-10
SLIDE 10

CAT Through User Supervised OCR

Neudecker and Tzadok (2010) OCR, then present characters with low score to user to clean.

  • C. Neudecker and A. Tzadok, “User collaboration for improving access to historical texts,” Liber Quarterly, vol. 20, no. 1, p. 119-128, 2010.
slide-11
SLIDE 11

Strengths of Prior CAT Systems

OCR & word spotting:

  • As long as words/letters can be segmented, will work with any document

OCR:

  • Simple user tasks (no typing, very fast)
  • Very parallelizable

Word spotting:

  • Potential high payoff for little user effort (few taps, many words

transcribed)

slide-12
SLIDE 12

Weaknesses of Prior CAT Systems

Prefix based:

  • Only works on sentence structured writing.
  • Limited lexicon size (e.g. hard time with names).

Word spotting:

  • Often words don’t repeat frequently or at all (e.g. names).

OCR:

  • Letter segmentation improbable for handwritten text.
slide-13
SLIDE 13

A Solution

Solution: Spot character n-grams (bigrams and trigrams). Reconstruct words from them.

slide-14
SLIDE 14

The “Sweet Spot”

Bigrams/trigrams occur with great frequency + Subword spotting still reasonably accurate = High pay-off for spotting effort

Additionally, able to use larger lexicon, including more names.

http://machinedesign.com/archive/building-better-bat

slide-15
SLIDE 15

N-gram Spotting and Word Completion

_ _ c h a e l

slide-16
SLIDE 16

N-gram Spotting and Word Completion

M i c h a e l

slide-17
SLIDE 17

N-gram Spotting and Word Completion

_ _ _ h o _ _

slide-18
SLIDE 18

N-gram Spotting and Word Completion

A n _ h o _ _

slide-19
SLIDE 19

N-gram Spotting and Word Completion

A n t h o _ _

slide-20
SLIDE 20

N-gram Spotting and Word Completion

A n t h o n y

Computers are much better at this than we are!

A n _ h o _ _ => [anchors, anchovy, anthony, anthoni]

slide-21
SLIDE 21

N-gram Spotting and Word Completion

Regular expression make this easy. Spotted n-grams are parsed into a regular expression. The regular expression is used as a lookup on the lexicon.

slide-22
SLIDE 22

Overview of Proposed CAT System

slide-23
SLIDE 23

Proposed CAT System

slide-24
SLIDE 24

Proposed CAT System

slide-25
SLIDE 25

Proposed CAT System

slide-26
SLIDE 26

Proposed CAT System

slide-27
SLIDE 27

Proposed CAT System

slide-28
SLIDE 28

Proposed CAT System

slide-29
SLIDE 29

Proposed CAT System

slide-30
SLIDE 30

Proposed CAT System

slide-31
SLIDE 31

Proposed CAT System

slide-32
SLIDE 32

Proposed CAT System

slide-33
SLIDE 33

Proposed CAT System

slide-34
SLIDE 34

Proposed CAT System

slide-35
SLIDE 35

Overview of Proposed CAT System

Complicated system, simple UI

slide-36
SLIDE 36

Mock-up of User Tasks

slide-37
SLIDE 37

Mock-up of User Tasks

slide-38
SLIDE 38

Justification: Simulation of Proposed CAT System

George Washington corpus 100 most common bigrams simulated 50% recall* for bigram spotting simulated uncertain number of characters not spotted in word word was “transcribed” when 10 or less possible transcriptions remain lexicon of ~108,000 words and ~7, 000 names

*Based on preliminary results in subword spotting.

slide-39
SLIDE 39

Possible Bonuses

N-gram spotting verification may be reasonably completed by non-native speakers of a language. Small user tasks may be easy to gamify.

slide-40
SLIDE 40

Questions?

slide-41
SLIDE 41

Limitations and Weaknesses

Dependent on word segmentation. May require manual transcription for first few pages of a corpus as training. Requires manual transcription to “finish” out-of-vocabulary, malformed and infrequent unfavorable words. Poor spotting will burden human users with too much rejecting (or low recall). If recognition/spotting scoring of word images does not prune effectively, the feasible lexicon size may be limited.

slide-42
SLIDE 42

Subword N-gram Spotting

Preliminary results show 64% mAP for bigrams and 72% mAP for trigrams on George Washington dataset.* Better results should come with a specialized method.

*using adaption of J. Almazan, A. Gordo, A. Fornes, and E. Valveny, “Word spotting and recognition with embedded attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36,

  • no. 12, pp. 2552–2566, 2014.