Classifiers that improve with use Argument George Nagy In-house - - PDF document

classifiers that improve with use
SMART_READER_LITE
LIVE PREVIEW

Classifiers that improve with use Argument George Nagy In-house - - PDF document

PRMU Classifiers that improve with use Argument George Nagy In-house training sets are never DocLab large enough, and never representative enough. Rensselaer Polytechnic Institute We must therefore augment them with samples from actual


slide-1
SLIDE 1

1

Febuary 19-20, 2004 IEICE-PRMU George Nagy 1

Classifiers that improve with use

George Nagy DocLab Rensselaer Polytechnic Institute

Febuary 19-20, 2004 IEICE-PRMU George Nagy 2

Argument In-house training sets are never large enough, and never representative enough. We must therefore augment them with samples from actual (real-time, real-world) OCR operation. We present some methods to this end.

PRMU

Febuary 19-20, 2004 IEICE-PRMU George Nagy 3

Outline

Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation (SIC) new *** Style-constrained classification Weakly-constrained data distributions (new ***) Linguistic context Recommendations

Febuary 19-20, 2004 IEICE-PRMU George Nagy 4

Representation

O O OOO O O O O OO X X X X X X X X X X X equiprobability contours

x1 x2

samples

Feature Space

  • f two features

decision boundary

Febuary 19-20, 2004 IEICE-PRMU George Nagy 5

How representative is the training set?

(2) adaptable (long fields) training test (4) continuous styles (short fields) (3) discrete styles (1) representative (5) weakly constrained

Febuary 19-20, 2004 IEICE-PRMU George Nagy 6

Traditional open-loop OCR System

training set parameter estimation

  • perational

data (bitmaps) classifier parameters meta-parameters (e.g. regularization, estimators) correction, reject entry transcript patterns and labels patterns labels rejects CLASSIFIER

slide-2
SLIDE 2

2

Febuary 19-20, 2004 IEICE-PRMU George Nagy 7

NEAREST NEIGHBOR GAUSSIAN QUADRATIC LINEAR BAYES MULTILAYER NEURAL NETWORK SUPPORT VECTOR MACHINE SIMPLE PERCEPTRON

Some classifiers

Febuary 19-20, 2004 IEICE-PRMU George Nagy 8

Supervised learning

training set parameter estimation

  • perational

data (bitmaps) classifier parameters CLASSIFIER meta-parameters correction, reject entry transcript keyboarded labels of rejects and errors Generic OCR System that makes use of post-processed rejects and errors

Febuary 19-20, 2004 IEICE-PRMU George Nagy 9

Adaptation (DHS: “Decision directed approximation”)

training set parameter estimation

  • perational

data (bitmaps) classifier parameters CLASSIFIER meta-parameters correction, reject entry transcript classifier assigned labels

Field estimation, singlet classification

Febuary 19-20, 2004 IEICE-PRMU George Nagy 10

Self-corrective recognition (1966)

accepted rejected REFERENCE GENERATOR CATEGORIZER FEATURE EXTRACTOR SCANNER INTIAL REFERENCES NEW REFERENCES SOURCE DOCUMENT

Febuary 19-20, 2004 IEICE-PRMU George Nagy 11

Results: self-corrective recognition (Shelton & Nagy 1966)

Training set: 9 fonts, 500 characters/font, U/C Test set: 12 fonts, 1500 characters/font, U/C 96 n-tuple features, ternary reference vectors Initial error and reject rates: 3.5% 15.2% After self correction: 0.7% 3.7%

Febuary 19-20, 2004 IEICE-PRMU George Nagy 12

Decision-directed adaptation

aka self-corrective recognition, auto-label adaptation, semi-supervised learning, .... adapted to a single font z

1 1 1 7 7 7 7 1

Omnifont classifier

slide-3
SLIDE 3

3

Febuary 19-20, 2004 IEICE-PRMU George Nagy 13

Results - Baird & Nagy (DR&R 1994)

100 fonts, 80 symbols each from Baird’s defect model (6,400,000 characters) Size (pt)

Error reduction

% fonts improved Best Worst 6 x 1.4 100 x 4 x 1.0 10 x 2.5 93 x 11 x 0.8 12 x 4.4 98 x 34 x 0.9 16 x 7.2 98 x 141 x0.8

Febuary 19-20, 2004 IEICE-PRMU George Nagy 14

Results: adapting both means and variances

(Harsha Veeramachaneni 2003) NIST Hand-printed digit classes, with 50 “Hitachi features” Train Test % Error Before Adapt means Adapt variance SD3 SD3 1.1 0.7 0.6 SD7 5.0 2.6 2.2 SD7 SD3 1.7 0.9 0.8 SD7 2.4 1.6 1.7 SD3+SD7 SD3 0.9 0.6 0.6 SD7 3.2 1.9 1.8

Febuary 19-20, 2004 IEICE-PRMU George Nagy 15

InkLink

Adnan El-Nasan (2003)

Constrained localized polygram matching

  • ne unknown word against many reference words,

using a lexicon of legal words. The reference set does not include most

  • f the lexicon words!

On-line handwriting recognition

Febuary 19-20, 2004 IEICE-PRMU George Nagy 16

From electronic ink to feature string

Febuary 19-20, 2004 IEICE-PRMU George Nagy 17

Feature matching

Febuary 19-20, 2004 IEICE-PRMU George Nagy 18

(tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS)

Unknown query word: “founding” A reference word: “amendment” Polygram feature match

slide-4
SLIDE 4

4

Febuary 19-20, 2004 IEICE-PRMU George Nagy 19

(tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS)

Query hypothesized as “contract”: poor match

Febuary 19-20, 2004 IEICE-PRMU George Nagy 20

(tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS)

Query hypothesized as “founding”: good match

Febuary 19-20, 2004 IEICE-PRMU George Nagy 21

Localized Viterbi trellis search

a m e n d m e n t s c

  • n

t r a c t

Febuary 19-20, 2004 IEICE-PRMU George Nagy 22

InkLink classification algorithm

1. The expected location where the unknown matches the reference words is pre-computed 2. The features matches of the unknown against the reference words are found by string matching. 3. The hypothesis that corresponds best to the expected length and location of the matches is chosen.

Febuary 19-20, 2004 IEICE-PRMU George Nagy 23

Our most/least favorite writers

Febuary 19-20, 2004 IEICE-PRMU George Nagy 24

Comparison with external system

(four writers we like) 100-word lexicons

slide-5
SLIDE 5

5

Febuary 19-20, 2004 IEICE-PRMU George Nagy 25

Self-corrective recognition (1966)

accepted rejected REFERENCE GENERATOR CATEGORIZER FEATURE EXTRACTOR SCANNER INTIAL REFERENCES NEW REFERENCES SOURCE DOCUMENT

Febuary 19-20, 2004 IEICE-PRMU George Nagy 26

Auto-label adaptation

Febuary 19-20, 2004 IEICE-PRMU George Nagy 27

Error rate dropped from 28% to 7% . As good with 100 reference words as with 500 reference words without adaptation.

Results of adaptation (“auto-label”) in InkLink

5 10 15 20 25 30 1 2 3 4 5 Iteration Number Error Rate Febuary 19-20, 2004 IEICE-PRMU George Nagy 28

Outline

Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation (SIC) Style constrained classification Weakly-constrained data distributions Linguistic context Recommendations

Febuary 19-20, 2004 IEICE-PRMU George Nagy 29

Symbolic Indirect Correlation (SIC)

~P O

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

~PERI OD

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

LEXICAL GRAPHS ~

0 1 2 3 4 5 6 7 8 9

L E VE R ~ ~

0 1 2 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 0 0 1 2

SIGNAL GRAPH

P E R I O D ~ E V E R P E O P L E ~ ~ ~

~LEVER~ ~PERIOD ~EVER ~PEOPLE~

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

~PERIOD ~EVER ~PEOPLE~ ~PERPLEX~

Match Graphs

Febuary 19-20, 2004 IEICE-PRMU George Nagy 30

Signal graph of lever compared to reference signal graph

~

L E R~

PERI OD~EVER P E O P L E ~ ~

0 1 2 3 4 5 6 7 8 9

LEVER ~ ~

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 0 0 1 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

~ E E ~LEVER~

~PERIOD~EVER ~PEOPLE~

slide-6
SLIDE 6

6

Febuary 19-20, 2004 IEICE-PRMU George Nagy 31

Matching Match Graphs

Best matching subgraphs preserve Edge Crossings, i.e. find order-isomorphic subgraphs

~~PERI OD EVER~PEOPLE

~

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

~LEVER~

0 1 2 3 4 5 6

~

P ERI OD~EVER P E O P L E ~ ~ LEVER ~ ~

0 1 2 3 5 6 7 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0 1 2

~PERIOD~EVER ~PEOPLE~

Febuary 19-20, 2004 IEICE-PRMU George Nagy 32

Signal and lexial graphs (handwriting)

Febuary 19-20, 2004 IEICE-PRMU George Nagy 33

Results: SIC (simulation only!)

96 100 100 100 % Correct Recognition: 750 600 500 390 Reference String Size: 0.8 1.0 0.6 0.4 Probability of Wrong Match:

Weighted Noise Model

Adaptation: we can add recognized words to the reference string, as in InkLink 1000 most common words of Brown Corpus

Febuary 19-20, 2004 IEICE-PRMU George Nagy 34

Outline

Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation Style constrained classification Weakly-constrained data distributions Linguistic context Recommendations

Febuary 19-20, 2004 IEICE-PRMU George Nagy 35

Style consistency: Field estimation, field classification

training set parameter estimation test set classifier field parameters style-constrained CLASSIFIER meta-parameters correction, reject entry transcript field-labeled training set batched test set fields

Febuary 19-20, 2004 IEICE-PRMU George Nagy 36

Single-class and multi-class style

SINGLE CLASS STYLE MULTI-CLASS STYLE

Source 1: 29/05/1925 25/07/1922 Source 2: 15/05/1990 05/05/1925 Source 3: 21/06/1943 02/06/1943 Source 4: 05 /29/1945 02/25/1942 Styles are induced in a collection of documents by multiple sources*.

* fonts, printers, scanners, writers, speakers, microphones, ...

slide-7
SLIDE 7

7

Febuary 19-20, 2004 IEICE-PRMU George Nagy 37

Multi-mode parametric classifier for single-class style constraints 5 5 5 6 6 6 6

Febuary 19-20, 2004 IEICE-PRMU George Nagy 38

Field classifier for strong style constraints 524 524 524 653 653 653 653 “implicit font recognition”

Febuary 19-20, 2004 IEICE-PRMU George Nagy 39

Field-trained classification versus style-constrained classification

Training set for field classification 0000 0001 0010 9998 9999 (104 classes) Training set for style classification 00 01 02 98 99 (102 classes) Field Length = 4 classifier parameters for longer field length computed from pair parameters (because Gaussian variables defined completely by covariance)

Febuary 19-20, 2004 IEICE-PRMU George Nagy 40

Results: style-constrained classification - short fields (Harsha V.)

Field error rate (%) Field length: L=2 L=5 Test data w/o style with style w/o style with style SD3 1.4 1.3 3.0 2.5 SD7 2.7 2.4 5.3 4.5 Continuous style-constrained classifier, trained on ~ 17,000 characters and tested on ~17,000 characters. 25 top principal component “Hitachi” blurred directional features.

Febuary 19-20, 2004 IEICE-PRMU George Nagy 41

Outline

Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation Style constrained classification Weakly-constrained data distributions Linguistic context Recommendations

Febuary 19-20, 2004 IEICE-PRMU George Nagy 42

Weakly-constrained data

training test 3 classes, 4 multi-class styles

given p(x), find p(y), where y=g(x)

slide-8
SLIDE 8

8

Febuary 19-20, 2004 IEICE-PRMU George Nagy 43

Are weak constraints enough?

? Test

Training

4 9 5 6

Febuary 19-20, 2004 IEICE-PRMU George Nagy 44

Outline

Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation Style constrained classification Weakly-constrained data distributions Linguistic context Recommendations

Febuary 19-20, 2004 IEICE-PRMU George Nagy 45

Cipher text: 1 2 . 2 . 2 . .2 5 2 . . 5 2 . 5 LANGUAGE MODEL: N-gram frequencies, Lexicon, … (Nagy & Casey, 1966; Nagy & Seth, 1987) DECODER

1 → a 2 → n 5 → e

an unknown sentence

Cluster the bitmaps: 1 2 5

Language context: decoding a substitution cipher

Febuary 19-20, 2004 IEICE-PRMU George Nagy 46

Language context [with Tin Ho 2000]

Febuary 19-20, 2004 IEICE-PRMU George Nagy 47

Text printed with Spitz glyphs

Febuary 19-20, 2004 IEICE-PRMU George Nagy 48

Decoded text

chapter I 2 LOOMINGS Call me Ishmael. Some years ago – never mind how long precisely – having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have ...

chapter i _ bee_inds _all me ishmaels some years ago__never mind how long precisely __having little or no money in my purses and nothing particular to interest me on shores i thought i would sail about a little and see the watery part of the worlds it is a way i have ...

slide-9
SLIDE 9

9

Febuary 19-20, 2004 IEICE-PRMU George Nagy 49

Style context versus Language context Two digits in an isogenous field: ...... 5 6 ..... with feature vectors: xi xj and class labels: Ci Cj

Style context: P(xi ,xj | Ci=5,Cj=6) ≠ ≠ ≠ ≠ P(xi| Ci=5) P(xj | Cj=6) Language Context: P(Ci=5,Cj=6) ≠ ≠ ≠ ≠ P(Ci=5) P(Cj=6)

Febuary 19-20, 2004 IEICE-PRMU George Nagy 50

Recommendations

Never let the machine rest: design it so that it puts every coffee-break to good use. Don’t throw away edits (corrected labels): use them. Classify style-consistent fields, not characters: adapt on long fields, exploit multi-class style in short fields. Use order rather than position. Let the machine guess: lazy decisions. Make use of all possible context: language, shape, layout, and function.

Every document can be read by the right reader.

for OCR systems that improve with use

Febuary 19-20, 2004 IEICE-PRMU George Nagy 51

The End

I am grateful to Hitachi CRL for technical and financial support since 1992. Special thanks to

  • Drs. Fujisawa, Liu, Sako, and Shima, and to
  • Messrs. Koga, Marukawa, and Mine.

I also learned much from my former and present students Jung, Sarkar,Veeramachaneni, & El-Nasan. But any misconceptions in this presentation are entirely my own!

Thank you!

Febuary 19-20, 2004 IEICE-PRMU George Nagy 52

Non-supervised classification is always semi-supervised classification Only the supervisory constraints are hidden! Clustering: cardinality, diameter, population, or sequence of clusters. ML / EM : form of distribution, sigma, number of mixture components. Trees, neural networks: validation set.

Febuary 19-20, 2004 IEICE-PRMU George Nagy 53

InkLink classification algorithm ( ) ( )

( )

( )

ij i i i ij ij

P l c P c P c l P l =

( ) ( )

*

ˆ , 1 arg m ax ˆ ,

ij ij ij i j ij ij ij

P l l M c P l l M = = =

:

ij

l

ˆ :

ij

l :

ij

M

Observed feature match length Predicted feature match length Match

Four classifiers, based on different features sets, are combined by Borda Count

Febuary 19-20, 2004 IEICE-PRMU George Nagy 54

A letter from George Washington

Feature-string matching: ”complete” vs. “are to be made to the aid de camp.”

slide-10
SLIDE 10

10

Febuary 19-20, 2004 IEICE-PRMU George Nagy 55

OUTLINE: VENUE INVITED SPEAKERS AWARDS SESSION TOPICS NOTABLE CONTRIBUTIONS LARGE-SCALE PROJECTS OUTLOOK (Slides from my closing talk at ICDAR 1997 in Ulm, Germany)

CONFERENCE REPORT ON ICDAR '07

George Nagy Professor Emeritus (??), Rensselaer Polytechnic Institute

Febuary 19-20, 2004 IEICE-PRMU George Nagy 56

  • Prof. Tin Kam Ho

Web-wide Voting Network for Weak Classifiers Exploratory data analysis at Lucent Bell Labs

  • Dr. Tapas Kanungo

Multimedia Document Devastation Models DIA at IBM Almaden Research Center

  • Dr. Rainer Hoch

High-speed Context Switches Just moved from SAP to the Mannheim Berufsakademie

  • Dr. Abdelwahab Zramdini

Web Typography Top secret research at a bank in Geneva

  • Prof. Omid Kia

Processing Compressed Multimedia Documents Founded a company, then disappeared!

INVITED SPEAKERS

(all second generation)

Febuary 19-20, 2004 IEICE-PRMU George Nagy 57

Storage Docupin, DocuPinCushion (DPC) 750 MB/DP, 18GB/DPC Not addressable

... イベント イベント イベント イベントのお のお のお のお知 知 知 知らせ らせ らせ らせ "Memory Stick Xmas Magic". 楽 楽 楽 楽しくお しくお しくお しくお得 得 得 得な な な な各種 各種 各種 各種 キャンペー キャンペー キャンペー キャンペーン ン ン ンをご をご をご をご紹介 紹介 紹介 紹介します します します します!! !! !! !!. ... 最新情報誌 最新情報誌 最新情報誌 最新情報誌 Memory Stick Update. ... SANDISK UNVEILS THE HIGHEST CAPACITY MEMORY STICK PRO STORAGE CARD IN THE WORLD—2 GIGABYTES.

SONY SANDISK

III. Devices

Febuary 19-20, 2004 IEICE-PRMU George Nagy 58

Cholesteric polymer-dispersed hybrid substrate 10 micron thick (100 sheets per cm) Ten megacell array provides 300 dpi resolution 10**6 read-write cycles Ecologically benign (silicon, not carbon, based) No color so far E-Scap

Febuary 19-20, 2004 IEICE-PRMU George Nagy 59

Lambertian high-contrast reflective display Satellite cyberband channel Pentameric (soft, 4mm thick elastomere) vary-keyboard 100 MB flash-cache Docupin port (USB)

Tablet PCs Toshiba's Convertible Tablet PC paves the way ...

Radiotablet

Febuary 19-20, 2004 IEICE-PRMU George Nagy 60

Speech-to-print Print-to-speech Convergence of speech and document image processing - began in 1990's with adoption of HMM in OCR AUDIO DOCUMENTS So far, primarily for entertainment (music)

(almost everything is primarily for entertainment!)

Some natural sounds (bird calls, frogs, dolphins) Oral history (eye-witness accounts) Personal note-taking (key-chain recorders) Instructional audios (“prompter”)

???

VI. Oral Documents

slide-11
SLIDE 11

11

Febuary 19-20, 2004 IEICE-PRMU George Nagy 61

Now used mainly for self-communication letters already replaced by email forms being replaced by e-forms Children learn to type in pre-school! Therefore personal electronic ink recognition is becoming more important than scanned handwriting recognition.

handwriting

Febuary 19-20, 2004 IEICE-PRMU George Nagy 62

Other questions?