1
Febuary 19-20, 2004 IEICE-PRMU George Nagy 1
Classifiers that improve with use
George Nagy DocLab Rensselaer Polytechnic Institute
Febuary 19-20, 2004 IEICE-PRMU George Nagy 2
Argument In-house training sets are never large enough, and never representative enough. We must therefore augment them with samples from actual (real-time, real-world) OCR operation. We present some methods to this end.
PRMU
Febuary 19-20, 2004 IEICE-PRMU George Nagy 3
Outline
Non-representative training sets Supervised learning (continuing classifier education) “Unsupervised” adaptation Self-corrective, Decision-directed, Auto-label Symbolic Indirect Correlation (SIC) new *** Style-constrained classification Weakly-constrained data distributions (new ***) Linguistic context Recommendations
Febuary 19-20, 2004 IEICE-PRMU George Nagy 4
Representation
O O OOO O O O O OO X X X X X X X X X X X equiprobability contours
x1 x2
samples
Feature Space
- f two features
decision boundary
Febuary 19-20, 2004 IEICE-PRMU George Nagy 5
How representative is the training set?
(2) adaptable (long fields) training test (4) continuous styles (short fields) (3) discrete styles (1) representative (5) weakly constrained
Febuary 19-20, 2004 IEICE-PRMU George Nagy 6
Traditional open-loop OCR System
training set parameter estimation
- perational
data (bitmaps) classifier parameters meta-parameters (e.g. regularization, estimators) correction, reject entry transcript patterns and labels patterns labels rejects CLASSIFIER