Handwriting Recognition Technology in the Newton's Second Generation - - PowerPoint PPT Presentation

handwriting recognition technology in the newton s second
SMART_READER_LITE
LIVE PREVIEW

Handwriting Recognition Technology in the Newton's Second Generation - - PowerPoint PPT Presentation

Handwriting Recognition Technology in the Newton's Second Generation Print Recognizer (The One That Worked) Larry Yaeger Professor of Informatics, Indiana University Distinguished Scientist, Apple Computer World Wide Newton Conference


slide-1
SLIDE 1

WWNC 2004

Handwriting Recognition Technology in the Newton's Second Generation “Print Recognizer” (The One That Worked)

Larry Yaeger Professor of Informatics, Indiana University Distinguished Scientist, Apple Computer World Wide Newton Conference September 4-5, 2004

slide-2
SLIDE 2

2 WWNC 2004

Handwriting Recognition Team

(ATG) Bill Stafford (Contractor) Les Vogel (ATG) Dick Lyon (Contractor) Brandyn Webb (ATG) Larry Yaeger Core Team

Giulia Pagallo Ernie Beernink Michael Kaplan Josh Gold Boris Aleksandrovsky Dan Azuma George Mills Chris Hamlin Stuart Crawford Gene Ciccarelli Kara Hayes Rus Maxham

Other Contributors Testers

Emmanuel Euren Denny Mahdik Ron Dotson Julie Wilson Glen Raphael Polina Fukshansky

slide-3
SLIDE 3

3 WWNC 2004

Recognizer History

υ

‘92 ATG “Rosetta” project demos well at Stewart Alsop’s “Demo ‘92” (blows socks off Nathan Myhrvold’s MS demo) and WWDC

υ

‘93 Head of ATG suggests abandoning handwriting recognition for interactive TV project

υ

‘93-’94 Rosetta nearly ships in “PenLite” pen-based Mac product

υ

Jan ‘94 Port to Newton started

υ

‘94 Brief interest in Rosetta for abortive “Nautilus” Mac product

υ

… testing with tethered Newtons, much accuracy improvement…

υ

18 Nov ‘94 Provided handful of untethered Newtons for testing

υ

1 Feb ‘95 Beta 1 build (Merry Xmas!)

υ

‘95 Rosetta ships as “Print Recognizer” in Newton (120?)

υ

‘95 Rosetta widely acknowledged as world’s first usable handwriting recognizer

slide-4
SLIDE 4

4 WWNC 2004

Recognizer History

υ

13 Nov ’95 John Markoff writes about Rosetta in NY Times

υ

Nov or Dec ‘95 receive cease-and-desist demand for use of "Rosetta" name (Mac-based SmallTalk platform)

υ

Jan ‘96 team picks “Mondello” codename, “Neuropen” product name

υ

‘96 Short-lived “Hollywood” pen-based Mac project

υ

Mar ‘97 cursive almost working

υ

18 Mar ‘97 ATG laid off

υ

May ‘00 “Inkwell” for Mac OS 9 declares beta

υ

May ‘00 Marketing declares “no new features on 9”, OS X work begins

υ

Jul ‘02 Inkwell for Mac OS X declares GM (10.2 / Jaguar)

υ

Sep ‘03 Inkwell APIs and additional languages declare GM (10.3 / Panther)

υ

Apr ‘04 Motion announced with gestural interface, including tablet and in- air ink-on-demand

slide-5
SLIDE 5

5 WWNC 2004

Recognizer Overview

υ

Powerful state-of-the-art technology

υ Neural network character classifier υ Maximum-likelihood search over letter segmentation, letter

class, word, and word segmentation hypotheses

υ Flexible, loosely applied language model with very broad

coverage

υ

Now part of “Inkwell” in Mac OS X

υ

Also provides gesture recognition

υ System υ Application (Motion)

slide-6
SLIDE 6

6 WWNC 2004

Recognition Block Diagram

Tentative Segmentation

(x,y) points & pen-lifts character segmentation hypotheses

Beam Search With Context

word probabilities

Neural Network Classifier

character class hypotheses

slide-7
SLIDE 7

7 WWNC 2004

Character Segmentation

υ

Which strokes comprise which characters?

υ

Constraints

υ All strokes must be used υ No strokes may be used twice υ

Efficient pre-segmentation

υ Avoid trying all possible permutations υ Based on order, overlap, crossings, aspect ratio… υ

Integrated with recognition

υ Forward & reverse “delays” implement implicit graph of

hypotheses

slide-8
SLIDE 8

8 WWNC 2004

Neural Network Character Classifier

υ Inherently data-driven υ Learn from examples υ Non-linear decision boundaries υ Effective generalization

slide-9
SLIDE 9

9 WWNC 2004

Context Is Essential

υ

Humans achieve 90% accuracy on characters in isolation (our database)

υ Word accuracy would then be ~ 60% (.95) υ

Variety of context models are possible

υ N-grams υ Variable (Memory) Length Markov Model υ Word lists υ Regular expression graphs υ

"Out of dictionary" writing also required

υ "xyzzy", unix pathnames, technical/medical terms, etc.

slide-10
SLIDE 10

10 WWNC 2004

Recognition Technology

Tentative Segmentation Neural Network Classifier Beam Search With Context

(x,y) points & pen-lifts character segmentation hypotheses character class hypotheses word probabilities

a b c d e f g … l …

.1 .0 .7 .0 .1 .0 .0 … .0 … .1 … .0 .1 .0 .7 .0 .1 .0 … .1 … .0 … .0 .0 .0 .0 .0 .0 .0 … 1. … .0 …

slide-11
SLIDE 11

11 WWNC 2004

Character Segmentation

1 2 3 4 5 6 7 1 2 3 1 2 1 1 3 4 4 2 2 1 Segment Number Segment Ink Stroke Count Forward Delay Reverse Delay 1 2 1 i -> j is legal iff FDi + RDj = j - i

slide-12
SLIDE 12

12 WWNC 2004

Network Design

υ Variety of architectures tried υ Single hidden layer, fully-connected υ Multi-hidden layer, with receptive fields υ Shared weights (LeCun) υ Parallel classifiers combined at output layer υ Representation as important as architecture υ Anti-aliased images υ Baseline-driven with ascenders and descenders υ Stroke features

slide-13
SLIDE 13

13 WWNC 2004

Network Architectures

a … z A … Z 0 … 9 ! … ~ a … z A … Z 0 … 9 ! … ~ a … z A … Z 0 … 9 ! … ~

slide-14
SLIDE 14

14 WWNC 2004

Neural Network Classifier

a … z A … Z 0 … 9 ! … ~£

Stroke Count Aspect Ratio Image Stroke Feature 14 x 14 1 x 1 5 x 1 20 x 9 72 x 1 104 x 1 95 x 1 112 x 1 2 x 7 7 x 2 7 x 7 (8x8) (8x7;1,7) (7x8;7,1) (8x6;1,8) (6x8;8,1) 1 x 9 9 x 1 (10x10) (6x14) (14x6) 5 x 5

slide-15
SLIDE 15

15 WWNC 2004

Normalizing Output Error

υ Normalize “pressure towards zero” υ Based on recognition of the fact that most training

signals are zero

υ Training vector for letter "x"

a … w x y z A … Z 0 … 9 ! … ~ 0 … 0 1 0 0 0 … 0 0 … 0 0 … 0

υ Forces net to attempt to make unambiguous

classifications

υ Makes it difficult to obtain meaningful 2nd and 3rd

choice probabilities

slide-16
SLIDE 16

16 WWNC 2004

Normalized Output Error

υ We reduce the BP error for non-target classes relative

to the target class

υ By a factor that ”normalizes" the non-target error

relative to the target error

υ Based on the number of non-target vs. target classes υ For non-target output nodes

e' = e A where A = 1 / d (Noutputs - 1)

υ Allocates network resources to modeling of low-

probability regime

slide-17
SLIDE 17

17 WWNC 2004

Normalized Output Error

υ Converges to MMSE estimate of

f( P(class|input), A )

υ We derived that function:

<ê2> = p (1-y)2 + A (1-p) y2 where p = P(class|input), y = output unit activation

υ Output y for particular class is then:

y = p / (A - A p + p)

υ Inverting for p:

p = y A / (y A - y + 1)

slide-18
SLIDE 18

18 WWNC 2004

Normalized Output Error

Empirical p vs. y histogram for a net trained with A=0.11 (d=0.1), with corresponding theoretical curve

slide-19
SLIDE 19

19 WWNC 2004

Normalized Output Error

9.5 9.5 12.4 12.4 31.6 31.6 22.7 22.7 Character Error Character Error Word Error Word Error 5 5 10 10 15 15 20 20 25 25 30 30 35 35 0.0 0.0 0.8 0.8 NormOutErr = NormOutErr = Error (%) Error (%)

slide-20
SLIDE 20

20 WWNC 2004

Stroke Warping

υ Produce random variations in stroke data during

training

υ Small changes in skew, rotation, x and y linear and

quadratic scaling

υ Consistent with stylistic variations υ Improves generalization by effectively adding extra

data samples

slide-21
SLIDE 21

21 WWNC 2004

Stroke Warping

Original X Quadratic X Skew Y Linear Rotation X Linear

slide-22
SLIDE 22

22 WWNC 2004

Class Frequency Balancing

υ Skip and repeat patterns υ Instead of dividing by the class priors υ Eliminates noisy estimate of low freq. classes υ Eliminates need for renormalization υ Forces net to better model low freq. classes υ Compute normalized frequency, relative to average

frequency

_ Fi = Si / S _ C S = 1 / C ∑ Si

i=1

slide-23
SLIDE 23

23 WWNC 2004

Class Frequency Balancing

υ Compute repetition factor υ Where a (0.2 to 0.8) controls amount of skipping

  • vs. repeating

υ And b (0.5 to 0.9) controls amount of balancing

Ri = ( a / Fi )b

slide-24
SLIDE 24

24 WWNC 2004

Stroke-Count Frequency Balancing

υ Compute frequencies for stroke-counts in each class υ Modulate repetition factors by stroke-count sub-class

frequencies

Rij = Ri [(Si/J)/Sij]b

slide-25
SLIDE 25

25 WWNC 2004

Adding Noise to Stroke-Count

υ Small percentage of samples use randomly selected

stroke-count (as input to the net)

υ Improves generalization by reducing bias towards

  • bserved stroke-counts

υ Even improves accuracy on data drawn from training

set

slide-26
SLIDE 26

26 WWNC 2004

Negative Training

υ Inherent ambiguities force segmentation code to

generate false segmentations

υ Ink can be interpreted in various ways... υ "dog", "clog", "cbg", "%g" υ Train network to compute low probabilities for false

segmentations

slide-27
SLIDE 27

27 WWNC 2004

Negative Training

υ

Modulate negative training two ways…

υ Negative error factor (0.2 to 0.5) υ Like A in normalized output error υ Negative training probability (0.05 to 0.3) υ Also speeds training υ

Too much negative training

υ Suppresses net outputs for characters that look like

elements of multi-stroke characters (I, 1, l, |, o, O, 0)

υ

Slight reduction in character accuracy, large gain in word accuracy

slide-28
SLIDE 28

28 WWNC 2004

Error Emphasis

υ

Probabilistically skip training for correctly classified patterns

υ

Never skip incorrectly classified patterns

υ

Just one form of error emphasis

υ Can reduce learning rate/error for correctly classified

patterns

υ And/or increase learning rate/error for incorrectly

classified patterns

υ Maintain pool of samples from which correctly classified

patterns are flushed each epoch

slide-29
SLIDE 29

29 WWNC 2004

Training Probabilities and Error Factors

1.0 Segment Type

  • Prob. of Usage

0.5 0.18 Error Factor 1.0 0.3 0.1 POS NEG Correct Incorrect Target Class Other Classes NA

slide-30
SLIDE 30

30 WWNC 2004

Annealing & Scheduling

υ Start with large learning rate, then decay υ When training set's total squared error increases υ Start with high error emphasis, then decay υ Start with minimal negative training, then increase υ Mostly for pragmatic reasons

slide-31
SLIDE 31

31 WWNC 2004

Training Schedule

Phase Learning Rate Correct Train Prob Negative Train Prob Epochs 1 2 3 4 25 25 50 30 1.0 - 0.5 0.5 - 0.1 0.1 - 0.01 0.01 - 0.001 0.1 0.25 0.5 1.0 0.05 0.1 0.18 0.3

slide-32
SLIDE 32

32 WWNC 2004

Quantized Weights

υ

Forward/classification pass requires less precision than backward/learning pass

υ

Use one-byte weights for classification

υ Saves both space and time υ ±3.4 (-8 to +8 with 1/16 Steps) υ

Use three-byte weights for learning

υ ±3.20 υ

First Newton version

υ ~200KB ROM (~85KB for weights) υ ~5KB-100KB RAM υ ~3.8 char/second

slide-33
SLIDE 33

33 WWNC 2004

Quantized Weights

slide-34
SLIDE 34

34 WWNC 2004

User Adaptation

υ Neural net classifer based on an inherently learning

technology

υ Learning not used in Newton due to memory constraints υ Learning not (yet) used in Mac OS X due to limited

human resources

υ Can reduce error rates by factor of 2 to 5, yet user-

independent “walk-up” performance is maintained!

slide-35
SLIDE 35

35 WWNC 2004

User Adaptation

υ

User training scenario

υ 15-20 min. of data entry

υ Less for problem characters alone

υ Possibly < 1 min. network learning

υ One-shot learning may suffice (single epoch) υ May learn during data entry υ Maximum of a few minutes (~10-12 Epochs)

υ

Learn on the fly

υ Can continuously adapt υ Need system hooks υ Choosing what to train on is key system issue

slide-36
SLIDE 36

36 WWNC 2004

User Adaptation

Character Error Rate (%) Alphanumeric Test Set (Not in Any Training Set) Character Error Rate (%) Alphanumeric Test Set (Not in Any Training Set) 13.9 13.9 24.8 24.8 5.1 5.1 4.7 4.7 17.8 17.8 6.1 6.1 6.3 6.3 21.3 21.3 11 11 10.7 10.7 (45) (45) 1 1 2 2 3 3 5 5 10 10 15 15 20 20 25 25 User-Independent User-Independent User-Specific User-Specific User-Adapted User-Adapted Writer Writer

slide-37
SLIDE 37

37 WWNC 2004

Integration with Character Segmentation

υ Search takes place over segmentation hypotheses (as

well as character hypotheses)

υ Stroke recombinations are presented in regular,

predictable order

υ Forward and reverse ”delay" parameters suffice to

indicate legal time-step transitions

slide-38
SLIDE 38

38 WWNC 2004

Integration with Word Segmentation

υ Search also takes place over word segmentation

hypotheses

υ Word-space becomes an optional segment/character υ Weighted by probability ("SpaceProb") derived from

statistical model of gap sizes and stroke centroid spacing

υ Non-space hypotheses are weighted by 1-SpaceProb

slide-39
SLIDE 39

39 WWNC 2004

Word Segmentation Statistical Model

PWordBreak = ΓWordGap / (ΓStrokeGap + ΓWordGap) Samples Gap Size

Word Break Stroke (Non-Word) Break

ΓStrokeGap ΓWordGap

slide-40
SLIDE 40

40 WWNC 2004

Recognition Ambiguity

slide-41
SLIDE 41

41 WWNC 2004

Geometric Context

vs Table “if” User from

(User data scaled and offset to minimize error magnitude) Error vector of eight differences

slide-42
SLIDE 42

42 WWNC 2004

Language Model

υ Dictionaries υ Word lists υ Regular expression grammars υ BiGrammars - combinations of dictionaries υ Probabilistically weighted υ Flexible starts, stops, and transitions

slide-43
SLIDE 43

43 WWNC 2004

Regular Expression Grammars

υ Telephone numbers example:

dig = [0123456789] digm01 = [23456789] acodenums = (digm01 [01] dig) acode = { ("1-"? acodenums "-"):40 , ("1"? "(" acodenums ")"):60 } phone = (acode? digm01 dig dig "-" dig dig dig dig)

slide-44
SLIDE 44

44 WWNC 2004

Bigrammars

υ Limited context telephone example:

BiGrammar2 Phone [phone.lang 1. 1. 1.]

slide-45
SLIDE 45

45 WWNC 2004

BiGrammars

υ (Fairly) general context example:

BiGrammar2 FairlyGeneral (.8 (.6 [WordList.dict .5 .8 1. EndPunct.lang .2] [User.dict .5 .8 1. EndPunct.lang .2] ) (.4 [Phone.lang .5 .8 1. EndPunct.lang .2] [Date.lang .5 .8 1. EndPunct.lang .2] ) ) (.2 [OpenPunct.lang 1. 0. .5 (.6 WordList.dict .5 User.dict .5 ) (.4 Phone.lang .5 Date.lang .5 ) ] ) [EndPunct.lang 0. .9 .5 EndPunct.lang .1]

slide-46
SLIDE 46

46 WWNC 2004

Old Newton Writing Example

when Year-old Arabian retire tipped off the Christmas wrapping, No square with delights Santa brought the Attacking hit too dathe would Problem was, Joe talked Bobbie. His doll stones at the really in its army Antiques I machine gun and hand decades At its side. But it says things like 3 "Want togo shopping" The Pro has claimed responsibility that's Bobbie Liberation Organization. Make up of more than 50 Concerned parents 3 Machinist 5 and oth er activists 5 the Pro claims to hsve crop if Housed switched the voice boxes on 300 hit, Joe and Bobbie foils across the United States this holiday Season we have operations All over the country" said one pro member 5 who wished to remain autonomous. "Our goal is to cereal and correct Thu problem of exposed stereo in editorials toys."

slide-47
SLIDE 47

47 WWNC 2004

Mondello Writing Example

When 7-year-old Zachariah Zelin ripped off the Christmas wrapping, he squealed with delight. Santa brought the talking G.I. Joe doll he

  • wanted. Problem was, Joe talked like Barbie. His doll stands at the

ready in its Armyfatigues, machine gun and hand grenades at its side. But it says things like, "Want to go shopping?" The BLO has claimed

  • responsibility. That's Barbie Liberation Organization. Made up of

more than 50 concerned parents, feminists and other activists, the BLO claims to have surreptitiously switched the voice boxes on 300 G.I. Joe and Barbie dolls across the United States this holiday season. "We have operatives all over the country," said one BLO member, who wished to remain anonymous. "Our goal is to reveal and Correct the problem of gender-based stereotyping in children's toys."

slide-48
SLIDE 48

48 WWNC 2004

Apple-Newton Handwriting Recognition

The Power +o he your 6est