Fully Convolutional Networks for Handwriting Recognition Felipe - - PDF document

fully convolutional networks for handwriting recognition
SMART_READER_LITE
LIVE PREVIEW

Fully Convolutional Networks for Handwriting Recognition Felipe - - PDF document

Fully Convolutional Networks for Handwriting Recognition Felipe Petroski Such*, Dheeraj Peri*, Frank Brockler, Paul Hutkowski, Raymond Ptucha* *Rochester Institute of Technology, Kodak Alaris, 1 Such et al. ICFHR18 Background


slide-1
SLIDE 1

1

Such et al. ICFHR‘18

1

Fully Convolutional Networks for Handwriting Recognition

Felipe Petroski Such*, Dheeraj Peri*, Frank Brockler†, Paul Hutkowski†, Raymond Ptucha* *Rochester Institute of Technology, †Kodak Alaris,

Such et al. ICFHR‘18

3

Background

  • Offline handwriting recognition

continues to be a difficult process due to the virtually infinite ways the same information can be written.

  • Convolutional Neural Networks

(CNNs) and have been applied to handwriting recognition with good success.

  • Recurrent Neural Networks (RNNs)

are useful for arbitrary length sequences and Connectionist Temporal Classification (CTC) are good as a post correction step.

I am truly touched by your kind contribution to my birthday presents & grateful for your good wishes. Winston Churchill Note: Some believe the above letter is a forgery.

slide-2
SLIDE 2

2

Such et al. ICFHR‘18

4

Workflow- Word Extraction

Document Segmentation Block Segmentation

SegNet or similar labels each pixel by type- can grow to orthogonal boundaries. Modified XY Tree or similar suggests rectilinear splits. Use both to define paragraphs, sentences and word blocks.

Such et al. ICFHR‘18

5

Workflow- Word Recognition

  • Preprocessing

– Fix skewing, rotation, contrast

  • Prediction

– CNNs, HMM, LSTMs used together

  • Post-processing

– Train & Test: CTC – Test: Language Model

f o r

slide-3
SLIDE 3

3

Such et al. ICFHR‘18

6

Proposed Method

  • Character classification without the need for:

– Preprocessing- no deskewing – Predefined lexicon of words- can work on surnames, phone numbers, and street addresses – Post processing- No RNN or CTC needed

  • Utilizes Fully Convolutional Networks (FCNs) to translate

arbitrary sequence length. – FCNs are faster to train than RNNs and more robust – CTC can still be used, but we found them hard to converge

  • Single architecture works on arbitrary words as well as

words from a lexicon

Such et al. ICFHR‘18

7

High Level

Vocabulary CNN Length CNN Symbol CNN Language Model CNN Predicts word label for common words such as ‘his’, ‘her’, ‘the’. If confidence > g, then done! CNN Predicts the number of symbols, then resample block to 32 ☓ 16N, where N is the number of symbols. FCN Predicts 2N+1 symbols, where each symbol is separated by a blank space. (optional step) When block is known to come from a lexicon

  • f words, use

vocabulary matching by minimizing character error rate.

slide-4
SLIDE 4

4

Such et al. ICFHR‘18

8

Vocabulary and Length CNNs

128 32 128 32 64 conv1a (64) 3×3 128 32 64 conv1b (64) 3×3 128 32 64 conv1c (64) 3×3 pool 1 64 16 64 64 16 128 64 16 128 32 128 8 16 256 4 Conv4 (512) 4×16 conv3b (256) 3×3 pool 2 Input pixels 32 256 8 32 256 8 pool 3 conv3a (256) 3×3 conv2b (128) 3×3 conv2a (128) 3×3

C(64,3,3)-C(64,3,3)-C(64,3,3)-P(2)-C(128,3,3)-C(128,3,3)-C(256,3,3)-P(2)- C(256,3,3)-C(512,3,3)-C(512,3,3)-P(2)-C(256,4,16)-FC(V)-SoftMax where C(D,H,W) stands for convolution with the dimensions of the filter as H☓W and the depth D. Each convolutional layer is followed by a batch norm and ReLU layer. P(2) represents a 2 ☓ 2 pooling layer with stride 2.

512 FC (V) V

For vocabulary, V=~1000 For length, V=32 (but can be any value or regression)

Such et al. ICFHR‘18

9

High Level

Vocabulary CNN Length CNN Symbol CNN Language Model CNN Predicts word label for common words such as ‘his’, ‘her’, ‘the’. If confidence > g, then done! CNN Predicts the number of symbols, then resample block to 32 ☓ 16N, where N is the number of symbols. FCN Predicts 2N+1 symbols, where each symbol is separated by a blank space. (optional step) When block is known to come from a lexicon

  • f words, use

vocabulary matching by minimizing character error rate.

slide-5
SLIDE 5

5

Such et al. ICFHR‘18

11

Symbol FCN

1 6 64 8 3 2 4 16 64 128 25 6 1024

Conv 3x3 pool Conv 3x3 pool Conv 3x3 pool FC Relu x2

128/16 N 3 2 32/4 N 8 16/2 N 102 4 3 2N+ 1 64/8 N 128 1 6

Conv 3x3 pool x3 Conv 3x3 pool x2 Conv 3x3 pool x2

4 256 512

Conv 4x4 1x2 pad

102 4 2N+ 1 3 2N+ 1 1 Ns 2N+1 Predictions

Tile Add ReLU FC (Ns) ReLU

128 3 2

Context path Symbol detail path

Such et al. ICFHR‘18

12

Symbol FCN

2N 4 512

(1024) Conv 4x4x512 1x2 pad

1024 3 2N+1 2N+1 1 Ns

2N+1 Predictions

  • Vertical pad gives

forgiveness for up/down- can think as three estimates for each prediction.

  • Horizontal pad gives

2N+1 outputs.

(Ns) FullyConv 3x1x1024 1x2 pad softmax

N=1 N Input Symbols 2N+1 Predicted Symbols N=3 Pad of 2 on left/right Activation maps 2N wide Conv filter of width 4

slide-6
SLIDE 6

6

Such et al. ICFHR‘18

13

Symbol FCN

2N 4 512

(1024) Conv 4x4x512 1x2 pad

1024 3 2N+1 2N+1 1 Ns

2N+1 Predictions

  • Vertical pad gives

forgiveness for up/down- can think as three estimates for each prediction.

  • Horizontal pad gives

2N+1 outputs.

(Ns) FullyConv 3x1x1024 1x2 pad softmax

N=1 N Input Symbols 2N+1 Predicted Symbols N=3 N=4 N=9

Such et al. ICFHR‘18

14

Symbol FCN

2N 4 512

(1024) Conv 4x4x512 1x2 pad

1024 3 2N+1 2N+1 1 Ns

2N+1 Predictions

  • Vertical pad gives

forgiveness for up/down- can think as three estimates for each prediction.

  • Horizontal pad

gives 2N+1 outputs.

(Ns) FullyConv 3x1x1024 1x2 pad

  • Each of 2N+1

predictions are a linear combination

  • f 3x1024

activation map.

softmax

  • Softmax over Ns

symbols.

slide-7
SLIDE 7

7

Such et al. ICFHR‘18

15

t i m 1 2 3 1

  • 2
  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4

  • Such et al. ICFHR‘18

16

t i m 1 2 3 1 ?

  • 2
  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

slide-8
SLIDE 8

8

Such et al. ICFHR‘18

17

t i m 1 2 3 1 ?

  • 2
  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Such et al. ICFHR‘18

18

t i m 1 2 3 1

  • 2
  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Match! Pass along previous error

slide-9
SLIDE 9

9

Such et al. ICFHR‘18

19

t i m 1 2 3 1 1

  • 2
  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss! +1 To insert i

Such et al. ICFHR‘18

20

t i m 1 2 3 1 1 2 2

  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss! +1 To insert m, then e

slide-10
SLIDE 10

10

Such et al. ICFHR‘18

21

t i m 1 2 3 1 1 2 2 1

  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to delete y

Such et al. ICFHR‘18

22

t i m 1 2 3 1 1 2 2 1 1

  • 3
  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to replace y with i

slide-11
SLIDE 11

11

Such et al. ICFHR‘18

23

t i m 1 2 3 1 1 2 2 1 1 2 3

  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to replace y with m

  • r +1 to

insert m

Such et al. ICFHR‘18

24

t i m 1 2 3 1 1 2 2 1 1 2 3

  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to replace y with e

  • r +1 to

insert e

slide-12
SLIDE 12

12

Such et al. ICFHR‘18

25

t i m 1 2 3 1 1 2 2 1 1 2 3 2

  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to delete m

Such et al. ICFHR‘18

26

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2

  • 4
  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, +1 to replace m with i

  • r +1 to

delete y

slide-13
SLIDE 13

13

Such et al. ICFHR‘18

27

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4

  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Match Match

Such et al. ICFHR‘18

28

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4

  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3 2

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, Cost +1 to insert e

slide-14
SLIDE 14

14

Such et al. ICFHR‘18

29

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4 3 3

  • t

y m m Predicted Word Comparison Word 5

  • e

e 4 3 3 2

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, Cost +1 to delete m

Such et al. ICFHR‘18

30

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4 3 3 2 t y m m Predicted Word Comparison Word 5

  • e

e 4 3 3 2

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Match

slide-15
SLIDE 15

15

Such et al. ICFHR‘18

31

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4 3 3 2 t y m m Predicted Word Comparison Word 5

  • e

e 4 3 3 2 2

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, Cost +1 to replace m with e

Such et al. ICFHR‘18

32

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4 3 3 2 t y m m Predicted Word Comparison Word 5 4 4 3 e e 4 3 3 2 2

  • !",$ = &'( !")*,$ + 1, !",$)* + 1, -'./
  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Miss, Cost +1 to delete e

slide-16
SLIDE 16

16

Such et al. ICFHR‘18

33

t i m 1 2 3 1 1 2 2 1 1 2 3 2 2 1 4 3 3 2 t y m m Predicted Word Comparison Word 5 4 4 3 e e 4 3 3 2 2 2

!",$ = &'( !")*,$ + 1, !",$)* + 1, -'./

  • './ = 0

!")*,$)* '1 2345 6ℎ.3 = 68&2.34 6ℎ.3 !")*,$)* + 1 8/:

Final Cost Match

Such et al. ICFHR‘18

34

Datasets

  • IAM English handwritten dataset

– 115,320 English words, mostly cursive, by 500 authors. – Comes with train, validation, test splits.

  • RIMES French handwritten dataset

– 60,000 French words by over 1,000 authors. – Use ICDAR2011 release and splits

  • NIST Handprinted and Forms database

– 810,000 characters by 3,600 authors

slide-17
SLIDE 17

17

Such et al. ICFHR‘18

35

IAM Results

LSTM w/ CTC HMMS with MLP HMMS with MLP HMM CNN with RNN CNN w/ RNN CNN with pre and post processing, fixed symbol lexicon of only upper and lower case Latin alphabet ✩ (our work): Vocabulary CNN of 1100 words Symbol CNN uses Ns=123 symbols ✩

Such et al. ICFHR‘18

36

IAM Results

slide-18
SLIDE 18

18

Such et al. ICFHR‘18

37

RIMES Results

LSTM w/ CTC HMM CNN with RNN CNN with pre and post processing, fixed symbol lexicon of only upper and lower case Latin alphabet ✩ (our work): Vocabulary CNN of 800 words Symbol CNN uses Ns=123 symbols ✩

Such et al. ICFHR‘18

38

RIMES Results

slide-19
SLIDE 19

19

Such et al. ICFHR‘18

39

NIST Results

92.4% accuracy on a subset of 12,000 word blocks (English, French, and special characters) generated from NIST dataset

Such et al. ICFHR‘18

40

Attention Modeling

256 Input pixels

CNN2

2N+1 1024 1 FC_classify SoftMax 2N+1 Symbol predictions

CNN1a

1

Concat descriptors

Input pixels- variant 1 Input pixels- variant 2 Input pixels- variant M

Attn weights, s !

"#$ %

&"'(" =

slide-20
SLIDE 20

20

Such et al. ICFHR‘18

41

Conclusions

  • Introduction of offline handwritten recognition

architecture which works with either arbitrary characters

  • r fixed lexicon.
  • Vocabulary CNN quickly solves simple words.
  • Length CNN forms canonical word suitable for input into

Symbol CNN.

  • Symbol CNN is a FCN which is indifferent to canonical

word length.

  • Despite using large character lexicon (123 symbols) and

being able to predict arbitrary words such as surnames and phone numbers, generates competitive CER and WER.

Such et al. ICFHR‘18

42

Thank you!!

Ray Ptucha rwpeec@rit.edu

https://www.rit.edu/mil