Dropout improves Recurrent Neural Networks for Handwriting - - PowerPoint PPT Presentation

dropout improves recurrent neural networks for
SMART_READER_LITE
LIVE PREVIEW

Dropout improves Recurrent Neural Networks for Handwriting - - PowerPoint PPT Presentation

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore Bluche Christopher Kermorvant J er ome Louradour tb@a2ia.com , jl@a2ia.com 1/22 Dropout improves Recurrent Neural Networks for Handwriting


slide-1
SLIDE 1

1/22

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour

tb@a2ia.com, jl@a2ia.com

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-2
SLIDE 2

2/22

Outline

1

RNN for Handwritten Text Line Recognition Offline Handwritten Text Recognition Recurrent Neural Networks (RNN)

2

Dropout for RNN

3

Experiments Improvement of RNN Improvement of the complete recognition system

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-3
SLIDE 3

3/22 RNN for Handwritten Text Line Recognition

Outline

1

RNN for Handwritten Text Line Recognition Offline Handwritten Text Recognition Recurrent Neural Networks (RNN)

2

Dropout for RNN

3

Experiments Improvement of RNN Improvement of the complete recognition system

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-4
SLIDE 4

4/22 RNN for Handwritten Text Line Recognition

Offline Handwritten Text Recognition

Dear Charlize. You are cordially invited to the grand opening of my new art gallery intitled «The new era of Media Music and paintings».

  • n July 17th 2012

P.S: UR presence is obligatory due to your great help of launching my career.

Line segmentation in the front-end “Temporal Classification”: Variable-length 1D or 2D input → 1D target sequence (different length)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-5
SLIDE 5

5/22 RNN for Handwritten Text Line Recognition

Modeling: Recurrent Neural Networks (RNN)

State-of-the-art in Handwritten Text Recognition

Task: Image (2D sequence) → 1D sequence of characters

2 2 Input image Block: 2 x 2 MDLSTM Features: 2 Convolutional Input size: 2x4 Features: 6 2 6 6 Sum & Tanh MDLSTM Features: 10 Convolutional Input: 2 x 4 Features: 20 Sum & Tanh MDLSTM Features: 50 Fully-connected Features: N Sum Collapse 10 20 20 50 N N N

.....

N-way softmax

CTC

20 40 60 80 100 120 140 160 0.0 0.2 0.4 0.6 0.8 1.0 I t _ w a s _ a _ s p l e n d i d _ i n t e r p r e t a t i
  • n

1

RNN Network Architecture (Graves & Schmidhuber, 2008)

Multi-Directional layers of LSTM unit “Long-Short Term Memory” – 2D recurrence in 4 possible directions Convolutions: parameterized subsampling layers Collapse layer: from 2D to 1D (output ∼ log P)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-6
SLIDE 6

5/22 RNN for Handwritten Text Line Recognition

Modeling: Recurrent Neural Networks (RNN)

State-of-the-art in Handwritten Text Recognition

Task: Image (2D sequence) → 1D sequence of characters

2 2 Input image Block: 2 x 2 MDLSTM Features: 2 Convolutional Input size: 2x4 Features: 6 2 6 6 Sum & Tanh MDLSTM Features: 10 Convolutional Input: 2 x 4 Features: 20 Sum & Tanh MDLSTM Features: 50 Fully-connected Features: N Sum Collapse 10 20 20 50 N N N

.....

N-way softmax

CTC

20 40 60 80 100 120 140 160 0.0 0.2 0.4 0.6 0.8 1.0 I t _ w a s _ a _ s p l e n d i d _ i n t e r p r e t a t i
  • n

1

RNN Network Architecture (Graves & Schmidhuber, 2008)

Multi-Directional layers of LSTM unit “Long-Short Term Memory” – 2D recurrence in 4 possible directions Convolutions: parameterized subsampling layers Collapse layer: from 2D to 1D (output ∼ log P)

2

CTC Training (“Connectionist Temporal Classification”)

The network can output all possible symbols and also a blank output Minimization of the Negative Log-Likelihood − log(P(Y |X)) (NLL)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-7
SLIDE 7

6/22 RNN for Handwritten Text Line Recognition

Modeling: Recurrent Neural Networks (RNN)

State-of-the-art in Handwritten Text Recognition

The recurrent neurons are Long Short-Term Memory (LSTM) units

Input gate Forget gate Output gate Forget gate

hidden layer input layer (i, j) (i, j) (i-1, j) (i, j-1) hidden layer input layer (i, j) (i, j) (i+1, j) (i, j-1) hidden layer input layer hidden layer input layer (i, j) (i, j) (i-1, j) (i, j+1) (i, j) (i, j) (i+1, j) (i, j+1)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-8
SLIDE 8

7/22 RNN for Handwritten Text Line Recognition

Loss function: Connectionist Temporal Classification (CTC)

Deal with several possible alignments between two 1D sequences

T E A

length U' = 2U+1

1 8 , 5 c m

∅ ∅ ∅ ∅

Target Sequence ''Tea''

length T ≥ U

34,2cm

− log P(Y |X)

U = 3: Number of target symbols T: Number of RNN outputs ∝ image width Basic decoding strategy (without lexicon neither language model): [∅ . . . ]T . . . [∅ . . . ]E . . . [∅ . . . ]A . . . [∅ . . . ] → “TEA”

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-9
SLIDE 9

7/22 RNN for Handwritten Text Line Recognition

Loss function: Connectionist Temporal Classification (CTC)

Deal with several possible alignments between two 1D sequences

T E E

length T ≥ U+1 length U' = 2U+1

34,2cm 1 8 , 5 c m

Target Sequence ''Tee''

∅ ∅ ∅ ∅

− log P(Y |X)

U = 3: Number of target symbols T: Number of RNN outputs ∝ image width Basic decoding strategy (without lexicon neither language model): [∅ . . . ]T . . . [∅ . . . ]E . . . ∅ . . . E . . . [∅ . . . ] → “TEE”

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-10
SLIDE 10

8/22 RNN for Handwritten Text Line Recognition

Optimization: Stochastic Gradient Descent

Simple and efficient

No mathematical guarantee (no chance to converge to the real global minimum) But popular with deep networks: works well in practice! (find “good” local minima) for ( input, target ) in Oracle() do

  • utput= RNN.Forward( input )
  • utGrad= CTC NLL.Gradient( output, target )

paramGrad= RNN.BackwardGradient( input, ..., outGrad ) RNN.Update( paramGrad )

end for

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-11
SLIDE 11

9/22 Dropout for RNN

Outline

1

RNN for Handwritten Text Line Recognition Offline Handwritten Text Recognition Recurrent Neural Networks (RNN)

2

Dropout for RNN

3

Experiments Improvement of RNN Improvement of the complete recognition system

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-12
SLIDE 12

10/22 Dropout for RNN

Dropout

General Principle [Krizhevsky & Hinton, 2012]

Training:

Randomly set to 0 intermediate activities (*) with probability p (typically p = 0.5) (*) neurons outputs usually in [−1, 1], [0, 1] or [0, ∞) ∼ Sampling from 2N different architectures that share weights

Decoding:

All intermediate activities are scaled, by 1 − p ∼ Geometric mean of the outputs from 2N models

Featured in award-winning convolutional networks (ImageNet)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-13
SLIDE 13

11/22 Dropout for RNN

Dropout

Dropout with recurrent layer

Recurrent connections are kept untouched Dropout can be implemented as separated layer (outputs identical to inputs, except at dropped locations)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-14
SLIDE 14

12/22 Dropout for RNN

Dropout

Overview of the full network

2 2

Input image Block: 2 x 2 MDLSTM Features: 2 Convolutional Input: 2x4 Stride: 2x4 Features: 6

2 6 6

Sum & Tanh MDLSTM Features: 10 Convolutional Input: 2x4 Stride: 2x4 Features: 20 Sum & Tanh

MDLSTM Features: 50 Fully-connected Features: N Sum Collapse 10 20 20 50 N N N

.....

N-way softmax

CTC

Dropout Dropout Dropout

After recurrent LSTM layers Before feed-forward layers (convolutional and linear layers)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-15
SLIDE 15

13/22 Experiments

Outline

1

RNN for Handwritten Text Line Recognition Offline Handwritten Text Recognition Recurrent Neural Networks (RNN)

2

Dropout for RNN

3

Experiments Improvement of RNN Improvement of the complete recognition system

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-16
SLIDE 16

14/22 Experiments

Databases and performance assessment

Training subset # different # labelled # characters Database Language characters lines (in lines) IAM English 78 9, 462 338, 904 Rimes French 114 11, 065 429, 099 OpenHaRT Arabic 154 91, 811 2, 267, 450

Training: Minimizing Negative Log-Likelihood (NLL) with CTC alignments. Decoding: Pick the best label at each timestep, Remove duplicates, then blanks. Evaluation: Character Error Rate (%), on a separate dataset. Reduction w/ and w/o dropout. Training convergence time is also interesting, but not critical.

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-17
SLIDE 17

15/22 Experiments

Results: Dropout on the topmost LSTM layer

∼ Dropout on high-level features used in Logit Regression Error rate reduction when varying the number of hidden units in the topmost layer

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-18
SLIDE 18

16/22 Experiments

Results: Dropout on all LSTM layers

Use the good recipe whenever possible! Number of hidden units tuned (on validation dataset) to reach best performance

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-19
SLIDE 19

17/22 Experiments

Results analysis: Dropout acts as Regularization

10 20 30 40 50 60 6 8 10 12 14 16 CER (%) 10 20 30 40 50 60 Nb updates / 1M 20 25 30 35 40 45 50 WER (%)

nb topmost hidden units: 50 (valid) 50 (train) 100 200 50 with dropout 100 with dropout 200 with dropout

Convergence curves Less overfitting: the gap between training and validation loss is smaller Training with dropout is slower: There is a trade-off between accuracy & training speed. (However, decoding speed is the same for a given neural archi.!)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-20
SLIDE 20

18/22 Experiments

Results analysis: Dropout acts as Regularization

IAM (English)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Classif weights - Baseline Classif weights - Dropout

Rimes (French)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Classif weights - Baseline Classif weights - Dropout

OpenHaRT (Arabic)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Classif weights - Baseline Classif weights - Dropout

Outgoing weights are smaller: L1 and L2 norms are greatly reduced Better than L1/L2 Weight Decay (and also simple to implement)

Data-driven approach. No need to tune λ ∈ [0, +∞( to control the Bias-Variance Tradeoff. Only one hyper-parameter p ∈ [0, 1( that is less sensitive. NB: p = 0.5 works well!

On the other hand, tanh activations (in [-1,1]) are sharper: More “helpful” features learned by “preventing co-adaptation” (Hinton et al., 2012)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-21
SLIDE 21

19/22 Experiments

Intergration in a complete recognition system

Performance improves when language constraints (vocabulary, LM) are added. Decoding in a hybrid RNN/HMM framework ( p(y|x)

p(y) ∝ p(x|y) p(x) )

HMM: One state for each label including blank, with self-loop and

  • utgoing transition

Lexicon: Each word is the sequence of character HMMs with

  • ptional blanks in between

Language Model: Word n-grams The goal is to find the optimal word sequence ˆ W ˆ W = arg max

W p(W|X) = arg max W p(X|W)p(W)

(1)

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-22
SLIDE 22

20/22 Experiments

Results in a complete system:

Word Error Rate of Full Systems (Optical Model + Lexicon/Language Model):

Word Error Rate (%) 6 12 18 24 Rimes (Eval) IAM (Eval) OpenHaRT (Eval) 18 13,6 12,3 18,6 15,9 12,6 23,3 12,2 12,9 best published w/o dropout with dropout

# words Database Language # words in vocabulary % OOV LM Perplexity Rimes French 5,639 12k 2.6% 4-gram 18 IAM English 25,920 50k 3.7% 3-gram 329 OpenHaRT Arabic 47,837 95k 6.8% 3-gram 1162

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-23
SLIDE 23

21/22 Conclusion

Conclusions and future work

Dropout acts as a regularizer: outgoing weights tend to be lower Dropout improves accuracy of Offline Text Recognition with RNN about 10-20% improvement in CER and WER Training convergence with dropout is longer roughly twice slower

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour
slide-24
SLIDE 24

22/22

Thank you for your attention!

Questions and comments are welcome. tb@a2ia.com, jl@a2ia.com

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th´ eodore Bluche Christopher Kermorvant J´ erˆ

  • me Louradour