A New Training Pipeline for an Improved Neural Transducer Albert - - PowerPoint PPT Presentation

a new training pipeline for an improved neural transducer
SMART_READER_LITE
LIVE PREVIEW

A New Training Pipeline for an Improved Neural Transducer Albert - - PowerPoint PPT Presentation

A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr e Merboldt 1 , Ralf Schl uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics,


slide-1
SLIDE 1

A New Training Pipeline for an Improved Neural Transducer

Albert Zeyer1,2, Andr´ e Merboldt1, Ralf Schl¨ uter1,2, Hermann Ney1,2

1Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition

Department of Mathematics, Computer Science and Natural Sciences RWTH Aachen University

2AppTek

slide-2
SLIDE 2

Introduction, Motivation:

  • Model which allows for time-synchronous decoding

Generalization & extension and comparison from:

  • RNN transducer (RNN-T) and Recurrent Neural Aligner (RNA) models
  • CTC, RNA and RNN-T label topologies

– Explicit blank label, or separate emit sigmoid

Training criteria:

  • Full sum (FS) over all possible alignments (standard RNN-T loss)
  • Frame-wise cross-entropy (CE)

– Allows more powerful models

Setup:

  • End-to-end on subword BPE units
  • Experiments on Switchboard 300h
  • All code & setups published: github.com/rwth-i6/returnn-experiments

2 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-3
SLIDE 3

Label topologies,

  • Generalize from label topologies to compare them.
  • Input seq. xT

1 , output/target (non-blank) seq. y N 1 , alignment αU 1 .

αu ∈ Σ′ := {b} ∪ Σ, yn ∈ Σ. T input seq. len, N output seq. len, U alignment seq. len.

  • Blank label b, can be part of labels, or separat emit sigmoid.
  • Labels Σ are 1k BPE subword units in this work.

d b /d

  • b /o

g

t n

CTC [Graves et al., 2006]:

  • Time-sync., label repetition,
  • U = T, ∆t ≡ 1, tu = u,

∆n(αu) = 1αu=b∧αu=αu−1.

d b

  • b

g

t n

RNA [Sak et al., 2017] or monotonic RNN-T [Tripathi et al., 2019]:

  • Time-sync., no label rep.,
  • U = T, ∆t ≡ 1, tu = u, ∆n(α) = 1α=b.

b d b b

  • b

g b

t n

RNN-T [Graves, 2012]:

  • Only blank forwards in time

(no label rep.),

  • U = N + T, ∆t(α) = 1α=b,

∆n(α) = 1α=b.

3 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-4
SLIDE 4

Generalize RNN-T and RNA, Original RNN transducer (RNN-T) model [Graves, 2012]

Encoder Prediction Network Joint Network Outputu yn-1 x1 xT ...

4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-5
SLIDE 5

Generalize RNN-T and RNA, RNN-T unrolled over alignment frames u

Encodert

  • Pred. Network

Joint Network Outputu

4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-6
SLIDE 6

Generalize RNN-T and RNA, Monotonic RNN-T [Tripathi et al., 2019] (time-synchronous)

Encodert

  • Pred. Network

Joint Network Outputt

4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-7
SLIDE 7

Generalize RNN-T and RNA, Recurrent neural aligner (RNA) model [Sak et al., 2017]

Encodert RNNt Outputt

4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-8
SLIDE 8

Generalize RNN-T and RNA, Generalized (time-sync.) transducer model

Encodert SlowRNNn FastRNNt Outputt

4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-9
SLIDE 9

Transducer,

Model

Encodert SlowRNNn FastRNNu Outputu

  • Unrolled generalized model over alignment frames u.
  • Input xT ′

1

to Encoder, Output αu per frame. Bold output is non-blank label.

  • Dependencies are optional.
  • EncoderT

1 (xT ′ 1 )

– BLSTM – potentially downsampled

  • SlowRNN

– LSTM or FFNN – per (non-blank) label n – like language model (LM)

  • FastRNN

– LSTM+FFNN or FFNN – per alignment frame u (per time t if time-sync.)

  • Outputu ≡ αu ∈ Σ ∪ {b}

– blank or new (non-blank) label

5 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-10
SLIDE 10

Transducer,

Model

p(y N

1 | xT ′ 1 ) :=

  • αU

1 :(T,y N 1 )

U

  • u=1

p(αu | αu−1

1

, hT

1 ),

Explicit blank label:

p(αu | ...) := softmaxΣ′(Readout(sfast

u , sslow nu ))

  • r

Separate emit sigmoid:

p(αu=b | ...) := σ(Readoutb(sfast

u , sslow nu )),

p(αu=b | ...) = σ(− Readoutb(sfast

u , sslow nu )),

q(αu | ...) := softmaxΣ(Readouty(sfast

u , sslow nu )), αu ∈ Σ

p(αu | ...) := p(αu=b | ...) · q(αu | ...), αu ∈ Σ hT

1 := Encoder(xT ′ 1 ),

sfast

u

:= FastRNN

  • sfast

u−1, sslow nu , αu−1, htu

  • ,

sslow

nu

:= SlowRNN

  • sslow

nu−1, αu′−1, htu′

  • ,

u′ := min{k | k ≤ u, nk = nu},

6 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-11
SLIDE 11

Training, Full-sum (FS) loss:

LFS := − log p(y N

1 | xT 1 ) = − log

  • αU

1 :(T,y N 1 )

p(αU

1 | xT 1 )

  • To be able to calculate efficiently:

– 0-order or 1-order dependency on α (but no restriction on y)

Frame-wise cross entropy (CE) loss:

LCE := − log p(αU

1 | xT 1 )

(−) Needs alignment αU

1 . We use a fixed alignment.

(+) Much faster calculation (twice as fast training) (+) Faster and more stable convergence (+) Chunked training – Even faster because of less zero padding – Makes use of all training data (no filtering by seq. length) – Very effective regularization (16.3% → 14.7% WER) (+) Other common methods: Label smoothing, focal loss, ... (+) No restriction on order ⇒ Allows more powerful models, required for our extended generalized model

7 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-12
SLIDE 12

Training Pipeline,

  • 1. Start from scratch with FS training.

Train 25 epochs.

  • 2. Calculate alignment.
  • 3. Start from scratch with extended model and CE training.

Train 50 epochs.

8 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-13
SLIDE 13

Decoding, Decision rule:

xT ′

1 → arg max N,y N

1

p(y N

1 | xT ′ 1 ) ≈ arg max U,αU

1

U

  • u=1

p(αu | αu−1

1

, xT ′

1 )

  • Beam search decoding, fixed beam size (12 hyps.)
  • Hypotheses: partially finished sequences αu

1

  • Pruning based on scores p(αu

1|xT 1 )

  • Time-synchronous (in case U = T), or synchronous over the axis {1, . . . , U}

9 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-14
SLIDE 14

Experiments, Ablations and variations

  • Switchboard 300h, WER on Hub5’00
  • Transducer baselines B1 and B2

– Without external LM – Full label (α & y) context, FastRNN/SlowRNN are LSTMs – Separate emit sigmoid (no explicit blank label)

Variant WER[%] B1 B2 Baseline 14.7 14.5 No chunked training 16.3 15.7 SlowRNN always updated (not slow) 14.8 14.8 No SlowRNN (exactly RNA) 14.8 14.7 No encoder feedback to SlowRNN (like RNN-T) 14.9 14.7 + No FastRNN α label feedback (like RNN-T) 14.9 14.5 + No FastRNN, just Joint Network (exactly RNN-T) 15.2 15.1 No separate emit sigmoid (explicit blank) (like RNN-T/RNA) 14.9 14.9

10 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-15
SLIDE 15

Experiments, Varying sequence lengths

  • Switchboard 300h, WER on RT03S (less overfitting)
  • No external LM, beam size 12
  • Transducer with CTC label topology
  • Compared to standard global soft attention-based encoder decoder model
  • Concatenating every C consecutive seqs. (only in recog.)

C

  • Seq. length [secs]

WER[%] mean±std min-max Attention Transducer 1 2.71± 2.38 0.17- 34.59 16.5 16.0 2 7.83± 4.96 0.42- 63.70 16.8 16.1 4 17.74± 9.25 0.42- 91.62 17.9 16.3 10 45.57±19.93 0.74-126.52 29.3 16.7 20 86.37±39.58 0.74-194.10 51.9 17.1 30 122.10±57.28 1.17-297.15 65.1 18.1 100 290.58±35.50 8.54-309.14 94.8 18.2

11 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-16
SLIDE 16

Experiments, Overal performance & literature comparison Work Label #Ep LM WER[%] Type # Hub500 Hub501 RT03 SWB CH Σ Σ Σ [Zeineldeen et al., 2020] Phone 4.5k 13 yes 9.6 18.5 14.0 14.1 [Zeyer et al., 2019] BPE 1k 33 no 10.1 20.6 15.4 14.7 [Nguyen et al., 2019] 4k 50 8.8 17.2 13.0 [Karita et al., 2019] 2k 100 yes 9.0 18.1 13.6 [Wang et al., 2020] BPE

Ph 500 150

7.9 16.1 14.5 [T¨ uske et al., 2020] BPE 600 250 no 7.6 14.6 [Park et al., 2019] WPM 1k 760 7.2 14.6 Ours Attention BPE 1k 25 no 9.2 21.1 15.2 14.2 17.6 50 8.7 19.3 14.0 13.3 16.6 Transducer 25 9.4 18.7 14.1 14.1 16.7 50 8.7 18.3 13.5 13.3 15.6

12 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-17
SLIDE 17

Conclusion,

  • Generalization, extension and comparison from RNN-T and RNA model

– Generalized model performs better (15.1% → 14.5% WER)

  • Generalization and comparison from CTC, RNA and RNN-T label-topology

– No label loop seems better (like RNA / RNN-T)

  • Frame-wise CE has many advantages:

– Requirement for our generalized extended model – Faster training and faster & more stable convergence – Chunked training (15.7% → 14.5% WER) (even faster)

  • Comparison transducer to attention-based encoder-decoder model:

– Transducer is better than attention (14.0% → 13.5% WER) (in contrast to literature results) – Transducer generalizes to long sequences

  • Side observation from literature review:

– Total num. of train epochs correlates strongly with WER. – Relevant: Amount of computation effort until WER.

13 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-18
SLIDE 18

Thank you for your attention

Any questions?

slide-19
SLIDE 19

Appendix, Training time

  • Switchboard 300h
  • Loss: full-sum (FS) or frame-wise cross entropy (CE)

– CE training is without chunking

  • Loss implementation: pure TensorFlow (TF), or CUDA
  • Training time on single GTX 1080 Ti GPU

– Measures whole training, not just loss calculation

Model Label Loss Loss # params time / epoch Topology Impl. [M] [min] Transd. RNA FS TF 147 306 CTC 326 RNN-T 333 CUDA 219 CTC CE TF 160 Attention − 162 138

15 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-20
SLIDE 20

Appendix, FS vs. CE

  • Switchboard 300h, transducer model without external LM
  • full-sum (FS) training vs. frame-wise cross entropy (CE)

– trained from scratch (random init.) in both cases – CE uses fixed external alignment – CE uses chunking

Label Training WER[%] Topology Criterion Hub5’00 Hub5’01 SWB CH Σ Σ RNA FS 11.5 23.4 17.5 16.5 CE 10.1 20.4 15.2 14.8 CTC FS 15.0 24.6 19.8 20.1 CE 10.5 20.6 15.6 15.3 RNN-T FS 11.6 22.3 17.0 16.4

16 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-21
SLIDE 21

Appendix, Comparing alignments

  • On Switchboard 300h, WER on Hub 5’00
  • CE-trained transducer B1 and B2 models
  • Alignment model specifies the model used to calculate the alignment

Alignment model WER[%] B1 B2 CTC-align 4l 14.7 14.3 CTC-align 6l 14.7 14.5 CTC-align 6l with prior (non-peaky) 15.4 14.9 CTC-align 6l, less training 14.6 14.6 Att.-based enc.-dec. + CTC-align 14.4 14.2 Transducer-align 14.2 14.1

17 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-22
SLIDE 22

Appendix, Ablations and variations extended

  • On Switchboard 300h, WER on Hub5’00, transducer baselines B1 and B2

Variant WER[%] B1 B2 Baseline 14.7 14.5 No chunked training 16.3 15.7 No switchout 15.0 14.5 SlowRNN always updated (not slow) 14.8 14.8 No SlowRNN 14.8 14.7 No attention 14.5 * FastRNN dim 128 → 512 14.3 14.5 No encoder feedback to SlowRNN 14.9 14.7 + No FastRNN label feedback (like RNN-T) 14.9 14.5 + No FastRNN (exactly RNN-T) 15.2 15.1 No separate blank sigmoid 14.9 14.9

18 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-23
SLIDE 23

Appendix, Comparing imported model parameters

  • On Switchboard 300h, WER on Hub5’00.
  • Transducer baseline models B1 and B2 with CTC topology, without external LM.
  • Trained with CE using a fixed external alignment (CTC-align 6l).

Imported model params WER[%] B1 B2 None 14.7 14.5 CTC as encoder 15.4 15.5 Attention encoder 14.2 13.9 Transducer (itself) 13.7 13.6

19 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-24
SLIDE 24

Appendix, Comparing beam sizes, attention vs. transducer

  • On Switchboard 300h, WER on RT03S.
  • Comparing attention-based encoder-decoder model vs. transducer model.

– Both without external LM. – Transducer uses the CTC-label topology.

  • Optionally recombine (merge) hypotheses in the beam corresponding to same word sequence

(after collapsing repetitions, removing blank, and BPE merging).

Model Merge WER[%] Beam size 1 2 4 8 12 24 32 64 Attention no 17.9 17.0 16.7 16.6 16.6 16.5 16.6 16.5 Transducer 16.8 16.4 16.2 16.2 16.2 16.2 16.2 16.2 yes 16.8 16.3 16.0 15.9 15.9 15.9 15.9 16.0

20 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-25
SLIDE 25

References, Graves, A. (2012). Sequence transduction with recurrent neural networks. Preprint arXiv:1211.3711. Graves, A., Fern´ andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369–376. ACM. Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., et al. (2019). A comparative study on Transformer vs RNN in speech applications. In ASRU. Nguyen, T.-S., Stueker, S., Niehues, J., and Waibel, A. (2019). Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. Preprint arXiv:1910.13296.

20 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-26
SLIDE 26

References, Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech 2019, pages 2613–2617. Sak, H., Shannon, M., Rao, K., and Beaufays, F. (2017). Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proc. of Interspeech. Tripathi, A., Lu, H., Sak, H., and Soltau, H. (2019). Monotonic recurrent neural network transducer and decoding strategies. In ASRU, pages 944–948. T¨ uske, Z., Saon, G., Audhkhasi, K., and Kingsbury, B. (2020). Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300. Preprint arXiv:2001.07263. Wang, W., Zhou, Y., Xiong, C., and Socher, R. (2020).

20 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image

slide-27
SLIDE 27

References, An investigation of phone-based subword units for end-to-end speech recognition. Preprint arXiv:2004.04290. Zeineldeen, M., Zeyer, A., Schl¨ uter, R., and Ney, H. (2020). Layer-normalized LSTM for hybrid-HMM and end-to-end ASR. In ICASSP, Barcelona, Spain. Zeyer, A., Bahar, P., Irie, K., Schl¨ uter, R., and Ney, H. (2019). A comparison of Transformer and LSTM encoder decoder models for ASR. In ASRU, pages 8–15, Sentosa, Singapore.

14 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Image