a new training pipeline for an improved neural transducer
play

A New Training Pipeline for an Improved Neural Transducer Albert - PowerPoint PPT Presentation

A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr e Merboldt 1 , Ralf Schl uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics,


  1. A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr´ e Merboldt 1 , Ralf Schl¨ uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics, Computer Science and Natural Sciences RWTH Aachen University 2 AppTek

  2. Introduction , Image Motivation: • Model which allows for time-synchronous decoding Generalization & extension and comparison from: • RNN transducer (RNN-T) and Recurrent Neural Aligner (RNA) models • CTC, RNA and RNN-T label topologies – Explicit blank label, or separate emit sigmoid Training criteria: • Full sum (FS) over all possible alignments (standard RNN-T loss) • Frame-wise cross-entropy (CE) – Allows more powerful models Setup: • End-to-end on subword BPE units • Experiments on Switchboard 300h • All code & setups published: github.com/rwth-i6/returnn-experiments 2 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  3. Label topologies , Image • Generalize from label topologies to compare them. • Input seq. x T 1 , output/target (non-blank) seq. y N 1 , alignment α U 1 . α u ∈ Σ ′ := {� b �} ∪ Σ, y n ∈ Σ. T input seq. len, N output seq. len, U alignment seq. len. • Blank label � b � , can be part of labels, or separat emit sigmoid. • Labels Σ are 1k BPE subword units in this work. n n n � b � g g g � b � � b � � b � /o o o o � b � � b � � b � � b � /d d d d � b � t t t CTC [Graves et al., 2006]: RNN-T [Graves, 2012]: RNA [Sak et al., 2017] or monotonic RNN-T [Tripathi et al., 2019]: • Time-sync., label repetition, • Only blank forwards in time (no label rep.), • Time-sync., no label rep., • U = T , ∆ t ≡ 1, t u = u , • U = N + T , ∆ t ( α ) = 1 α = � b � , ∆ n ( α u ) = 1 α u � = � b �∧ α u � = α u − 1 . • U = T , ∆ t ≡ 1, t u = u , ∆ n ( α ) = 1 α � = � b � . ∆ n ( α ) = 1 α � = � b � . 3 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  4. Generalize RNN-T and RNA , Image Original RNN transducer (RNN-T) model [Graves, 2012] Output u Joint Network Prediction Encoder Network ... y n-1 x 1 x T 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  5. Generalize RNN-T and RNA , Image RNN-T unrolled over alignment frames u Output u Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  6. Generalize RNN-T and RNA , Image Monotonic RNN-T [Tripathi et al., 2019] (time-synchronous) Output t Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  7. Generalize RNN-T and RNA , Image Recurrent neural aligner (RNA) model [Sak et al., 2017] Output t RNN t Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  8. Generalize RNN-T and RNA , Image Generalized (time-sync.) transducer model Output t FastRNN t SlowRNN n Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  9. Transducer , Model Image Output u 1 ( x T ′ • Encoder T 1 ) FastRNN u – BLSTM – potentially downsampled • SlowRNN – LSTM or FFNN SlowRNN n – per (non-blank) label n – like language model (LM) • FastRNN – LSTM+FFNN or FFNN Encoder t – per alignment frame u (per time t if time-sync.) • Unrolled generalized model over alignment frames u . • Output u ≡ α u ∈ Σ ∪ {� b �} • Input x T ′ – blank or new (non-blank) label to Encoder, Output α u per frame. Bold output is non-blank label. 1 • Dependencies are optional. 5 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  10. Transducer , Model Image U 1 | x T ′ � � p ( α u | α u − 1 p ( y N , h T 1 ) := 1 ) , 1 u =1 α U 1 :( T , y N 1 ) Explicit blank label: or Separate emit sigmoid: p ( α u | ...) := softmax Σ ′ (Readout( s fast u , s slow p ( α u = � b � | ...) := σ (Readout b ( s fast u , s slow n u )) n u )) , p ( α u � = � b � | ...) = σ ( − Readout b ( s fast u , s slow n u )) , q ( α u | ...) := softmax Σ (Readout y ( s fast u , s slow n u )) , α u ∈ Σ p ( α u | ...) := p ( α u � = � b � | ...) · q ( α u | ...) , α u ∈ Σ 1 := Encoder( x T ′ h T 1 ) , � � s fast s fast u − 1 , s slow := FastRNN n u , α u − 1 , h t u , u � � s slow s slow := SlowRNN n u − 1 , α u ′ − 1 , h t u ′ , n u u ′ := min { k | k ≤ u , n k = n u } , 6 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  11. Training , Image Full-sum (FS) loss: � L FS := − log p ( y N 1 | x T p ( α U 1 | x T 1 ) = − log 1 ) α U 1 :( T , y N 1 ) • To be able to calculate efficiently: – 0-order or 1-order dependency on α (but no restriction on y ) Frame-wise cross entropy (CE) loss: L CE := − log p ( α U 1 | x T 1 ) ( − ) Needs alignment α U 1 . We use a fixed alignment. (+) Much faster calculation (twice as fast training) (+) Faster and more stable convergence (+) Chunked training – Even faster because of less zero padding – Makes use of all training data (no filtering by seq. length) – Very effective regularization (16.3% → 14.7% WER) (+) Other common methods: Label smoothing, focal loss, ... (+) No restriction on order ⇒ Allows more powerful models, required for our extended generalized model 7 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  12. Training Pipeline , Image 1. Start from scratch with FS training. Train 25 epochs. 2. Calculate alignment. 3. Start from scratch with extended model and CE training. Train 50 epochs. 8 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  13. Decoding , Image Decision rule: U x T ′ 1 | x T ′ , x T ′ � p ( α u | α u − 1 p ( y N 1 �→ arg max 1 ) ≈ arg max 1 ) 1 N , y N U ,α U 1 1 u =1 • Beam search decoding, fixed beam size (12 hyps.) • Hypotheses: partially finished sequences α u 1 • Pruning based on scores p ( α u 1 | x T 1 ) • Time-synchronous (in case U = T ), or synchronous over the axis { 1 , . . . , U } 9 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

  14. Experiments , Image Ablations and variations • Switchboard 300h, WER on Hub5’00 • Transducer baselines B1 and B2 – Without external LM – Full label ( α & y ) context, FastRNN/SlowRNN are LSTMs – Separate emit sigmoid (no explicit blank label) Variant WER[%] B1 B2 Baseline 14.7 14.5 No chunked training 16.3 15.7 SlowRNN always updated (not slow) 14.8 14.8 No SlowRNN (exactly RNA) 14.8 14.7 No encoder feedback to SlowRNN (like RNN-T) 14.9 14.7 + No FastRNN α label feedback (like RNN-T) 14.9 14.5 + No FastRNN, just Joint Network (exactly RNN-T) 15.2 15.1 No separate emit sigmoid (explicit blank) (like RNN-T/RNA) 14.9 14.9 10 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend