A New Training Pipeline for an Improved Neural Transducer Albert - PowerPoint PPT Presentation

A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr´ e Merboldt 1 , Ralf Schl¨ uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics, Computer Science and Natural Sciences RWTH Aachen University 2 AppTek

Introduction , Image Motivation: • Model which allows for time-synchronous decoding Generalization & extension and comparison from: • RNN transducer (RNN-T) and Recurrent Neural Aligner (RNA) models • CTC, RNA and RNN-T label topologies – Explicit blank label, or separate emit sigmoid Training criteria: • Full sum (FS) over all possible alignments (standard RNN-T loss) • Frame-wise cross-entropy (CE) – Allows more powerful models Setup: • End-to-end on subword BPE units • Experiments on Switchboard 300h • All code & setups published: github.com/rwth-i6/returnn-experiments 2 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Label topologies , Image • Generalize from label topologies to compare them. • Input seq. x T 1 , output/target (non-blank) seq. y N 1 , alignment α U 1 . α u ∈ Σ ′ := {� b �} ∪ Σ, y n ∈ Σ. T input seq. len, N output seq. len, U alignment seq. len. • Blank label � b � , can be part of labels, or separat emit sigmoid. • Labels Σ are 1k BPE subword units in this work. n n n � b � g g g � b � � b � � b � /o o o o � b � � b � � b � � b � /d d d d � b � t t t CTC [Graves et al., 2006]: RNN-T [Graves, 2012]: RNA [Sak et al., 2017] or monotonic RNN-T [Tripathi et al., 2019]: • Time-sync., label repetition, • Only blank forwards in time (no label rep.), • Time-sync., no label rep., • U = T , ∆ t ≡ 1, t u = u , • U = N + T , ∆ t ( α ) = 1 α = � b � , ∆ n ( α u ) = 1 α u � = � b �∧ α u � = α u − 1 . • U = T , ∆ t ≡ 1, t u = u , ∆ n ( α ) = 1 α � = � b � . ∆ n ( α ) = 1 α � = � b � . 3 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Generalize RNN-T and RNA , Image Original RNN transducer (RNN-T) model [Graves, 2012] Output u Joint Network Prediction Encoder Network ... y n-1 x 1 x T 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Generalize RNN-T and RNA , Image RNN-T unrolled over alignment frames u Output u Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Generalize RNN-T and RNA , Image Monotonic RNN-T [Tripathi et al., 2019] (time-synchronous) Output t Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Generalize RNN-T and RNA , Image Recurrent neural aligner (RNA) model [Sak et al., 2017] Output t RNN t Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Generalize RNN-T and RNA , Image Generalized (time-sync.) transducer model Output t FastRNN t SlowRNN n Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Transducer , Model Image Output u 1 ( x T ′ • Encoder T 1 ) FastRNN u – BLSTM – potentially downsampled • SlowRNN – LSTM or FFNN SlowRNN n – per (non-blank) label n – like language model (LM) • FastRNN – LSTM+FFNN or FFNN Encoder t – per alignment frame u (per time t if time-sync.) • Unrolled generalized model over alignment frames u . • Output u ≡ α u ∈ Σ ∪ {� b �} • Input x T ′ – blank or new (non-blank) label to Encoder, Output α u per frame. Bold output is non-blank label. 1 • Dependencies are optional. 5 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Transducer , Model Image U 1 | x T ′ � � p ( α u | α u − 1 p ( y N , h T 1 ) := 1 ) , 1 u =1 α U 1 :( T , y N 1 ) Explicit blank label: or Separate emit sigmoid: p ( α u | ...) := softmax Σ ′ (Readout( s fast u , s slow p ( α u = � b � | ...) := σ (Readout b ( s fast u , s slow n u )) n u )) , p ( α u � = � b � | ...) = σ ( − Readout b ( s fast u , s slow n u )) , q ( α u | ...) := softmax Σ (Readout y ( s fast u , s slow n u )) , α u ∈ Σ p ( α u | ...) := p ( α u � = � b � | ...) · q ( α u | ...) , α u ∈ Σ 1 := Encoder( x T ′ h T 1 ) , � � s fast s fast u − 1 , s slow := FastRNN n u , α u − 1 , h t u , u � � s slow s slow := SlowRNN n u − 1 , α u ′ − 1 , h t u ′ , n u u ′ := min { k | k ≤ u , n k = n u } , 6 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Training , Image Full-sum (FS) loss: � L FS := − log p ( y N 1 | x T p ( α U 1 | x T 1 ) = − log 1 ) α U 1 :( T , y N 1 ) • To be able to calculate efficiently: – 0-order or 1-order dependency on α (but no restriction on y ) Frame-wise cross entropy (CE) loss: L CE := − log p ( α U 1 | x T 1 ) ( − ) Needs alignment α U 1 . We use a fixed alignment. (+) Much faster calculation (twice as fast training) (+) Faster and more stable convergence (+) Chunked training – Even faster because of less zero padding – Makes use of all training data (no filtering by seq. length) – Very effective regularization (16.3% → 14.7% WER) (+) Other common methods: Label smoothing, focal loss, ... (+) No restriction on order ⇒ Allows more powerful models, required for our extended generalized model 7 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Training Pipeline , Image 1. Start from scratch with FS training. Train 25 epochs. 2. Calculate alignment. 3. Start from scratch with extended model and CE training. Train 50 epochs. 8 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Decoding , Image Decision rule: U x T ′ 1 | x T ′ , x T ′ � p ( α u | α u − 1 p ( y N 1 �→ arg max 1 ) ≈ arg max 1 ) 1 N , y N U ,α U 1 1 u =1 • Beam search decoding, fixed beam size (12 hyps.) • Hypotheses: partially finished sequences α u 1 • Pruning based on scores p ( α u 1 | x T 1 ) • Time-synchronous (in case U = T ), or synchronous over the axis { 1 , . . . , U } 9 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

Experiments , Image Ablations and variations • Switchboard 300h, WER on Hub5’00 • Transducer baselines B1 and B2 – Without external LM – Full label ( α & y ) context, FastRNN/SlowRNN are LSTMs – Separate emit sigmoid (no explicit blank label) Variant WER[%] B1 B2 Baseline 14.7 14.5 No chunked training 16.3 15.7 SlowRNN always updated (not slow) 14.8 14.8 No SlowRNN (exactly RNA) 14.8 14.7 No encoder feedback to SlowRNN (like RNN-T) 14.9 14.7 + No FastRNN α label feedback (like RNN-T) 14.9 14.5 + No FastRNN, just Joint Network (exactly RNN-T) 15.2 15.1 No separate emit sigmoid (explicit blank) (like RNN-T/RNA) 14.9 14.9 10 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University

A New Training Pipeline for an Improved Neural Transducer Albert - PowerPoint PPT Presentation

A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr e Merboldt 1 , Ralf Schl uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics,

Turing Machine as Transducer Transducer -- any device that converts Turing Machines a signal

Transducer FSMs in System Design In this lecture we go through examples of transducer FSMs in

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Transducer choice: resolution vs penetration The key is to balance both the need for image

MOBILE COMPUTING CSE 40814/60814 Fall 2015 Basic Terms Transducer: a device which converts

Continuations and Transducer Composition Olin Shivers Matthew Might Georgia Tech PLDI 2006 The

A Tree Transducer Model for Synchronous Tree-Adjoining Grammars Andreas Maletti Universitat

2018 2018 Pipeline T Trai aining G Gran ant Pipeline Safety Training Grant $134,000 in

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Ignorance and anti-negativity in the grammar: or / some NP SG and comparative- /

Merry Christmas 2016 Morning Rituals Morning Rituals Morning Rituals Morning Rituals Morning

Functional Programming and Parallel Computing Bjrn Lisper School of Innovation, Design, and

Exam Review 2 Exam Overview Final Exam Friday,

1 8/10/2020 Video Use the divider line to enlarge or reduce the video feed 4 Archive

CSE 258 Web Mining and Recommender Systems Assignment 1 Assignment 1 Two recommendation tasks

Yehuda Lindell Bar-Ilan University, Israel Tal Rabin IBM Research, New York A set of parties

What is it? Instrument or ensemble? Lars Bo Andersen, Humans and IT research seminar, 13/5-2015

A New Training Pipeline for an Improved Neural Transducer Albert - PowerPoint PPT Presentation

A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr e Merboldt 1 , Ralf Schl uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics,

Turing Machine as Transducer Transducer -- any device that converts Turing Machines a signal

Transducer FSMs in System Design In this lecture we go through examples of transducer FSMs in

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Transducer choice: resolution vs penetration The key is to balance both the need for image

MOBILE COMPUTING CSE 40814/60814 Fall 2015 Basic Terms Transducer: a device which converts

Continuations and Transducer Composition Olin Shivers Matthew Might Georgia Tech PLDI 2006 The

A Tree Transducer Model for Synchronous Tree-Adjoining Grammars Andreas Maletti Universitat

2018 2018 Pipeline T Trai aining G Gran ant Pipeline Safety Training Grant $134,000 in

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering &amp; Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Ignorance and anti-negativity in the grammar: or / some NP SG and comparative- /

Merry Christmas 2016 Morning Rituals Morning Rituals Morning Rituals Morning Rituals Morning

Functional Programming and Parallel Computing Bjrn Lisper School of Innovation, Design, and

Exam Review 2 Exam Overview Final Exam Friday,

1 8/10/2020 Video Use the divider line to enlarge or reduce the video feed 4 Archive

CSE 258 Web Mining and Recommender Systems Assignment 1 Assignment 1 Two recommendation tasks

Yehuda Lindell Bar-Ilan University, Israel Tal Rabin IBM Research, New York A set of parties

What is it? Instrument or ensemble? Lars Bo Andersen, Humans and IT research seminar, 13/5-2015

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research