Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: - PowerPoint PPT Presentation

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi

Recall Hybrid DNN-HMM Systems Triphone state labels   (DNN posteriors) Instead of GMMs, use scaled • DNN posteriors as the HMM … … … observation probabilities 39 features in one frame DNN trained using triphone • labels derived from a forced alignment “Viterbi” step. Fixed window of   5 speech frames

Multilingual Training   (Hybrid DNN/HMM System) Stacked RBMs DNN finetuned DNN finetuned DNN finetuned DNN finetuned trained on PL on CZ on DE on PT on PL Image/Table from Ghoshal et al., “Multilingual training of deep neural networks”, ICASSP, 2013.

Multilingual Training   (Hybrid DNN/HMM System) Stacked RBMs DNN finetuned DNN finetuned DNN finetuned DNN finetuned trained on PL on CZ on DE on PT on PL Di ff erent training language schedules Mono- and multilingual results Languages Dev Eval Language Vocab PPL ML-GMM DNN Multilingual DNN WER(%) WER(%) Languages WER(%) RU 27.5 24.3 CZ 29K 823 18.5 15.8 — — DE 36K 115 13.9 11.2 CZ → DE 9.4 CZ → RU 27.5 24.6 FR 16K 341 25.8 22.6 CZ → DE → FR 22.6 SP 17K 134 26.3 22.3 CZ → DE → FR → SP 21.2 CZ → DE → FR → SP → RU 26.6 23.8 PT 52K 184 24.1 19.1 CZ → DE → FR → SP → PT 18.9 RU 24K 634 32.5 27.5 CZ → DE → FR → SP → PT → RU 26.3 CZ → DE → FR → SP → PT → RU 26.3 23.6 PL 29K 705 20.0 17.4 CZ → DE → FR → SP → PT → RU → PL 15.9 Image/Table from Ghoshal et al., “Multilingual training of deep neural networks”, ICASSP, 2013.

Shared hidden layers + Language-specific softmax layers Language 1 senones Language 2 senones Language 3 senones Language 4 senones ... ... ... ... ... ... ... Shared Many Hidden Layers Feature Transformation ... ... ... Text Input Layer: A window of acoustic feature frames Training or Testing Samples Lang 1 Lang 2 Lang 3 Lang 4 Figure 1: rchitecture of the shared-hidden-layer multilingual Huang et al., “Cross-language knowledge transfer using multilingual DNNs with shared hidden layers”, ICASSP 2013. ’s output nodes correspond 7305

Shared hidden layers + Language-specific softmax layers Language 1 senones Language 2 senones Language 3 senones Language 4 senones ... ... ... ... ... Hidden layers are shared across • languages; treated as a universal feature ... transformation ... Shared Many Hidden Layers Feature Transformation ... Each language has its own softmax layer • ... to estimate posterior probabilities of tied triphone states specific to each language ... Text Input Layer: A window of acoustic feature frames Training or Testing Samples Lang 1 Lang 2 Lang 3 Lang 4 Figure 1: rchitecture of the shared-hidden-layer multilingual Huang et al., “Cross-language knowledge transfer using multilingual DNNs with shared hidden layers”, ICASSP 2013. ’s output nodes correspond 7305

� � � Shared hidden layers + Language-specific softmax layers Hidden layers are transferable � WER (%) Baseline (9-hr ENU) 30.9 FRA HLs + Train All Layers 30.6 FRA HLs + Train Softmax Layer 27.3 Language 1 senones Language 2 senones Language 3 senones Language 4 senones ... ... ... ... SHL-MDNN + Train Softmax Layer 25.3 � ... Training strategy based on target language data ... ENU training data (#. Hours) 3 9 36 Baseline DNN (no Transfer) 38.9 30.9 23.0 ... SHL-MDNN + Train Softmax Layer 22.4 28.0 25.3 Shared Many Hidden Layers SHL-MDNN + Train All Layers 33.4 28.9 21.6 Feature Transformation ... Best Case Relative WER Reduction (%) 28.0 18.1 6.1 � ... Cross-lingual transfer Measured in CER Reduction (%). ... Text Input Layer: CHN Training Set (Hrs) 3 9 36 139 A window of acoustic feature frames Baseline - CHN only 45.1 40.3 31.7 29.0 SHL-MDNN Model Transfer 35.6 33.9 28.4 26.6 Training or Testing Samples Lang 1 Lang 2 Lang 3 Lang 4 Relative CER Reduction 21.1 15.9 10.4 8.3 Figure 1: rchitecture of the shared-hidden-layer multilingual � � Huang et al., “Cross-language knowledge transfer using multilingual DNNs with shared hidden layers”, ICASSP 2013. 7306 � 7306 ’s output nodes correspond 7306 7305

Recall Tandem DNN-HMM Systems Output Layer Neural network outputs are • used as “features” to train HMM-GMM models Bottleneck Layer Use a low-dimensional • bottleneck layer representation to extract features from the bottleneck layer   Input Layer

Multilingual Training   (Tandem System) softmax layer for language 1 bottleneck   layer softmax layer for language 2 ⋮ softmax layer for language N Language-independent   hidden layers Vesely et al., “The language-independent bottleneck features”, SLT, 2012.

Multilingual Training   (Tandem System) softmax layer for language 1 bottleneck   layer softmax layer for language 2 ⋮ softmax layer for language N Language-independent   hidden layers Monolingual/multilingual BN feature-based results Language Czech English German Portugese Spanish Russian Turkish Vietnamese HMM 22.6 16.8 26.6 27.0 23.0 33.5 32.0 27.3 1-Softmax 20.3 16.1 25.9 27.2 24.2 33.4 31.3 26.9 mono-BN 19.7 15.9 25.5 27.2 23.2 32.5 30.4 23.4 1-Softmax(IPA) 19.4 15.5 24.8 25.6 23.2 32.5 30.3 25.9 8-Softmax 19.3 14.7 24.0 25.2 22.6 31.5 29.4 24.3 Vesely et al., “The language-independent bottleneck features”, SLT, 2012.

Multilingual Training   (Tandem System) softmax layer for language 1 bottleneck   layer softmax layer for language 2 ⋮ Cross-lingual WERs ANN output : softmax layer for language N baselines Language-independent   5-Softmax Language hidden layers PLP-HLDA Mono-BN (lang-pooled) (II.) (III.) (d) Czech 22.6 19.7 19.2 English 16.8 15.9 14.7 German 26.6 25.5 24.5 Portuguese 27.0 27.2 26.0 Spanish 23.0 23.2 23.0 32.3 Russian 33.5 32.5 Turkish 32.0 30.4 30.7 Vietnamese 27.3 23.4 26.8 Vesely et al., “The language-independent bottleneck features”, SLT, 2012. 340

Cross- and Multilingual Bottleneck features . . . GER EN . . . . . . . . . . . . ENU FR DE . . . FRA Tuske et al., “Investigation on cross- and multilingual MLP features”, ICASSP, 2013 7351

Cross- and Multilingual Bottleneck features . GER . . EN . . . . . . . . . . . . ENU FR DE . . . FRA Features from three languages are merged and presented as input to the model • Language-specific softmax layers • Bottleneck layer which is shared across languages • 7351

Cross- and Multilingual Bottleneck features . GER . . EN . . . . . . . . . . . . ENU FR DE . . . FRA Multilingual BN features using mismatched data Target and cross-lingual BN features in round brackets. WER MFCC+BN MFCC+BN [%] BN trained on WER [%] MFCC Bottleneck trained on ENU+FRA GER+FRA GER+ENU GER+ENU+FRA GER ENU FRA GER 28.37 27.06 26.89 26.90 27.50 29.63 30.38 Test language Test language GER 29.97 (5.3) (9.7) (10.3) (10.2) (8.2) (1.1) (-1.4) GER+FRA ENU+FRA ENU+GER GER+ENU+FRA 21.31 18.85 22.63 ENU 20.29 18.21 17.99 17.89 ENU 21.69 (1.8) (13.1) (-4.3) (6.5) (16.0) (17.1) (17.5) 37.76 38.72 33.95 GER+ENU FRA+GER FRA+ENU GER+ENU+FRA FRA 37.78 (0.1) (-2.5) (10.1) FRA 35.88 33.52 33.45 33.61 (5.0) (11.3) (11.5) (11.0) 7351 7351 7352

e2e multilingual models

ે Multilingual ASR with an e2e Model Speller y 2 y 3 y 4 h eos i Use attention-based encoder-decoder models • Decoder outputs one character per time-step • c 1 c 2 h h h For multilingual models, use union over character sets • s 1 s 2 y 2 y 3 y S − 1 h sos i Bengali   আজ �মঘলা িদন તે વાદળછા�ું iદવસ છ Gujarati h = ( h 1 , . . . , h U ) Hindi   यह एक बादल का िदन है ಇದು �␣ೂೕಡ ಕ�␣ದ �␣ನ Kannada Listener Malayalam ഇത് െതളി� ദിവസമാണ് h 1 h U Marathi तो ढगाळ िदवस आहे Tamil இ� ஒ� ேமக��டமான நாlm Telugu �� ఇ� ��వృత�న �� ٓ � �� Urdu x 1 x 2 x 3 x 4 x T Image from: Chan et al., Listen, Attend and Spell: A NN for LVCSR, ICASSP 2016 ��

Multilingual ASR with an e2e Model Language-specific vs. Multilingual models Speller y 2 y 3 y 4 h eos i Language Language-specific Joint Joint + MTL Bengali 19.1 16.8 16.5 Gujarati 26.0 18.0 18.2 Hindi 16.5 14.4 14.4 Kannada 35.4 34.5 34.6 Malayalam 44.0 36.9 36.7 c 1 c 2 Marathi 28.8 27.6 27.2 Tamil 13.3 10.7 10.6 h h h Telugu 37.4 22.5 22.7 Urdu 29.5 26.8 26.7 s 1 s 2 Weighted Avg. 29.05 22.93 22.91 LAS models conditioned on language ID y 2 y 3 y S − 1 h sos i Language Joint Dec Enc Enc + Dec h = ( h 1 , . . . , h U ) Bengali 16.8 16.9 16.5 16.5 Gujarati 18.0 17.7 17.2 17.3 Listener Hindi 14.4 14.6 14.5 14.4 h 1 h U Kannada 34.5 30.1 29.4 29.2 Malayalam 36.9 35.5 34.8 34.3 Marathi 27.6 24.0 22.8 23.1 Tamil 10.7 10.4 10.3 10.4 Telugu 22.5 22.5 21.9 21.5 Urdu 26.8 25.7 24.2 24.5 Weighted Avg. 22.93 22.03 21.37 21.32 x 1 x 2 x 3 x 4 x T Image from: Chan et al., Listen, Attend and Spell: A NN for LVCSR, ICASSP 2016 ��

Hybrid End-to-end Multilingual ASR Watanabe et al., “e2e architecture for joint language identification and ASR”, ASRU, 2017

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: - PowerPoint PPT Presentation

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid DNN-HMM Systems Triphone state labels (DNN posteriors) Instead of GMMs, use scaled DNN posteriors as the HMM observation

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource

1 In this presentation the two types of alkali-aggregate reaction ASR and ACR will de

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

Water Authoritys ASR Policy Perspective RICK SHEAN, WATER QUALITY HYDROLOGIST AUG. 16, 2017

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Multilingual and Multitask Learning in seq2seq Models CMSC 470 Marine Carpuat Multilingual

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Multilingual Web: Affordable for SMEs and Small Organizations? Multilingual Communication

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

From multilingual documents to multilingual websites: challenges for international organizations

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang,

Connecting and Supporting Socially Responsible Educators Networking Workshop by Anastasia Khawaja

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

has been the focus of much interest recently, involving an interplay of methods Date : November 22,

AP Physics C Inductance Multiple Choice Slide 2 / 21 1 At time = 0, the switch is closed in

Final exam on Thursday, May 16 Drawing on the Web Final CSCI-UA 380 Review Multiple choice

Embodied Methodologies: The Body as Research In Instrument Dr. Eline Kieft, Research Fellow

Chapter 1: Introduction EET-223: RF Communication Circuits Walter Lara Introduction