CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli - PowerPoint PPT Presentation

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology Hongyu Xiang, Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Tsinghua University http://oa.ee.tsinghua.edu.cn/ouzhijian/ 1

Content 1. Introduction  Related work 2. CTC-CRF 3. Experiments 4. Conclusions

In Introduction • ASR is a discriminative problem  For acoustic observations 𝒚 ≜ 𝑦 1 , ⋯ , 𝑦 𝑈 , find the most likely labels 𝒎 ≜ 𝑚 1 , ⋯ , 𝑚 𝑀 • ASR state-of-the-art: DNNs of various network architectures • Conventionally, multi-stage  Monophone  alignment & triphone tree building  triphone  alignment  DNN-HMM Labels 𝒎 : GMM-HMM Nice to meet you. Acoustic features 𝒚 : DNN-HMM 3

Motivation • End-to-end system:  Eliminate GMM-HMM pre-training and tree building, and can be trained from scratch (flat-start or single-stage). • In a more strict sense:  Remove the need for a pronunciation lexicon and, even further, train the acoustic and language models jointly rather than separately  Data-hungry We are interested in advancing single-stage acoustic models, which use a separate language model (LM) with or without a pronunciation lexicon.  Text corpus for language modeling are cheaply available.  Data-efficient 4

Related work (SS (SS-LF LF-MMI/EE-LF LF-MMI) • Single-Stage (SS) Lattice-Free Maximum-Mutual-Information (LF-MMI)  10 - 25% relative WER reduction on 80-h WSJ, 300-h Switchboard and 2000-h Fisher+Switchboard datasets, compared to CTC, Seq2Seq, RNN-T.  Cast as MMI-based discriminative training of an HMM (generative model) with Pseudo state-likelihoods calculated by the bottom DNN, Fixed state-transition probabilities. CTC-CRF  2-state HMM topology  Cast as a CRF;  Including a silence label  CTC topology;  No silence label. Hadian, et al., “Flat -start single-stage discriminatively trained HMM-based models for ASR”, T -ASLP 2018. 5

Related work ASR is a discriminative problem  For acoustic observations 𝒚 ≜ 𝑦 1 , ⋯ , 𝑦 𝑈 , find the most likely labels 𝒎 ≜ 𝑚 1 , ⋯ , 𝑚 𝑀 1. How to obtain 𝑞 𝒎 | 𝒚 2. How to handle alignment, since 𝑀 ≠ 𝑈 6

Related work How to handle alignment, since 𝑀 ≠ 𝑈  Explicitly by state sequence 𝝆 ≜ 𝜌 1 , ⋯ , 𝜌 𝑈 in HMM, CTC, RNN-T, or implicitly in Seq2Seq  State topology : determines a mapping ℬ , which map 𝝆 to a unique 𝒎 𝑞 𝒎 𝒚 = ෍ 𝑞(𝝆|𝒚) 𝝆∈ℬ −1 (𝒎) CTC topology : a mapping ℬ maps 𝝆 to 𝒎 by 1. removing all repetitive symbols between the blank symbols. 2. removing all blank symbols. ℬ −𝐷𝐷 − −𝐵𝐵 − 𝑈 − = 𝐷𝐵𝑈  Admit the smallest number of units in state inventory, by adding only one <blk> to label inventory.  Avoid ad-hoc silence insertions in estimating denominator LM of labels. 7 Graves, et al., “Connectionist Temporal Classification: Labelling unsegmented sequence data with RNNs”, ICML 2006.

Related work How to obtain 𝑞 𝒎 | 𝒚 𝜌 𝑢−1 𝜌 𝑢 𝜌 𝑢+1 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1  Directed Graphical Model/Locally normalized DNN-HMM  DNN-HMM : Model 𝑞 𝝆, 𝒚 as an HMM, could be 𝜌 𝑢−1 𝜌 𝑢 𝜌 𝑢+1 discriminatively trained, e.g. by max 𝑞 𝜾 𝒎 | 𝒚 𝜾 𝑈  CTC : Directly model 𝑞 𝝆| 𝒚 = ς 𝑢=1 𝑞 𝜌 𝑢 |𝒚 CTC 𝒚 𝑀  Seq2Seq : Directly model 𝑞 𝒎 | 𝒚 = ς 𝑗=1 𝑞 𝑚 𝑗 |𝑚 1 , ⋯ , 𝑚 𝑗−1 , 𝒚 𝑚 𝑗−1 𝑚 𝑗 𝑚 𝑗+1  Undirected Graphical Model/Globally normalized Seq2Seq 𝒚  CRF : 𝑞 𝝆| 𝒚 ∝ 𝑓𝑦𝑞 𝜚 𝝆, 𝒚 MMI training of GMM-HMMs is equiv. to 𝜌 𝑢−1 𝜌 𝑢 𝜌 𝑢+1 CML training of CRFs (using 0/1/2-order features in potential definition). CRF 𝒚 8 Heigold, et al., “ Equivalence of generative and log-linear models ”, T-ASLP 2011.

Related work (s (summary ry) Model State topology Training objective Locally/globally normalized 𝑞 𝒚 𝒎 Regular HMM HMM Local Regular CTC CTC 𝑞 𝒎 𝒚 Local 𝑞 𝒎 𝒚 SS-LF-MMI HMM Local CTC-CRF CTC 𝑞 𝒎 𝒚 Global 𝑞 𝒎 𝒚 Seq2Seq - Local • To the best of our knowledge, this paper represents the first exploration of CRFs with CTC topology. 9

CT CTC vs CT CTC-CRF CTC CTC-CRF 𝑞 𝒎 𝒚 = σ 𝝆∈ℬ −1 (𝒎) 𝑞(𝝆|𝒚) , using CTC topology ℬ 𝑓 𝜚(𝝆,𝒚;𝜾) 𝑞 𝝆 𝒚; 𝜾 = State Independence Node potential, by NN σ 𝝆′ 𝑓 𝜚(𝝆′,𝒚;𝜾) 𝑈 𝑈 log 𝑞 𝜌 𝑢 𝒚 𝜚 𝝆, 𝒚; 𝜾 = ෍ 𝑞 𝝆 𝒚; 𝜾 = ෑ 𝑞 𝜌 𝑢 𝒚 + log 𝑞 𝑀𝑁 (ℬ(𝝆)) 𝑢=1 Edge potential, 𝑢=1 by n-gram denominator LM of labels, like in LF-MMI 𝜖log 𝑞 𝒎 𝒚; 𝜾 𝜖log 𝑞 𝝆|𝒚; 𝜾 𝜖log 𝑞 𝒎 𝒚; 𝜾 𝜖𝜚 𝝆, 𝒚; 𝜾 𝜖𝜚 𝝆′, 𝒚; 𝜾 = 𝔽 𝑞(𝝆|𝒎,𝒚;𝜾) = 𝔽 𝑞(𝝆|𝒎,𝒚;𝜾) − 𝔽 𝑞(𝝆′|𝒚;𝜾) 𝜖𝜾 𝜖𝜾 𝜖𝜾 𝜖𝜾 𝜖𝜾 𝜌 𝑢−1 𝜌 𝑢 𝜌 𝑢+1 𝜌 𝑢−1 𝜌 𝑢 𝜌 𝑢+1 𝒚 𝒚 11

SS SS-LF LF-MMI vs CT CTC-CRF SS-LF-MMI CTC-CRF State topology HMM topology with two states CTC topology No silence labels. Use <blk> to absorb Using silence labels. silence. Silence label Silence labels are randomly inserted  No need to insert silence labels to when estimating denominator LM. transcripts. The posterior is dominated by <blk> and Decoding No spikes. non-blank symbols occur in spikes.  Speedup decoding by skipping blanks.  No length modification; no leaky Modify the utterance length to one Implementation of 30 lengths; use leaky HMM. HMM. 12

Experiments • We conduct our experiments on three benchmark datasets:  WSJ 80 hours  Switchboard 300 hours  Librispeech 1000 hours • Acoustic model: 6 layer BLSTM with 320 hidden dim, 13M parameters • Adam optimizer with an initial learning rate of 0.001, decreased to 0.0001 when cv loss does not decrease • Implemented with Pytorch. • Objective function (use the CTC objective function to help convergences): 𝒦 𝐷𝑈𝐷−𝐷𝑆𝐺 + 𝛽𝒦 𝐷𝑈𝐷 • Decoding score function (use word-based language models, WFST based decoding): log 𝑞 𝒎 𝒚 + 𝛾 log 𝑞 𝑀𝑁 (𝒎) 14

Exp xperim iments (C (Compari rison with ith CT CTC, C, phone based) WSJ Model Unit LM SP dev93 eval92 CTC Mono-phone 4-gram N 10.81% 7.02% 44.4% CTC-CRF Mono-phone 4-gram N 6.24% 3.90% Switchboard Model Unit LM SP SW CH CTC Mono-phone 4-gram N 12.9% 23.6% 14.7% 11% CTC-CRF Mono-phone 4-gram N 11.0% 21.0% Librispeech Model Unit LM SP Dev Clean Dev Other Test Clean Test Other CTC Mono-phone 4-gram N 4.64% 13.23% 5.06% 13.68% 19.1% 22.1% CTC-CRF Mono-phone 4-gram N 3.87% 10.28% 4.09% 10.65% SP: speed perturbation for 3-fold data augmentation. 15

Exp xperim iments (Compari rison with ith SS SS-LF LF-MMI, phone based) Model Unit LM SP dev93 eval92 SS-LF-MMI Mono-phone 4-gram Y 6.3% 3.1% WSJ SS-LF-MMI Bi-phone 4-gram Y 6.0% 3.0% CTC-CRF Mono-phone 4-gram Y 6.23% 3.79% Model Unit LM SP SW CH Switchboard SS-LF-MMI Mono-phone 4-gram Y 11.0% 20.7% 6.4% 4.8% SS-LF-MMI Bi-phone 4-gram Y 9.8% 19.3% CTC-CRF Mono-phone 4-gram Y 10.3% 19.7% Seq2Seq Subword LSTM N 11.8% 25.7% Librispeech Model Unit LM SP Dev Clean Dev Other Test Clean Test Other LF-MMI Tri-phone 4-gram Y - - 4.28% - 4.4% CTC-CRF Mono-phone 4-gram N 3.87% 10.28% 4.09% 10.65% Seq2Seq Subword 4-gram N 4.79% 13.1% 4.82% 15.30% 16 Zeyer, Irie, Schlter, and Ney, “ Improved training of end-to-end attention models for speech recognition ” , Interspeech 2018.

Exp xperim iments (Compari rison with ith SS SS-LF LF-MMI, phone based) Switchboard (After camera-ready version) Model Unit LM SP SW CH SS-LF-MMI Mono-phone 4-gram Y 11.0% 20.7% SS-LF-MMI Bi-phone 4-gram Y 9.8% 19.3% CTC-CRF Mono-phone 4-gram Y 10.3% 19.7% Seq2Seq Subword LSTM N 11.8% 25.7% 5% 4% CTC-CRF Clustered Bi-phone 4-gram Y 9.8% 19.0% Bi-phones clustering from 1213 to 311 according to frequencies 17 Zeyer, Irie, Schlter, and Ney, “ Improved training of end-to-end attention models for speech recognition ” , Interspeech 2018.

WFST representation of f CT CTC topology EESEN T.fst Corrected T.fst Using corrected T.fst performs slightly better; The decoding graph size smaller, and the decoding speed faster. Miao, et al., “ EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding ” , ASRU 2015.

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli - PowerPoint PPT Presentation

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology Hongyu Xiang, Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Tsinghua University http://oa.ee.tsinghua.edu.cn/ouzhijian/ 1

Tier 1 Water Budget CTC SPC Meeting # 2/09 Agenda Item # 6.1 February 17, 2009 Gayle

Groundwater Quality Vulnerability Analysis - WHPA delineation & vulnerability CTC SWP

Ian Richardson, CTC Cycling Development Manager ian.richardson@ctc.org.uk 07771 603826 CTC

Teri Hamelin, CRF Business Director CRF@uml.edu 978.934.6421 Learning with Purpose Learning with

CARES CRF Funding Spending Plan Marvin Odum, City of Houston COVID-19 Response and Recovery

Clinical Research Facility (CRF) TRANSLATIONAL RESEARCH INSTITUTE History of the CRF Took

Neural CRF Parsing AUTHORS: GREG DURRETT AND DAN KLEIN PRESENTER: YUNDI FEI 1 Overview Based

Guidance to CTC Guidance to CTC Why is Santa Cruz allowed to waste tax dollars on a train

79% O F CTC S T U D E N T S 1,811 course sections time fresh- time fresh-

Community Campuses (all community campuses and UAF CTC, UAA CTC and UAS SoCE) Team Presentation

SHOPP Discussion Workshop May 28, 2020 Teri Anderson, CTC Chief Engineer Jon Pray, CTC

IQ: Are Men Smarter than Women? V0G 9/3/2017 V0G 2017 CTC 1 V0G 2017 CTC 2 Whos

V0D 14 Nov 2018 Why 25% of Voter Polls Are Wrong V0D 2018 CTC 1 V0D 2018 CTC 2 Why 25% of

Program June 11, 2020 1 Municipal Coronavirus Relief Fund (CRF) Program The federal support the

Coronavirus Relief Fund (CRF) Recipient Briefing BRIEFING OBJECTIVES Funding Disbursement in

Indianas Coronavirus Relief Fund (CRF) Program CARES Act Reimbursement Presentation for the

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

Lecture 23: Recurrent Neural Networks, Long Short Term Memory Networks, Conntectionist Temporal

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Joint use of SAXS o o with MX and EM Peter Konarev European Molecular Biology Laboratory,

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Audio Adversarial Examples: Targeted Attacks on Speech-To-Text Nicholas Carlini and David

Machine Learning Discussion Dave Draffin 04/24/ 2 018 After this discussion you should: Know

Job Scheduling Uwe Schwiegelshohn EPIT 2007, June 5 Ordonnancement Content of the Lecture

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli - PowerPoint PPT Presentation

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology Hongyu Xiang, Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Tsinghua University http://oa.ee.tsinghua.edu.cn/ouzhijian/ 1

Tier 1 Water Budget CTC SPC Meeting # 2/09 Agenda Item # 6.1 February 17, 2009 Gayle

Groundwater Quality Vulnerability Analysis - WHPA delineation &amp; vulnerability CTC SWP

Ian Richardson, CTC Cycling Development Manager ian.richardson@ctc.org.uk 07771 603826 CTC

Teri Hamelin, CRF Business Director CRF@uml.edu 978.934.6421 Learning with Purpose Learning with

CARES CRF Funding Spending Plan Marvin Odum, City of Houston COVID-19 Response and Recovery

Clinical Research Facility (CRF) TRANSLATIONAL RESEARCH INSTITUTE History of the CRF Took

Neural CRF Parsing AUTHORS: GREG DURRETT AND DAN KLEIN PRESENTER: YUNDI FEI 1 Overview Based

Guidance to CTC Guidance to CTC Why is Santa Cruz allowed to waste tax dollars on a train

79% O F CTC S T U D E N T S 1,811 course sections time fresh- time fresh-

Community Campuses (all community campuses and UAF CTC, UAA CTC and UAS SoCE) Team Presentation

SHOPP Discussion Workshop May 28, 2020 Teri Anderson, CTC Chief Engineer Jon Pray, CTC

IQ: Are Men Smarter than Women? V0G 9/3/2017 V0G 2017 CTC 1 V0G 2017 CTC 2 Whos

V0D 14 Nov 2018 Why 25% of Voter Polls Are Wrong V0D 2018 CTC 1 V0D 2018 CTC 2 Why 25% of

Program June 11, 2020 1 Municipal Coronavirus Relief Fund (CRF) Program The federal support the

Coronavirus Relief Fund (CRF) Recipient Briefing BRIEFING OBJECTIVES Funding Disbursement in

Indianas Coronavirus Relief Fund (CRF) Program CARES Act Reimbursement Presentation for the

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

Lecture 23: Recurrent Neural Networks, Long Short Term Memory Networks, Conntectionist Temporal

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Joint use of SAXS o o with MX and EM Peter Konarev European Molecular Biology Laboratory,

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Audio Adversarial Examples: Targeted Attacks on Speech-To-Text Nicholas Carlini and David

Machine Learning Discussion Dave Draffin 04/24/ 2 018 After this discussion you should: Know

Job Scheduling Uwe Schwiegelshohn EPIT 2007, June 5 Ordonnancement Content of the Lecture

Groundwater Quality Vulnerability Analysis - WHPA delineation & vulnerability CTC SWP