hallucinating system outputs for discriminative language
play

Hallucinating system outputs for discriminative language modeling - PowerPoint PPT Presentation

Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D.


  1. Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C ¸ elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D. Karakos, S. Khudanpur, P. Koehn, M. Lehr, A. Lopez, M. Post, E. Prud’hommeaux, D. Riley, K. Sagae, H. Sak, M. Sarac ¸lar, I. Shafran, P. Xu Symposium on Machine Learning in Speech and Language Processing (MLSLP), Portland

  2. Project overview • NSF funded project and recent JHU summer workshop team • General topic: discriminative language modeling for ASR and MT – Learning language models with discriminative objectives • Specific topic: learning models from text only – Enabling use of much more training data; adaptation scenarios • Have made some progress with ASR models (topic today) – Less progress on improving MT (even fully supervised) • Talk includes a few other observations about DLM in general 1

  3. Motivation • Generative language models built from monolingual corpora are task agnostic – But tasks differ in the kinds of ambiguities that arise • Supervised discriminative language modeling needs paired input:output sequences – Limited data vs. vast amounts of monolingual text used in generative models • Semi-supervised discriminative language modeling would have large benefits – Optimize models for task specific objectives – Applicable to arbitrary amounts of monolingual text in target language • How would this work? Here’s one method: – Use baseline models to discover confusable sequences for observed target – Learn to discriminate between observed sequence and confusables • Similar to Contrastive Estimation but with observed output rather than input 2

  4. Prior work • Some prior research on ASR simulation for modeling – Work on small vocabulary tasks ∗ Jyothi and Fosler-Lussier (2010) used phone confusion WFSTs for generating confusion sets for training ∗ Kurata et al. (2009; 2011) used phone confusions to perform “pseudo-ASR” and train discriminative language models – Tan et al. (2010) used machine translation approaches to simu- late ASR, though without system gains • Zhifei Li has also applied similar techniques for MT modeling (Li et al. COLING 2010; EMNLP 2011) 3

  5. Discriminative language modeling • Supervised training of language models – Training data ( x, y ) , x ∈ X (inputs), y ∈ Y (outputs) – e.g., x input speech, y output reference transcript • Run system on training inputs, update model – Commonly a linear model, n-gram features and others – Learn parameterizations using perceptron-like or global conditional likelihood methods – Use n-best or lattice output (Roark et al., 2004; 2007); or update directly on decoding graph WFST (Kuo et al., 2007) • Run to some stopping criterion; regularize final model 4

  6. Acoustic confusions in speech recognition • Given a text string from the NY Times: He has not hesitated to use his country’s daunting problems as a kind of threat ... country’s problems ... kind of threat countries problem time threats country proms kinds thread countries’ kinda threads trees spread conferees read conference fred company copy 5

  7. Synthesizing confusions in ASR (A) (B) ◦ (C) ⇓ (D) 6

  8. Open questions • Various ways to approach this simulation, many open questions – What is the best unit of confusion for simulation? – How might simulation output diverge from system output? – How to make simulation output “look like” system output? – What kind of data can be used to train simulation models? • Experimented with some answers to these questions – Confusion models based on phones, syllables, words, phrases – Sampling to get n-best lists with particular characteristics – Training confusion models without the use of the reference 7

  9. Three papers Going to highlight results from three papers in this talk • Sagae et al. Hallucinated n-best lists for discriminative language modeling. In Proceedings ICASSP 2012. – Controlled experiments with three methods for hallucination • C ¸ elebi et al. Semi-supervised discriminative language modeling for Turkish ASR. In Proceedings ICASSP 2012. – Experiments in Turkish with many other confusion model alternations – Also sampling from simulated output to match WER distribution • Xu, Khudanpur and Roark. Phrasal Cohort Based Unsupervised Discriminative Language Modeling. Proceedings of Interspeech 2012. – Unsupervised methods for deriving confusion model 8

  10. Sagae et al. (ICASSP, 2012) • Simulating ASR errors or pseudo-ASR on English CTS task; then training a discriminative LM (DLM) for n-best reranking • Running controlled experimentation under several conditions: – Three different methods of training data “hallucination” – Different sized training corpora • Comparing WER reductions from real vs. hallucinated n-best lists • Standard methods for training linear model – Simple features: unigrams, bigrams and trigrams – Using averaged perceptron algorithm 9

  11. Perceptron algorithm • On-line learning approach, i.e., – Consider each example in training set in turn – Use the current model to produce output for example – Update model based on example, move on to the next one • For structured learning problems (parsing, tagging, transcription) – Given a set of input utterances and reference output sequences – Typically trying to learn parameters for features in a linear model – Need some kind of regularization (typically averaging) • Learning a language model – Consider each input utterance in training set in turn – Use the current model to produce output for example (transcription) – Update feature parameters based on example, move on to the next one 10

  12. Hallucination methods • Three methods for hallucinating being compared: – FST phone-based confusion model ∗ Phone confusion model encoded as a pair language model – Machine translation system: reference to ASR 1-best ∗ Build a parallel corpus and let Moses loose – Word-based phrasal cohorts model ∗ Direct induction of anchored phrasal alternatives • Learn hallucination models by aligning ASR output and reference – Given new reference text, hallucinate confusion set 11

  13. FST phone-based confusion model • Let S be a string; L a pronunciation lexicon; and G an n-gram language model • Learn a phone confusion model X • Create lattice of confusions: S ◦ L ◦ X ◦ L − 1 ◦ G – (Prune this very large composition in a couple of ways) • Experimented with various methods to derive X – Best method was to train a pair language model, encoded as a transducer ∗ First, align the 1-best phone sequence with the reference phone sequence: e: ǫ s:s k:k a:a p:t ǫ :r e.g., escape to skater: – Treat each symbol pair a:b as a token, train a language model – Convert the resulting LM automaton into a transducer by splitting tokens 12

  14. Machine translation system (using Moses) 13

  15. Phrasal cohorts • Levenshtein word-level alignment between reference and each candidate • Find cohorts that share pivots (or anchors) on either side of variation • Build phrase table, weighted by relative frequency reference <s> What kind of company is it </s> cohort members weights 1st-best <s> What kind of company that </s> company that 2/4 = 0.5 2nd-best <s> What kind of campaign that </s> campaign that 1/4 = 0.25 3rd-best <s> What kind of company is it </s> company is it 1/4 = 0.25 4th-best <s> Well kind of company that </s> 14

  16. Experimental setup • Task and baseline ASR specs: ASR software IBM Attila 2000 hours (25M words) English conversational telephone speech (CTS), 11000 conver- Training data sations from Fisher, 3500 from Switchboard. Dev & test data NIST RT04 Fall dev and eval test: 38K words, 37K words respectively, ~2 hours. Acoustic models 41 phones, 3-state left-to-right HMM topology for phones, 4K clustered quinphone states, 150K Gaussians, linear discriminant transform, semi-tied covariance transform. Features 13 coefficient perceptual linear coding vectors with speaker-specific VTLN. 4-gram language model defined over 50K word vocabulary estimated by interpolating the Baseline LM transcripts and similar data extracted from the web. • Complicated double cross-validation method: 20 folds of 100 hours each – Needed to train confusion models as well as discriminative LM • 100-best list output from ASR system or from simulation • Varied amount of DLM training data; compared with supervised DLM 15

  17. Development set results e . Results: Dev Set ASR 1-best 2 folds 4 folds Real n-best 8 folds Phone MT Cohorts 21 21.5 22 22.5 23 WER 16 rs

  18. Evaluation set results Results: Eval Set ASR 1-best 8 folds Real n-best Phone MT Cohorts 24 24.5 25 25.5 26 WER rs 17

  19. Discussion • All three methods yield models that improve on the baseline – About half the gain of wholly supervised DLM methods • Phrasal cohorts slightly better than the others – Not significantly different • Some take-away impressions/speculations – Hybrid/joint methods (e.g., phone and phrase) worth looking at – None of the approaches (incl. supervised) make as much use of the extra data as we would hope 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend