Hallucinating system outputs for discriminative language modeling - - PowerPoint PPT Presentation

hallucinating system outputs for discriminative language
SMART_READER_LITE
LIVE PREVIEW

Hallucinating system outputs for discriminative language modeling - - PowerPoint PPT Presentation

Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D.


slide-1
SLIDE 1

Hallucinating system outputs for discriminative language modeling

Brian Roark

Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C ¸ elebi, E. Dikici, N. Glenn,

  • K. Hall, E. Hasler, D. Karakos, S. Khudanpur, P. Koehn, M. Lehr,
  • A. Lopez, M. Post, E. Prud’hommeaux, D. Riley, K. Sagae, H. Sak,
  • M. Sarac

¸lar, I. Shafran, P. Xu Symposium on Machine Learning in Speech and Language Processing (MLSLP), Portland

slide-2
SLIDE 2

Project

  • verview
  • NSF funded project and recent JHU summer workshop team
  • General topic: discriminative language modeling for ASR and MT

– Learning language models with discriminative objectives

  • Specific topic: learning models from text only

– Enabling use of much more training data; adaptation scenarios

  • Have made some progress with ASR models (topic today)

– Less progress on improving MT (even fully supervised)

  • Talk includes a few other observations about DLM in general

1

slide-3
SLIDE 3

Motivation

  • Generative language models built from monolingual corpora are task agnostic

– But tasks differ in the kinds of ambiguities that arise

  • Supervised discriminative language modeling needs paired input:output sequences

– Limited data vs. vast amounts of monolingual text used in generative models

  • Semi-supervised discriminative language modeling would have large benefits

– Optimize models for task specific objectives – Applicable to arbitrary amounts of monolingual text in target language

  • How would this work? Here’s one method:

– Use baseline models to discover confusable sequences for observed target – Learn to discriminate between observed sequence and confusables

  • Similar to Contrastive Estimation but with observed output rather than input

2

slide-4
SLIDE 4

Prior work

  • Some prior research on ASR simulation for modeling

– Work on small vocabulary tasks ∗ Jyothi and Fosler-Lussier (2010) used phone confusion WFSTs for generating confusion sets for training ∗ Kurata et al. (2009; 2011) used phone confusions to perform “pseudo-ASR” and train discriminative language models – Tan et al. (2010) used machine translation approaches to simu- late ASR, though without system gains

  • Zhifei Li has also applied similar techniques for MT modeling

(Li et al. COLING 2010; EMNLP 2011)

3

slide-5
SLIDE 5

Discriminative language modeling

  • Supervised training of language models

– Training data (x, y), x ∈ X (inputs), y ∈ Y (outputs) – e.g., x input speech, y output reference transcript

  • Run system on training inputs, update model

– Commonly a linear model, n-gram features and others – Learn parameterizations using perceptron-like or global conditional likelihood methods – Use n-best or lattice output (Roark et al., 2004; 2007); or update directly on decoding graph WFST (Kuo et al., 2007)

  • Run to some stopping criterion; regularize final model

4

slide-6
SLIDE 6

Acoustic confusions in speech recognition

  • Given a text string from the NY Times:

He has not hesitated to use his country’s daunting problems as a kind of threat

... country’s problems ... kind of threat countries problem time threats country proms kinds thread countries’ kinda threads trees spread conferees read conference fred company copy

5

slide-7
SLIDE 7

Synthesizing confusions in ASR

(A) (B)

  • (C)

⇓ (D)

6

slide-8
SLIDE 8

Open questions

  • Various ways to approach this simulation, many open questions

– What is the best unit of confusion for simulation? – How might simulation output diverge from system output? – How to make simulation output “look like” system output? – What kind of data can be used to train simulation models?

  • Experimented with some answers to these questions

– Confusion models based on phones, syllables, words, phrases – Sampling to get n-best lists with particular characteristics – Training confusion models without the use of the reference

7

slide-9
SLIDE 9

Three papers

Going to highlight results from three papers in this talk

  • Sagae et al. Hallucinated n-best lists for discriminative language modeling. In

Proceedings ICASSP 2012. – Controlled experiments with three methods for hallucination

  • C

¸ elebi et al. Semi-supervised discriminative language modeling for Turkish

  • ASR. In Proceedings ICASSP 2012.

– Experiments in Turkish with many other confusion model alternations – Also sampling from simulated output to match WER distribution

  • Xu, Khudanpur and Roark. Phrasal Cohort Based Unsupervised Discriminative

Language Modeling. Proceedings of Interspeech 2012. – Unsupervised methods for deriving confusion model

8

slide-10
SLIDE 10

Sagae et al. (ICASSP, 2012)

  • Simulating ASR errors or pseudo-ASR on English CTS task;

then training a discriminative LM (DLM) for n-best reranking

  • Running controlled experimentation under several conditions:

– Three different methods of training data “hallucination” – Different sized training corpora

  • Comparing WER reductions from real vs. hallucinated n-best lists
  • Standard methods for training linear model

– Simple features: unigrams, bigrams and trigrams – Using averaged perceptron algorithm

9

slide-11
SLIDE 11

Perceptron algorithm

  • On-line learning approach, i.e.,

– Consider each example in training set in turn – Use the current model to produce output for example – Update model based on example, move on to the next one

  • For structured learning problems (parsing, tagging, transcription)

– Given a set of input utterances and reference output sequences – Typically trying to learn parameters for features in a linear model – Need some kind of regularization (typically averaging)

  • Learning a language model

– Consider each input utterance in training set in turn – Use the current model to produce output for example (transcription) – Update feature parameters based on example, move on to the next one

10

slide-12
SLIDE 12

Hallucination methods

  • Three methods for hallucinating being compared:

– FST phone-based confusion model ∗ Phone confusion model encoded as a pair language model – Machine translation system: reference to ASR 1-best ∗ Build a parallel corpus and let Moses loose – Word-based phrasal cohorts model ∗ Direct induction of anchored phrasal alternatives

  • Learn hallucination models by aligning ASR output and reference

– Given new reference text, hallucinate confusion set

11

slide-13
SLIDE 13

FST phone-based confusion model

  • Let S be a string; L a pronunciation lexicon; and G an n-gram language model
  • Learn a phone confusion model X
  • Create lattice of confusions: S ◦ L ◦ X ◦ L−1 ◦ G

– (Prune this very large composition in a couple of ways)

  • Experimented with various methods to derive X

– Best method was to train a pair language model, encoded as a transducer ∗ First, align the 1-best phone sequence with the reference phone sequence: e.g., escape to skater: e:ǫ s:s k:k a:a p:t ǫ:r – Treat each symbol pair a:b as a token, train a language model – Convert the resulting LM automaton into a transducer by splitting tokens

12

slide-14
SLIDE 14

Machine translation system (using Moses)

13

slide-15
SLIDE 15

Phrasal cohorts

  • Levenshtein word-level alignment between reference and each candidate
  • Find cohorts that share pivots (or anchors) on either side of variation
  • Build phrase table, weighted by relative frequency

reference <s> What kind of company is it </s> 1st-best <s> What kind of company that </s> 2nd-best <s> What kind of campaign that </s> 3rd-best <s> What kind of company is it </s> 4th-best <s> Well kind of company that </s> cohort members weights company that 2/4 = 0.5 campaign that 1/4 = 0.25 company is it 1/4 = 0.25

14

slide-16
SLIDE 16

Experimental setup

  • Task and baseline ASR specs:

ASR software IBM Attila Training data 2000 hours (25M words) English conversational telephone speech (CTS), 11000 conver- sations from Fisher, 3500 from Switchboard. Dev & test data NIST RT04 Fall dev and eval test: 38K words, 37K words respectively, ~2 hours. Acoustic models 41 phones, 3-state left-to-right HMM topology for phones, 4K clustered quinphone states, 150K Gaussians, linear discriminant transform, semi-tied covariance transform. Features 13 coefficient perceptual linear coding vectors with speaker-specific VTLN. Baseline LM 4-gram language model defined over 50K word vocabulary estimated by interpolating the transcripts and similar data extracted from the web.

  • Complicated double cross-validation method: 20 folds of 100 hours each

– Needed to train confusion models as well as discriminative LM

  • 100-best list output from ASR system or from simulation
  • Varied amount of DLM training data; compared with supervised DLM

15

slide-17
SLIDE 17

Development set results

e . rs

2 folds 4 folds 8 folds

Real n-best Phone MT Cohorts

21 21.5 22 22.5 23 WER

ASR 1-best Results: Dev Set

16

slide-18
SLIDE 18

Evaluation set results

rs Real n-best Phone MT Cohorts

24 24.5 25 25.5 26 WER

ASR 1-best

8 folds

Results: Eval Set

17

slide-19
SLIDE 19

Discussion

  • All three methods yield models that improve on the baseline

– About half the gain of wholly supervised DLM methods

  • Phrasal cohorts slightly better than the others

– Not significantly different

  • Some take-away impressions/speculations

– Hybrid/joint methods (e.g., phone and phrase) worth looking at – None of the approaches (incl. supervised) make as much use of the extra data as we would hope

18

slide-20
SLIDE 20

C ¸ elebi et al. (ICASSP, 2012)

  • Experiments on Turkish BN task
  • Focus on phone, syllable, morph, and word based confusions
  • Some different choices in setup and experimental evaluation

– Used a WER-rate sensitive (passive-aggressive) style update in perceptron – Explored sampling methods for producing n-best list output – Used different phone confusion methods (e.g., Bhattacharyya) – Different methods for language modeling in producing confusion sets (using reference transcripts, ASR output or no language model) – Combined supervised and hallucinated n-best lists

  • Larger sub-word units (syllable, morph) performed best

19

slide-21
SLIDE 21

Experiments with different units of confusion model

  • WER (%) on held-out for different CMs and LMs
  • Held-out baseline: 22.9%; real (ASR) 50-best DLM: 22.1%

CMs GEN-LM ASR-LM NO-LM phone-based (bh) 22.8 22.7 N/A phone-based (ed) 22.6 22.7 N/A syllable-based 22.5 22.4 22.6 morph-based 22.6 22.4 22.5 word-based 22.6 22.5 22.7

20

slide-22
SLIDE 22

Sampling strategies

  • K-best output of simulation differs from ASR k-best:

– Top-50 uses the best scoring output from hallucination – ASRdist-50 samples a 50-best list from a 1000-best list:

!" #!" $!" %!" &!" '!" !(!" !(#" !($" !(%" !(&" !('" !()" !(*" !(+" !(," #(!"

Percentage WER Top-50 (22.5%) ASRdist-50 (22.4%) Real (ASR) 50-best (22.1%)

21

slide-23
SLIDE 23

Combination methods

  • Eval set results for combining real and simulated n-best lists
  • Eval set baseline: 22.4%

Training data First half Second half WER (%) Real Real 21.5 Simulated Simulated 21.8 Real Simulated 21.6

22

slide-24
SLIDE 24

Xu, Khudanpur and Roark (Interspeech, 2012)

  • Pursues a phrasal cohort method for simulating ASR
  • Trains the phrase-based model both with and without reference

– Supervised: ASR candidates align w/reference to derive cohorts – Unsupervised: pairwise alignments of the 5 best

  • Similar to earlier discussed algorithm on cohort extraction

– Except unweighted and symmetric; extract k-best using LM

p (p) p p (p)

  • Examined results with very different recognizer characteristics

23

slide-25
SLIDE 25

English CTS Results

  • Same experimental setup as previous English results (RT04)
  • Cohort model trained with and without access to reference

Dev WER Eval WER ASR 1-best 22.8 Supervised DLM 21.7 Cohort-based DLM; with transcripts 22.2 Cohort-based DLM; w/out transcripts 22.3

24

slide-26
SLIDE 26

English CTS Results

  • Same experimental setup as previous English results (RT04)
  • Cohort model trained with and without access to reference

Dev WER Eval WER ASR 1-best 22.8 25.7 Supervised DLM 21.7 24.8 Cohort-based DLM; with transcripts 22.2 25.2 Cohort-based DLM; w/out transcripts 22.3 25.3

  • Nearly the same results without transcripts

25

slide-27
SLIDE 27

Further work in Puyang Xu’s thesis

  • Explored the impact of different baseline ASR systems on utility

– More gain relative to supervised DLM with larger data systems – Less gain relative to supervised DLM with discriminative AM’s

  • System quality can also impact cohort modeling methods

– Less gain without reference for poor quality systems (otherwise very similar results with and without across the board)

  • As with other DLM approaches, strong gains with moderate amounts
  • f data, but just modest improvements with more data

26

slide-28
SLIDE 28

Overall Summary

  • Confusion set generation (hallucination) yields useful DLM training data

– Have shown system improvements for LVCSR in multiple languages

  • Methods using larger units (morphs; words) show important benefits

– Different models (phone/phrase) will likely be quite complementary

  • Simulation outputs often show mismatches with real outputs in different ways

– E.g., WER distribution over list; helps to control for such mismatch

  • Can train confusion models on unlabeled speech

– In many common scenarios, resulting models nearly as good – Ultimate utility of methods can vary with amount and type of AM training

  • Like supervised DLM, more work needed on using larger data sets

27

slide-29
SLIDE 29

Some directions to follow

  • Combined methods of confusion set generation
  • Combined supervised and semi-supervised DLM training
  • Application to areas like OCR, different confusion modeling

– Text normalization, machine translation

  • General improvements to discriminative language modeling

– Methods do not leverage large data sets as much as they might – Better feature representations; randomization – Other perceptron-like direct loss minimization algorithms

28

slide-30
SLIDE 30

1-best Hallucination of Jabberwocky

Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. ”Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!”

29

slide-31
SLIDE 31

1-best Hallucination of Jabberwocky

Twas brillig, and the slithy toves was braille it and sly see toes Did gyre and gimble in the wabe: ire and kimble in that way All mimsy were the borogoves, aw limbs were borough goes And the mome raths outgrabe. and um rath out great ”Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!”

30

slide-32
SLIDE 32

1-best Hallucination of Jabberwocky

Twas brillig, and the slithy toves was braille it and sly see toes Did gyre and gimble in the wabe: ire and kimble in that way All mimsy were the borogoves, aw limbs were borough goes And the mome raths outgrabe. and um rath out great ”Beware the Jabberwock, my son! be way the japanese bur[ning]- walk my son The jaws that bite, the claws that catch! the jaws that by the clause that catch Beware the Jubjub bird, and shun be way the jew but jew but bird and one The frumious Bandersnatch!” the roomie a band other snatch

31