ELEN E6884/COMS 86884 Speech Recognition Lecture 11
Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,eeide,stanchen}@us.ibm.com 17 November 2005
■❇▼
ELEN E6884: Speech Recognition
ELEN E6884/COMS 86884 Speech Recognition Lecture 11 Michael - - PowerPoint PPT Presentation
ELEN E6884/COMS 86884 Speech Recognition Lecture 11 Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,eeide,stanchen } @us.ibm.com 17 November 2005 ELEN E6884: Speech
■❇▼
ELEN E6884: Speech Recognition
■ Lab 3 handed back today ■ Lab 4 due Sunday midnight ■ next week is Thanksgiving ■ select paper(s) for final presentation by Monday after
■ aiming for field trip to IBM between Tuesday (12/13) – Thursday
■❇▼
ELEN E6884: Speech Recognition 1
■ same model form used: HMM w/ GMM output distributions
■ (please ask questions during class!)
■❇▼
ELEN E6884: Speech Recognition 2
ω
ω
■ P(ω) — probability distribution over word sequences ω
■ P(x|ω) — probability of acoustic feature vectors x given word
■❇▼
ELEN E6884: Speech Recognition 3
■ helps us disambiguate acoustically ambiguous utterances
■ estimates relative frequency of word sequences ω = w1 · · · wl
■❇▼
ELEN E6884: Speech Recognition 4
■ for very restricted, small-vocabulary domains
■ large vocabulary, unrestricted domain ASR
■❇▼
ELEN E6884: Speech Recognition 5
l
■ Markov assumption: identity of next word depends only on last
■❇▼
ELEN E6884: Speech Recognition 6
■ maximum likelihood estimation
■ smoothing
■❇▼
ELEN E6884: Speech Recognition 7
■ n-gram models are robust
■ n-gram models are easy to build
■ n-gram models are scalable
■ n-gram models are great!
■❇▼
ELEN E6884: Speech Recognition 8
■ not great at modeling short-distance dependencies ■ do not generalize well
■ collecting more data can’t fix this
■❇▼
ELEN E6884: Speech Recognition 9
■ not great at modeling medium-distance dependencies
■ Fabio example
■❇▼
ELEN E6884: Speech Recognition 10
■ random generation of sentences with model P(ω = w1 · · · wl)
■❇▼
ELEN E6884: Speech Recognition 11
AND WITH WHOM IT MATTERS AND IN THE SHORT -HYPHEN TERM AT THE UNIVERSITY OF MICHIGAN IN A GENERALLY QUIET SESSION THE STUDIO EXECUTIVES LAW REVIEW WILL FOCUS ON INTERNATIONAL UNION OF THE STOCK MARKET HOW FEDERAL LEGISLATION
THE LOS ANGELES THE TRADE PUBLICATION SOME FORTY %PERCENT OF CASES ALLEGING GREEN PREPARING FORMS NORTH AMERICAN FREE TRADE AGREEMENT (LEFT-PAREN NAFTA
A MORGAN STANLEY CAPITAL INTERNATIONAL PERSPECTIVE ,COMMA GENEVA
THE NEW YORK MISSILE FILINGS OF BUYERS
■❇▼
ELEN E6884: Speech Recognition 12
■ real sentences tend to “make sense” and be coherent
■ n-gram models don’t seem to know this
■❇▼
ELEN E6884: Speech Recognition 13
■ not great at modeling long-distance dependencies
■ see previous examples ■ in real life, consecutive sentences tend to be on the same topic
■ n-gram models generate each sentence independently
■❇▼
ELEN E6884: Speech Recognition 14
■ not great at modeling short-distance dependencies ■ not great at modeling medium-distance dependencies ■ not great at modeling long-distance dependencies ■ basically, n-gram models are just a dumb idea
■❇▼
ELEN E6884: Speech Recognition 15
■
■ Unit II: techniques for unrestricted domains ■ Unit III: maximum entropy models ■ Unit IV: other directions in language modeling ■ Unit V: an apology to n-gram models
■❇▼
ELEN E6884: Speech Recognition 16
■ we need relevant data to build statistical models
■❇▼
ELEN E6884: Speech Recognition 17
■ ask co-workers to think up what they might say ■ Wizard of Oz data collection
■ get a simple system up fast
■❇▼
ELEN E6884: Speech Recognition 18
■ main issue: data sparsity/lack of generalization
■ point: (handcrafted) grammars were kind of good at this
■ can we combine robustness of n-gram models with
■❇▼
ELEN E6884: Speech Recognition 19
■ use grammars to generate artificial n-gram model training data ■ e.g., identify cities in real training data
■ replace city instances with other legal cities/places (according
■ also, for date and time expressions, airline names, etc.
■❇▼
ELEN E6884: Speech Recognition 20
■ instead of constructing n-gram models on words, build n-gram
■ e.g., replace cities and dates in training set with special tokens
■ build n-gram model on new data, e.g., P([DATE] | [CITY] ON) ■ express grammars as weighted FST’s
1 2/1 [CITY]:AUSTIN/0.1 [CITY]:BOSTON/0.3 3 [CITY]:NEW/1 <epsilon>:YORK/0.4 <epsilon>:JERSEY/0.2
■❇▼
ELEN E6884: Speech Recognition 21
■ possible implementation
■ embedded embedded grammars?
■❇▼
ELEN E6884: Speech Recognition 22
■ addresses sparse data issues in n-gram models
■ can handle longer-distance dependencies since whole
■ what about modeling whole-sentence grammaticality?
■❇▼
ELEN E6884: Speech Recognition 23
■ many apps involve computer-human dialogue
■ directed dialogue
■ undirected or mixed initiative dialogue
■❇▼
ELEN E6884: Speech Recognition 24
■ switching LM’s based on context ■ e.g., directed dialogue
■ boost probabilities of entities mentioned before in dialogue?
■❇▼
ELEN E6884: Speech Recognition 25
■ e.g., ASR errors put user in dialogue state they can’t get out of ■ e.g., you ask: IS THIS FLIGHT OK?
■ if activate specialized grammars/LM’s for different situations
■ can an ASR system detect when it’s wrong?
■❇▼
ELEN E6884: Speech Recognition 26
■ want to reject ASR hypotheses with low confidence
■ how to tell when you have low confidence?
■ better: posterior probability
■❇▼
ELEN E6884: Speech Recognition 27
■ to calculate a reasonably accurate posterior, need to sum over
■ issue: language model weight or acoustic model weight?
■❇▼
ELEN E6884: Speech Recognition 28
■ accurate rejection essential for usable dialogue systems ■ posterior probabilities are more or less state-of-the-art ■ if you think you’re wrong, can you use this information to
■❇▼
ELEN E6884: Speech Recognition 29
■ Unit I: techniques for restricted domains
■
■ Unit III: maximum entropy models ■ Unit IV: other directions in language modeling ■ Unit V: an apology to n-gram models
■❇▼
ELEN E6884: Speech Recognition 30
■ short-distance dependencies
■ medium-distance dependencies
■ long-distance dependencies
■ linear interpolation revisited
■❇▼
ELEN E6884: Speech Recognition 31
■ word n-gram models do not generalize well
■ in embedded grammars, some words/phrases are members of
■❇▼
ELEN E6884: Speech Recognition 32
■ embedded grammars
■ class n-gram model
■❇▼
ELEN E6884: Speech Recognition 33
■ with vocab sizes of 50,000+, don’t want to do this by hand ■ maybe we can do this using statistical methods?
■ possible algorithm (Schutze, 1992)
■❇▼
ELEN E6884: Speech Recognition 34
■ maximum likelihood!
■ naturally groups words occurring in similar contexts ■ directly optimizes an objective function we care about
■❇▼
ELEN E6884: Speech Recognition 35
■ basic algorithm
■❇▼
ELEN E6884: Speech Recognition 36
THE TONIGHT’S SARAJEVO’S JUPITER’S PLATO’S CHILDHOOD’S GRAVITY’S EVOLUTION’S OF AS BODES AUGURS BODED AUGURED HAVE HAVEN’T WHO’VE DOLLARS BARRELS BUSHELS DOLLARS’ KILOLITERS
HIS SADDAM’S MOZART’S CHRIST’S LENIN’S NAPOLEON’S JESUS’ ARISTOTLE’S DUMMY’S APARTHEID’S FEMINISM’S ROSE FELL DROPPED GAINED JUMPED CLIMBED SLIPPED TOTALED EASED PLUNGED SOARED SURGED TOTALING AVERAGED RALLIED TUMBLED SLID SANK SLUMPED REBOUNDED PLUMMETED TOTALLED DIPPED FIRMED RETREATED TOTALLING LEAPED SHRANK SKIDDED ROCKETED SAGGED LEAPT ZOOMED SPURTED NOSEDIVED
■❇▼
ELEN E6884: Speech Recognition 37
■ e.g., class trigram model
■ outperforms word n-gram models with small training sets ■ on larger training sets, word n-gram models win (<1% absolute
■ can we combine the two?
■❇▼
ELEN E6884: Speech Recognition 38
■ e.g., in smoothing, combining a higher-order n-gram model with
■ linear interpolation
■❇▼
ELEN E6884: Speech Recognition 39
■ linear interpolation — a hammer for combining models
■ small gain over either model alone (<1% absolute WER) ■ state-of-the-art single-domain language model for large training
■ conceivably, λ can be history-dependent
■❇▼
ELEN E6884: Speech Recognition 40
■ not well-suited to one-pass decoding
■ smaller than word n-gram models
■ easy to add new words to vocabulary
■❇▼
ELEN E6884: Speech Recognition 41
■ two-pass decoding
THE THIS THUD DIG DOG DOG DOGGY ATE EIGHT MAY MY MAY
■ N-best list rescoring?
■ lattice may contain exponential number of paths
■❇▼
ELEN E6884: Speech Recognition 42
■ can we just put new LM scores directly on lattice arcs?
1 2 THE THIS 3 DIG DOG
1 2 THE 5 THIS 3 DIG 4 DOG DIG DOG
■❇▼
ELEN E6884: Speech Recognition 43
■ is there an easy way of doing this?
■ expressing class n-gram model as an WFSA
■ what if WFSA corresponding to LM is too big?
■❇▼
ELEN E6884: Speech Recognition 44
■ have lattice ALM containing LM scores from first pass ■ pretend this is the full language model FSA, do regular
THE THIS THUD DIG DOG DOG DOGGY ATE EIGHT MAY MY MAY
■❇▼
ELEN E6884: Speech Recognition 45
■ short-distance dependencies
■
■ long-distance dependencies
■ linear interpolation revisited
■❇▼
ELEN E6884: Speech Recognition 46
■ n-gram models predict identity of next word . . .
■ important words for prediction may occur in many positions
✟✟✟✟✟✟ ✟ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
■❇▼
ELEN E6884: Speech Recognition 47
■ important words for prediction may occur in many positions
✟✟✟✟✟✟✟✟✟ ✟ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✟✟✟✟✟✟ ✟ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
✟✟ ✟ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
■ n-gram model predicts saw using words on top ■ shouldn’t condition on words a fixed number of words back?
■❇▼
ELEN E6884: Speech Recognition 48
■ each constituent has a headword
✟✟✟✟✟✟✟✟✟ ✟ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✟✟✟✟✟✟ ✟ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
✟✟ ✟ ❍ ❍ ❍
✟✟✟ ✟ ❍ ❍ ❍ ❍
■❇▼
ELEN E6884: Speech Recognition 49
■ predict next word based on preceding exposed headwords
■ structured language model (Chelba and Jelinek, 2000)
■❇▼
ELEN E6884: Speech Recognition 50
■ come up with grammar rules
■ come up with probabilistic parameterization
■ can extract rules and train probabilities using a treebank
■❇▼
ELEN E6884: Speech Recognition 51
■ decoding
■ not yet implemented in one-pass decoder ■ evaluated via lattice rescoring
■❇▼
ELEN E6884: Speech Recognition 52
■ um, -cough-, kind of ■ issue: training is expensive
■ lattice rescoring
■ well, can we get gains of both?
■❇▼
ELEN E6884: Speech Recognition 53
■ grammatical language models not yet ready for prime time
■ if you have an exotic LM and need publishable results . . .
■❇▼
ELEN E6884: Speech Recognition 54
■ short-distance dependencies
■ medium-distance dependencies
■
■ linear interpolation revisited
■❇▼
ELEN E6884: Speech Recognition 55
■❇▼
ELEN E6884: Speech Recognition 56
■ observation: words in previous sentences are more likely to
■ current formulation of language models P(ω = w1 · · · wl)
■ language model adaptation
■❇▼
ELEN E6884: Speech Recognition 57
■ how to boost probabilities of recently-occurring words? ■ idea: combine static n-gram model with n-gram model built on
i−500) =
i−500(wi|wi−1)
■❇▼
ELEN E6884: Speech Recognition 58
■ can we improve on cache language models?
■ try to automatically induce which words trigger which other
■❇▼
ELEN E6884: Speech Recognition 59
■ combining triggers and a static language model
i−500) =
i−500(wi|wi−1)
■ use maximum entropy models (Lau et al., 1993)
■❇▼
ELEN E6884: Speech Recognition 60
■ observations: there are groups of words that are all mutual
■ ⇒ topic language models
■❇▼
ELEN E6884: Speech Recognition 61
■ assign a topic (or topics) to each document in training corpus
■ for each topic, build a topic-specific language model
■ when decoding
■❇▼
ELEN E6884: Speech Recognition 62
■ assigning topics to documents
■ guessing the current topic
■❇▼
ELEN E6884: Speech Recognition 63
■ topic LM’s may be sparse
T
■❇▼
ELEN E6884: Speech Recognition 64
■ um, -cough-, kind of ■ cache models
■❇▼
ELEN E6884: Speech Recognition 65
■ trigger models
■ topic models
■❇▼
ELEN E6884: Speech Recognition 66
■ large PP gains, but small WER gains
■ increases system complexity for ASR
■ basically, unclear whether worth the effort
■❇▼
ELEN E6884: Speech Recognition 67
■ short-distance dependencies
■ medium-distance dependencies
■ long-distance dependencies
■❇▼
ELEN E6884: Speech Recognition 68
■ short-distance dependencies
■ medium-distance dependencies
■ long-distance dependencies
■
■❇▼
ELEN E6884: Speech Recognition 69
■ if short, medium, and long-distance modeling all achieve ∼1%
■ “A Bit of Progress in Language Modeling” (Goodman, 2001)
■❇▼
ELEN E6884: Speech Recognition 70
■ intuitively, it’s clear that humans use short, medium, and long
■ should get gains from modeling each type of dependency ■ and yet, linear interpolation failed to yield cumulative gains
■❇▼
ELEN E6884: Speech Recognition 71
■ say, unigram cache model
i−500) =
i−500(wi)
■ compute Pcache(FRIEDMAN | KENTUCKY FRIED)
i−500(FRIEDMAN) = 0.1
■ observation: Pcache(FRIEDMAN | wi−2wi−1, wi−1
i−500) ≥ 0.01 for
■❇▼
ELEN E6884: Speech Recognition 72
■ linear interpolation is like an OR
■ doesn’t seem like correct behavior in this case
■ is there a way of combining things that acts like an AND?
■❇▼
ELEN E6884: Speech Recognition 73
■ Unit I: techniques for restricted domains
■ Unit II: techniques for unrestricted domains ■
■ Unit IV: other directions in language modeling ■ Unit V: an apology to n-gram models
■❇▼
ELEN E6884: Speech Recognition 74
■ old way
■ new way (Jaynes, 1957)
■❇▼
ELEN E6884: Speech Recognition 75
■ for a model or probability distribution P(x) . . . ■ the entropy H(P) (in bits) of P(·) is
■❇▼
ELEN E6884: Speech Recognition 76
■ deterministic distribution has zero bits of entropy
■ uniform distribution over N elements has log2 N bits of entropy
■❇▼
ELEN E6884: Speech Recognition 77
■ with no constraints
■ information theoretic interpretation
■ entropy ⇔ uniformness ⇔ least assumptions ■ maximum entropy model given some constraints . . .
■❇▼
ELEN E6884: Speech Recognition 78
■ before we roll it, what distribution would we guess?
■❇▼
ELEN E6884: Speech Recognition 79
■ rolled the die 20 times, and don’t remember everything, but . . .
■❇▼
ELEN E6884: Speech Recognition 80
■ fi(x) are called features
■ choose distribution P(x) such that for each feature fi(x) . . .
■❇▼
ELEN E6884: Speech Recognition 81
■ rolled the die 20 times, and don’t remember everything, but . . .
■ the maximum entropy distribution
■❇▼
ELEN E6884: Speech Recognition 82
■ as it turns out, maximum entropy models have the following
i
■ also called exponential models or log-linear models
■❇▼
ELEN E6884: Speech Recognition 83
i
■ α0 = 0.1, α1 = 1.75, α2 = 2
■❇▼
ELEN E6884: Speech Recognition 84
■ as it turns out, maximum entropy models are also maximum
i
■ likelihood of training data is convex function of αi
■❇▼
ELEN E6884: Speech Recognition 85
■ motivation for using exponential/log-linear models
■ maximum likelihood perspective is more useful
i αfi(x) i
■ however, still use name maximum entropy because sounds
■❇▼
ELEN E6884: Speech Recognition 86
■ unigram cache model
i−500) =
i−500(wi)
■ compute Pcache(FRIEDMAN | KENTUCKY FRIED)
i−500(FRIEDMAN) = 0.1
■ observation: Pcache(FRIEDMAN | wi−2wi−1, wi−1
i−500) ≥ 0.01 for
■❇▼
ELEN E6884: Speech Recognition 87
■ combine through multiplication rather than addition
i−500) =
fi(wi,wi−1
i−500)
i
i−500) =
i−500
■ where α1 ≈ 10
■❇▼
ELEN E6884: Speech Recognition 88
■ this gets the AND behavior we want
■ can combine in individual features rather than whole models
■ can combine disparate sources of information
■❇▼
ELEN E6884: Speech Recognition 89
■ 40M words of WSJ training data ■ trained maximum entropy model with . . .
■ 30% reduction in PP
■ training time: 200 computer-days
■❇▼
ELEN E6884: Speech Recognition 90
■ training updates
■ normalization — making probs sum to 1
■❇▼
ELEN E6884: Speech Recognition 91
■ each is appropriate in different situations ■ e.g., when combining models trained on different domains
■ together, they comprise a very powerful tool set for model
■ maximum entropy models still too slow for prime time
■❇▼
ELEN E6884: Speech Recognition 92
■ Unit I: techniques for restricted domains
■ Unit II: techniques for unrestricted domains ■ Unit III: maximum entropy models ■
■ Unit V: an apology to n-gram models
■❇▼
ELEN E6884: Speech Recognition 93
■ blah, blah, blah
■❇▼
ELEN E6884: Speech Recognition 94
■ Unit I: techniques for restricted domains
■ Unit II: techniques for unrestricted domains ■ Unit III: maximum entropy models ■ Unit IV: other directions in language modeling ■
■❇▼
ELEN E6884: Speech Recognition 95
■ I didn’t mean what I said about you ■ you know I was kidding when I said you are great to poop on
■❇▼
ELEN E6884: Speech Recognition 96
■ technology
■ users cannot distinguish WER differences of a few percent
■ research developments in language modeling
■❇▼
ELEN E6884: Speech Recognition 97
■ e.g., government evaluations: Switchboard, Broadcast News
■ recent advances
■ modeling medium-to-long-distance dependencies
■ LM gains pale in comparison to acoustic modeling gains
■❇▼
ELEN E6884: Speech Recognition 98
■ n-gram models are just really easy to build
■ doing well involves combining many sources of information
■ evidence that LM’s will help more when WER’s are lower
■❇▼
ELEN E6884: Speech Recognition 99
■ week 12: applications of ASR
■ week 13: final presentations ■ week 14: going to Disneyland
■❇▼
ELEN E6884: Speech Recognition 100