administrivia
play

Administrivia Lab 4 due Thursday, 11:59pm. Lecture 10 Lab 3 handed - PowerPoint PPT Presentation

Administrivia Lab 4 due Thursday, 11:59pm. Lecture 10 Lab 3 handed back next week. Answers: Advanced Language Modeling /user1/faculty/stanchen/e6870/lab3_ans/ . Main feedback from last lecture. Bhuvana Ramabhadran, Michael Picheny, Stanley F.


  1. Administrivia Lab 4 due Thursday, 11:59pm. Lecture 10 Lab 3 handed back next week. Answers: Advanced Language Modeling /user1/faculty/stanchen/e6870/lab3_ans/ . Main feedback from last lecture. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen Pace a little fast; derivations were “heavy”. IBM T.J. Watson Research Center Yorktown Heights, New York, USA {bhuvana,picheny,stanchen}@us.ibm.com 17 November 2009 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 1 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 2 / 114 Where Are We? Review: Language Modeling The Fundamental Equation of Speech Recognition. Introduction 1 class ( x ) = arg max P ( ω | x ) = arg max P ( ω ) P ( x | ω ) ω ω Techniques for Restricted Domains 2 P ( ω = w 1 · · · w l ) — models frequencies of word sequences w 1 · · · w l . Techniques for Unrestricted Domains 3 Helps disambiguate acoustically ambiguous utterances. e.g. , THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD Maximum Entropy Models 4 Other Directions in Language Modeling 5 An Apology 6 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 3 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 4 / 114

  2. Review: Language Modeling Review: N -Gram Models Small vocabulary, restricted domains. Write grammar; convert to finite-state acceptor. P ( ω = w 1 · · · w l ) Or possibly n -gram models. = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 w 2 ) · · · P ( w l | w 1 · · · w l − 1 ) Large vocabulary, unrestricted domains. l N -gram models all the way. � = P ( w i | w 1 · · · w i − 1 ) i = 1 Markov assumption: identity of next word depends only on last n − 1 words, say n =3 P ( w i | w 1 · · · w i − 1 ) ≈ P ( w i | w i − 2 w i − 1 ) ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 5 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 6 / 114 Review: N -Gram Models Spam, Spam, Spam, Spam, and Spam Maximum likelihood estimation N -gram models are robust. Assigns nonzero probs to all word sequences. count ( w i − 2 w i − 1 w i ) P MLE ( w i | w i − 2 w i − 1 ) = Handles unrestricted domains. � w i count ( w i − 2 w i − 1 w i ) N -gram models are easy to build. count ( w i − 2 w i − 1 w i ) = Can train on plain unannotated text. count ( w i − 2 w i − 1 ) No iteration required over training corpus. N -gram models are scalable. Smoothing. Better estimation in sparse data situations. Can build models on billions of words of text, fast. Can use larger n with more data. N -gram models are great! Or are they? ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 7 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 8 / 114

  3. The Dark Side of N -Gram Models What About Short-Distance Dependencies? In fact, n -gram models are deeply flawed. Poor generalization. Training data contains sentence: Let us count the ways. LET ’ S EAT STEAK ON TUESDAY Test data contains sentence: LET ’ S EAT SIRLOIN ON THURSDAY Occurrence of STEAK ON TUESDAY . . . Doesn’t affect estimate of P ( THURSDAY | SIRLOIN ON ) Collecting more data won’t fix this. (Brown et al. , 1992) 350MW training ⇒ 15% trigrams unseen. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 9 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 10 / 114 Medium-Distance Dependencies? Medium-Distance Dependencies? “Medium-distance” ⇔ within sentence. Random generation of sentences with P ( ω = w 1 · · · w l ) : Roll a K -sided die where . . . Fabio example: Each side s ω corresponds to a word sequence ω . . . FABIO , WHO WAS NEXT IN LINE , ASKED IF THE And probability of landing on side s ω is P ( ω ) TELLER SPOKE . . . Reveals what word sequences model thinks is likely. Trigram model: P ( ASKED | IN LINE ) ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 11 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 12 / 114

  4. Trigram Model, 20M Words of WSJ Medium-Distance Dependencies? Real sentences tend to “make sense” and be coherent. AND WITH WHOM IT MATTERS AND IN THE SHORT - HYPHEN TERM Don’t end/start abruptly. AT THE UNIVERSITY OF MICHIGAN IN A GENERALLY QUIET SESSION Have matching quotes. THE STUDIO EXECUTIVES LAW Are about a single subject. REVIEW WILL FOCUS ON INTERNATIONAL UNION OF THE STOCK MARKET Some are even grammatical. HOW FEDERAL LEGISLATION " DOUBLE - QUOTE SPENDING Why can’t n -gram models model this stuff? THE LOS ANGELES THE TRADE PUBLICATION SOME FORTY % PERCENT OF CASES ALLEGING GREEN PREPARING FORMS NORTH AMERICAN FREE TRADE AGREEMENT ( LEFT - PAREN NAFTA ) RIGHT - PAREN , COMMA WOULD MAKE STOCKS A MORGAN STANLEY CAPITAL INTERNATIONAL PERSPECTIVE , COMMA GENEVA " DOUBLE - QUOTE THEY WILL STANDARD ENFORCEMENT THE NEW YORK MISSILE FILINGS OF BUYERS ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 13 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 14 / 114 Long-Distance Dependencies? Recap: Shortcomings of N -Gram Models “Long-distance” ⇔ between sentences. Not great at modeling short-distance dependencies. See previous examples. Not great at modeling medium-distance dependencies. In real life, adjacent sentences tend to be on same topic. Not great at modeling long-distance dependencies. Referring to same entities, e.g. , Clinton. Basically, n -gram models are just a dumb idea. In a similar style, e.g. , formal vs. conversational. They are an insult to language modeling researchers. Why can’t n -gram models model this stuff? Are great for me to poop on. N -gram models, . . . you’re fired! P ( ω = w 1 · · · w l ) = frequency of w 1 · · · w l as sentence? ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 15 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 16 / 114

  5. Where Are We? Where Are We? Introduction Techniques for Restricted Domains 1 2 Embedded Grammars Using Dialogue State Techniques for Restricted Domains 2 Confidence and Rejection Techniques for Unrestricted Domains 3 Maximum Entropy Models 4 Other Directions in Language Modeling 5 An Apology 6 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 17 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 18 / 114 Improving Short-Distance Modeling Combining N -Gram Models with Grammars Issue: data sparsity/lack of generalization. Replace cities and dates, say, in training set with class tokens: I WANT TO FLY FROM BOSTON TO ALBUQUERQUE I WANT TO FLY TO [ CITY ] ON [ DATE ] I WANT TO FLY FROM AUSTIN TO JUNEAU Point: (handcrafted) grammars are good for this: Build n -gram model on new data, e.g. , P ( [ DATE ] | [ CITY ] ON ) Instead of n -gram model on words . . . [sentence] → I WANT TO FLY FROM [city] TO [city] We have n -gram model over words and classes . [city] → AUSTIN | BOSTON | JUNEAU | . . . To model probability of class expanding to particular token, use WFSM: Can we combine robustness of n -gram models . . . [CITY]:AUSTIN/0.1 With generalization ability of grammars? [CITY]:BOSTON/0.3 1 2/1 [CITY]:NEW/1 <epsilon>:YORK/0.4 3 <epsilon>:JERSEY/0.2 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 19 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 20 / 114

  6. The Model Implementing Embedded Grammars Given word sequence w 1 · · · w l . Need final LM as WFSA. Substitute in classes to get class/word sequence Convert word/class n -gram model to WFSM. C = c 1 · · · c l ′ . Compose with transducer expanding each class . . . To its corresponding WFSM. I WANT TO FLY TO [ CITY ] ON [ DATE ] l ′ + 1 Static or on-the-fly composition? � � P ( w 1 · · · w l ) = P ( c i | c i − 2 c i − 1 ) × P ( words ( c i ) | c i ) What if city grammar contains 100,000 cities? C i = 1 Sum over all possible ways to substitute in classes? e.g. , treat MAY as verb or date? Viterbi approximation. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 21 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 22 / 114 Recap: Embedded Grammars Where Are We? Improves modeling of short-distance dependencies. Techniques for Restricted Domains 2 Improves modeling of medium-distance dependencies, e.g. , Embedded Grammars I WANT TO FLY TO WHITE PLAINS AIRPORT IN FIRST CLASS Using Dialogue State I WANT TO FLY TO [ CITY ] IN FIRST CLASS Confidence and Rejection More robust than grammars alone. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 23 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 24 / 114

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend