Natural Language Processing Language Modeling I Dan Klein UC - PDF document

A Speech Example Natural Language Processing Language Modeling I Dan Klein – UC Berkeley ASR Components The Noisy ‐ Channel Model  We want to predict a sentence given acoustics: Language Model Acoustic Model  The noisy ‐ channel approach: channel source w a P(a|w) P(w) observed best decoder w a argmax P(w|a) = argmax P(a|w)P(w) w w Acoustic model: HMMs over Language model: Distributions word positions with mixtures over sequences of words of Gaussians as emissions (sentences) Acoustic Confusions Language Models  A language model is a distribution over sequences of words (sentences) the station signs are in deep in english ‐ 14732 � � � � � � … � � the stations signs are in deep in english ‐ 14735 the station signs are in deep into english ‐ 14739 the station 's signs are in deep in english ‐ 14740  What’s w? (closed vs open vocabulary) the station signs are in deep in the english ‐ 14741  What’s n? (must sum to one over all lengths) the station signs are indeed in english ‐ 14757  Can have rich structure or be linguistically naive the station 's signs are indeed in english ‐ 14760 the station signs are indians in english ‐ 14790  Why language models? the station signs are indian in english ‐ 14799 the stations signs are indians in english ‐ 14807  Usually the point is to assign high weights to plausible sentences (cf the stations signs are indians and english ‐ 14815 acoustic confusions)  This is not the same as modeling grammaticality 1

Translation: Codebreaking? MT System Components “Also knowing nothing official about, but having guessed Language Model Translation Model and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I channel source e f believe succeed even when one does not know what P(f|e) P(e) language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in observed Russian, I say: ‘This is really written in English, but it has best decoder been coded in some strange symbols. I will now proceed to e f decode.’ ” argmax P(e|f) = argmax P(f|e)P(e) Warren Weaver (1947) e e Other Noisy Channel Models?  We’re not doing this only for ASR (and MT)  Grammar / spelling correction N ‐ Gram Models  Handwriting recognition, OCR  Document summarization  Dialog generation  Linguistic decipherment  … N ‐ Gram Models Empirical N ‐ Grams  Use chain rule to generate words left ‐ to ‐ right  How do we know P(w | history)?  Use statistics from data (examples using Google N ‐ Grams)  E.g. what is P(door | the)?  Can’t condition on the entire left context 198015222 the first P(??? | Turn to page 134 and look at the picture of the) Training Counts 194623024 the same 168504105 the following  N ‐ gram models make a Markov assumption 158562063 the world … 14112454 the door ----------------- 23135851162 the *  This is the maximum likelihood estimate 2

Increasing N ‐ Gram Order Increasing N ‐ Gram Order  Higher orders capture more dependencies Bigram Model Trigram Model 198015222 the first 197302 close the window 194623024 the same 191125 close the door 168504105 the following 152500 close the gap 158562063 the world 116451 close the thread … 87298 close the deal 14112454 the door ----------------- ----------------- 23135851162 the * 3785230 close the * P(door | the) = 0.0006 P(door | close the) = 0.05 Sparsity Sparsity  Problems with n ‐ gram models: 1  New words (open vocabulary) 0.8 Fraction Seen Please close the first door on the left.  Synaptitute 0.6  132,701.03 Unigrams 0.4  multidisciplinarization 0.2 Bigrams  Old words in new contexts 3380 please close the door 0 0 500000 1000000 1601 please close the window Number of Words 1164 please close the new  Aside: Zipf’s Law 1159 please close the gate  Types (words) vs. tokens (word occurences) …  Broadly: most word types are rare ones 0 please close the first -----------------  Specifically: 13951 please close the *  Rank word types by token frequency  Frequency inversely proportional to rank  Not special to language: randomly generated character strings have this property (try it!)  This law qualitatively (but rarely quantitatively) informs NLP Smoothing  We often want to make estimates from sparse statistics: P(w | denied the) 3 allegations allegations 2 reports 1 claims N ‐ Gram Estimation reports charges benefits motion 1 request claims request … 7 total  Smoothing flattens spiky distributions so they generalize better: P(w | denied the) 2.5 allegations allegations 1.5 reports allegations 0.5 claims charges benefits motion 0.5 request reports … 2 other claims request 7 total  Very important all over NLP, but easy to do badly 3

Likelihood and Perplexity Train, Held ‐ Out, Test  How do we measure LM “goodness”?  Want to maximize likelihood on test, not training data grease 0.5  Shannon’s game: predict the next word  Empirical n ‐ grams won’t generalize well sauce 0.4  Models derived from counts / sufficient statistics require dust 0.05 When I eat pizza, I wipe off the _________ …. generalization parameters to be tuned on held ‐ out data to simulate mice 0.0001 test generalization  Formally: define test set (log) likelihood …. the 1e-100 Held-Out Test log P X � � � log �� Training Data Data Data �∈� 3516 wipe off the excess 1034 wipe off the dust  Perplexity: “average per word branching 547 wipe off the sweat Counts / parameters from Hyperparameters Evaluate here 518 wipe off the mouthpiece factor” (not per ‐ step) … here from here 120 wipe off the grease 0 wipe off the sauce  Set hyperparameters to maximize the likelihood of the held ‐ out data perp X, � � exp � log ��|�� 0 wipe off the mice (usually with grid search or EM) ----------------- |�| 28048 wipe off the * Measuring Model Quality (Speech) Idea 1: Interpolation  We really want better ASR (or whatever), not better perplexities Please close the first door on the left.  For speech, we care about word error rate (WER) 4 ‐ Gram 3 ‐ Gram 2 ‐ Gram 3380 please close the door 197302 close the window Correct answer: Andy saw a part of the movie 198015222 the first 1601 please close the window 191125 close the door 194623024 the same 1164 please close the new 152500 close the gap 168504105 the following 1159 please close the gate 116451 close the thread 158562063 the world Recognizer output: And he saw apart of the movie … … … 0 please close the first 8662 close the first … ----------------- ----------------- ----------------- 13951 please close the * 3785230 close the * 23135851162 the * insertions + deletions + substitutions WER: = 4/7 = 57% true sentence size 0.0 0.002 0.009  Common issue: intrinsic measures like perplexity are easier to Specific but Sparse Dense but General use, but extrinsic ones are more credible (Linear) Interpolation Idea 2: Discounting  Simplest way to mix different orders: linear interpolation  Observation: N ‐ grams occur more in training data than they will later Empirical Bigram Counts (Church and Gale, 91)  How to choose lambdas? Count in 22M Words Future c* (Next 22M)  Should lambda depend on the counts of the histories? 1 0.45 2 1.25  Choosing weights: either grid search or EM using held ‐ out 3 2.24 data 4 3.23 5 4.21  Better methods have interpolation weights connected to context counts, so you smooth more when you know less 4

Natural Language Processing Language Modeling I Dan Klein UC - PDF document

A Speech Example Natural Language Processing Language Modeling I Dan Klein UC Berkeley ASR Components The Noisy Channel Model We want to predict a sentence given acoustics: Language Model Acoustic Model The noisy channel

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Differential Fertility and Economic Growth David de la Croix Matthias Doepke 1 Differential

affect young womens education and labor market transitions in Kenya? By Phyllis Machio Jane

CHINAS NEW SHERATON HOTEL In keeping with its status as a rising global superpower, China

6. Process Models Cand. Scient. Bygningsinformatik. Semester 1, 2010. CONTENT Chapter 7. BIM

Lecture 1: Analyzing algorithms A royal mathematical challenge (1202): Suppose that rabbits take

Accessing World Fertility Survey-data using Read.ISI Introduction Read.ISI: R-Project

Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

The Finances of Vegetable Fertility WHERE TO FOCUS Y OUR FERTI L I ZER DOL L A RS ON NEW L A ND

Sambuz

Useful Links

Newsletter

Mail Us