Language models Chapter 3 in Martin/Jurafsky Language model as a - PDF document

10/22/19 Language models Chapter 3 in Martin/Jurafsky Language model as a generative model • Choose a random bigram <s> I I want (<s>, w) according to its probability want to • Now choose a random bigram to eat (w, x) according to its probability eat Chinese • And so on until we choose </s> Chinese food food </s> • Then string the words together I want to eat Chinese food 1

10/22/19 Approximating Shakespeare –To him swallowed confess hear both. Which. Of save on trail for are ay device and 1 rote life have gram –Hill he late speaks; or! a more to leg less first you enter –Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live 2 king. Follow. gram –What means, sir. I confess she? then all sorts, he is trim, captain. –Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, 3 ’tis done. gram –This shall forbid it should be branded, if renown made it empty. –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A 4 great banquet serv’d in; gram –It cannot be but so. Figure 4.3 Eight sentences randomly generated from four N -grams computed from Shakespeare’s works. All Shakespeare as a corpus • N=884,647 tokens, V=29,066 • Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams. – So 99.96% of the possible bigrams were never seen (have zero entries in the table) • Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare 2

10/22/19 The wall street journal is not shakespeare 1 Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives gram Last December through the way to preserve the Hudson corporation N. 2 B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her They also point to ninety nine point six billion dollars from two hundred 3 four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions Figure 4.4 Three sentences randomly generated from three N-gram models computed from The perils of overfitting • N-grams only work well for word prediction if the test corpus looks like the training corpus – In real life, it often doesn’t – We need to train robust models that generalize! – Zeros get in the way of generalization • Things that don’t ever occur in the training set – But occur in the test set 3

10/22/19 Zeros • Training set: • Test set … denied the allegations … denied the offer … denied the reports … denied the loan … denied the claims … denied the request P(“offer” | denied the) = 0 Zero probability bigrams • Bigrams with zero probability – mean that we will assign 0 probability to the test set! • And hence we cannot compute perplexity (can’t divide by 0)! 4

10/22/19 The intuition of smoothing • When we have sparse statistics: allegations steal probability to generalize outcome reports attack better … claims request man allegations allegations outcome attack reports man … claims request Add-one estimation • Also called Laplace smoothing • Pretend we saw each word one more time than we did • Just add one to all the counts! MLE ( w i | w i − 1 ) = c ( w i − 1 , w i ) P c ( w i − 1 ) • Add-1 estimate: Add − 1 ( w i | w i − 1 ) = c ( w i − 1 , w i ) + 1 P c ( w i − 1 ) + V 5

10/22/19 Berkeley Restaurant Corpus: Laplace smoothed bigram counts Laplace-smoothed bigrams 6

10/22/19 Reconstituted counts compared with raw bigram counts Add-1 estimation is a blunt instrument • So add-1 isn’t used for N-grams: – We’ll see better methods • But add-1 is used to smooth other NLP models – For text classification – In domains where the number of zeros isn’t so huge. 7

10/22/19 Backoff and Interpolation • Sometimes it helps to use less context – Condition on less context for contexts you haven’t learned much about • Interpolation: – mix unigram, bigram, trigram Linear Interpolation • Simple interpolation ˆ P ( w n | w n − 2 w n − 1 ) = λ 1 P ( w n | w n − 2 w n − 1 ) X λ i = 1 + λ 2 P ( w n | w n − 1 ) i + λ 3 P ( w n ) • Lambdas conditional on context: 8

10/22/19 How to set the lambdas? • Use a hold-out corpus Held-Out Test Training Data Data Data • Choose λ s to maximize the probability of held-out data: – Fix the N-gram probabilities (on the training data) – Then search for λ s that give largest probability to held-out set: ∑ log P ( w 1 ... w n | M ( λ 1 ... λ k )) = log P M ( λ 1 ... λ k ) ( w i | w i − 1 ) i Unknown words: Open versus closed vocabulary tasks • If we know all the words in advance – Vocabulary is fixed – Closed vocabulary task • Often we don’t know this – Out Of Vocabulary = OOV words – Open vocabulary task • Instead: create an unknown word token <UNK> – Training of <UNK> probabilities • Create a fixed lexicon L of size V • At text normalization phase, any training word not in L changed to <UNK> • Now we train its probabilities like a normal word – At decoding time • If text input: Use UNK probabilities for any word not in training 9

10/22/19 Web-scale N-gram datasets • How to deal with, e.g., Google N-gram corpus • Pruning – Only store N-grams with count > threshold. • Efficiency – Efficient data structures like tries – Bloom filters: approximate language models – Store words as indexes, not strings – Quantize probabilities (4-8 bits instead of 8-byte float) Advanced language modeling • Discriminative models: – choose n-gram weights to improve a task, not to fit the training set • Caching models – Recently used words are more likely to appear CACHE ( w | history ) = λ P ( w i | w i − 2 w i − 1 ) + (1 − λ ) c ( w ∈ history ) P | history | – These perform very poorly for speech recognition (why?) 10

Language models Chapter 3 in Martin/Jurafsky Language model as a - PDF document

10/22/19 Language models Chapter 3 in Martin/Jurafsky Language model as a generative model Choose a random bigram <s> I I want (<s>, w) according to its probability want to Now choose a random bigram to eat (w, x)

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP n-grams = Chain Rule

& Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram

Antimicrobial resistance and strategies for Gram-negative bacteria Y Glupczynski UCL

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural

CIS 530: Logistic Regression Wrap-up SPEECH AND LANGUAGE PROCESSING (3 RD EDITION DRAFT)

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Sambuz

Useful Links

Newsletter

Mail Us

Language models Chapter 3 in Martin/Jurafsky Language model as a - PDF document

10/22/19 Language models Chapter 3 in Martin/Jurafsky Language model as a generative model Choose a random bigram <s> I I want (<s>, w) according to its probability want to Now choose a random bigram to eat (w, x)

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP n-grams = Chain Rule

&amp; Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram

Antimicrobial resistance and strategies for Gram-negative bacteria Y Glupczynski UCL

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural

CIS 530: Logistic Regression Wrap-up SPEECH AND LANGUAGE PROCESSING (3 RD EDITION DRAFT)

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Sambuz

Useful Links

Newsletter

Mail Us

N-grams & Language ID If N-gram models represent language models, can we use N-gram

& Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram