Language Modeling CSE392 - Spring 2019 Special Topic in CS Task - - PowerPoint PPT Presentation

language modeling
SMART_READER_LITE
LIVE PREVIEW

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task - - PowerPoint PPT Presentation

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling how? Language Modeling ML: Logistic Regression (auto-complete) Probability Theory Language Modeling -- assigning a probability to


slide-1
SLIDE 1

Language Modeling

CSE392 - Spring 2019 Special Topic in CS

slide-2
SLIDE 2

Task

  • Language Modeling

(auto-complete)

  • Probabilistic Modeling

○ ML: Logistic Regression ○ Probability Theory how?

slide-3
SLIDE 3

Language Modeling

  • - assigning a probability to sequences of words.

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words

slide-4
SLIDE 4

Language Modeling

  • - assigning a probability to sequences of words.

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words Version 2: Compute P(w5| w1, w2, w3, w4) = P(wn| w1, w2, …, wn-1) :probability of a next word given history

slide-5
SLIDE 5

Language Modeling

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

slide-6
SLIDE 6

Language Modeling

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ? Applications:

  • Auto-complete: What word is next?
  • Machine Translation: Which translation is most likely?
  • Spell Correction: Which word is most likely given

error?

  • Speech Recognition: What did they just say?

“eyes aw of an”

(example from Jurafsky, 2017)

slide-7
SLIDE 7

Language Modeling

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

slide-8
SLIDE 8

Simple Solution

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork)_ count( * * * * * * *)

slide-9
SLIDE 9

Simple Solution: The Maximum Likelihood Estimate

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork)_ count( * * * * * * *)

total number of

  • bserved 7grams
slide-10
SLIDE 10

Simple Solution: The Maximum Likelihood Estimate

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork)_ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork)_ count(He at the cake with the)

slide-11
SLIDE 11

Simple Solution: The Maximum Likelihood Estimate

Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork)_ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork)_ count(He at the cake with the) Problem: even the Web isn’t large enough to enable good estimates of most phrases.

slide-12
SLIDE 12

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases.

slide-13
SLIDE 13

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B)

Example from (Jurafsky, 2017)

slide-14
SLIDE 14

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B)

Example from (Jurafsky, 2017)

slide-15
SLIDE 15

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Example from (Jurafsky, 2017)

slide-16
SLIDE 16

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

slide-17
SLIDE 17

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption:

slide-18
SLIDE 18

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption:

P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n

slide-19
SLIDE 19

Solution: Estimate from shorter sequences, use more sophisticated probability theory. Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption:

P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n

What about Logistic Regression? Y = next word P(Y|X) = P(Xn | X1, X2, X3, ...) Not a terrible option, but X1 through Xn-1 would be modeled as independent dimensions. Let’s revisit later.

slide-20
SLIDE 20

Unigram Model: k = 0; Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule:

P(X1, X2…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption:

P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n

slide-21
SLIDE 21

Bigram Model: k = 1; Problem: even the Web isn’t large enough to enable good estimates of most phrases.

Example from (Jurafsky, 2017)

Markov Assumption:

P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n

Example generated sentence:

  • utside, new, car, parking, lot, of, the, agreement, reached

P(X1 = “outside”, X2=”new”, X3 = “car”, ....) ≈ P(X2=”new”|X1 = “outside) * P(X3=”car” | X2=”new”) * …

slide-22
SLIDE 22

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence?

slide-23
SLIDE 23

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? How to build?

slide-24
SLIDE 24

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

How to build?

slide-25
SLIDE 25

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

slide-26
SLIDE 26

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word \ second word

Bigram Counts

Example from (Jurafsky, 2017)

slide-27
SLIDE 27

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word \ second word

Bigram Counts

Example from (Jurafsky, 2017)

slide-28
SLIDE 28

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word \ second word

Bigram Counts

Example from (Jurafsky, 2017)

Bigram model: Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi-1)

slide-29
SLIDE 29

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

Bigram model: Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi)

slide-30
SLIDE 30

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

Bigram model: Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi)

slide-31
SLIDE 31

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

Bigram model: Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi)

slide-32
SLIDE 32

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Trained Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

slide-33
SLIDE 33

Language Modeling

Building a model (or system / API) that can answer the following:

food

Trained Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn)

slide-34
SLIDE 34

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Trained Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

training (fit, learn) Test?

slide-35
SLIDE 35

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Trained Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

Test:

Feed the model X1...Xi-1 and see how well it predicts Xi. Test Corpus

Perplexity

slide-36
SLIDE 36

Language Modeling

Building a model (or system / API) that can answer the following:

a sequence of natural language

Trained Language Model

How common is this sequence? What is the next word in the sequence? Training Corpus

Test:

Feed the model X1...Xi-1 and see how well it predicts Xi. Test Corpus

Perplexity

slide-37
SLIDE 37

Evaluation

a sequence of natural language

Trained Language Model

What is the next word in the sequence? Test Corpus Perplexity

slide-38
SLIDE 38

Evaluation

a sequence of natural language

Trained Language Model

What is the next word in the sequence? Test Corpus Perplexity Apply Chain Rule:

slide-39
SLIDE 39

Evaluation

a sequence of natural language

Trained Language Model

What is the next word in the sequence? Test Corpus Perplexity Apply Chain Rule: Apply Chain Rule: Thus, PP for Bigrams:

slide-40
SLIDE 40

Coding Example: Modeling Tweets from POS data

  • 1. Count unigrams, bigrams, and trigrams
  • 2. Train probabilities for unigram, bigram, and trigram

models (over training)

  • 3. Generate language

Trigram model when good evidence (high counts) Backing off to bigram or even unigram

slide-41
SLIDE 41

Practical Considerations:

  • Use log probability to keep numbers reasonable and save computation.

(uses addition rather than multiplication)

  • Out-of-vocabulary (OOV)

Choose minimum frequency and mark as <OOV>

  • Sentence start and end: <s> this is a sentence </s>
  • Alternative to backoff: Interpolation

Coding Example: Modeling Tweets from POS data

slide-42
SLIDE 42

Zeros and Smoothing

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

slide-43
SLIDE 43

Zeros and Smoothing

Laplace (“Add one”) smoothing: add 1 to all counts

first word \ second word

Bigram Counts

slide-44
SLIDE 44

Zeros and Smoothing

Laplace (“Add one”) smoothing: add 1 to all counts

first word \ second word

Bigram Counts

slide-45
SLIDE 45

Unsmoothed probs

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

slide-46
SLIDE 46

Smoothed

first word(Xi-1) \ second word (Xi)

P(Xi | Xi-1)

Example from (Jurafsky, 2017)

(vocabulary size)

slide-47
SLIDE 47

Why Smoothing? Generalizes

Original With Smoothing

(Example from Jurafsky / Originally Dan Klein)

slide-48
SLIDE 48

Why Smoothing? Generalizes

Add-one is blunt: can lead to very large changes. Better Smoothing:

  • Good-Turing
  • Kneser-Nay

These are outside scope of course because we will eventually cover, even stronger, deep learning based models.

slide-49
SLIDE 49

Language Modeling Summary

  • Two versions of assigning probability to sequence of words
  • Applications
  • The Chain Rule, The Markov Assumption:
  • Training a unigram, bigram, trigram model based on counts
  • Evaluation: Perplexity
  • Zeros, Low Counts, and Generalizability
  • Add-one smoothing