language modeling
play

Language Modeling CSE354 - Spring 2020 Task Language Modeling - PowerPoint PPT Presentation

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling h o w ? (i.e. auto-complete) Probability Theory Logistic Regression Sequence Modeling Language Modeling -- assigning a


  1. Language Modeling CSE354 - Spring 2020

  2. Task ● Language Modeling ● Probabilistic Modeling h o w ? (i.e. auto-complete) ○ Probability Theory ○ Logistic Regression ○ Sequence Modeling

  3. Language Modeling -- assigning a probability to sequences of words. Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words

  4. Language Modeling -- assigning a probability to sequences of words. Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history

  5. Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

  6. Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words Applications: P(He ate the cake with the fork) = ? ● Auto-complete: What word is next? ● Machine Translation: Which translation is most likely? ● Spell Correction: Which word is most likely given Version 2: Compute P(w5| w1, w2, w3, w4) error? = P(w n | w 1 , w 2 , …, w n-1 ) ● Speech Recognition: What did they just say? :probability of a next word given history “eyes aw of an” P(fork | He ate the cake with the) = ? (example from Jurafsky, 2017)

  7. Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

  8. Simple Solution Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *)

  9. Simple Solution: The Maximum Likelihood Estimate Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ total number of count( * * * * * * *) observed 7grams

  10. Simple Solution: The Maximum Likelihood Estimate Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork) _ count(He ate the cake with the * )

  11. Simple Solution: The Maximum Likelihood Estimate Problem: even the Web isn’t large enough to enable Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) good estimates of most phrases. :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork) _ count(He ate the cake with the * )

  12. Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory.

  13. Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) Example from (Jurafsky, 2017)

  14. Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) Example from (Jurafsky, 2017)

  15. Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1) Example from (Jurafsky, 2017)

  16. Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

  17. Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

  18. Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

  19. Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n What about Logistic Regression? Y = next word Solution: Estimate from shorter sequences, use more P(Y|X) = P(Xn | Xn-1, Xn-2, Xn-3, ...) sophisticated probability theory. Not a terrible option, but Xn-1 through Xn-k would be modeled as independent dimensions. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) Let’s revisit later. P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

  20. Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n Unigram Model: k = 0; P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

  21. Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n Bigram Model: k = 1; Example generated sentence: outside, new, car, parking, lot, of, the, agreement, reached P(X1 = “outside”, X2=”new”, X3 = “car”, ....) ≈ P(X1=“outside”) * P(X2=”new”|X1 = “outside) * P(X3=”car” | X2=”new”) * ... Example from (Jurafsky, 2017)

  22. Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence?

  23. Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? How to build?

  24. Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? How to build? training Training Corpus (fit, learn)

  25. Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? training Training Corpus (fit, learn)

  26. first word \ second word Bigram Counts Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Training Corpus (fit, learn)

  27. first word \ second word Bigram Counts Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Training Corpus (fit, learn)

  28. first word \ second word Bigram Counts Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Bigram model: Training Corpus (fit, learn) Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi-1)

  29. first word: xi-1 \ second word: xi P(Xi | Xi-1) Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Bigram model: Training Corpus (fit, learn) Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi-1)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend