probability language modeling ii classification the noisy
play

Probability & Language Modeling (II) Classification & the - PowerPoint PPT Presentation

Probability & Language Modeling (II) Classification & the Noisy Channel Model CMSC 473/673 UMBC September 11 th , 2017 Some slides adapted from 3SLP , Jason Eisner Recap from last time Three people have been fatally shot, p (


  1. n = 3 His employer , if silver was regulated according to the temporary and occasional event . What goods could bear the expense of defending themselves , than in the value of different sorts of goods , and placed at a much greater , there have been the effects of self-deception , this attention , but a very important ones , and which , having become of less than they ever were in this agreement for keeping up the business of weighing . After food , clothes , and a few months longer credit than is wanted , there must be sufficient to keep by him , are of such colonies to surmount . They facilitated the acquisition of the empire , both from the rents of land and labour of those pedantic pieces of silver which he can afford to take from the duty upon every quarter which they have a more equable distribution of employment .

  2. n = 4 To buy in one market , in order to have it ; but the 8th of George III . The tendency of some of the great lords , gradually encouraged their villains to make upon the prices of corn , cattle , poultry , etc . Though it may , perhaps , in the mean time , that part of the governments of New England , the market , trade cannot always be transported to so great a number of seamen , not inferior to those of other European nations from any direct trade to America . The farmer makes his profit by parting with it . But the government of that country below what it is in itself necessarily slow , uncertain , liable to be interrupted by the weather .

  3. 0s Are Not Your (Language Model’s) Friend 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0

  4. 0s Are Not Your (Language Model’s) Friend 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0 0 probability  item is impossible 0s annihilate: x*y*z*0 = 0 Language is creative: new words keep appearing existing words could appear in known contexts How much do you trust your data?

  5. Add- λ estimation Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

  6. Add- λ estimation Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

  7. Add- λ estimation Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

  8. Add- λ estimation Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

  9. Add- λ N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) Raw Count Norm Prob. Add- λ Count Add- λ Norm. Add- λ Prob. The 1 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16 opening 1 1/16 and 1 1/16 16 the 1 1/16 went 1 1/16 on 1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

  10. Add-1 N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 1/16 2 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2 opening 1 1/16 2 and 1 1/16 2 16 the 1 1/16 2 went 1 1/16 2 on 1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

  11. Add-1 N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 1/16 2 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2 opening 1 1/16 2 and 1 1/16 2 16 + 14*1 = 16 30 the 1 1/16 2 went 1 1/16 2 on 1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

  12. Add-1 N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 1/16 2 =1/15 film 2 1/8 3 =1/10 got 1 1/16 2 =1/15 a 2 1/8 3 =1/10 great 1 1/16 2 =1/15 opening 1 1/16 2 =1/15 and 1 1/16 2 =1/15 16 + 14*1 = 16 30 the 1 1/16 2 =1/15 went 1 1/16 2 =1/15 on 1 1/16 2 =1/15 to 1 1/16 2 =1/15 become 1 1/16 2 =1/15 hit 1 1/16 2 =1/15 . 1 1/16 2 =1/15

  13. Backoff and Interpolation Sometimes it helps to use less context condition on less context for contexts you haven’ t learned much

  14. Backoff and Interpolation Sometimes it helps to use less context condition on less context for contexts you haven’ t learned much about Backoff: use trigram if you have good evidence otherwise bigram, otherwise unigram

  15. Backoff and Interpolation Sometimes it helps to use less context condition on less context for contexts you haven’ t learned much about Backoff: use trigram if you have good evidence otherwise bigram, otherwise unigram Interpolation: mix (average) unigram, bigram, trigram

  16. Linear Interpolation Simple interpolation 𝑞 𝑧 𝑦) = 𝜇𝑞 2 𝑧 𝑦) + 1 − 𝜇 𝑞 1 𝑧 0 ≤ 𝜇 ≤ 1

  17. Linear Interpolation Simple interpolation 𝑞 𝑧 𝑦) = 𝜇𝑞 2 𝑧 𝑦) + 1 − 𝜇 𝑞 1 𝑧 0 ≤ 𝜇 ≤ 1 Condition on context 𝑞 𝑨 𝑦, 𝑧) = 𝜇 3 𝑦, 𝑧 𝑞 3 𝑨 𝑦, 𝑧) + 𝜇 2 (𝑧)𝑞 2 𝑨 | 𝑧 + 𝜇 1 𝑞 1 (𝑨)

  18. Backoff Trust your statistics, up to a point

  19. Discounted Backoff Trust your statistics, up to a point

  20. Discounted Backoff Trust your statistics, up to a point discount constant context-dependent normalization constant

  21. Setting Hyperparameters Use a development corpus Dev Test Training Data Data Data Choose λs to maximize the probability of dev data: Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

  22. Implementation: Unknown words Create an unknown word token <UNK> Training: 1. Create a fixed lexicon L of size V 2. Change any word not in L to <UNK> 3. Train LM as normal Evaluation: Use UNK probabilities for any word not in training

  23. Other Kinds of Smoothing Interpolated (modified) Kneser-Ney Idea: How “productive” is a context? How many different word types v appear in a context x, y Good-Turing Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes Witten-Bell Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring

  24. Bayes Rule  NLP Applications prior likelihood probability posterior probability marginal likelihood (probability)

  25. Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification

  26. Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c  C

  27. Text Classification: Hand-coded Rules? Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Rules based on combinations of words or other features spam: black-list- address OR (“dollars” AND“have been selected”) Accuracy can be high If rules carefully refined by expert Building and maintaining these rules is expensive

  28. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) Output: a learned classifier γ:d  c

  29. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } Support-vector A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: k-Nearest Neighbors a learned classifier γ:d  c …

  30. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } Support-vector A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: k-Nearest Neighbors a learned classifier γ:d  c …

  31. Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification class observed data

  32. Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification prior class-based probability of likelihood class class observed observation likelihood (averaged over all classes) data

  33. Noisy Channel Model

  34. Noisy Channel Model what I want to tell you “sports”

  35. Noisy Channel Model what you what I want to actually see tell you “The Os lost “sports” again…”

  36. Noisy Channel Model Decode hypothesized what you what I want to intent actually see tell you “sad stories” “The Os lost “sports” “sports” again…”

  37. Noisy Channel Model Decode Rerank hypothesized reweight what you what I want to intent according to actually see tell you “sad stories” what’s likely “The Os lost “sports” “sports” “sports” again…”

  38. Noisy Channel Machine translation Part-of-speech tagging Speech-to-text Morphological analysis Spelling correction … Text normalization translation/ (clean) possible decode language (clean) model model output observed observation (noisy) likelihood (noisy) text

  39. Noisy Channel Machine translation Part-of-speech tagging Speech-to-text Morphological analysis Spelling correction … Text normalization translation/ (clean) possible decode language (clean) model model output observed observation (noisy) likelihood (noisy) text

  40. Language Model Use any of the language modeling algorithms we’ve learned Unigram, bigram, trigram Add- λ , interpolation, backoff (Later: Maxent , RNNs, hierarchical Bayesian LMs, …)

  41. Noisy Channel

  42. Noisy Channel

  43. Noisy Channel constant with respect to X

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend