the noisy channel model
play

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation Chain Rule +


  1. Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP

  2. Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

  3. Chain Rule + Backoff (Markov assumption) = n-grams

  4. N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

  5. Language Models & Smoothing Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

  6. Evaluation Framework What is “correct?” What is working “well?” fine-tune any secondary (hyper)parameters Dev Test Training Data Data Data acquire primary statistics for perform final learning model parameters evaluation DO NOT ITERATE ON THE TEST DATA

  7. Setting Hyperparameters Use a development corpus Dev Test Training Data Data Data Choose λs to maximize the probability of dev data: Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

  8. Evaluating Language Models What is “correct?” What is working “well?” Extrinsic : Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic : Treat LM as its own downstream task Use perplexity (from information theory)

  9. Perplexity Lower is better: lower perplexity --> less surprised n-gram history (n-1 items) 𝑁 log 𝑞 𝑥 𝑗 perplexity = exp( −1 𝑁 σ 𝑗=1 ℎ 𝑗 ))

  10. Implementation: Unknown words Create an unknown word token <UNK> Training: 1. Create a fixed lexicon L of size V 2. Change any word not in L to <UNK> 3. Train LM as normal Evaluation: Use UNK probabilities for any word not in training

  11. Implementation: EOS Padding Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation: 1. Identify “chunks” that are relevant (sentences, paragraphs, documents) 2. Append the <EOS> token to the end of the chunk 3. Train or evaluate LM as normal Post 25

  12. Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

  13. Probabilistic Classification 𝑞 𝑌 𝑍) = ℎ(𝑌; 𝑍) Directly model the posterior Discriminatively trained classifier Model the 𝑞 𝑌 𝑍) ∝ 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) posterior with Bayes rule Generatively trained classifier

  14. Generative Training: Two Different Philosophical Frameworks prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability) Posterior Classification/Decoding Noisy Channel Model Decoding maximum a posteriori there are others too (CMSC 478/678)

  15. Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

  16. Classification P OLITICS T ERRORISM Three people have been fatally shot, and five S PORTS people, including a mayor, were seriously wounded T ECH as a result of a Shining Path attack today against a H EALTH community in Junin department, central F INANCE Peruvian mountain region. …

  17. Classification P OLITICS T ERRORISM Three people have been fatally shot, and five S PORTS people, including a mayor, were seriously wounded T ECH as a result of a Shining Path attack today against a H EALTH community in Junin department, central F INANCE Peruvian mountain region. …

  18. Classification P OLITICS Electronic alerts have T ERRORISM been used to assist the authorities in moments of S PORTS chaos and potential danger: after the Boston T ECH bombing in 2013, when the Boston suspects were H EALTH still at large, and last month in Los Angeles, F INANCE during an active shooter scare at the airport. …

  19. Classification P OLITICS Electronic alerts have T ERRORISM been used to assist the authorities in moments of S PORTS chaos and potential danger: after the Boston T ECH bombing in 2013, when the Boston suspects were H EALTH still at large, and last month in Los Angeles, F INANCE during an active shooter scare at the airport. …

  20. Classify with Uncertainty Use probabilities

  21. Classify with Uncertainty Use probabilities* *There are non- probabilistic ways to handle uncertainty… but probabilities sure are handy!

  22. Classification P OLITICS .05 Electronic alerts have T ERRORISM .48 been used to assist the authorities in moments of S PORTS .0001 chaos and potential danger: after the Boston T ECH .39 bombing in 2013, when the Boston suspects were H EALTH .0001 still at large, and last month in Los Angeles, F INANCE .0002 during an active shooter scare at the airport. …

  23. Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification

  24. Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

  25. Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document linguistic blob a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

  26. Text Classification: Hand-coded Rules? Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Rules based on combinations of words or other features spam: black-list- address OR (“dollars” AND “have been selected”) Accuracy can be high If rules carefully refined by expert Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?

  27. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) Output: a learned classifier γ that maps documents to classes

  28. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled Support-vector documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: a learned classifier γ that maps k-Nearest Neighbors documents to classes …

  29. Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled Support-vector documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: a learned classifier γ that maps k-Nearest Neighbors documents to classes …

  30. Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

  31. Generative Training: Two Different Philosophical Frameworks prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability) Posterior Classification/Decoding Noisy Channel Model Decoding maximum a posteriori

  32. Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification class 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) observed data

  33. Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification prior class-based likelihood probability of (language model) class class 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) observed observation likelihood (averaged over all classes) data

  34. Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification prior class-based likelihood probability of (language model) class class 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) observed observation likelihood (averaged over all classes) data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend