The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP

Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

Chain Rule + Backoff (Markov assumption) = n-grams

N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

Language Models & Smoothing Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Evaluation Framework What is “correct?” What is working “well?” fine-tune any secondary (hyper)parameters Dev Test Training Data Data Data acquire primary statistics for perform final learning model parameters evaluation DO NOT ITERATE ON THE TEST DATA

Setting Hyperparameters Use a development corpus Dev Test Training Data Data Data Choose λs to maximize the probability of dev data: Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Evaluating Language Models What is “correct?” What is working “well?” Extrinsic : Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic : Treat LM as its own downstream task Use perplexity (from information theory)

Perplexity Lower is better: lower perplexity --> less surprised n-gram history (n-1 items) 𝑁 log 𝑞 𝑥 𝑗 perplexity = exp( −1 𝑁 σ 𝑗=1 ℎ 𝑗 ))

Implementation: Unknown words Create an unknown word token <UNK> Training: 1. Create a fixed lexicon L of size V 2. Change any word not in L to <UNK> 3. Train LM as normal Evaluation: Use UNK probabilities for any word not in training

Implementation: EOS Padding Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation: 1. Identify “chunks” that are relevant (sentences, paragraphs, documents) 2. Append the <EOS> token to the end of the chunk 3. Train or evaluate LM as normal Post 25

Probabilistic Classification 𝑞 𝑌 𝑍) = ℎ(𝑌; 𝑍) Directly model the posterior Discriminatively trained classifier Model the 𝑞 𝑌 𝑍) ∝ 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) posterior with Bayes rule Generatively trained classifier

Generative Training: Two Different Philosophical Frameworks prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability) Posterior Classification/Decoding Noisy Channel Model Decoding maximum a posteriori there are others too (CMSC 478/678)

Classification P OLITICS T ERRORISM Three people have been fatally shot, and five S PORTS people, including a mayor, were seriously wounded T ECH as a result of a Shining Path attack today against a H EALTH community in Junin department, central F INANCE Peruvian mountain region. …

Classification P OLITICS Electronic alerts have T ERRORISM been used to assist the authorities in moments of S PORTS chaos and potential danger: after the Boston T ECH bombing in 2013, when the Boston suspects were H EALTH still at large, and last month in Los Angeles, F INANCE during an active shooter scare at the airport. …

Classify with Uncertainty Use probabilities

Classify with Uncertainty Use probabilities* *There are non- probabilistic ways to handle uncertainty… but probabilities sure are handy!

Classification P OLITICS .05 Electronic alerts have T ERRORISM .48 been used to assist the authorities in moments of S PORTS .0001 chaos and potential danger: after the Boston T ECH .39 bombing in 2013, when the Boston suspects were H EALTH .0001 still at large, and last month in Los Angeles, F INANCE .0002 during an active shooter scare at the airport. …

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document linguistic blob a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

Text Classification: Hand-coded Rules? Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Rules based on combinations of words or other features spam: black-list- address OR (“dollars” AND “have been selected”) Accuracy can be high If rules carefully refined by expert Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?

Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) Output: a learned classifier γ that maps documents to classes

Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled Support-vector documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: a learned classifier γ that maps k-Nearest Neighbors documents to classes …

Generative Training: Two Different Philosophical Frameworks prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability) Posterior Classification/Decoding Noisy Channel Model Decoding maximum a posteriori

Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification class 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) observed data

Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification prior class-based likelihood probability of (language model) class class 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) observed observation likelihood (averaged over all classes) data

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation Chain Rule +

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Speech Recognition and Synthesis Dan Klein UC Berkeley Noisy Channel Model: ASR We want to

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National

Natural Language Processing Language Modeling I Dan Klein UC Berkeley ASR Components The Noisy

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR

Noisy Channel Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday

Variable and Address Variable = Storage in computer Memory memory 0 70 char 1 31

Security and Networking Basics Security and Networking Basics Internet Security [1] VU Engin

Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos

Turning C into Machine Code CSAPP book is very useful and well-aligned with class for the remainder

Titanium Performance and Potential: an NPB Experimental Study Kaushik Datta, Dan Bonachea, and

Cla lass hierarcies inheritance method overriding super multiple inheritance Call

8: IP Basics IP protocol Routing protocols addressing conventions path selection

Multicasting Guevara Noubir Textbook: 1. Computer Networks: A Systems Approach, L. Peterson, B.

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation Chain Rule +

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Speech Recognition and Synthesis Dan Klein UC Berkeley Noisy Channel Model: ASR We want to

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National

Natural Language Processing Language Modeling I Dan Klein UC Berkeley ASR Components The Noisy

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR

Noisy Channel Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday

Variable and Address Variable = Storage in computer Memory memory 0 70 char 1 31

Security and Networking Basics Security and Networking Basics Internet Security [1] VU Engin

Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos

Turning C into Machine Code CSAPP book is very useful and well-aligned with class for the remainder

Titanium Performance and Potential: an NPB Experimental Study Kaushik Datta, Dan Bonachea, and

Cla lass hierarcies inheritance method overriding super multiple inheritance Call

8: IP Basics IP protocol Routing protocols addressing conventions path selection

Multicasting Guevara Noubir Textbook: 1. Computer Networks: A Systems Approach, L. Peterson, B.

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni