Classification & The Noisy Channel Model CMSC 473/673 UMBC - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC September 13 th , 2017 Some slides adapted from 3SLP

Recap from last time…

Three people have been fatally shot, p θ ( ) and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

Chain Rule + Backoff (Markov assumption) = n-grams

N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

Count-Based N-Grams (Unigrams)

Count-Based N-Grams (Trigrams)

Add- λ estimation Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

Linear Interpolation simpler models 𝑞 𝑧 𝑦) = 𝜇𝑞 2 𝑧 𝑦) + 1 − 𝜇 𝑞 1 𝑧 0 ≤ 𝜇 ≤ 1 Simple interpolation --> Averaging

Discounted Backoff Trust your statistics, up to a point discount constant context-dependent simpler models normalization constant

Evaluation Framework What is “correct?” What is working “well?” fine-tune any secondary (hyper)parameters Dev Test Training Data Data Data acquire primary statistics for perform final learning model parameters evaluation DO NOT ITERATE ON THE TEST DATA

Setting Hyperparameters Use a development corpus Dev Test Training Data Data Data Choose λs to maximize the probability of dev data: Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Evaluating Language Models What is “correct?” What is working “well?” Extrinsic : Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic : Treat LM as its own downstream task Use perplexity (from information theory)

Perplexity Lower is better: lower perplexity --> less surprised n-gram history (n-1 items) perplexity

Maximum Likelihood Estimates 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item) Maximizes the likelihood of the training set Do different corpora look the same? For large data: can actually do reasonably well

Implementation: Unknown words Create an unknown word token <UNK> Training: 1. Create a fixed lexicon L of size V 2. Change any word not in L to <UNK> 3. Train LM as normal Evaluation: Use UNK probabilities for any word not in training

<BOS>/<EOS> Padding p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * Post 23 p( <EOS> | sleep furiously) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution : Pad the right with a single <EOS> symbol

Implementation: EOS Padding Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation: 1. Identify “chunks” that are relevant (sentences, paragraphs, documents) 2. Append the <EOS> token to the end of the chunk 3. Train or evaluate LM as normal

Other Kinds of Smoothing Interpolated (modified) Kneser-Ney Idea: How “productive” is a context? How many different word types v appear in a context x, y Good-Turing Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes Witten-Bell Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring

Bayes Rule → NLP Applications prior likelihood probability posterior probability marginal likelihood (probability)

Two Different Philosophical Frameworks prior likelihood probability posterior probability marginal likelihood (probability) Posterior Classification/Decoding Noisy Channel Model Decoding maximum a posteriori there are others too (CMSC 478/678)

Classification P OLITICS T ERRORISM Three people have been fatally shot, and five S PORTS people, including a mayor, were seriously wounded T ECH as a result of a Shining Path attack today against a H EALTH community in Junin department, central F INANCE Peruvian mountain region. …

Classification P OLITICS Electronic alerts have T ERRORISM been used to assist the authorities in moments of S PORTS chaos and potential danger: after the Boston T ECH bombing in 2013, when the Boston suspects were H EALTH still at large, and last month in Los Angeles, F INANCE during an active shooter scare at the airport. …

Classify with Uncertainty Use probabilities

Classify with Uncertainty Use probabilities* *There are non- probabilistic ways to handle uncertainty… but probabilities sure are handy!

Classification P OLITICS .05 Electronic alerts have T ERRORISM .48 been used to assist the authorities in moments of S PORTS .0001 chaos and potential danger: after the Boston T ECH .39 bombing in 2013, when the Boston suspects were H EALTH .0001 still at large, and last month in Los Angeles, F INANCE .0002 during an active shooter scare at the airport. …

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a document a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input : a documentlinguistic blob a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c from C

Text Classification: Hand-coded Rules? Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Rules based on combinations of words or other features spam: black-list- address OR (“dollars” AND “have been selected”) Accuracy can be high If rules carefully refined by expert Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?

Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,(d m ,c m ) Output: a learned classifier γ that maps documents to classes

Text Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: Naïve Bayes a document d Logistic regression a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled Support-vector documents (d 1 ,c 1 ),....,(d m ,c m ) machines Output: a learned classifier γ that maps k-Nearest Neighbors documents to classes …

Probabilistic Text Classification Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification class observed data

Classification & The Noisy Channel Model CMSC 473/673 UMBC - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC September 13 th , 2017 Some slides adapted from 3SLP Recap from last time Three people have been fatally shot, p ( ) and five people, including a mayor, were seriously

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap:

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Speech Recognition and Synthesis Dan Klein UC Berkeley Noisy Channel Model: ASR We want to

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

Introduction to Machine Learning: Classification and The Noisy Channel Model CMSC 473/673 UMBC

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National

Natural Language Processing Language Modeling I Dan Klein UC Berkeley ASR Components The Noisy

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

CaSym: Cache Aware Symbolic Execution for Side Channel Detection and Mitigation Robert

CSU Channel Islands Data Governance Council 6.16.2016 Agenda Team Reports Data Knowledge

Learning Explanatory Rules from Noisy Data Richard Evans, Ed Grefenstette Overview Our system,

Computations by groups DATA MAN IP ULATION W ITH DATA.TABLE IN R Matt Dowle, Arun Srinivasan

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner 1 of 28

Data Analysis Modeling and Parsing 15-110 Friday 11/13 Learning Goals Read and write

High-throughput processing of telemetry data D.J. Lee Brigham Young University C. Mike and V.E.

Diameter PreCongestion Notification (PCN) Data Collection Application

Classification & The Noisy Channel Model CMSC 473/673 UMBC - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC September 13 th , 2017 Some slides adapted from 3SLP Recap from last time Three people have been fatally shot, p ( ) and five people, including a mayor, were seriously

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap:

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Speech Recognition and Synthesis Dan Klein UC Berkeley Noisy Channel Model: ASR We want to

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

Introduction to Machine Learning: Classification and The Noisy Channel Model CMSC 473/673 UMBC

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National

Natural Language Processing Language Modeling I Dan Klein UC Berkeley ASR Components The Noisy

Strong Converse for Testing Against Independence over a Noisy Channel Sreejith Sreekumar and Deniz

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

CaSym: Cache Aware Symbolic Execution for Side Channel Detection and Mitigation Robert

CSU Channel Islands Data Governance Council 6.16.2016 Agenda Team Reports Data Knowledge

Learning Explanatory Rules from Noisy Data Richard Evans, Ed Grefenstette Overview Our system,

Computations by groups DATA MAN IP ULATION W ITH DATA.TABLE IN R Matt Dowle, Arun Srinivasan

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner 1 of 28

Data Analysis Modeling and Parsing 15-110 Friday 11/13 Learning Goals Read and write

High-throughput processing of telemetry data D.J. Lee Brigham Young University C. Mike and V.E.

Diameter PreCongestion Notification (PCN) Data Collection Application

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni