Introduction to Machine Learning: Classification and The Noisy Channel Model
CMSC 473/673 UMBC
Some slides adapted from 3SLP
Introduction to Machine Learning: Classification and The Noisy - - PowerPoint PPT Presentation
Introduction to Machine Learning: Classification and The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Classification Why incorporate uncertainty Classification with Bayes Rule Example: Email Classifier
Some slides adapted from 3SLP
Discriminatively trained classifier Generatively trained classifier
Directly model the posterior Model the posterior with Bayes rule
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.
Source: http://www.nytimes.com/2016/09/20/nyregion/cellphone-alerts-used-in-search-of- manhattan-bombing-suspect.html
Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.
Source: http://www.nytimes.com/2016/09/20/nyregion/cellphone-alerts-used-in-search-of- manhattan-bombing-suspect.html
Use probabilities
Use probabilities*
*There are non-probabilistic ways to handle uncertainty⦠but probabilities sure are handy!
POLITICS .05 TERRORISM .48 SPORTS .0001 TECH .39 HEALTH .0001 FINANCE .0002 β¦
Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.
Source: http://www.nytimes.com/2016/09/20/nyregion/cellphone-alerts-used-in-search-of- manhattan-bombing-suspect.html
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Input:
a document a fixed set of classes C = {c1, c2,β¦, cJ}
Output: a predicted class c from C
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Input:
a document linguistic blob a fixed set of classes C = {c1, c2,β¦, cJ}
Output: a predicted class c from C
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Rules based on combinations of words or other features
spam: black-list-address OR (βdollarsβ AND βhave been selectedβ)
Accuracy can be high
If rules carefully refined by expert
Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Input:
a document d a fixed set of classes C = {c1, c2,β¦, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier Ξ³ that maps documents to classes
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Input:
a document d a fixed set of classes C = {c1, c2,β¦, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier Ξ³ that maps documents to classes
NaΓ―ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors β¦
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
NaΓ―ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors β¦
Input:
a document d a fixed set of classes C = {c1, c2,β¦, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier Ξ³ that maps documents to classes
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task Q: What are some examples
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Single
Multi-
If multiple π§π are predicted, then a multi- label classification task
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Single
Multi-
Given input π¦, predict multiple discrete labels π§ = (π§1, β¦ , π§π)
If multiple π§π are predicted, then a multi- label classification task
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Single
Multi-
Given input π¦, predict multiple discrete labels π§ = (π§1, β¦ , π§π)
If multiple π§π are predicted, then a multi- label classification task Each π§π could be binary or multi-class
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
class
data
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
class
data prior probability of class
class-based likelihood (language model)
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
class
data class-based likelihood (language model) prior probability of class
constant with respect to Y
how well does text X represent label Y? how likely is label Y overall?
For βsimpleβ or βflatβ labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score
how well does text (complex input) X represent text (complex output) Y? how likely is text (complex
* iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score
If Y is a string (or some complex structure), this can be complicated
Image source: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSZZUgncH-1nsicmUNzhOT2ttW8DZBEQS1HmK5JwqCET0X1gnX5qw
what I want to tell you βsportsβ
what I want to tell you βsportsβ what you actually see βThe Os lost againβ¦β
what I want to tell you βsportsβ what you actually see βThe Os lost againβ¦β Decode hypothesized intent βsad storiesβ βsportsβ
what I want to tell you βsportsβ what you actually see βThe Os lost againβ¦β Decode Rerank hypothesized intent βsad storiesβ βsportsβ reweight according to whatβs likely βsportsβ
Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning β¦
possible (clean)
(noisy) text translation/ decode model (clean) language model
Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning β¦
possible (clean)
(noisy) text (clean) language model
translation/ decode model
Discriminatively trained classifier Generatively trained classifier
Directly model the posterior Model the posterior with Bayes rule
Noisy Channel Model Decoding Discriminative training (e.g., maxent models: weβll cover these soon)
Q: What type of classification problem is this?
Q: What type of classification problem is this? A: multi-class (single label) classifier
Q: Why is p(Y | X) what we want to model?
Q: Why is p(Y | X) what we want to model? Q: To classify a document, do we need to find the normalizing constant?
Q: Why is p(Y | X) what we want to model? Q: To classify a document, do we need to find the normalizing constant? Q: If we can compute p(Y | X), up to a constant, how do we find the predicted label?
Wonβt you please donate?
Primary Primary Primary
Wonβt you please donate?
Primary
Wonβt you please donate?
Primary
Wonβt you please donate?
Primary
Wonβt you please donate?
Social
Wonβt you please donate?
Forums
Wonβt you please donate?
Get a bunch of Class documents πΈClass Learn a new language model πClass on just πΈClass
Primary
Wonβt you please donate?
Class
Wonβt you please donate?
e.g., record separate trigram counts for Primary vs. Social
πClass(β¦ ) for each Class
e.g., record separate trigram counts for Primary vs. Social vs. Forums vs. Spam
OR
tables π(Class, β¦ )
e.g., record how often each trigram
Forums vs. Spam documents
πClass(β¦ ) for each Class
e.g., record separate trigram counts for Primary vs. Social vs. Forums vs. Spam
OR
tables π(Class, β¦ )
e.g., record how often each trigram
Forums vs. Spam documents
Q: Are these two conceptually the same?
πClass(β¦ ) for each Class
e.g., record separate trigram counts for Primary vs. Social vs. Forums vs. Spam
OR
tables π(Class, β¦ )
e.g., record how often each trigram
Forums vs. Spam documents
Q: Are these two conceptually the same? Q: How might the option you choose influence implementation (or vice versa)?
πClass(β¦ ) for each Class
e.g., record separate trigram counts for Primary vs. Social vs. Forums vs. Spam
OR
tables π(Class, β¦ )
e.g., record how often each trigram
Forums vs. Spam documents
Q: Are these two conceptually the same? Q: How might the option you choose influence implementation (or vice versa)? Q: Will one approach always be better than the
Primary
Primary
Q: Whatβs an easy way to estimate it?
Primary
Q: Whatβs an easy way to estimate it? Q: Could we use our smoothing techniques?
All your data Training Data
Dev Data Test Data
Training Data
Dev Data Test Data
Learn model parameters from training set
set hyper- parameters
Training Data
Dev Data Test Data
set hyper- parameters
Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set
Training Data
Dev Data Test Data
set hyper- parameters
Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set perform final evaluation on test, using the hyperparameters that
retraining the model
Training Data
Dev Data Test Data
set hyper- parameters
Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set perform final evaluation on test, using the hyperparameters that
retraining the model
Classes/Choices
Classes/Choices
Correct Guessed
Classes/Choices
Correct Guessed Correct Guessed
Classes/Choices
Correct Guessed Correct Guessed Correct Guessed
Classes/Choices
Correct Guessed Correct Guessed Correct Guessed Correct Guessed
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
TP + TN TP + FP + FN + TN
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
TP TP + FP TP + TN TP + FP + FN + TN
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
TP TP + FP TP TP + FN TP + TN TP + FP + FN + TN
algebra (not important)
1 = 2 β π β π
If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.
Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860
Class 1 Class 2 Micro Ave. Table
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes