Maxent Models (II)
CMSC 473/673 UMBC September 20th, 2017
Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 - - PowerPoint PPT Presentation
Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1:
Maxent Models (II)
CMSC 473/673 UMBC September 20th, 2017
Announcements: Assignment 1
Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1: forgetting files Common pitfall #2: incorrect paths to files Common pitfall #3: 3rd party libraries
Announcements: Question 6
𝑞(𝑜) 𝑦𝑗 𝑦𝑗−𝑜+1: 𝑦𝑗−1) = 𝜇𝑜𝑔(𝑜) 𝑦𝑗−𝑜+1:𝑦𝑗 + 1 − 𝜇𝑜 𝑞(𝑜−1)(𝑦𝑗|𝑦𝑗−𝑜+2:𝑦𝑗−1) 𝑞(𝑜) 𝑦𝑗 𝑦𝑗−𝑜+1: 𝑦𝑗−1) = 𝜇𝑜,𝑜𝑔(𝑜) 𝑦𝑗−𝑜+1:𝑦𝑗 + 𝜇𝑜,𝑜−1𝑔(𝑜−1) 𝑦𝑗−𝑜+2:𝑦𝑗 + ⋯ 𝜇𝑜,0𝑔(0) ⋅
𝜇𝑜,0 = 1 −
𝑛=0 𝑜−1𝜇𝑜,𝑜−𝑛
Announcements: Course Project
Official handout will be out Friday 9/22
Until then, focus on assignment 1
Teams of 1-3 Mixed undergrad/grad is encouraged but not required Some novel aspect is needed
Ex 1: reimplement existing technique and apply to new domain Ex 2: reimplement existing technique and apply to new (human) language Ex 3: explore novel technique on existing problem
Recap from last time…
Classify or Decode with Bayes Rule
how well does text Y represent label X? how likely is label X overall?
For “simple” or “flat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score
Classification Evaluation: the 2-by-2 contingency table
Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)
Classes/Choices
Correct Guessed Correct Guessed Correct Guessed Correct Guessed
Classification Evaluation: Accuracy, Precision, and Recall
Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
Language Modeling as Naïve Bayes Classifier
Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Relevant for classification…
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Relevant for classification… and language modeling
Maximum Entropy Models
a more general language model argmax𝑌𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)
Maximum Entropy Models
a more general language model classify in one go argmax𝑌𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) argmax𝑌𝑞 𝑌 𝑍)
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
We need to score the different combinations.
Document Classification
Score and Combine Our Possibilities
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
COMBINE posterior probability of ATTACK are all of these uncorrelated?
…
scorek(department, ATTACK)
Score and Combine Our Possibilities
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
COMBINE posterior probability of ATTACK
Q: What are the score and combine functions for Naïve Bayes?
Scoring Our Possibilities
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
Learn these scores… but how? What do we
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .ATTACK
Maxent Modeling
ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
f(x) = exp(x)
exp gives a positive, unnormalized probability
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
Maxent Modeling
Learn the scores (but we’ll declare what combinations should be looked at)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)
…
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
Q: What if none of our features apply?
Guiding Principle for Log-Linear Models
“[The log-linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”
Edwin T. Jaynes, 1957
Guiding Principle for Log-Linear Models
“[The log-linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”
Edwin T. Jaynes, 1957
exp(θ· f) exp(θ· 0) = 1
θ1 * f1(fatally shot, ATTACK) θ 2 * f2(seriously wounded, ATTACK) θ 3 * f 3(Shining Path, ATTACK)
…
Easier-to-write form
θ1 * f1(fatally shot, ATTACK) θ 2 * f2(seriously wounded, ATTACK) θ 3 * f 3(Shining Path, ATTACK)
…
Easier-to-write form
K weights K features
Easier-to-write form
K-dimensional weight vector K-dimensional feature vector dot product
Log-Linear Models
Log-Linear Models
Log-Linear Models
Feature function(s) Sufficient statistics “Strength” function(s)
Log-Linear Models
Feature Weights Natural parameters Distribution Parameters
Log-Linear Models
How do we normalize?
weight1 * applies1(fatally shot, X) weight2 * applies2(seriously wounded, X) weight3 * applies3(Shining Path, X)
…
label x
Normalization for Classification
𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )
classify doc y with label x in one go
Normalization for Language Model
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case Simplifying assumption: maxent n-grams!
general class-based (X) language model of doc y
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
https://goo.gl/B23Rxo Lesson 1
Connections to Other Techniques
Log-Linear Models
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression
as statistical regression
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt)
as statistical regression based in information theory
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models
as statistical regression a form of based in information theory
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes
as statistical regression a form of viewed as based in information theory
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets
as statistical regression a form of viewed as based in information theory to be cool today :)
Learning θ
probabilistic model
(given observations)
How will we optimize F(θ)?
Calculus
F(θ) θ
F(θ) θ
θ*
F(θ) θ F’(θ)
derivative of F wrt θ
θ*
Example
F’(x) = -2x + 4 F(x) = -(x-2)2
differentiate Solve F’(x) = 0
x = 2
Common Derivative Rules
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 g0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 g0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 y1 θ2 g0 g1
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0
Pick a starting value θt
Until converged:
θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2
Gradient = Multi-variable derivative
K-dimensional input K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent