Maxent and Neural Language Models (part 1)
CMSC 473/673 UMBC
Some slides adapted from 3SLP
(part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline - - PowerPoint PPT Presentation
Maxent and Neural Language Models (part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural
Maxent and Neural Language Models (part 1)
CMSC 473/673 UMBC
Some slides adapted from 3SLP
Outline
Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Terminology
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naรฏve Bayes Very shallow (sigmoidal) neural nets
as statistical regression a form of viewed as based in information theory to be cool today :) common NLP term
Maxent Models are Flexible
Maxent models can be used:
Maxent Models as Featureful n-gram Language Models
generatively trained: learn to model (class-specific) language
๐ ๐ฆ๐ ๐ง, ๐ฆ๐โ๐+1:๐โ๐) = maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * โฆ * p(<EOS> | Label , furiously)20
Model each n-gram term with a maxent model
Maxent Models for Classification: Discriminatively โฆ
๐ ๐ ๐) = ๐ง๐๐ฒ๐๐จ๐ฎ(๐; ๐)
Discriminatively trained classifier
Directly model the posterior
Maxent Models for Classification: Discriminatively or Generatively Trained
๐ ๐ ๐) โ ๐ ๐ ๐) โ ๐(๐)
๐ ๐ ๐) = maxent(๐; ๐)
Discriminatively trained classifier Generatively trained classifier with maxent-based language model
Directly model the posterior Model the posterior with Bayes rule
Maximum Entropy (Log-linear) Models For Discriminatively Trained Classifiers
discriminatively trained: classify in one go
๐ ๐ง ๐ฆ) = maxent ๐ฆ, ๐ง
(weโll start with this one)
Discriminative ML Classification in 30 Seconds
and y that are meaningful
Discriminative ML Classification in 30 Seconds
and y that are meaningful
โ Denoted by a general vector of K features ๐ ๐ฆ, ๐ง = (๐
1 ๐ฆ, ๐ง , โฆ , ๐ ๐ฟ(๐ฆ, ๐ง))
Discriminative ML Classification in 30 Seconds
and y that are meaningful
โ Denoted by a general vector of K features ๐ ๐ฆ, ๐ง = (๐
1 ๐ฆ, ๐ง , โฆ , ๐ ๐ฟ(๐ฆ, ๐ง))
โ E.g., POSITIVE sentiments tweets may be more likely to have the word โhappyโ
Discriminative ML Classification in 30 Seconds
and y that are meaningful
โ Denoted by a general vector of K features ๐ ๐ฆ, ๐ง = (๐
1 ๐ฆ, ๐ง , โฆ , ๐ ๐ฟ(๐ฆ, ๐ง))
โ E.g., POSITIVE sentiments tweets may be more likely to have the word โhappyโ Q: What are the features in a Naรฏve Bayes classifier?
Core Aspects to Maxent Classifier p(y|x)
We need to define
meaningful;
important each feature is; and
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result
against a community in Junin department, central Peruvian mountain region .
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
shot
ATTACK
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Discriminative Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
These extractions are all features that have fired (likely have some significance)
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
We need to score the different extracted clues.
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
Score and Combine Our Clues
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
โฆ
COMBINE posterior probability of ATTACK
โฆ
scorek(๐, ATTACK)
Scoring Our Clues
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
โฆ
(ignore the feature indexing for now)
Scoring Our Clues
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
โฆ
Learn these scoresโฆ but how? What do we
https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/
Lesson 1
ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
What functionโฆ
is never less than 0?
What functionโฆ
is never less than 0? f(x) = exp(x)
ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
โฆ
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(๐, ATTACK) score2(๐, ATTACK) score3(๐, ATTACK)
โฆ
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
Learn the scores (but weโll declare what combinations should be looked at)
weight1 * applies1(๐, ATTACK) weight2 * applies2(๐, ATTACK) weight3 * applies3(๐, ATTACK)
โฆ
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
weight1 * applies1(๐, ATTACK) weight2 * applies2(๐, ATTACK) weight3 * applies3(๐, ATTACK)
โฆ
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
K different weightsโฆ for K different features
weight1 * applies1(๐, ATTACK) weight2 * applies2(๐, ATTACK) weight3 * applies3(๐, ATTACK)
โฆ
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
K different weightsโฆ for K different featuresโฆ multiplied and then summed
Dot_product of weight_vec feature_vec(๐, ATTACK)
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
K different weightsโฆ for K different featuresโฆ multiplied and then summed
๐๐๐(๐, ATTACK)
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
K different weightsโฆ for K different featuresโฆ multiplied and then summed
๐๐๐(๐, ATTACK)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
Maxent Modeling
1 Z
Q: How do we define Z?
label y
Normalization for Classification
๐ ๐ง ๐ฆ) โ exp(๐๐๐ ๐ฆ, ๐ง )
classify doc x with label y in one go
๐๐๐(๐, Y)
โฆ
label y
Normalization for Classification (long form)
๐ ๐ง ๐ฆ) โ exp(๐๐๐ ๐ฆ, ๐ง )
classify doc x with label y in one go
weight1 * applies1(๐, y) weight2 * applies2(๐, y) weight3 * applies3(๐, y)
Core Aspects to Maxent Classifier p(y|x)
meaningful;
important each feature is; and
๐ ๐ง ๐ฆ) = exp(๐๐๐(๐ฆ, ๐ง)) ฯ๐งโฒ exp(๐๐๐(๐ฆ, ๐งโฒ))
Outline
Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Features
in conditional models
Defining Appropriate Features in a Maxent Model
Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired
Defining Appropriate Features in a Maxent Model
Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired Generally templated Often binary-valued (0 or 1), but can be real-valued
Templated Features
Define a feature fclue(๐, label) for each clue you want to consider The feature fclue fires if the clue applies to/can be found in the (๐, label) pair Clue is often a target phrase (an n-gram) and a label
Templated Features
Define a feature fclue(๐, label) for each clue you want to consider The feature fclue fires if the clue applies to/can be found in the (๐, label) pair Clue is often a target phrase (an n-gram) and a label
Q: For a classifier p(label | ๐) are clues that depend only on ๐ useful?
exp( ))
weight1 * applies1(๐, ATTACK) weight2 * applies2(๐, ATTACK) weight3 * applies3(๐, ATTACK)
โฆ
Maxent Modeling: Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
p( | )โ
ATTACK
binary
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: What does this function check?
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used?
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used?
A1: ๐๐
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used?
A1: ๐๐
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used?
A1: ๐๐ A2: ๐2๐
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used? 3. How many features are defined if unigram and bigram targets are used?
A1: ๐๐ A2: ๐2๐
Example of a Templated Binary Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applieshurt,ATTACK ๐, ATTACK = แ1, hurt ๐๐ ๐ and ATTACK == ATTACK 0,
Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used? 3. How many features are defined if unigram and bigram targets are used?
A1: ๐๐ A2: ๐2๐ A2: (V + ๐2)๐
More on Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
Templated real- valued Non-templated real-valued binary
??? ???
More on Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
applies ๐, ATTACK = log ๐ ๐ ATTACK)
Non-templated real-valued binary Templated real- valued
???
More on Feature Functions
appliestarget,type ๐, ATTACK = แ1, target ๐๐๐ข๐โ๐๐ก ๐ and type == ATTACK 0,
appliestarget,type ๐, ATTACK = log ๐ ๐ ATTACK) + log ๐ type ATTACK) + log ๐(ATTACK |type)
Templated real- valued
applies ๐, ATTACK = log ๐ ๐ ATTACK)
Non-templated real-valued binary
Understanding Conditioning ๐ ๐ง ๐ฆ) โ count(๐ฆ)
Q: Is this a good model?
Understanding Conditioning ๐ ๐ง ๐ฆ) โ exp(๐ โ ๐ ๐ฆ )
Q: Is this a good model?
https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/
Lesson 11
Earlier, I said Maxent Models as Featureful n-gram Language Models of text x
๐ ๐ฆ๐ ๐ง, ๐ฆ๐โ๐+1:๐โ๐) = maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * โฆ * p(<EOS> | Label , furiously)20
Model each n-gram term with a maxent model
Q: What would this look like?
Language Model with Maxent n-grams
๐๐ ๐ ๐ง) = เท
๐=1 ๐
maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
n-gram label
Language Model with Maxent n-grams
๐๐ ๐ ๐ง) = เท
๐=1 ๐
maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
= เท
๐=1 ๐
exp(๐๐๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)) ฯ๐ฆโฒ exp(๐๐๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆโฒ))
n-gram label
Language Model with Maxent n-grams
๐๐ ๐ ๐ง) = เท
๐=1 ๐
maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
n-gram
= เท
๐=1 ๐ exp(๐๐๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐))
๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐)
Language Model with Maxent n-grams
๐๐ ๐ ๐ง) = เท
๐=1 ๐
maxent(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐)
n-gram
= เท
๐=1 ๐ exp(๐๐๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐, ๐ฆ๐))
๐(๐ง, ๐ฆ๐โ๐+1:๐โ๐)
Q: Why is this Z a function of the context?
Outline
Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
probabilistic model
Primary Objective: Likelihood
the training data it observes
Objective = Full Likelihood? (in LM)
These values can have very small magnitude โ underflow Differentiating this product could be a pain
เท
๐
๐๐ ๐ฆ๐ โ๐ โ เท
๐
exp(๐๐๐ ๐ฆ๐, โ )
(assume โ๐ has whatever context and n-gram history necessary)
Logarithms
(0, 1] โ (-โ, 0] Products โ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) โ log(b) Inverse of exp log(exp(x)) = x
Log-Likelihood (n-gram LM)
Differentiating this becomes nicer (even though Z depends on ฮธ) Wide range of (negative) numbers Sums are more stable
Products โ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) โ log(b)
log เท
๐
๐๐ ๐ฆ๐ โ๐ = เท
๐
log ๐๐(๐ฆ๐|โ๐)
Log-Likelihood (n-gram LM)
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ฮธ)
log เท
๐
๐๐ ๐ฆ๐ โ๐ = เท
๐
log ๐๐(๐ฆ๐|โ๐) = เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
Inverse of exp log(exp(x)) = x
Log-Likelihood (n-gram LM)
Wide range of (negative) numbers Sums are more stable
log เท
๐
๐๐ ๐ฆ๐ โ๐ = เท
๐
log ๐๐(๐ฆ๐|โ๐) = เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐) = ๐บ ๐
Outline
Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
How will we optimize F(ฮธ)?
Calculus
F(ฮธ) ฮธ
F(ฮธ) ฮธ
ฮธ*
F(ฮธ) ฮธ Fโ(ฮธ)
derivative of F wrt ฮธ
ฮธ*
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0 g0
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0 ฮธ1 g0
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0 ฮธ1 z1 ฮธ2 g0 g1
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the derivative
Set t = 0
Pick a starting value ฮธt
Until converged:
ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2
Remember: Common Derivative Rules
Gradient = Multi-variable derivative
K-dimensional input K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
F(ฮธ) ฮธ Fโ(ฮธ)
derivative
ฮธ*
What if you canโt find the roots? Follow the gradient
Set t = 0 Pick a starting value ฮธt Until converged:
ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2 K-dimensional vectors
Outline
Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Reminder: Expectation of a Random Variable
1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5
๐ฝ ๐ = เท
๐ฆ
๐ฆ ๐ ๐ฆ
Reminder: Expectation of a Random Variable
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
๐ฝ ๐ = เท
๐ฆ
๐ฆ ๐ ๐ฆ
Reminder: Expectation of a Random Variable
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
๐ฝ ๐ = เท
๐ฆ
๐ฆ ๐ ๐ฆ
Expectations Depend on a Probability Distribution
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
๐ฝ ๐ = เท
๐ฆ
๐ฆ ๐ ๐ฆ
Log-Likelihood (n-gram LM)
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends
log เท
๐
๐๐ ๐ฆ๐ โ๐ = เท
๐
log ๐๐(๐ฆ๐|โ๐) = เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐) = ๐บ ๐
Log-Likelihood Gradient
Each component k is the difference between:
Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data เท
๐
๐
๐(๐ฆ๐, โ๐)
Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pฮธ thinks it computes for feature fk เท
๐
๐ฝ๐ฆโฒ~ ๐ ๐
๐(๐ฆโฒ, โ๐)
เท
๐
๐
๐(๐ฆ๐, โ๐)
https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/
Lesson 6
๐ผ๐๐บ ๐ = ๐ผ๐ เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
Log-Likelihood Gradient Derivation for LM p ๐ฆ๐ | โ๐
๐ง๐
๐ผ๐๐บ ๐ = ๐ผ๐ เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
= เท
๐
๐ ๐ฆ๐, โ๐ โ
Log-Likelihood Gradient Derivation for LM p ๐ฆ๐ | โ๐
๐ง๐
๐ โ๐ = เท
๐ฆโฒ
exp(๐ โ ๐ ๐ฆโฒ, โ๐ )
๐ผ๐๐บ ๐ = ๐ผ๐ เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
= เท
๐
๐ ๐ฆ๐, โ๐ โ เท
๐
เท
๐ฆโฒ
exp ๐๐๐ ๐ฆโฒ, โ๐ ๐ โ๐ ๐(๐ฆโฒ, โ๐)
Log-Likelihood Gradient Derivation for LM p ๐ฆ๐ | โ๐
๐ ๐๐ log ๐(โ ๐ ) = ๐๐ ๐โ(๐) ๐โ ๐๐ use the (calculus) chain rule
scalar p(xโ | hi) vector of functions
๐ผ๐๐บ ๐ = ๐ผ๐ เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
= เท
๐
๐ ๐ฆ๐, โ๐ โ เท
๐
เท
๐ฆโฒ
exp ๐๐๐ ๐ฆโฒ, โ๐ ๐ โ๐ ๐(๐ฆโฒ, โ๐)
Log-Likelihood Gradient Derivation for LM p ๐ฆ๐ | โ๐
Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
Gradient Optimization for LM p ๐ฆ๐ | โ๐
Set t = 0 Pick a starting value ฮธt Until converged:
๐๐บ ๐๐๐ = เท
๐
๐
๐ ๐ฆ๐, โ๐ โ เท ๐
เท
๐ฆโฒ
๐
๐ ๐ฆโฒ, โ๐ ๐ ๐ฆโฒ โ๐)
เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐)
Gradient Optimization for Classifier p ๐ง |๐
Set t = 0 Pick a starting value ฮธt Until converged:
๐๐บ ๐๐๐ = ๐
๐ ๐, ๐ง โ เท ๐งโฒ
๐
๐ ๐, ๐งโฒ ๐ ๐งโฒ ๐)
๐๐๐ ๐, ๐ง โ log ๐(๐)
Preventing Extreme Values
Naรฏve Bayes
Extreme values are 0 probabilities
๐ ๐ฃ ๐ค โ count(๐ค, ๐ฃ) ๐ ๐ฃ ๐ค โ count ๐ค, ๐ฃ + ๐
Preventing Extreme Values
Naรฏve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large ฮธ values
๐ ๐ฃ ๐ค โ count(๐ค, ๐ฃ) ๐ ๐ฃ ๐ค โ count ๐ค, ๐ฃ + ๐ ๐บ ๐ = เท
๐
log ๐๐(๐ฃ๐|๐ค๐) ๐บ ๐ = เท
๐
log ๐๐(๐ฃ๐|๐ค๐) โ ๐ ๐
Preventing Extreme Values
Naรฏve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large ฮธ values
regularization
๐ ๐ฃ ๐ค โ count(๐ค, ๐ฃ) ๐ ๐ฃ ๐ค โ count ๐ค, ๐ฃ + ๐ ๐บ ๐ = เท
๐
log ๐๐(๐ฃ๐|๐ค๐) ๐บ ๐ = เท
๐
log ๐๐(๐ฃ๐|๐ค๐) โ ๐ ๐
(Squared) L2 Regularization
https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/
Lesson 8
Outline
Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Revisiting the SNAP Function
softmax
๐ ๐ฃ ๐ค) โ exp(๐ โ ๐ ๐ค, ๐ฃ )
Revisiting the SNAP Function
softmax
๐ ๐ฃ ๐ค) โ exp(๐ โ ๐ ๐ค, ๐ฃ ) softmax ๐ ๐ = exp(๐จ๐) ฯ๐ exp(๐จ
๐)
N-gram Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1
N-gram Language Models
predict the next word given some contextโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ ๐๐๐ฃ๐๐ข(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐)
wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
N-gram Language Models
predict the next word given some contextโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ ๐๐๐ฃ๐๐ข(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐)
wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
Maxent Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐))
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐))
can we learn the feature function(s)?
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐๐๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐๐ฅ๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 ew
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐๐ฅ๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 combine these representationsโฆ C = f
matrix-vector product
ew
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐๐ฅ๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 combine these representationsโฆ C = f
matrix-vector product
ew ฮธwi
Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) = softmax(๐๐ฅ๐ โ ๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 combine these representationsโฆ C = f
matrix-vector product
ew ฮธwi
โA Neural Probabilistic Language Model,โ Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
โA Neural Probabilistic Language Model,โ Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252
โA Neural Probabilistic Language Model,โ Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 โwe were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)โ (Sect. 4.2)
Outline
Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models