Naïve Bayes, Maxent and Neural Models
CMSC 473/673 UMBC
Some slides adapted from 3SLP
Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - - PowerPoint PPT Presentation
Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Nave Bayes (NB) classification Terminology: bag-of-words Nave assumption
Naïve Bayes, Maxent and Neural Models
CMSC 473/673 UMBC
Some slides adapted from 3SLP
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Probabilistic Classification
Discriminatively trained classifier Generatively trained classifier
Directly model the posterior Model the posterior with Bayes rule
Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding
Posterior Decoding: Probabilistic Text Classification
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
class
data class-based likelihood (language model) prior probability of class
Noisy Channel Model
what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”
Noisy Channel
Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …
possible (clean)
(noisy) text (clean) language model
translation/ decode model
Use Logarithms
Accuracy, Precision, and Recall
Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/notguessed False Negative (FN) True Negative (TN)
A combined measure: F
Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
The Bag of Words Representation
The Bag of Words Representation
The Bag of Words Representation
13Bag of Words Representation
seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
classifier classifier
Naïve Bayes Classifier
Start with Bayes Rule
label text Q: Are we doing discriminative training
Naïve Bayes Classifier
Start with Bayes Rule
label text Q: Are we doing discriminative training
A: generative training
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi
label each word (token)
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi Assume position doesn’t matter
label each word (token)
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X
label each word (token)
Multinomial Naïve Bayes: Learning
From training corpus, extract Vocabulary
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms For each cj in C do
docsj = all docs with class =cj
From training corpus, extract Vocabulary
Brill and Banko (2001) With enough data, the classifier may not matter
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms For each cj in C do
docsj = all docs with class =cj
Calculate P(wk | cj) terms Textj = single doc containing all docsj Foreach word wk in Vocabulary nk = # of occurrences of wk in Textj
From training corpus, extract Vocabulary
𝑞 𝑥𝑙| 𝑑𝑘 = class unigram LM
Naïve Bayes and Language Modeling
Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides
We use only word features we use all of the words in the text (not a subset)
Then
Naïve Bayes has an important similarity to language modeling
Naïve Bayes as a Language Model
Sec.13.2.1
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
film love this fun I
Sec.13.2.1
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
film love this fun I
0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
film love this fun I
0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2
5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Sec.13.2.1
Summary: Naïve Bayes is Not So Naïve
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
But: Naïve Bayes Isn’t Without Issue
Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data? (automated, more principled)
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets
as statistical regression a form of viewed as based in information theory to be cool today :)
Maxent Models for Classification: Discriminatively or Generatively Trained
Discriminatively trained classifier Generatively trained classifier
Directly model the posterior Model the posterior with Bayes rule
Maximum Entropy (Log-linear) Models
discriminatively trained: classify in one go
Maximum Entropy (Log-linear) Models
generatively trained: learn to model language
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result
against a community in Junin department, central Peruvian mountain region .
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Document Classification
ATTACK
shot
ATTACK
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Document Classification
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
We need to score the different combinations.
Score and Combine Our Possibilities
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
COMBINE posterior probability of ATTACK are all of these uncorrelated?
…
scorek(department, ATTACK)
Score and Combine Our Possibilities
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
COMBINE posterior probability of ATTACK
Q: What are the score and combine functions for Naïve Bayes?
Scoring Our Possibilities
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9 Lesson 1
ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
What function…
is never less than 0?
What function…
is never less than 0? f(x) = exp(x)
ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)
…
Maxent Modeling
Learn the scores (but we’ll declare what combinations should be looked at)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)
…
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
exp( ))
weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)
…
Maxent Modeling: Feature Functions
Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0
valued
ቊ1, target == fatally shot and type == ATTACK 0,
p( | )∝
ATTACK
binary
More on Feature Functions
Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued
log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) + log 𝑞(ATTACK |type)
Templated real- valued
log 𝑞 fatally shot ATTACK)
Non-templated real-valued Non-templated count-valued
???
ቊ1, target == fatally shot and type == ATTACK 0,
binary
More on Feature Functions
Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued
log 𝑞 fatally shot ATTACK)
Non-templated real-valued
count fatally sho𝑢 ATTACK)
Non-templated count-valued
log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) + log 𝑞(ATTACK |type)
Templated real- valued
ቊ1, target == fatally shot and type == ATTACK 0,
binary
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
ATTACK
weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)
…
Maxent Modeling
1 Z
Q: How do we define Z?
…
label x
Normalization for Classification
𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )
classify doc y with label x in one go
weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)
Normalization for Language Model
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case Simplifying assumption: maxent n-grams!
general class-based (X) language model of doc y
Understanding Conditioning
Is this a good language model?
Understanding Conditioning
Is this a good language model?
Understanding Conditioning
Is this a good language model? (no)
Understanding Conditioning
Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9 Lesson 11
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
probabilistic model
(given observations)
Objective = Full Likelihood?
These values can have very small magnitude ➔ underflow Differentiating this product could be a pain
Logarithms
(0, 1] ➔ (-∞, 0] Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x
Log-Likelihood
Differentiating this becomes nicer (even though Z depends on θ) Wide range of (negative) numbers Sums are more stable
Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b)
Log-Likelihood
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ)
Inverse of exp log(exp(x)) = x
Log-Likelihood
Wide range of (negative) numbers Sums are more stable
= 𝐺 𝜄
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
How will we optimize F(θ)?
Calculus
F(θ) θ
F(θ) θ
θ*
F(θ) θ F’(θ)
derivative of F wrt θ
θ*
Example
F’(x) = -2x + 4 F(x) = -(x-2)2
differentiate Solve F’(x) = 0
x = 2
Common Derivative Rules
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 g0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 g0
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 y1 θ2 g0 g1
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0 Pick a starting value θt Until converged:
θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2
F(θ) θ F’(θ)
derivative
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0
Pick a starting value θt
Until converged:
θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2
Gradient = Multi-variable derivative
K-dimensional input K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Expectations
1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5
Expectations
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
Expectations
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
Expectations
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
Log-Likelihood
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends
= 𝐺 𝜄
Log-Likelihood Gradient
Each component k is the difference between:
Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pθ thinks it computes for feature fk
X' Yi
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9 Lesson 6
Log-Likelihood Gradient Derivation
𝑧𝑗
Log-Likelihood Gradient Derivation
𝑧𝑗
Log-Likelihood Gradient Derivation
𝜖 𝜖𝜄 log(ℎ 𝜄 ) = 𝜖 𝜖ℎ(𝜄) 𝜖ℎ 𝜖𝜄 use the (calculus) chain rule
scalar p(x’ | yi) vector of functions
𝑧𝑗
Log-Likelihood Gradient Derivation
Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
Gradient Optimization
Set t = 0 Pick a starting value θt Until converged:
𝜖𝐺 𝜖𝜄𝑙 =
𝑗
𝑔
𝑙 𝑦𝑗,𝑧𝑗 − 𝑗
𝑧′
𝑔
𝑙 𝑦𝑗,𝑧′ 𝑞 𝑧′ 𝑦𝑗)
Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
Preventing Extreme Values
Naïve Bayes
Extreme values are 0 probabilities
Preventing Extreme Values
Naïve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large θ values
Preventing Extreme Values
Naïve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large θ values
regularization
(Squared) L2 Regularization
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9 Lesson 8
Outline
Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Revisiting the SNAP Function
softmax
Revisiting the SNAP Function
softmax
N-gram Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1
N-gram Language Models
predict the next word given some context…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗)
wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
N-gram Language Models
predict the next word given some context…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗)
wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
Maxent Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗))
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗))
can we learn the feature function(s)?
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝒙𝒋 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 ew
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 “we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)