Maxent Models (III), & Neural Language Models
CMSC 473/673 UMBC September 25th, 2017
Some slides adapted from 3SLP
Maxent Models (III), & Neural Language Models CMSC 473/673 - - PowerPoint PPT Presentation
Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()
Some slides adapted from 3SLP
a more general language model classify in one go argmaxππ π π) β π(π) argmaxππ π π)
Feature function(s) Sufficient statistics βStrengthβ function(s) Feature Weights Natural parameters Distribution Parameters
F(ΞΈ) ΞΈ Fβ(ΞΈ)
derivative
ΞΈ*
Set t = 0 Pick a starting value ΞΈt Until converged:
ΞΈ0 y0 ΞΈ1 y1 ΞΈ2 y2 y3 ΞΈ3 g0 g1 g2
F(ΞΈ) ΞΈ Fβ(ΞΈ)
derivative
ΞΈ*
Set t = 0
Pick a starting value ΞΈt
Until converged:
ΞΈ0 y0 ΞΈ1 y1 ΞΈ2 y2 y3 ΞΈ3 g0 g1 g2
as statistical regression a form of viewed as based in information theory to be cool today :)
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
These values can have very small magnitude ο¨ underflow Differentiating this product could be a pain
Differentiating this becomes nicer (even though Z depends on ΞΈ) Wide range of (negative) numbers Sums are more stable
Products ο¨ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) β log(b)
Differentiating this becomes nicer (even though Z depends on ΞΈ) Wide range of (negative) numbers Sums are more stable
Inverse of exp log(exp(x)) = x π π§ π¦) β exp(π β π π¦, π§ )
Wide range of (negative) numbers Sums are more stable
Inverse of exp log(exp(x)) = x π π§ π¦) β exp(π β π π¦, π§ )
Differentiating this becomes nicer (even though Z depends on ΞΈ)
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ΞΈ)
1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
Each component k is the difference between:
Each component k is the difference between: the total value of feature fk in the training data
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pΞΈ thinks it computes for feature fk
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pΞΈ thinks it computes for feature fk
βmoment matchingβ
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
depends on ΞΈ
depends on ΞΈ
depends on ΞΈ
π ππ log π(β π ) = ππ πβ(π) πβ ππ use the (calculus) chain rule
π ππ log π(β π ) = ππ πβ(π) πβ ππ use the (calculus) chain rule
scalar p(yβ | xi) vector of functions
ππΊ πππ = ΰ·
π
π
π π¦π, π§π β ΰ· π
ΰ·
π§β²
π
π π¦π, π§β² π π§β² π¦π)
Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
NaΓ―ve Bayes
Extreme values are 0 probabilities
NaΓ―ve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large ΞΈ values
NaΓ―ve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large ΞΈ values
regularization
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
NaΓ―ve Bayes Observed features Label/class
NaΓ―ve Bayes Maxent/ Logistic Regression Observed features Label/class
(in one dimension)
Is this a good language model?
Is this a good language model?
Is this a good language model? (no)
Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
as statistical regression a form of viewed as based in information theory to be cool today :)
softmax
softmax
predict the next word given some context⦠wi-3 wi-2 wi wi-1
predict the next word given some contextβ¦
π π₯π π₯πβ3,π₯πβ2, π₯πβ1) β πππ£ππ’(π₯πβ3, π₯πβ2, π₯πβ1, π₯π)
wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
predict the next word given some contextβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β πππ£ππ’(π₯πβ3, π₯πβ2, π₯πβ1, π₯π)
wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(π β π(π₯πβ3, π₯πβ2, π₯πβ1, π₯π))
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(π β π(π₯πβ3, π₯πβ2, π₯πβ1, π₯π))
can we learn the feature function(s)?
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(πππ β π(π₯πβ3, π₯πβ2, π₯πβ1))
can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 ew
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 combine these representationsβ¦ C = f
matrix-vector product
ew
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 combine these representationsβ¦ C = f
matrix-vector product
ew ΞΈwi
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 combine these representationsβ¦ C = f
matrix-vector product
ew ΞΈwi
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252
Baselines
LM Name N- gram Params. Test Ppl. Interpolation 3
Kneser-Ney backoff 3
Kneser-Ney backoff 5
Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312
NPLM
N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 βwe were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)β (Sect. 4.2)