10/17/19 1
Language models
Chapter 3 in Martin/Jurafsky
N-gram models using the Markov assumption
- In other words, we approximate each component in the
product as
P(w1w2…wn) ≈ P(wi | wi−k…wi−1)
i
P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - - PDF document
10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we approximate each component in the product as P ( w i | w
10/17/19 1
i
10/17/19 2
10/17/19 3
10/17/19 4
10/17/19 5
10/17/19 6
… https://books.google.com/ngrams
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
10/17/19 7
– Assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences?
– A test set is an unseen dataset that is different from our training set, totally unused. – An evaluation metric tells us how well our model does on the test set.
probability
14
10/17/19 8
10/17/19 9
– How well can we predict the next word? – Unigrams are terrible at this game. (Why?)
– is one which assigns a higher probability to the word that actually occurs I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____
mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 …. and 1e-100
Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set
PP(W) = P(w1w2...wN )
− 1 N
= 1 P(w1w2...wN )
N
10/17/19 10
Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule: For bigrams: The best language model is one that best predicts an unseen test set
PP(W) = P(w1w2...wN )
− 1 N
= 1 P(w1w2...wN )
N
P=1/10 to each digit?
PP(W) = P(w1w2 ...wN)− 1
N
= ( 1 10
N
)− 1
N
= 1 10
−1
= 10
10/17/19 11
using the fewest number of questions.
about the unknown number.
After the first question you only need log2(512) = 9 bits.
10/17/19 12
information they carry to be additive. Let’s check:
I(x, y) = − log P (x, y) = − log P (x)P (y) = − log P (x) − log P (y) = I(x) + I(y)
variable X:
H(X) = − X
x∈χ
p(x)log2 p(x) principle, be computed in any base.
10/17/19 13
) = −1 n X
W n
1 ∈L
p(W n
1 )log p(W n 1 )
H(w1,w2,...,wn) = − X
W n
1 ∈L
p(W n
1 )log p(W n 1 )
define the entropy rate (we could also think of this H(L) = lim
n→∞
1 nH(w1,w2,...,wn) = − lim
n→∞
1 n X
W∈L
p(w1,...,wn)log p(w1,...,wn)
10/17/19 14
H(L) = lim
n→∞
1 nH(w1,w2,...,wn) = − lim
n→∞
1 n X
W∈L
p(w1,...,wn)log p(w1,...,wn)
H(L) = lim
n→∞−1
n log p(w1w2 ...wn) take a single sequence that is long enough
H(W) = − 1 N logP(w1w2 ...wN) model P on a sequence of words W is no
N = 2H(W )