Lecture 4: Language Model Evaluation and Advanced methods
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 6501 Natural Language Processing
Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - - PowerPoint PPT Presentation
Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Kneser-Ney smoothing v
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 6501 Natural Language Processing
2 6501 Natural Language Processing
3 6501 Natural Language Processing
4 6501 Natural Language Processing
5
1 1 1 1 scounting AbsoluteDi
i i i i i i โ โ โ โ
discounted bigram unigram
Interpolation weight 6501 Natural Language Processing
6
Francisco glasses
6501 Natural Language Processing
7
CONTINUATION(w)โ {wiโ1 :c(wiโ1,w) > 0}
6501 Natural Language Processing
8
CONTINUATION(w) =
CONTINUATION(w)โ {wiโ1 :c(wiโ1,w) > 0}
6501 Natural Language Processing
9
CONTINUATION(w) =
w'
6501 Natural Language Processing
10
KN(wi | wiโ1) = max(c(wiโ1,wi)โ d,0)
CONTINUATION(wi)
6501 Natural Language Processing
11
KN (wi | wiโn+1 iโ1 ) = max(cKN (wiโn+1 i
iโ1 )
iโ1 )P KN (wi | wiโn+2 iโ1
6501 Natural Language Processing
12 6501 Natural Language Processing
13
https://en.wikipedia.org/wiki/Trie
6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 14
14
There are more principled smoothing methods, too. Weโll look next at log-linear models, which are a good and popular general technique.
6501 Natural Language Processing
15 600.465 - Intro to NLP - J. Eisner 15
6501 Natural Language Processing
ED
16 6501 Natural Language Processing
17
Ranges over all features Whether (x,y) has feature k(0 or 1) Or how many times it fires (โฅ 0) Or how strongly it fires (real #) Weight of the kth feature. To be learned โฆ
6501 Natural Language Processing
๐
'(โ๐ฅ$1), ๐ฅ$10โ, โ๐ฅ$โ) for Score(โ๐ฅ$1), ๐ฅ$10โ, โ๐ฅ$โ) can be
v # โ๐ฅ$1)โ appears in the training corpus. v 1, if โ๐ฅ$โ is an unseen word; 0, otherwise. v 1, if โ๐ฅ$1), ๐ฅ$10โ = โa redโ; 0, otherwise. v 1, if โ๐ฅ$10โ belongs to the โcolorโ category; 0 otherwise.
18 6501 Natural Language Processing
๐
'(โ๐ ๐๐โ, โ๐โ, โ๐๐๐๐ก๐ก๐๐กโ) for Score(โ๐ ๐๐โ, โ๐โ, โ๐๐๐๐ก๐ก๐๐กโ)
v # โ๐ ๐๐โ appears in the training corpus. v 1, if โ๐โ is an unseen word; 0, otherwise. v 1, if โa ๐ ๐๐โ = โa redโ; 0, otherwise. v 1, if โ๐ ๐๐โ belongs to the โcolorโ category; 0 otherwise.
19 6501 Natural Language Processing
20
600.465 - Intro to NLP - J. Eisner 20
where we choose Z(x) to ensure that unnormalized prob (at least itโs positive!) thus,
6501 Natural Language Processing
21
This version is โdiscriminative trainingโ: to learn to predict y from x, maximize p(y|x).
Whereas in โgenerative modelsโ, we learn to model x, too, by maximizing p(x,y).
6501 Natural Language Processing
22
๐ ๐ โ ๐ X Y/ZY
6501 Natural Language Processing
23
6501 Natural Language Processing
24 6501 Natural Language Processing
๐งโฒ
What is P(shoes; blue)?
25
red glasses; yellow glasses; green glasses; blue glasses red shoes; yellow shoes; green shoes;
6501 Natural Language Processing
26
Example 1: One hot vector: each component of the vector represents one word [0, 0, 1, 0, 0] Example 2: word embeddings
6501 Natural Language Processing
27
Learned matrices to project the input vectors Obtain (y|x) by performing softmax Concatenate projected vectors Non-linear function e.g., โ = tanh (๐b ๐ โ + ๐)
6501 Natural Language Processing
28 6501 Natural Language Processing
29 6501 Natural Language Processing
30 6501 Natural Language Processing
31 6501 Natural Language Processing
32
p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) โฆ 1/8 * 1/8 * 1/8 * 1/16 โฆ
log (1/8 * 1/8 * 1/8 * 1/16 โฆ) = log 1/8 + log 1/8 + log 1/8 + log 1/16 โฆ = (-3) + (-3) + (-3) + (-4) + โฆ
6501 Natural Language Processing
Average? Geometric average
= 1/23.25 โ 1/9.5
33
p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) โฆ 1/8 * 1/8 * 1/8 * 1/16 โฆ
v Want this to be small (equivalent to wanting good compression!) v Lower limit is called entropy โ obtained in principle as cross-entropy of the true model measured on an infinite amount of data
Average? Geometric average
= 1/23.25 โ 1/9.5
6501 Natural Language Processing
k
34 6501 Natural Language Processing
6501 Natural Language Processing 35
6501 Natural Language Processing 36
6501 Natural Language Processing 37
k
38 6501 Natural Language Processing
6501 Natural Language Processing 39
N log2 m(w1. . . wN)
N
N
6501 Natural Language Processing 40