CS498JH: Introduction to NLP (Fall 2012)
http://cs.illinois.edu/class/cs498jh
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm
Lecture 8: Sequence labeling with discriminative models Julia - - PowerPoint PPT Presentation
CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm Sequence
CS498JH: Introduction to NLP (Fall 2012)
http://cs.illinois.edu/class/cs498jh
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm
CS498JH: Introduction to NLP
2
CS498JH: Introduction to NLP
3
Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._.
Task: assign POS tags to words
CS498JH: Introduction to NLP
4
Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] .
Task: identify all non-recursive NP chunks
CS498JH: Introduction to NLP
We define three new tags: – B-NP: beginning of a noun phrase chunk – I-NP: inside of a noun phrase chunk – O: outside of a noun phrase chunk
5
[NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP
a_B-NP nonexecutive_I-NP director_I-NP Nov._B-NP 29_I-NP ._O
CS498JH: Introduction to NLP
6
Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] .
Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks
CS498JH: Introduction to NLP
We define several new tags: – B-NP B-VP B-PP: beginning of an NP, “VP”, “PP” chunk – I-NP: inside of an NP, “VP”, “PP” chunk – O: outside of any chunk
7
Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP
as_B-PP a_B-NP nonexecutive_I-NP director_I-NP Nov._B- NP 29_I-NP ._O [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] .
CS498JH: Introduction to NLP
8
Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] .
Task: identify all mentions of named entities (people, organizations, locations, dates)
CS498JH: Introduction to NLP
We define many new tags: – B-PERS, B-DATE,…: beginning of a mention of a person/date... – I-PERS, B-DATE,…:: inside of a mention of a person/date... – O: outside of any mention of a named entity
9
Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] .
CS498JH: Introduction to NLP
Input: a sequence of tokens/words:
Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 .
Output: a sequence of labeled tokens/words:
POS-tagging: Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS
as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. Named Entity Recognition: Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O
10
CS498JH: Introduction to NLP
11
CS498JH: Introduction to NLP
Graphical models are a notation for probability models. Nodes represent distributions over random variables: – P(X) = Arrows represent dependencies: – P(Y) P(X | Y) = – P(Y) P(Z) P(X | Y, Z) = Shaded nodes represent observed variables. White nodes represent hidden variables – P(Y) P(X | Y) with Y hidden and X observed =
12 X X Y X Y Z X Y
CS498JH: Introduction to NLP
HMMs are generative models of the observed input string w They ‘generate’ w with P(w) = ∏iP(ti | ti-1)P(wi | ti) We know w, but need to find t t1 t2 t3 t4 w1 w2 w3 w4
CS498JH: Introduction to NLP
Sequence labeling: Given an input sequence w = w1...wn, predict the best (most likely) label sequence t = t1…tn Generative models use Bayes Rule: Discriminative (conditional) models model P(t |w) directly
14
argmax
t
P(t|w) = ) = argmax
t
P(t,w) P(w) = argmaxP(t w) = argmax
t
P(t,w) = (t) (w = argmax
t
P(t)P(w|t) argmax
t
P(t|w) =
CS498JH: Introduction to NLP
We’re usually not really interested in P(w | t).
– w is given. We don’t need to predict it!
Why not model what we’re actually interested in: P(t | w) Modeling P(w | t) well is quite difficult: – Prefixes (capital letters) or suffixes are good predictors for certain classes of t (proper nouns, adverbs,…) – But these features may not be independent (e.g. they are overlapping) – These features may also help us deal with unknown words Modeling P(t | w) should be easier: – Now we can incorporate arbitrary features of the word, because we don’t need to predict w anymore
15
CS498JH: Introduction to NLP
MEMMs are conditional models of the labels t given the observed input string w. They model P(t | w) = ∏P(ti |wi, ti-1)
[NB: We also use dynamic programming for learning and labeling]
t1 t2 t3 t4 w1 w2 w3 w4
CS498JH: Introduction to NLP
Classification: Predict a class (label) c for an input x Probabilistic classification: –Model the probability P( c | x)
P(c|x) is a probability if 0 < P (ci | x) < 1, and ∑iP( ci | x) = 1
–Predict the class that has the highest probability
17
CS498JH: Introduction to NLP
Define a set of feature functions fi(x) over the input: – Binary feature functions:
ffirst-letter-capitalized(Urbana) = 1
ffirst-letter-capitalized(computer) = 0 – Integer (or real-valued) feature functions: fnumber-of-vowels(Urbana) = 3 Because each class might care only about certain features (e.g. capitalization for proper nouns), redefine feature functions fi(x,c) to take the class label into account: ffirst-letter-capitalized(Urbana, NNP) = 1 ffirst-letter-capitalized(Urbana, VB) = 0 => We turn each feature fi on or off depending on c
18
CS498JH: Introduction to NLP
– We also associate a real-valued weight wi (λi) with each feature fi – Now we have a score for predicting class c for input x: score(x,c) = ∑iwi fi(x,c) – This score could be negative, so we exponentiate it: score(x,c) = exp( ∑iwi fi(x,c)) = e ∑iwi fi(x,c) – We normalize this score to define a probability: – Learning = finding the best weights wi
19
P(c|x) = e∑i wi fi(x,c) ∑c e∑i wi f(x,c) = e∑i wi fi(x,c) Z
CS498JH: Introduction to NLP
We use conditional maximum likelihood estimation (and standard convex optimization algorithms) to find w Conditional MLE: Find the w that assigns highest probability to all observed
ˆ w = argmax
w ∏ i
P(ci|xi,w) = argmax
w ∑ i
log(P(ci|xi,w)) = argmax
w ∑ i
log
∑c e∑ j w j f j(xi,c) ⇥
20
CS498JH: Introduction to NLP
We also refer to these models as exponential models because we exponentiate (e∑wf(x,c)) the weights and features We also refer to them as loglinear models because the log probability is a linear function Statisticians refer to them as multinomial logistic regression models.
log(P(c|x,w)) = log e∑ j w j f j(x,c) Z ⇥ = ∑
j
w j f j(x,c)−log(Z)
21
CS498JH: Introduction to NLP
MEMMs use a MaxEnt classifier for each P(ti |wi, ti-1):
ti-1 ti wi
P
j wjfj(wi,ti−1,ti)
P
j wjfj(wi,ti−1,ti)
P
j wjfj(wi,ti−1,tk)
CS498JH: Introduction to NLP
Entropy: Measures uncertainty. Is highest for uniform distributions We also refer to these models as Maximum Entropy (MaxEnt) models because conditional MLE finds the most uniform distribution (subject to the constraints that the expected counts equal the observed counts in the training data). The default value for all weights wi is zero.
H(P) = −∑
x
P(x)log2 P(x) H(P(y|x)) = −∑
y
P(y|x)log2 P(y|x)
23
CS498JH: Introduction to NLP
Chain CRFs are also conditional models of the labels t given the observed input string w, but instead of one classifier for each P(ti |wi, ti-1) they learn global distributions P(t|w) t1 t2 t3 t4 w1 w2 w3 w4
CS498JH: Introduction to NLP
Sequence labeling tasks:
POS tagging NP chunking Shallow Parsing Named Entity Recognition
Discriminative models:
Maximum Entropy classifiers MEMMs
25
CS498JH: Introduction to NLP
26
CS498JH: Introduction to NLP
In probabilistic classification, we use P(y|x) to predict a class y for input x. If we want to do binary classification, i.e. Y = {true,false}, then P(y=true|x) + P(y=false | x) = 1 We choose y=true if P(y=true|x) > P(y=false |x) P(y=true|x) > 1− P(y=true |x) I.e. we choose y=true if
| P(y=true|x) 1−P(y=true|X) > 1
P(y = true x) 0 5
27
CS498JH: Introduction to NLP
logit(p) = ln
1− p ⇥ lim
p→0logit(p)
= −∞ logit(0.5) = lim
p→1logit(p)
= +∞
For a probability p, logit(p) is the natural logarithm
Note that -∞ < logit(p) < ∞:
28
CS498JH: Introduction to NLP
Probabilistic classification: predict probability P( c | x)
P(c|x) is a probability if 0 < P (ci | x) < 1, and ∑iP( ci | x) = 1
Linear Regression: y = wx Predict a real-valued outcome y for input x using weights w.
Difficult to force y to be a probability
Logistic Regression: logit(P(c|x)) = wx
Possible since -∞ < logit(P) < ∞
29
CS498JH: Introduction to NLP
| P(y|x) = ewx 1+ewx = ewx 1+ewx e−wf e−wf ⇥ = 1 1+e−wf
wx wx
wx
30
logit(P(y|x)) = wx ln
1−P(y|x) ⇥ = wx P(y x)
| ⇥ P(y|x) 1−P(y|x) = ewx P(y|x) = ewx(1−P(y|x)) P(y|x) = ewx −ewxP(y|x) P(y|x)+ewxP(y|x) = ewx P(y|x)(1+ewx) = ewx
CS498JH: Introduction to NLP
P(y |x) depends on w:
P(¬y|x) = 1 1+ewx = e0 1+ewx = e0x 1+ewx w¬y =
31
P(y|x)+P(¬y|x) = 1 ewx 1+ewx + 1 1+ewx = 1
CS498JH: Introduction to NLP
P(c1|x)+...+P(cn|x) = 1 ew1x ∑n
i ewix +...+ ewnx
∑n
i ewix
= 1 e0x ∑e
i 0x +...+
e0x ∑n
i e0x
= 1 Set wi = 0 1 n +...+ 1 n = 1
Generalizing from a Bernoulli distribution (Y = {0, 1}) to a categorical distribution (Y = {c1,…,cn}): Setting wi = 0 yields a uniform distribution. Recall: uniform distributions have maximal entropy.
32
CS498JH: Introduction to NLP
33
CS498JH: Introduction to NLP
ˆ w = argmax
w ∑ i
log
∑c e∑j wj f j(xi,c) ⇥ = argmax
w (Lw)
dL dw =
i
f(xi,yi) ⇤ ⇥ ⌅
empirical counts
−∑
i ∑ j
f(xi,yj)P(yj|xi) ⇤ ⇥ ⌅
expected counts
=
We need maximize the conditional likelihood estimate: We need to set the first derivative dL/dw to 0: We need to find P(y|x) such that the expected counts equal the empirical (observed) counts
34
CS498JH: Introduction to NLP
Problem: If there is a feature fi that perfectly predicts some class cj, its weight will go to ∞, and the other weights don’t matter. Solution: We need to penalize large weights. Instead of MLE (w* = argmax w P(y |x, w)), predict w* = argmax w P(y |x,w)P(w) (Maximum A Posteriori estimate)
35
CS498JH: Introduction to NLP
As before Easy to deal with
ˆ w = argmax
w ∏ i
P(yi|xi,wi)P(w) = argmax
w ∑ i
logP(yi|xi,wi))+logP(w) = argmax
w ∑ i
logP(yi|xi,wi))−∑
j
w2
j
2σ2
j
Assume P(w) is a Gaussian (normal) distribution with mean μ=0 and (fixed) variance σ:
36