Lecture 8: Sequence labeling with discriminative models Julia - - PowerPoint PPT Presentation

lecture 8 sequence labeling with discriminative models
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Sequence labeling with discriminative models Julia - - PowerPoint PPT Presentation

CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm Sequence


slide-1
SLIDE 1

CS498JH: Introduction to NLP (Fall 2012)

http://cs.illinois.edu/class/cs498jh

Julia Hockenmaier

juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm

Lecture 8: Sequence labeling with discriminative models

slide-2
SLIDE 2

CS498JH: Introduction to NLP

Sequence labeling

2

slide-3
SLIDE 3

CS498JH: Introduction to NLP

POS tagging

3

Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._.

Task: assign POS tags to words

slide-4
SLIDE 4

CS498JH: Introduction to NLP

Noun phrase (NP) chunking

4

Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] .

Task: identify all non-recursive NP chunks

slide-5
SLIDE 5

CS498JH: Introduction to NLP

The BIO encoding

We define three new tags: – B-NP: beginning of a noun phrase chunk – I-NP: inside of a noun phrase chunk – O: outside of a noun phrase chunk

5

[NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP

  • ld_O ,_O will_O join_O IBM_B-NP ‘s_O board_B-NP as_O

a_B-NP nonexecutive_I-NP director_I-NP Nov._B-NP 29_I-NP ._O

slide-6
SLIDE 6

CS498JH: Introduction to NLP

Shallow parsing

6

Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] .

Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks

slide-7
SLIDE 7

CS498JH: Introduction to NLP

The BIO encoding for shallow parsing

We define several new tags: – B-NP B-VP B-PP: beginning of an NP, “VP”, “PP” chunk – I-NP: inside of an NP, “VP”, “PP” chunk – O: outside of any chunk

7

Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP

  • ld_O ,_O will_B-VP join_I-VP IBM_B-NP ‘s_O board_B-NP

as_B-PP a_B-NP nonexecutive_I-NP director_I-NP Nov._B- NP 29_I-NP ._O [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] .

slide-8
SLIDE 8

CS498JH: Introduction to NLP

Named Entity Recognition

8

Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] .

Task: identify all mentions of named entities (people, organizations, locations, dates)

slide-9
SLIDE 9

CS498JH: Introduction to NLP

The BIO encoding for NER

We define many new tags: – B-PERS, B-DATE,…: beginning of a mention of a person/date... – I-PERS, B-DATE,…:: inside of a mention of a person/date... – O: outside of any mention of a named entity

9

Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] .

slide-10
SLIDE 10

CS498JH: Introduction to NLP

Many NLP tasks are sequence labeling tasks

Input: a sequence of tokens/words:

Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 .

Output: a sequence of labeled tokens/words:

POS-tagging: Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS

  • ld_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN

as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. Named Entity Recognition: Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O

10

slide-11
SLIDE 11

CS498JH: Introduction to NLP

Graphical models for sequence labeling

11

slide-12
SLIDE 12

CS498JH: Introduction to NLP

Graphical models

Graphical models are a notation for probability models. Nodes represent distributions over random variables: – P(X) = Arrows represent dependencies: – P(Y) P(X | Y) = – P(Y) P(Z) P(X | Y, Z) = Shaded nodes represent observed variables. White nodes represent hidden variables – P(Y) P(X | Y) with Y hidden and X observed =

12 X X Y X Y Z X Y

slide-13
SLIDE 13

CS498JH: Introduction to NLP

HMMs as graphical models

HMMs are generative models of the observed input string w They ‘generate’ w with P(w) = ∏iP(ti | ti-1)P(wi | ti) We know w, but need to find t t1 t2 t3 t4 w1 w2 w3 w4

slide-14
SLIDE 14

CS498JH: Introduction to NLP

Models for sequence labeling

Sequence labeling: Given an input sequence w = w1...wn, predict the best (most likely) label sequence t = t1…tn Generative models use Bayes Rule: Discriminative (conditional) models model P(t |w) directly

14

argmax

t

P(t|w) = ) = argmax

t

P(t,w) P(w) = argmaxP(t w) = argmax

t

P(t,w) = (t) (w = argmax

t

P(t)P(w|t) argmax

t

P(t|w) =

slide-15
SLIDE 15

CS498JH: Introduction to NLP

Advantages of discriminative models

We’re usually not really interested in P(w | t).

– w is given. We don’t need to predict it!

Why not model what we’re actually interested in: P(t | w) Modeling P(w | t) well is quite difficult: – Prefixes (capital letters) or suffixes are good predictors for certain classes of t (proper nouns, adverbs,…) – But these features may not be independent (e.g. they are overlapping) – These features may also help us deal with unknown words Modeling P(t | w) should be easier: – Now we can incorporate arbitrary features of the word, because we don’t need to predict w anymore

15

slide-16
SLIDE 16

CS498JH: Introduction to NLP

Maximum Entropy Markov Models

MEMMs are conditional models of the labels t given the observed input string w. They model P(t | w) = ∏P(ti |wi, ti-1)

[NB: We also use dynamic programming for learning and labeling]

t1 t2 t3 t4 w1 w2 w3 w4

slide-17
SLIDE 17

CS498JH: Introduction to NLP

Probabilistic classification

Classification: Predict a class (label) c for an input x Probabilistic classification: –Model the probability P( c | x)

P(c|x) is a probability if 0 < P (ci | x) < 1, and ∑iP( ci | x) = 1

–Predict the class that has the highest probability

17

slide-18
SLIDE 18

CS498JH: Introduction to NLP

Representing features

Define a set of feature functions fi(x) over the input: – Binary feature functions:

ffirst-letter-capitalized(Urbana) = 1

ffirst-letter-capitalized(computer) = 0 – Integer (or real-valued) feature functions: fnumber-of-vowels(Urbana) = 3 Because each class might care only about certain features (e.g. capitalization for proper nouns), redefine feature functions fi(x,c) to take the class label into account: ffirst-letter-capitalized(Urbana, NNP) = 1 ffirst-letter-capitalized(Urbana, VB) = 0 => We turn each feature fi on or off depending on c

18

slide-19
SLIDE 19

CS498JH: Introduction to NLP

From features to probabilities

– We also associate a real-valued weight wi (λi) with each feature fi – Now we have a score for predicting class c for input x: score(x,c) = ∑iwi fi(x,c) – This score could be negative, so we exponentiate it: score(x,c) = exp( ∑iwi fi(x,c)) = e ∑iwi fi(x,c) – We normalize this score to define a probability: – Learning = finding the best weights wi

19

P(c|x) = e∑i wi fi(x,c) ∑c e∑i wi f(x,c) = e∑i wi fi(x,c) Z

slide-20
SLIDE 20

CS498JH: Introduction to NLP

We use conditional maximum likelihood estimation (and standard convex optimization algorithms) to find w Conditional MLE: Find the w that assigns highest probability to all observed

  • utputs ci given the inputs xi

Learning: finding w

ˆ w = argmax

w ∏ i

P(ci|xi,w) = argmax

w ∑ i

log(P(ci|xi,w)) = argmax

w ∑ i

log

  • e∑ j w j f j(xi,c)

∑c e∑ j w j f j(xi,c) ⇥

20

slide-21
SLIDE 21

CS498JH: Introduction to NLP

We also refer to these models as exponential models because we exponentiate (e∑wf(x,c)) the weights and features We also refer to them as loglinear models because the log probability is a linear function Statisticians refer to them as multinomial logistic regression models.

Some terminology

log(P(c|x,w)) = log e∑ j w j f j(x,c) Z ⇥ = ∑

j

w j f j(x,c)−log(Z)

21

slide-22
SLIDE 22

CS498JH: Introduction to NLP

MEMMs use a MaxEnt classifier for each P(ti |wi, ti-1):

Maximum Entropy Markov Models

ti-1 ti wi

P(ti|wi, ti−1) = e

P

j wjfj(wi,ti−1,ti)

Z = e

P

j wjfj(wi,ti−1,ti)

  • tk e

P

j wjfj(wi,ti−1,tk)

slide-23
SLIDE 23

CS498JH: Introduction to NLP

Entropy: Measures uncertainty. Is highest for uniform distributions We also refer to these models as Maximum Entropy (MaxEnt) models because conditional MLE finds the most uniform distribution (subject to the constraints that the expected counts equal the observed counts in the training data). The default value for all weights wi is zero.

Terminology II: Maximum Entropy

H(P) = −∑

x

P(x)log2 P(x) H(P(y|x)) = −∑

y

P(y|x)log2 P(y|x)

23

slide-24
SLIDE 24

CS498JH: Introduction to NLP

Chain Conditional Random Fields

Chain CRFs are also conditional models of the labels t given the observed input string w, but instead of one classifier for each P(ti |wi, ti-1) they learn global distributions P(t|w) t1 t2 t3 t4 w1 w2 w3 w4

slide-25
SLIDE 25

CS498JH: Introduction to NLP

Today’s key concepts

Sequence labeling tasks:

POS tagging NP chunking Shallow Parsing Named Entity Recognition

Discriminative models:

Maximum Entropy classifiers MEMMs

25

slide-26
SLIDE 26

CS498JH: Introduction to NLP

Supplementary material: Why Maximum Entropy?

26

slide-27
SLIDE 27

CS498JH: Introduction to NLP

  • dds ratio of P(y=true|x)

In probabilistic classification, we use P(y|x) to predict a class y for input x. If we want to do binary classification, i.e. Y = {true,false}, then P(y=true|x) + P(y=false | x) = 1 We choose y=true if P(y=true|x) > P(y=false |x) P(y=true|x) > 1− P(y=true |x) I.e. we choose y=true if

Probabilistic classification

| P(y=true|x) 1−P(y=true|X) > 1

P(y = true x) 0 5

27

slide-28
SLIDE 28

CS498JH: Introduction to NLP

The logit function

logit(p) = ln

  • p

1− p ⇥ lim

p→0logit(p)

= −∞ logit(0.5) = lim

p→1logit(p)

= +∞

For a probability p, logit(p) is the natural logarithm

  • f the odds ratio of p:

Note that -∞ < logit(p) < ∞:

28

slide-29
SLIDE 29

CS498JH: Introduction to NLP

Predicting probabilities with logistic regression

Probabilistic classification: predict probability P( c | x)

P(c|x) is a probability if 0 < P (ci | x) < 1, and ∑iP( ci | x) = 1

Linear Regression: y = wx Predict a real-valued outcome y for input x using weights w.

Difficult to force y to be a probability

Logistic Regression: logit(P(c|x)) = wx

Possible since -∞ < logit(P) < ∞

29

slide-30
SLIDE 30

CS498JH: Introduction to NLP

| P(y|x) = ewx 1+ewx = ewx 1+ewx e−wf e−wf ⇥ = 1 1+e−wf

wx wx

wx

30

logit(P(y|x)) = wx ln

  • P(y|x)

1−P(y|x) ⇥ = wx P(y x)

| ⇥ P(y|x) 1−P(y|x) = ewx P(y|x) = ewx(1−P(y|x)) P(y|x) = ewx −ewxP(y|x) P(y|x)+ewxP(y|x) = ewx P(y|x)(1+ewx) = ewx

slide-31
SLIDE 31

CS498JH: Introduction to NLP

P(y |x) depends on w:

P(¬y|x) = 1 1+ewx = e0 1+ewx = e0x 1+ewx w¬y =

What about P(¬y |x) ?

31

P(y|x)+P(¬y|x) = 1 ewx 1+ewx + 1 1+ewx = 1

slide-32
SLIDE 32

CS498JH: Introduction to NLP

P(c1|x)+...+P(cn|x) = 1 ew1x ∑n

i ewix +...+ ewnx

∑n

i ewix

= 1 e0x ∑e

i 0x +...+

e0x ∑n

i e0x

= 1 Set wi = 0 1 n +...+ 1 n = 1

From binary to multiclass classification

Generalizing from a Bernoulli distribution (Y = {0, 1}) to a categorical distribution (Y = {c1,…,cn}): Setting wi = 0 yields a uniform distribution. Recall: uniform distributions have maximal entropy.

32

slide-33
SLIDE 33

CS498JH: Introduction to NLP

Supplementary material: Learning the weights

33

slide-34
SLIDE 34

CS498JH: Introduction to NLP

Analytically...

ˆ w = argmax

w ∑ i

log

  • e∑j wj f j(xi,c)

∑c e∑j wj f j(xi,c) ⇥ = argmax

w (Lw)

dL dw =

i

f(xi,yi) ⇤ ⇥ ⌅

empirical counts

−∑

i ∑ j

f(xi,yj)P(yj|xi) ⇤ ⇥ ⌅

expected counts

=

We need maximize the conditional likelihood estimate: We need to set the first derivative dL/dw to 0: We need to find P(y|x) such that the expected counts equal the empirical (observed) counts

34

slide-35
SLIDE 35

CS498JH: Introduction to NLP

Regularization

Problem: If there is a feature fi that perfectly predicts some class cj, its weight will go to ∞, and the other weights don’t matter. Solution: We need to penalize large weights. Instead of MLE (w* = argmax w P(y |x, w)), predict w* = argmax w P(y |x,w)P(w) (Maximum A Posteriori estimate)

35

slide-36
SLIDE 36

CS498JH: Introduction to NLP

As before Easy to deal with

Modeling P(w): Gaussian prior

ˆ w = argmax

w ∏ i

P(yi|xi,wi)P(w) = argmax

w ∑ i

logP(yi|xi,wi))+logP(w) = argmax

w ∑ i

logP(yi|xi,wi))−∑

j

w2

j

2σ2

j

Assume P(w) is a Gaussian (normal) distribution with mean μ=0 and (fixed) variance σ:

36