Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

lecture 13 structured prediction
SMART_READER_LITE
LIVE PREVIEW

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key


slide-1
SLIDE 1

Lecture 13: Structured Prediction

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501: NLP

slide-2
SLIDE 2

Quiz 2

v Lectures 9-13

vLecture 12: before page 44 vLecture 13: before page 33

v Key points:

vHMM model vThree basic problems vSequential tagging

CS6501: NLP 2

slide-3
SLIDE 3

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501: NLP 3

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-4
SLIDE 4

Supervised Learning Setting

v Assume we have annotated examples

CS6501: NLP 4

The/DT grand/JJ jury/NN commented/VBD

  • n/IN a/DT number/NN of/IN other/JJ

topics/NNS ./. Tag set: DT, JJ, NN, VBD… POS Tagger

slide-5
SLIDE 5

Sequence tagging problems

v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems

CS6501: NLP 5

slide-6
SLIDE 6

OCR example

CS6501: NLP 6

slide-7
SLIDE 7

Noun phrase (NP) chunking

v Task: identify all non-recursive NP chunks

CS6501: NLP 7

slide-8
SLIDE 8

The BIO encoding

v Define three new tags

v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk

POS Tagging with a restricted Tagset?

CS6501: NLP 8

slide-9
SLIDE 9

Shallow parsing

vTask: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks

CS6501: NLP 9

slide-10
SLIDE 10

BIO Encoding for Shallow Parsing

v Define new tags

v B-NP B-VP B-PP: beginning of an “NP”, “VP”, “PP” chunk v I-NP I-VP I-PP: inside of an “NP”, “VP”, “PP” chunk v O: outside of any chunk

CS6501: NLP 10

POS Tagging with a restricted Tagset?

slide-11
SLIDE 11

Named Entity Recognition

v Task: identify all mentions of named entities (people,

  • rganizations, locations, dates)

CS6501: NLP 11

slide-12
SLIDE 12

BIO Encoding for NER

v Define many new tags

v B-PERS, B-DATE,…: beginning of a mention of a person/date... v I-PERS, I-DATE,…: inside of a mention of a person/date... v O: outside of any mention of a named entity

CS6501: NLP 12

slide-13
SLIDE 13

Sequence tagging

v Many NLP tasks are sequence tagging tasks

vInput: a sequence of tokens/words vOutput: a sequence of corresponding labels

v E.g., POS tags, BIO encoding for NER

vSolution: finding the most probable label sequence for the given word sequence

v𝒖∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖 𝑄 𝒖 𝒙

CS6501: NLP 13

slide-14
SLIDE 14

Sequential tagging v.s independent prediction

Sequence labeling

  • 𝒖∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖𝑄 𝒖 𝒙
  • 𝒖 is a vector/matrix

Independent classifier

  • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦-𝑄(𝑧|𝒚)
  • 𝑧 is a single label

CS6501: NLP 14

yi xi xj yj ti wi wj tj

slide-15
SLIDE 15

Sequential tagging v.s independent prediction

Sequence labeling

  • 𝒖∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖𝑄 𝒖 𝒙
  • 𝒖 is a vector/matrix
  • Dependency

between both (𝒖, 𝒙) and (𝑢4,𝑢5)

  • Structured output
  • Difficult to solve the

inference problem

Independent classifiers

  • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦-𝑄(𝑧|𝒚)
  • 𝑧 is a single label
  • Dependency only

within (𝑧, 𝒚)

  • Independent output
  • Easy to solve the

inference problem

CS6501: NLP 15

slide-16
SLIDE 16

Recap: Viterbi Decoding

CS6501 Natural Language Processing 16

Induction:

𝜀7 𝑟 = 𝑄 𝑥7 ∣ 𝑢7 = 𝑟 max

>? 𝜀7@A 𝑟B 𝑄 𝑢7 = 𝑟 ∣ 𝑢7@A = 𝑟B

slide-17
SLIDE 17

Recap: Viterbi algorithm

v Store the best tag sequence for 𝑥A … 𝑥4 that ends in 𝑢5 in 𝑈[𝑘][𝑗]

v𝑈[𝑘][𝑗] = max 𝑄(𝑥A …𝑥4, 𝑢A …, 𝑢4 = 𝑢5 )

v Recursively compute T[j][i] from the entries in the previous column T[j][i-1]

v 𝑈 𝑘 𝑗 = 𝑄 𝑥4 𝑢5 𝑁𝑏𝑦7 𝑈 𝑙 𝑗 − 1 𝑄 𝑢5 𝑢7

The best i-1 tag sequence Generating the current

  • bservation

Transition from the previous best ending tag

CS6501: NLP 17

slide-18
SLIDE 18

Two modeling perspectives

v Generative models

vModel the joint probability of labels and words v𝒖∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖 𝑄 𝒖 𝒙 = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖

M 𝒙,𝒖 M 𝒙

= 𝑏𝑠𝑕𝑛𝑏𝑦𝒖 𝑄(𝒖, 𝒙)

v Discriminative models

vDirectly model the conditional probability of labels given the words v𝒖∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝒖 𝑄 𝒖 𝒙

CS6501: NLP 18

Often modeled by Softmax function

slide-19
SLIDE 19

Generative V.S. discriminative models v Binary classification as an example

CS6501: NLP 19

Generative Model’s view Discriminative Model’s view

slide-20
SLIDE 20

Generative V.S. discriminative models

Generative

  • joint distribution
  • Full probabilistic

specification for all the random variables

  • Dependence assumption

has to be specified for P 𝒙 𝒖 and P(𝒖)

  • Can be used in

unsupervised learning

Discriminative

  • conditional distribution
  • Only explain the target

variable

  • Arbitrary features can be

incorporated for modeling P 𝒖 𝒙

  • Need labeled data, suitable

for (semi-) supervised learning

CS6501: NLP 20

slide-21
SLIDE 21

Independent Classifiers

v𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢4|𝑥4)

4

v~95% accuracy (token-wise)

CS6501: NLP 21

𝑢A 𝑥A 𝑥O 𝑢O 𝑢P 𝑥P 𝑥Q 𝑢Q

slide-22
SLIDE 22

Maximum entropy Markov models

v MEMMs are discriminative models of the labels 𝒖 given the observed input sequence 𝒙

v𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢4|𝑥4,𝑢4@A)

4

CS6501: NLP 22

slide-23
SLIDE 23

Design features

v Emission-like features

vBinary feature functions

v ffirst-letter-capitalized-NNP(China) = 1 v ffirst-letter-capitalized-VB(know) = 0

vInteger (or real-valued) feature functions

v fnumber-of-vowels-NNP(China) = 2

v Transition-like features

vBinary feature functions

v ffirst-letter-capitalized-VB-NNP(China) = 1

Not necessarily independent features! VB China NNP

CS6501: NLP 23

know

slide-24
SLIDE 24

Parameterization of P(𝑢4|𝑥4, 𝑢4@A)

v Associate a real-valued weight 𝜇 to each specific type of feature function

v𝜇7 for ffirst-letter-capitalized-NNP(w)

v Define a scoring function 𝑔 𝑢4, 𝑢4@A,𝑥4 = ∑ 𝜇7𝑔

7(𝑢4,𝑢4@A, 𝑥4) 7

v Naturally P 𝑢4 𝑥4, 𝑢4@A ∝ exp𝑔 𝑢4, 𝑢4@A,𝑥4

vRecall the basic definition of probability

v 𝑄(𝑦) > 0 v ∑ 𝑞(𝑦)

[

= 1

CS6501: NLP 24

slide-25
SLIDE 25

Parameterization of MEMMs

v It is a log-linear model

vlog 𝑞 𝒖 𝒙 = ∑ 𝑔(𝑢4,𝑢4@A,𝑥4)

4

− 𝐷(𝝁)

v Viterbi algorithm can be used to decode the most probable label sequence solely based

  • n ∑ 𝑔(𝑢4,𝑢4@A, 𝑥4)

4

P 𝒖 𝒙 = ∏ 𝑄(𝑢4|𝑥4,𝑢4@A)

4

= ∏

abc d ef,efgh,if ∑ abc d e,efgh,if

j

4

= ∏ exp 𝑔(𝑢4, 𝑢4@A, 𝑥4)

4

∏ ∑ exp 𝑔 𝑢, 𝑢4@A, 𝑥4

e 4 Constant only related to 𝝁 𝝁: parameters

CS6501: NLP 25

slide-26
SLIDE 26

Parameter estimation (Intuition)

v Maximum likelihood estimator can be used in a similar way as in HMMs

v𝜇∗ = 𝑏𝑠𝑕𝑛𝑏𝑦k ∑ log 𝑄(𝒖|𝑥)

𝒖,i

= 𝑏𝑠𝑕𝑛𝑏𝑦k ∑ ∑ 𝑔(𝑢4, 𝑢4@A, 𝑥4)

4

− 𝐷(𝝁)

𝒖,i

Decompose the training data into such units

CS6501: NLP 26

slide-27
SLIDE 27

Parameter estimation (Intuition)

v Essentially, training local classifiers using previous assigned tags as features

CS6501: NLP 27

slide-28
SLIDE 28

More about MEMMs

v Emission features can go across multiple

  • bservations

v𝑔 𝑢4,𝑢4@A, 𝑥4 ≜ ∑ 𝜇7𝑔

7(𝑢4,𝑢4@A,𝒙) 7

vEspecially useful for shallow parsing and NER tasks

CS6501: NLP 28

slide-29
SLIDE 29

Label biased problem

v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC

CS6501: NLP 29

  • ther

B-LOC E-LOC

B-PER E-PER

slide-30
SLIDE 30

Label biased problem

v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM:

P(B-PER|Thomas,other)= ½ P(B-LOC|Thomas,other)= ½ P(I-PER|Jefferson, B-PER)=1 P(I-LOC|Jefferson, B-LOC)=1

CS6501: NLP 30

  • ther

B-LOC E-LOC

B-PER E-PER

Should globally normalize!

slide-31
SLIDE 31

Conditional Random Field

v Model global dependency v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 = exp𝑇 𝒖, 𝒙 / ∑ exp

𝒖B

𝑇(𝒖B,𝒙)

𝑢P 𝑢Q 𝑥P 𝑥Q 𝑢A 𝑢O 𝑥A 𝑥O

CS6501: NLP 31

Score entire sequence directly

slide-32
SLIDE 32

Conditional Random Field

v 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇7𝑔

7 𝑢4,𝒙 + ∑ 𝛿q𝑕q(𝑢4,𝑢4@A, 𝒙) q 7 𝒋

) v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 =∏ exp(∑ 𝜇7𝑔

7 𝑢4,𝒙 + ∑ 𝛿q𝑕q(𝑢4,𝑢4@A,𝒙) q 7

)

4

𝑢P 𝑢Q 𝑥P 𝑥Q 𝑢A 𝑢O 𝑥A 𝑥O

Node feature 𝑔(𝑢4,𝒙) Edge feature 𝑕(𝑢4,𝑢4@A,𝒙)

CS6501: NLP 32

slide-33
SLIDE 33

Design features

v Emission-like features

vBinary feature functions

v ffirst-letter-capitalized-NNP(China) = 1 v ffirst-letter-capitalized-VB(know) = 0

vInteger (or real-valued) feature functions

v fnumber-of-vowels-NNP(China) = 2

v Transition-like features

vBinary feature functions

v ffirst-letter-capitalized-VB-NNP(China) = 1

Not necessarily independent features! VB China NNP

CS6501: NLP 33

know

slide-34
SLIDE 34

General Idea

v We want the score to the correct answer 𝑇 𝒖∗,𝒙 higher than others. 𝑇 𝒖∗, 𝒙 > 𝑇 𝒖B, 𝒙 ∀𝐮B ∈ 𝑈, 𝒖B ≠ 𝒖∗ v Different level of mistakes 𝑇 𝒖∗, 𝒙 ≥ 𝑇 𝒖B, 𝒙 + 𝚬(𝐮B, 𝐮∗) ∀𝐮B ∈ 𝑈 v Several ML models can be used

vStructured Perceptron vStructured SVM vLearning to Search

CS6501: NLP 34

slide-35
SLIDE 35

Log-linear model

v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 v 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇7𝑔

7 𝑢4,𝒙 + ∑ 𝛿𝑕q(𝑢4,𝑢4@A, 𝒙) q 7 𝒋

) = ∑ 𝜇7(∑ 𝑔

7 𝑢4,𝒙 ) 4 𝒍

+ ∑ 𝛿q(∑ 𝑕q(𝑢4,𝑢4@A, 𝒙))

4 𝒎

𝜇A 𝜇O ⋮ 𝛿A 𝛿O ⋮ ⋅ ∑ 𝑔

A 𝑢4, 𝒙 ) 4

∑ 𝑔

O 𝑢4, 𝒙 ) 4

⋮ ∑ 𝑕A(𝑢4,𝑢4@A, 𝒙))

4

∑ 𝑕O(𝑢4, 𝑢4@A,𝒙))

4

CS6501: NLP 35

Essentially, we aggregate transition and emission patterns as features 𝜄 ⋅ F(𝐮, 𝐱)

slide-36
SLIDE 36

MEMM v.s. CRF

v Score function can be the same: 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇7𝑔

7 𝑢4,𝒙 + ∑ 𝛿𝑕q(𝑢4,𝑢4@A, 𝒙) q 7 𝒋

)

= ∑ 𝑔(𝑢4,𝑢4@A, 𝑥4)

𝒋

v MEMM: 𝑄 𝒖 𝒙 = ∏ 𝑄 𝑢4 𝑥4, 𝑢4@A

4

=

∏ abc d(ef,efgh,if)

f

∏ ∑ abc d e,efgh,if

j f

v CRF: 𝑄 𝒖 𝒙 = abc (Ž 𝒖,𝒙 )

∑ abc • 𝒖,𝒙

𝒖

=

∏ abc d(ef,efgh,if)

f

∑ ∏ abc d(ef,efgh,if)

f 𝒖

CS6501: NLP 36

Like in the previous slide, we can rearrange the summations Locally normalized globally normalized

slide-37
SLIDE 37

HMM v.s. MEMM v.s. CRF

CS6501: NLP 37

P(X,Y) P(Y|X)

slide-38
SLIDE 38

Structured Prediction –beyond sequence tagging

Task Input Output

Part-of-speech Tagging They operate ships and banks. Dependency Parsing They operate ships and banks. Segmentation

38

Pronoun Verb Noun And Noun

Root They operate ships and banks .

Assign values to a set of interdependent output variables

slide-39
SLIDE 39

Inference

v Find the best scoring output given the model argmax

  • 𝑇 𝒛, 𝒚

v Output space is usually exponentially large v Inference algorithms:

v Specific: e.g., Viterbi (linear chain) v General: Integer linear programming (ILP) v Approximate inference algorithms: e.g., belief propagation, dual decomposition

39

slide-40
SLIDE 40

Learning Structured Models

40

Solve inferences Update the model

(stochastic) gradient updates

slide-41
SLIDE 41

Example: Structured Perceptron

v Goal: we want the score to the correct answer 𝑇 𝒛∗,𝒚; 𝜾 higher than others. 𝑇 𝒛∗,𝒚; 𝜾 > 𝑇 𝒛B,𝒚; 𝜾 ∀𝒛B ∈ 𝑈, 𝒛B ≠ 𝒛∗ v Let S 𝐳, 𝐲;𝜄 = 𝜄 ⋅ 𝐺(𝒛,𝒚; 𝜄) v Give training data {(𝒛𝒋,𝒚𝒋)}, 𝑗 = 1 …𝑂 v Loop until converge vFor i = 1…N vLet 𝒛B = arg max

𝒛

𝜄 ⋅ 𝐺(𝒛,𝒚; 𝜄) vIf 𝒛B ≠ 𝒛∗: 𝜄 ← 𝜄 + 𝜃(𝐺 𝒛∗,𝒚; 𝜄 − 𝐺(𝒛′,𝒚; 𝜄))

Kai-Wei Chang 41