Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1
Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key points: v HMM model v Three basic problems v Sequential tagging CS6501: NLP 2
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501: NLP 3
Supervised Learning Setting v Assume we have annotated examples Tag set: POS Tagger DT, JJ, NN, VBD… The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. CS6501: NLP 4
Sequence tagging problems v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems CS6501: NLP 5
OCR example CS6501: NLP 6
Noun phrase (NP) chunking v Task: identify all non-recursive NP chunks CS6501: NLP 7
The BIO encoding v Define three new tags v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS6501: NLP 8
Shallow parsing v Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks CS6501: NLP 9
BIO Encoding for Shallow Parsing v Define new tags v B-NP B-VP B-PP: beginning of an “NP”, “VP”, “PP” chunk v I-NP I-VP I-PP: inside of an “NP”, “VP”, “PP” chunk v O: outside of any chunk POS Tagging with a restricted Tagset? CS6501: NLP 10
Named Entity Recognition v Task: identify all mentions of named entities (people, organizations, locations, dates) CS6501: NLP 11
BIO Encoding for NER v Define many new tags v B-PERS, B-DATE,…: beginning of a mention of a person/date... v I-PERS, I-DATE,…: inside of a mention of a person/date... v O: outside of any mention of a named entity CS6501: NLP 12
Sequence tagging v Many NLP tasks are sequence tagging tasks v Input: a sequence of tokens/words v Output: a sequence of corresponding labels v E.g., POS tags, BIO encoding for NER v Solution: finding the most probable label sequence for the given word sequence v 𝒖 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 CS6501: NLP 13
Sequential tagging v.s independent prediction Sequence labeling Independent classifier • 𝒖 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label t j y j t i y i w j x j w i x i CS6501: NLP 14
Sequential tagging v.s independent prediction Sequence labeling Independent classifiers • 𝒖 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label •Dependency •Dependency only between both (𝒖, 𝒙) within (𝑧, 𝒚) and (𝑢 4 ,𝑢 5 ) •Independent output •Structured output •Easy to solve the •Difficult to solve the inference problem inference problem CS6501: NLP 15
Recap: Viterbi Decoding Induction: > ? 𝜀 7@A 𝑟 B 𝑄 𝑢 7 = 𝑟 ∣ 𝑢 7@A = 𝑟 B 𝜀 7 𝑟 = 𝑄 𝑥 7 ∣ 𝑢 7 = 𝑟 max CS6501 Natural Language Processing 16
Recap: Viterbi algorithm v Store the best tag sequence for 𝑥 A … 𝑥 4 that ends in 𝑢 5 in 𝑈[𝑘][𝑗] v 𝑈[𝑘][𝑗] = max 𝑄(𝑥 A …𝑥 4 , 𝑢 A …, 𝑢 4 = 𝑢 5 ) v Recursively compute T[j][i] from the entries in the previous column T[j][i-1] v 𝑈 𝑘 𝑗 = 𝑄 𝑥 4 𝑢 5 𝑁𝑏𝑦 7 𝑈 𝑙 𝑗 − 1 𝑄 𝑢 5 𝑢 7 Generating the current Transition from the observation previous best ending The best i-1 tag tag sequence CS6501: NLP 17
Two modeling perspectives v Generative models v Model the joint probability of labels and words v 𝒖 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 M 𝒙,𝒖 = 𝑏𝑠𝑛𝑏𝑦 𝒖 M 𝒙 = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄(𝒖, 𝒙) v Discriminative models v Directly model the conditional probability of labels given the words Often modeled by v 𝒖 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 Softmax function CS6501: NLP 18
Generative V.S. discriminative models v Binary classification as an example Generative Model’s view Discriminative Model’s view CS6501: NLP 19
Generative V.S. discriminative models Generative Discriminative • joint distribution • conditional distribution • Full probabilistic • Only explain the target specification for all the variable random variables • Arbitrary features can be • Dependence assumption incorporated for modeling P 𝒖 𝒙 has to be specified for P 𝒙 𝒖 and P (𝒖) • Need labeled data, suitable • Can be used in for (semi-) supervised unsupervised learning learning CS6501: NLP 20
Independent Classifiers v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ) 4 v ~95% accuracy (token-wise) 𝑢 O 𝑢 A 𝑢 Q 𝑢 P 𝑥 O 𝑥 A 𝑥 Q 𝑥 P CS6501: NLP 21
Maximum entropy Markov models v MEMMs are discriminative models of the labels 𝒖 given the observed input sequence 𝒙 v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) 4 CS6501: NLP 22
Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 23
Parameterization of P (𝑢 4 |𝑥 4 , 𝑢 4@A ) v Associate a real-valued weight 𝜇 to each specific type of feature function v 𝜇 7 for f first-letter-capitalized- NNP (w) v Define a scoring function 𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 = ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A , 𝑥 4 ) 7 v Naturally P 𝑢 4 𝑥 4 , 𝑢 4@A ∝ exp𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 v Recall the basic definition of probability v 𝑄(𝑦) > 0 v ∑ 𝑞(𝑦) = 1 [ CS6501: NLP 24
Parameterization of MEMMs abc d e f ,e fgh ,i f P 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) = ∏ 4 4 ∑ abc d e,e fgh ,i f j = ∏ exp 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) 4 ∏ ∑ exp 𝑔 𝑢, 𝑢 4@A , 𝑥 4 4 e v It is a log-linear model Constant only related to 𝝁 𝝁 : parameters v log 𝑞 𝒖 𝒙 = ∑ 𝑔(𝑢 4 ,𝑢 4@A ,𝑥 4 ) − 𝐷(𝝁) 4 v Viterbi algorithm can be used to decode the most probable label sequence solely based on ∑ 𝑔(𝑢 4 ,𝑢 4@A , 𝑥 4 ) 4 CS6501: NLP 25
Parameter estimation (Intuition) v Maximum likelihood estimator can be used in a similar way as in HMMs v 𝜇 ∗ = 𝑏𝑠𝑛𝑏𝑦 k ∑ log 𝑄(𝒖|𝑥) 𝒖,i = 𝑏𝑠𝑛𝑏𝑦 k ∑ ∑ 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) − 𝐷(𝝁) 𝒖,i 4 Decompose the training data into such units CS6501: NLP 26
Parameter estimation (Intuition) v Essentially, training local classifiers using previous assigned tags as features CS6501: NLP 27
More about MEMMs v Emission features can go across multiple observations v 𝑔 𝑢 4 ,𝑢 4@A , 𝑥 4 ≜ ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A ,𝒙) 7 v Especially useful for shallow parsing and NER tasks CS6501: NLP 28
Label biased problem v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC B-PER E-PER other E-LOC B-LOC CS6501: NLP 29
Label biased problem v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM: Should globally normalize! P(B-PER|Thomas,other)= ½ P(B-LOC|Thomas,other)= ½ B-PER E-PER P(I-PER|Jefferson, B-PER)=1 P(I-LOC|Jefferson, B-LOC)=1 other E-LOC B-LOC CS6501: NLP 30
Conditional Random Field v Model global dependency v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 𝑇(𝒖 B ,𝒙) = exp𝑇 𝒖, 𝒙 / ∑ exp 𝒖B Score entire sequence directly 𝑢 A 𝑢 O 𝑢 P 𝑢 Q 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 31
Conditional Random Field v 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q q (𝑢 4 ,𝑢 4@A , 𝒙) ) 𝒋 7 q v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 = ∏ exp(∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q q (𝑢 4 ,𝑢 4@A ,𝒙) ) 4 7 q Edge feature (𝑢 4 ,𝑢 4@A ,𝒙) 𝑢 A 𝑢 O 𝑢 P 𝑢 Q Node feature 𝑔(𝑢 4 ,𝒙) 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 32
Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 33
Recommend
More recommend