lecture 13 structured prediction
play

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key


  1. Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1

  2. Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key points: v HMM model v Three basic problems v Sequential tagging CS6501: NLP 2

  3. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501: NLP 3

  4. Supervised Learning Setting v Assume we have annotated examples Tag set: POS Tagger DT, JJ, NN, VBD… The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. CS6501: NLP 4

  5. Sequence tagging problems v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems CS6501: NLP 5

  6. OCR example CS6501: NLP 6

  7. Noun phrase (NP) chunking v Task: identify all non-recursive NP chunks CS6501: NLP 7

  8. The BIO encoding v Define three new tags v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS6501: NLP 8

  9. Shallow parsing v Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks CS6501: NLP 9

  10. BIO Encoding for Shallow Parsing v Define new tags v B-NP B-VP B-PP: beginning of an “NP”, “VP”, “PP” chunk v I-NP I-VP I-PP: inside of an “NP”, “VP”, “PP” chunk v O: outside of any chunk POS Tagging with a restricted Tagset? CS6501: NLP 10

  11. Named Entity Recognition v Task: identify all mentions of named entities (people, organizations, locations, dates) CS6501: NLP 11

  12. BIO Encoding for NER v Define many new tags v B-PERS, B-DATE,…: beginning of a mention of a person/date... v I-PERS, I-DATE,…: inside of a mention of a person/date... v O: outside of any mention of a named entity CS6501: NLP 12

  13. Sequence tagging v Many NLP tasks are sequence tagging tasks v Input: a sequence of tokens/words v Output: a sequence of corresponding labels v E.g., POS tags, BIO encoding for NER v Solution: finding the most probable label sequence for the given word sequence v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 CS6501: NLP 13

  14. Sequential tagging v.s independent prediction Sequence labeling Independent classifier • 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label t j y j t i y i w j x j w i x i CS6501: NLP 14

  15. Sequential tagging v.s independent prediction Sequence labeling Independent classifiers • 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label •Dependency •Dependency only between both (𝒖, 𝒙) within (𝑧, 𝒚) and (𝑢 4 ,𝑢 5 ) •Independent output •Structured output •Easy to solve the •Difficult to solve the inference problem inference problem CS6501: NLP 15

  16. Recap: Viterbi Decoding Induction: > ? 𝜀 7@A 𝑟 B 𝑄 𝑢 7 = 𝑟 ∣ 𝑢 7@A = 𝑟 B 𝜀 7 𝑟 = 𝑄 𝑥 7 ∣ 𝑢 7 = 𝑟 max CS6501 Natural Language Processing 16

  17. Recap: Viterbi algorithm v Store the best tag sequence for 𝑥 A … 𝑥 4 that ends in 𝑢 5 in 𝑈[𝑘][𝑗] v 𝑈[𝑘][𝑗] = max 𝑄(𝑥 A …𝑥 4 , 𝑢 A …, 𝑢 4 = 𝑢 5 ) v Recursively compute T[j][i] from the entries in the previous column T[j][i-1] v 𝑈 𝑘 𝑗 = 𝑄 𝑥 4 𝑢 5 𝑁𝑏𝑦 7 𝑈 𝑙 𝑗 − 1 𝑄 𝑢 5 𝑢 7 Generating the current Transition from the observation previous best ending The best i-1 tag tag sequence CS6501: NLP 17

  18. Two modeling perspectives v Generative models v Model the joint probability of labels and words v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 M 𝒙,𝒖 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 M 𝒙 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄(𝒖, 𝒙) v Discriminative models v Directly model the conditional probability of labels given the words Often modeled by v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 Softmax function CS6501: NLP 18

  19. Generative V.S. discriminative models v Binary classification as an example Generative Model’s view Discriminative Model’s view CS6501: NLP 19

  20. Generative V.S. discriminative models Generative Discriminative • joint distribution • conditional distribution • Full probabilistic • Only explain the target specification for all the variable random variables • Arbitrary features can be • Dependence assumption incorporated for modeling P 𝒖 𝒙 has to be specified for P 𝒙 𝒖 and P (𝒖) • Need labeled data, suitable • Can be used in for (semi-) supervised unsupervised learning learning CS6501: NLP 20

  21. Independent Classifiers v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ) 4 v ~95% accuracy (token-wise) 𝑢 O 𝑢 A 𝑢 Q 𝑢 P 𝑥 O 𝑥 A 𝑥 Q 𝑥 P CS6501: NLP 21

  22. Maximum entropy Markov models v MEMMs are discriminative models of the labels 𝒖 given the observed input sequence 𝒙 v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) 4 CS6501: NLP 22

  23. Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 23

  24. Parameterization of P (𝑢 4 |𝑥 4 , 𝑢 4@A ) v Associate a real-valued weight 𝜇 to each specific type of feature function v 𝜇 7 for f first-letter-capitalized- NNP (w) v Define a scoring function 𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 = ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A , 𝑥 4 ) 7 v Naturally P 𝑢 4 𝑥 4 , 𝑢 4@A ∝ exp𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 v Recall the basic definition of probability v 𝑄(𝑦) > 0 v ∑ 𝑞(𝑦) = 1 [ CS6501: NLP 24

  25. Parameterization of MEMMs abc d e f ,e fgh ,i f P 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) = ∏ 4 4 ∑ abc d e,e fgh ,i f j = ∏ exp 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) 4 ∏ ∑ exp 𝑔 𝑢, 𝑢 4@A , 𝑥 4 4 e v It is a log-linear model Constant only related to 𝝁 𝝁 : parameters v log 𝑞 𝒖 𝒙 = ∑ 𝑔(𝑢 4 ,𝑢 4@A ,𝑥 4 ) − 𝐷(𝝁) 4 v Viterbi algorithm can be used to decode the most probable label sequence solely based on ∑ 𝑔(𝑢 4 ,𝑢 4@A , 𝑥 4 ) 4 CS6501: NLP 25

  26. Parameter estimation (Intuition) v Maximum likelihood estimator can be used in a similar way as in HMMs v 𝜇 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 k ∑ log 𝑄(𝒖|𝑥) 𝒖,i = 𝑏𝑠𝑕𝑛𝑏𝑦 k ∑ ∑ 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) − 𝐷(𝝁) 𝒖,i 4 Decompose the training data into such units CS6501: NLP 26

  27. Parameter estimation (Intuition) v Essentially, training local classifiers using previous assigned tags as features CS6501: NLP 27

  28. More about MEMMs v Emission features can go across multiple observations v 𝑔 𝑢 4 ,𝑢 4@A , 𝑥 4 ≜ ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A ,𝒙) 7 v Especially useful for shallow parsing and NER tasks CS6501: NLP 28

  29. Label biased problem v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC B-PER E-PER other E-LOC B-LOC CS6501: NLP 29

  30. Label biased problem v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM: Should globally normalize! P(B-PER|Thomas,other)= ½ P(B-LOC|Thomas,other)= ½ B-PER E-PER P(I-PER|Jefferson, B-PER)=1 P(I-LOC|Jefferson, B-LOC)=1 other E-LOC B-LOC CS6501: NLP 30

  31. Conditional Random Field v Model global dependency v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 𝑇(𝒖 B ,𝒙) = exp𝑇 𝒖, 𝒙 / ∑ exp 𝒖B Score entire sequence directly 𝑢 A 𝑢 O 𝑢 P 𝑢 Q 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 31

  32. Conditional Random Field v 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q 𝑕 q (𝑢 4 ,𝑢 4@A , 𝒙) ) 𝒋 7 q v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 = ∏ exp(∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q 𝑕 q (𝑢 4 ,𝑢 4@A ,𝒙) ) 4 7 q Edge feature 𝑕(𝑢 4 ,𝑢 4@A ,𝒙) 𝑢 A 𝑢 O 𝑢 P 𝑢 Q Node feature 𝑔(𝑢 4 ,𝒙) 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 32

  33. Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend