Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1

Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key points: v HMM model v Three basic problems v Sequential tagging CS6501: NLP 2

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501: NLP 3

Supervised Learning Setting v Assume we have annotated examples Tag set: POS Tagger DT, JJ, NN, VBD… The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. CS6501: NLP 4

Sequence tagging problems v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems CS6501: NLP 5

OCR example CS6501: NLP 6

Noun phrase (NP) chunking v Task: identify all non-recursive NP chunks CS6501: NLP 7

The BIO encoding v Define three new tags v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS6501: NLP 8

Shallow parsing v Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks CS6501: NLP 9

BIO Encoding for Shallow Parsing v Define new tags v B-NP B-VP B-PP: beginning of an “NP”, “VP”, “PP” chunk v I-NP I-VP I-PP: inside of an “NP”, “VP”, “PP” chunk v O: outside of any chunk POS Tagging with a restricted Tagset? CS6501: NLP 10

Named Entity Recognition v Task: identify all mentions of named entities (people, organizations, locations, dates) CS6501: NLP 11

BIO Encoding for NER v Define many new tags v B-PERS, B-DATE,…: beginning of a mention of a person/date... v I-PERS, I-DATE,…: inside of a mention of a person/date... v O: outside of any mention of a named entity CS6501: NLP 12

Sequence tagging v Many NLP tasks are sequence tagging tasks v Input: a sequence of tokens/words v Output: a sequence of corresponding labels v E.g., POS tags, BIO encoding for NER v Solution: finding the most probable label sequence for the given word sequence v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 CS6501: NLP 13

Sequential tagging v.s independent prediction Sequence labeling Independent classifier • 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label t j y j t i y i w j x j w i x i CS6501: NLP 14

Sequential tagging v.s independent prediction Sequence labeling Independent classifiers • 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 • 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 - 𝑄(𝑧|𝒚) • 𝒖 is a vector/matrix • 𝑧 is a single label •Dependency •Dependency only between both (𝒖, 𝒙) within (𝑧, 𝒚) and (𝑢 4 ,𝑢 5 ) •Independent output •Structured output •Easy to solve the •Difficult to solve the inference problem inference problem CS6501: NLP 15

Recap: Viterbi Decoding Induction: > ? 𝜀 7@A 𝑟 B 𝑄 𝑢 7 = 𝑟 ∣ 𝑢 7@A = 𝑟 B 𝜀 7 𝑟 = 𝑄 𝑥 7 ∣ 𝑢 7 = 𝑟 max CS6501 Natural Language Processing 16

Recap: Viterbi algorithm v Store the best tag sequence for 𝑥 A … 𝑥 4 that ends in 𝑢 5 in 𝑈[𝑘][𝑗] v 𝑈[𝑘][𝑗] = max 𝑄(𝑥 A …𝑥 4 , 𝑢 A …, 𝑢 4 = 𝑢 5 ) v Recursively compute T[j][i] from the entries in the previous column T[j][i-1] v 𝑈 𝑘 𝑗 = 𝑄 𝑥 4 𝑢 5 𝑁𝑏𝑦 7 𝑈 𝑙 𝑗 − 1 𝑄 𝑢 5 𝑢 7 Generating the current Transition from the observation previous best ending The best i-1 tag tag sequence CS6501: NLP 17

Two modeling perspectives v Generative models v Model the joint probability of labels and words v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 M 𝒙,𝒖 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 M 𝒙 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄(𝒖, 𝒙) v Discriminative models v Directly model the conditional probability of labels given the words Often modeled by v 𝒖 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝒖 𝑄 𝒖 𝒙 Softmax function CS6501: NLP 18

Generative V.S. discriminative models v Binary classification as an example Generative Model’s view Discriminative Model’s view CS6501: NLP 19

Generative V.S. discriminative models Generative Discriminative • joint distribution • conditional distribution • Full probabilistic • Only explain the target specification for all the variable random variables • Arbitrary features can be • Dependence assumption incorporated for modeling P 𝒖 𝒙 has to be specified for P 𝒙 𝒖 and P (𝒖) • Need labeled data, suitable • Can be used in for (semi-) supervised unsupervised learning learning CS6501: NLP 20

Independent Classifiers v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ) 4 v ~95% accuracy (token-wise) 𝑢 O 𝑢 A 𝑢 Q 𝑢 P 𝑥 O 𝑥 A 𝑥 Q 𝑥 P CS6501: NLP 21

Maximum entropy Markov models v MEMMs are discriminative models of the labels 𝒖 given the observed input sequence 𝒙 v 𝑄 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) 4 CS6501: NLP 22

Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 23

Parameterization of P (𝑢 4 |𝑥 4 , 𝑢 4@A ) v Associate a real-valued weight 𝜇 to each specific type of feature function v 𝜇 7 for f first-letter-capitalized- NNP (w) v Define a scoring function 𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 = ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A , 𝑥 4 ) 7 v Naturally P 𝑢 4 𝑥 4 , 𝑢 4@A ∝ exp𝑔 𝑢 4 , 𝑢 4@A ,𝑥 4 v Recall the basic definition of probability v 𝑄(𝑦) > 0 v ∑ 𝑞(𝑦) = 1 [ CS6501: NLP 24

Parameterization of MEMMs abc d e f ,e fgh ,i f P 𝒖 𝒙 = ∏ 𝑄(𝑢 4 |𝑥 4 ,𝑢 4@A ) = ∏ 4 4 ∑ abc d e,e fgh ,i f j = ∏ exp 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) 4 ∏ ∑ exp 𝑔 𝑢, 𝑢 4@A , 𝑥 4 4 e v It is a log-linear model Constant only related to 𝝁 𝝁 : parameters v log 𝑞 𝒖 𝒙 = ∑ 𝑔(𝑢 4 ,𝑢 4@A ,𝑥 4 ) − 𝐷(𝝁) 4 v Viterbi algorithm can be used to decode the most probable label sequence solely based on ∑ 𝑔(𝑢 4 ,𝑢 4@A , 𝑥 4 ) 4 CS6501: NLP 25

Parameter estimation (Intuition) v Maximum likelihood estimator can be used in a similar way as in HMMs v 𝜇 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 k ∑ log 𝑄(𝒖|𝑥) 𝒖,i = 𝑏𝑠𝑕𝑛𝑏𝑦 k ∑ ∑ 𝑔(𝑢 4 , 𝑢 4@A , 𝑥 4 ) − 𝐷(𝝁) 𝒖,i 4 Decompose the training data into such units CS6501: NLP 26

Parameter estimation (Intuition) v Essentially, training local classifiers using previous assigned tags as features CS6501: NLP 27

More about MEMMs v Emission features can go across multiple observations v 𝑔 𝑢 4 ,𝑢 4@A , 𝑥 4 ≜ ∑ 𝜇 7 𝑔 7 (𝑢 4 ,𝑢 4@A ,𝒙) 7 v Especially useful for shallow parsing and NER tasks CS6501: NLP 28

Label biased problem v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC B-PER E-PER other E-LOC B-LOC CS6501: NLP 29

Label biased problem v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM: Should globally normalize! P(B-PER|Thomas,other)= ½ P(B-LOC|Thomas,other)= ½ B-PER E-PER P(I-PER|Jefferson, B-PER)=1 P(I-LOC|Jefferson, B-LOC)=1 other E-LOC B-LOC CS6501: NLP 30

Conditional Random Field v Model global dependency v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 𝑇(𝒖 B ,𝒙) = exp𝑇 𝒖, 𝒙 / ∑ exp 𝒖B Score entire sequence directly 𝑢 A 𝑢 O 𝑢 P 𝑢 Q 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 31

Conditional Random Field v 𝑇 𝒖,𝒙 = ∑ (∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q 𝑕 q (𝑢 4 ,𝑢 4@A , 𝒙) ) 𝒋 7 q v P 𝒖 𝒙 ∝ exp 𝑇 𝒖, 𝒙 = ∏ exp(∑ 𝜇 7 𝑔 7 𝑢 4 ,𝒙 + ∑ 𝛿 q 𝑕 q (𝑢 4 ,𝑢 4@A ,𝒙) ) 4 7 q Edge feature 𝑕(𝑢 4 ,𝑢 4@A ,𝒙) 𝑢 A 𝑢 O 𝑢 P 𝑢 Q Node feature 𝑔(𝑢 4 ,𝒙) 𝑥 A 𝑥 O 𝑥 P 𝑥 Q CS6501: NLP 32

Design features v Emission-like features VB NNP v Binary feature functions v f first-letter-capitalized- NNP (China) = 1 know China v f first-letter-capitalized- VB (know) = 0 v Integer (or real-valued) feature functions v f number-of-vowels- NNP (China) = 2 v Transition-like features Not necessarily v Binary feature functions independent features! v f first-letter-capitalized- VB - NNP (China) = 1 CS6501: NLP 33

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Course Information CS 6355: Structured Prediction Building up structured output prediction

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Introduction Out with the old ... CSCE 970 CSCE 970 Lecture 8: Lecture 8: Structured

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Campus with Tag Manager Marcel Ayers, Director of Implementation OmniUpdate Agenda What is Tag

2019 - 20 Financial Aid High School Presentation New Jersey Higher Education Student Assistance

Contextualization of Morphological Inflection Ekaterina Vylomova 1 Ryan Cotterell 2 Timothy Baldwin

Caching 1 Key Point What are Cache lines Tags Index offset How do we find

Prevention and Reaction Defending Privacy in the Web 2.0 Michael Hart Rob Johnson

Efficiency in Part-of-Speech Tagging Naghmeh Fazeli summer semester 2016 Supervisor: Dr.

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

in Z and Z and Higgs Bosons Higgs Bosons in bb final state Abhinav Dubey University of