Lecture 8: Sequence labeling with discriminative models Julia - PowerPoint PPT Presentation

CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm

Sequence labeling 2 CS498JH: Introduction to NLP

POS tagging Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. Task: assign POS tags to words 3 CS498JH: Introduction to NLP

Noun phrase (NP) chunking Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Task: identify all non-recursive NP chunks 4 CS498JH: Introduction to NLP

The BIO encoding We define three new tags: – B-NP : beginning of a noun phrase chunk – I-NP : inside of a noun phrase chunk – O : outside of a noun phrase chunk [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP old_O ,_O will_O join_O IBM_B-NP ‘s_O board_B-NP as_O a_B-NP nonexecutive_I-NP director_I-NP Nov._B-NP 29_I-NP ._O 5 CS498JH: Introduction to NLP

Shallow parsing Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] . Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks 6 CS498JH: Introduction to NLP

The BIO encoding for shallow parsing We define several new tags: – B-NP B-VP B-PP : beginning of an NP, “VP”, “PP” chunk – I-NP : inside of an NP, “VP”, “PP” chunk – O : outside of any chunk [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP old_O ,_O will_B-VP join_I-VP IBM_B-NP ‘s_O board_B-NP as_B-PP a_B-NP nonexecutive_I-NP director_I-NP Nov._B- NP 29_I-NP ._O 7 CS498JH: Introduction to NLP

Named Entity Recognition Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] . Task: identify all mentions of named entities (people, organizations, locations, dates) 8 CS498JH: Introduction to NLP

The BIO encoding for NER We define many new tags: – B-PERS , B-DATE,…: beginning of a mention of a person/date... – I-PERS , B-DATE,…: : inside of a mention of a person/date... – O : outside of any mention of a named entity [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] . Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O 9 CS498JH: Introduction to NLP

Many NLP tasks are sequence labeling tasks Input: a sequence of tokens/words: Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Output: a sequence of labeled tokens/words: POS-tagging: Pierre _NNP Vinken _NNP , _, 61 _CD years _NNS old _JJ , _, will _MD join _VB IBM _NNP ‘s _POS board _NN as _IN a _DT nonexecutive _JJ director _NN Nov. _NNP 29 _CD . _. Named Entity Recognition: Pierre _B-PERS Vinken _I-PERS , _O 61 _O years _O old _O , _O will _O join _O IBM _B-ORG ‘s _O board _O as _O a _O nonexecutive _O director _O Nov. _B-DATE 29 _I-DATE . _O 10 CS498JH: Introduction to NLP

Graphical models for sequence labeling 11 CS498JH: Introduction to NLP

Graphical models Graphical models are a notation for probability models . Nodes represent distributions over random variables: – P(X) = X Arrows represent dependencies: – P(Y) P(X | Y) = Y X – P(Y) P(Z) P(X | Y, Z) = Y X Z Shaded nodes represent observed variables. White nodes represent hidden variables – P(Y) P(X | Y) with Y hidden and X observed = Y X 12 CS498JH: Introduction to NLP

HMMs as graphical models HMMs are generative models of the observed input string w They ‘generate’ w with P( w ) = ∏ i P(t i | t i-1 )P(w i | t i ) We know w , but need to find t t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

Models for sequence labeling Sequence labeling: Given an input sequence w = w 1 ...w n , predict the best (most likely) label sequence t = t 1 …t n P ( t | w ) = argmax t Generative models use Bayes Rule: P ( t , w ) P ( t | w ) ) = = argmax argmax P ( w ) t t = = argmax P ( t w ) P ( t , w ) argmax t = = P ( t ) P ( w | t ) ( t ) ( w argmax t Discriminative (conditional) models model P( t | w ) directly 14 CS498JH: Introduction to NLP

Advantages of discriminative models We’re usually not really interested in P( w | t ). – w is given. We don’t need to predict it! Why not model what we’re actually interested in: P( t | w ) Modeling P( w | t ) well is quite difficult: – Prefixes (capital letters) or suffixes are good predictors for certain classes of t (proper nouns, adverbs,…) – But these features may not be independent (e.g. they are overlapping) – These features may also help us deal with unknown words Modeling P( t | w ) should be easier: – Now we can incorporate arbitrary features of the word, because we don’t need to predict w anymore 15 CS498JH: Introduction to NLP

Maximum Entropy Markov Models MEMMs are conditional models of the labels t given the observed input string w. They model P( t | w ) = ∏ P(t i |w i , t i-1 ) [NB: We also use dynamic programming for learning and labeling] t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

Probabilistic classification Classification: Predict a class (label) c for an input x Probabilistic classification: –Model the probability P( c | x ) P ( c| x ) is a probability if 0 < P (c i | x ) < 1 , and ∑ i P( c i | x ) = 1 –Predict the class that has the highest probability 17 CS498JH: Introduction to NLP

Representing features Define a set of feature functions f i ( x ) over the input: – Binary feature functions : f first-letter-capitalized ( Urbana ) = 1 f first-letter-capitalized ( computer ) = 0 – Integer (or real-valued) feature functions : f number-of-vowels ( Urbana ) = 3 Because each class might care only about certain features (e.g. capitalization for proper nouns), redefine feature functions f i ( x, c) to take the class label into account: f first-letter-capitalized ( Urbana, NNP ) = 1 f first-letter-capitalized ( Urbana, VB ) = 0 => We turn each feature f i on or off depending on c 18 CS498JH: Introduction to NLP

From features to probabilities – We also associate a real-valued weight w i ( λ i ) with each feature f i – Now we have a score for predicting class c for input x : score( x ,c) = ∑ i w i f i ( x ,c) – This score could be negative, so we exponentiate it: score( x ,c) = exp( ∑ i w i f i ( x ,c)) = e ∑ iwi fi( x ,c) – We normalize this score to define a probability: ∑ c � e ∑ i w i f ( x , c � ) = e ∑ i w i f i ( x , c ) e ∑ i w i f i ( x , c ) P ( c | x ) = Z – Learning = finding the best weights w i 19 CS498JH: Introduction to NLP

Learning: finding w We use conditional maximum likelihood estimation (and standard convex optimization algorithms) to find w Conditional MLE: Find the w that assigns highest probability to all observed outputs c i given the inputs x i w ∏ = P ( c i | x i , w ) ˆ argmax w i w ∑ = log ( P ( c i | x i , w )) argmax i e ∑ j w j f j ( x i , c ) � ⇥ w ∑ = argmax log ∑ c � e ∑ j w j f j ( x i , c � ) i 20 CS498JH: Introduction to NLP

Some terminology We also refer to these models as exponential models because we exponentiate ( e ∑ wf(x,c) ) the weights and features We also refer to them as loglinear models because the log probability is a linear function � e ∑ j w j f j ( x , c ) ⇥ log ( P ( c | x , w )) = log Z = ∑ w j f j ( x , c ) − log ( Z ) j Statisticians refer to them as multinomial logistic regression models. 21 CS498JH: Introduction to NLP

Maximum Entropy Markov Models MEMMs use a MaxEnt classifier for each P(t i |w i , t i-1 ): t i-1 t i w i P j w j f j ( w i ,t i − 1 ,t i ) e P ( t i | w i , t i − 1 ) = Z P j w j f j ( w i ,t i − 1 ,t i ) e = P j w j f j ( w i ,t i − 1 ,t k ) � t k e CS498JH: Introduction to NLP

Terminology II: Maximum Entropy Entropy: Measures uncertainty. Is highest for uniform distributions H ( P ) = − ∑ P ( x ) log 2 P ( x ) x H ( P ( y | x )) = − ∑ P ( y | x ) log 2 P ( y | x ) y We also refer to these models as Maximum Entropy (MaxEnt) models because conditional MLE finds the most uniform distribution (subject to the constraints that the expected counts equal the observed counts in the training data). The default value for all weights w i is zero. 23 CS498JH: Introduction to NLP

Chain Conditional Random Fields Chain CRFs are also conditional models of the labels t given the observed input string w, but instead of one classifier for each P(t i |w i , t i-1 ) they learn global distributions P( t | w ) t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

Lecture 8: Sequence labeling with discriminative models Julia - PowerPoint PPT Presentation

CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm Sequence

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, discriminative sequence labeling,

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Three models for discriminative machine Three models for discriminative machine translation using

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Biotech Berhad Investor Relations Briefing 10 June 2020 Disclaimer This presentation may

Welcome The meeting will start shortly Please mute your microphone to reduce echo Please

EXISTING Commercial & Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping

Bounds on the non-real spectrum of indefinite Sturm-Liouville operators Operator Theory in

HSBC FINANCE CORPORATION PRESENTATION A presentation relating to the results of HSBC Finance

1 Improving Evaluation through Relationship: The Evaluation Exchange June 2018 Session Overview

30 years in HR: How careers have changed and where theyre heading next 30 years in HR: How

Markov Decision Processes Mausam CSE 515 Operations Research Machine Graph Learning Theory

Lecture 8: Sequence labeling with discriminative models Julia - PowerPoint PPT Presentation

CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm Sequence

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, discriminative sequence labeling,

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Three models for discriminative machine Three models for discriminative machine translation using

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Biotech Berhad Investor Relations Briefing 10 June 2020 Disclaimer This presentation may

Welcome The meeting will start shortly Please mute your microphone to reduce echo Please

EXISTING Commercial &amp; Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping

Bounds on the non-real spectrum of indefinite Sturm-Liouville operators Operator Theory in

HSBC FINANCE CORPORATION PRESENTATION A presentation relating to the results of HSBC Finance

1 Improving Evaluation through Relationship: The Evaluation Exchange June 2018 Session Overview

30 years in HR: How careers have changed and where theyre heading next 30 years in HR: How

Markov Decision Processes Mausam CSE 515 Operations Research Machine Graph Learning Theory

EXISTING Commercial & Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping