Background Sequence labeling MEMMs - ? HMMs you know, right? - PowerPoint PPT Presentation

Background

Sequence labeling • MEMMs - ? • HMMs – you know, right? • Structured perceptron – also this? • linear-chain CRFs - ?

Sequence labeling • Imagine labeling a sequence of symbols in order to …. – do NER (finding named entities in text) – labels are: entity types – symbols are: words

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background � � Find the most likely state sequence: (Viterbi) arg max P ( s , o ) � s Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t -1 S t S t+1 identity of word … ends in “ -ski ” is capitalized is part of a noun phrase is in a list of city names is “ Wisniewski ” is under node X in WordNet … is in bold font part of   ends in … noun phrase “ -ski ” O O t O t +1 t -1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

What is a symbol? S t -1 S t S t+1 identity of word … ends in “ -ski ” is capitalized is part of a noun phrase is in a list of city names is “ Wisniewski ” is under node X in WordNet … is in bold font part of   ends in … noun phrase “ -ski ” O O t O t +1 t -1 Idea: replace generative model in HMM with a maxent model, where state depends on observations Pr( s t x | ) ... = t

What is a symbol? S t -1 S t S t+1 identity of word … ends in “ -ski ” is capitalized is part of a noun phrase is in a list of city names is “ Wisniewski ” is under node X in WordNet … is in bold font part of   ends in … noun phrase “ -ski ” O O t O t +1 t -1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state Pr( s | x , s ) ... 1 = t t t , −

What is a symbol? S t -1 S t+1 identity of word S t … ends in “ -ski ” is capitalized is part of a noun phrase is in a list of city names is “ Wisniewski ” is under node X in WordNet … is in bold font part of   ends in is indented noun phrase “ -ski ” is in hyperlink anchor O O t O t +1 t -1 … Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Pr( s | x , s s ...) ... = t t t 1 , t 2 , − −

Ratnaparkhi ’ s MXPOST POS tagger from late 90’s • Sequential learning problem: predict POS tags of words. • Uses MaxEnt model described above. • Rich feature set. • To smooth, discard features occurring < 10 times.

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS S t-1 S t S t+1 ... Pr( s , o ) Pr( s | s ) Pr( o | s ) ∏ = i i 1 i i 1 − − i O t-1 O t O t+1 S t-1 S t S t+1 ... Pr( s | o ) Pr( s | s , o ) ∏ = i i 1 i 1 − − i O t-1 O t O t+1

Graphical comparison among   HMMs, MEMMs and CRFs HMM MEMM CRF

Stacking and Searn William W. Cohen

Stacked Sequential Learning William W. Cohen Vitor Carvalho Language Technology Institute Center for Automated Learning and Discovery Carnegie Mellon University Carnegie Mellon University

Outline • Motivation: – MEMMs don’t work on segmentation tasks • New method: – Stacked sequential MaxEnt – Stacked sequential YFL • Results • More results... • Conclusions

However, in celebration of the locale, I will present this results in the style of Sir Walter Scott (1771-1832), author of “Ivanhoe” and other classics. In that pleasant district of merry Pennsylvania which is watered by   the river Mon, there extended since ancient times a large computer science department. Such being our chief scene, the date of our story refers to a period towards the middle of the year 2003 ....

Chapter 1, in which a graduate student (Vitor) discovers a bug in his advisor’s code that he cannot fix The problem : identifying reply and signature sections of email messages. The method : classify each line as reply, signature, or other.

Chapter 1, in which a graduate student discovers a bug in his advisor’s code that he cannot fix The problem : identifying reply and signature sections of email messages. The method : classify each line as reply, signature, or other. The warmup: classify each line is signature or nonsignature, using learning methods from Minorthird, and dataset of 600+ messages The results: from [CEAS-2004, Carvalho & Cohen]....

Chapter 1, in which a graduate student discovers a bug in his advisor’s code that he cannot fix But... Minorthird’s version of MEMMs has an accuracy of less than 70% (guessing majority class gives accuracy around 10%!)

Flashback: In which we recall the invention and re-invention of sequential classification with recurrent sliding windows, ..., MaxEnt Markov Models (MEMM) • From data, learn probabilistic classifier using previous label Y i-1 as Pr(y i |y i-1 ,x i ) a feature (or conditioned on Yi-1) – MaxEnt model reply reply sig Y i-1 Y i Y i+1 • To classify a Pr(Y i | Y i-1 , f 1 (X i ), f 2 (X i ),...)=... sequence x 1 ,x 2 ,... search for the best features of X i y 1 ,y 2 ,... – Viterbi X i-1 X i X i+1 – beam search

Flashback: In which we recall the invention and re-invention of sequential classification with recurrent sliding windows, ..., MaxEnt Markov Models (MEMM) ... and also praise their many virtues relative to CRFs • MEMMs are easy to implement Pr(Y i | Y i-1 ,Y i-2 ,...,f 1 (X i ),f 2 (X i ),...)=... • MEMMs train quickly – no probabilistic inference in the Y i-1 Y i Y i+1 inner loop of learning • You can use any old classifier (even if it’s not probabilistic) X i-1 X i X i+1 • MEMMs scale well with number of classes and length of history

The flashback ends and we return again to our document analysis task , on which the elegant MEMM method fails miserably for reasons unknown MEMMs have an accuracy of less than 70% on this problem – but why ?

Chapter 2, in which, in the fullness of time, the mystery is investigated... predicted true ...and it transpires that often the classifier predicts a signature block that is much longer than is correct false positive predictions ...as if the MEMM “gets stuck” predicting the sig label.

Chapter 2, in which, in the fullness of time, the mystery is investigated... ...and it transpires that Pr(Y i = sig |Y i-1 = sig ) = 1- ε as estimated from the data, giving the previous label a very high weight. reply reply sig Y i-1 Y i Y i+1 X i-1 X i X i+1

Chapter 2, in which, in the fullness of time, the mystery is investigated... 40 • We added “sequence noise” by randomly switching around 31.83 10% of the lines: this 30 – lowers the weight for the previous-label feature error rate – improves performance for 20 MEMMs – degrades performance for CRFs • Adding noise in this case 10 however is a loathsome bit 3.47 of hackery. 2.18 1.85 1.17 0 MaxEnt MEMM MEMM+noise CRF CRF+noise

Chapter 2, in which, in the fullness of time, the mystery is investigated... • Label bias problem CRFs can represent some distributions that MEMMs cannot [Lafferty et al 2000 ]: rib-rob MaxEnt – e.g., the “rib-rob” problem – this doesn’t explain why MaxEnt >> MEMMs CRFs • Observation bias problem: MEMMs can overweight MEMMs “observation” features [Klein and Manning 2002] : – here we observe the opposite: the history features are overweighted

Chapter 2, in which, in the fullness of time, the mystery is investigated...and an explanation is proposed. • From data, learn probabilistic classifier using previous label Y i-1 as Pr(y i |y i-1 ,x i ) a feature (or conditioned on Yi-1) – MaxEnt model reply reply sig Y i-1 Y i Y i+1 • To classify a sequence x 1 ,x 2 ,... search for the best y 1 ,y 2 ,... – Viterbi X i-1 X i X i+1 – beam search

Chapter 2, in which, in the fullness of time, the mystery is investigated...and an explanation is proposed. • From data, learn Learning data is noise-free, including values for Pr(y i |y i-1 ,x i ) Y i-1 – MaxEnt model • To classify a Classification data values for Y i-1 are noisy since sequence x 1 ,x 2 ,... they come from predictions search for the best i.e., the history values used at learning time are a y 1 ,y 2 ,... poor approximation of the values seen in – Viterbi classification – beam search

Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem • From data, learn While learning, replace the true value for Y i-1 with an approximation of the predicted value of Y i-1 Pr(y i |y i-1 ,x i ) – MaxEnt model • To classify a sequence x 1 ,x 2 ,... To approximate the value predicted by MEMMs, use the value predicted by non-sequential MaxEnt search for the best find approximate Y’s with a in a cross-validation experiment. y 1 ,y 2 ,... MaxEnt-learned hypothesis, and then apply the sequential After Wolpert [1992] we call this stacked MaxEnt. – Viterbi model to that – beam search

Background Sequence labeling MEMMs - ? HMMs you know, right? - PowerPoint PPT Presentation

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured perceptron also this? linear-chain CRFs - ? Sequence labeling Imagine labeling a sequence of symbols in order to . do NER (finding

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens Marek Rei

Sequence Labeling & Syntax CMSC 470 Marine Carpuat Recap: We know how to perform POS

Transition from direct to sequential 2p-decay in theory and in experiment. NUSTAR meeting, March

Using Financial Data in Macroeconomic Models Markus Brunnermeier, Darius Palia, and Chris Sims

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Adam Garnsworthy ARIEL Principal Scientist and TRIUMF Research Scientist FRIB Decay Station

Ma# Spangler, University of Nebraska 6/1/17 6/1/17 6/1/17

A review of global cement industry trends INSIGHTS FROM THE GLOBAL CEMENT REPORT, 12 TH EDITION

T C S AB BL LE E O OF F ON NT TE EN NT TS Chapter 5

VISUAL ENCODING & DESIGN PROCESS Miriah Meyer School of Computing University of Utah 2

Background Sequence labeling MEMMs - ? HMMs you know, right? - PowerPoint PPT Presentation

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured perceptron also this? linear-chain CRFs - ? Sequence labeling Imagine labeling a sequence of symbols in order to . do NER (finding

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling &amp; Labeling Larry Nees Seed Administrator Office of INDIANA

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens Marek Rei

Sequence Labeling &amp; Syntax CMSC 470 Marine Carpuat Recap: We know how to perform POS

Transition from direct to sequential 2p-decay in theory and in experiment. NUSTAR meeting, March

Using Financial Data in Macroeconomic Models Markus Brunnermeier, Darius Palia, and Chris Sims

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Adam Garnsworthy ARIEL Principal Scientist and TRIUMF Research Scientist FRIB Decay Station

Ma# Spangler, University of Nebraska 6/1/17 6/1/17 6/1/17

A review of global cement industry trends INSIGHTS FROM THE GLOBAL CEMENT REPORT, 12 TH EDITION

T C S AB BL LE E O OF F ON NT TE EN NT TS Chapter 5

VISUAL ENCODING &amp; DESIGN PROCESS Miriah Meyer School of Computing University of Utah 2

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Sequence Labeling & Syntax CMSC 470 Marine Carpuat Recap: We know how to perform POS

VISUAL ENCODING & DESIGN PROCESS Miriah Meyer School of Computing University of Utah 2