Text Classification and Sequence Labeling Graham Neubig Text - PowerPoint PPT Presentation

CMU CS11-737: Multilingual NLP Text Classification and Sequence Labeling Graham Neubig

桃と梨が好き Text Classification • Given an input text X , predict an output label y Topic Classification food food politics politics I like peaches and pears I like peaches and herb music music ... ... Language Identification English English Japanese Japanese I like peaches and pears German German ... ... Sentiment Analysis (sentence/document-level) positive positive neutral neutral I like peaches and pears I hate peaches and pears negative negative ... and many many more!

Sequence Labeling • Given an input text X , predict an output label sequence Y of equal length! Part of Speech Tagging Lemmatization He saw two birds He saw two birds PRON VERB NUM NOUN see two bird he Morphological Tagging He saw two birds PronType=prs Tense=past, NumType=card Number=plur VerbForm=fin ... and more!

Span Labeling • Given an input text X , predict an output spans and labels Y. Named Entity Recognition Graham Neubig is teaching at Carnegie Mellon University PER ORG Syntactic Chunking Graham Neubig is teaching at Carnegie Mellon University NP VP NP Semantic Role Labeling Graham Neubig is teaching at Carnegie Mellon University Actor Predicate Location ... and more!

Span Labeling as Sequence Labeling • Predict B eginning, I n, and O ut tags for each word in a span Graham Neubig is teaching at Carnegie Mellon University PER ORG Graham Neubig is teaching at Carnegie Mellon University B-PER I-PER O O O B-ORG I-ORG I-ORG

外国⼈参政権 Text Segmentation • Given an input text X , split it into segmented text Y. Tokenization A well-conceived "thought exercise." A well - conceived " thought exercise . " Word Segmentation 外国⼈参政権外国⼈参政権 foreign people voting rights foreign carrot government Morphological Segmentation Köpekler Köpek ler Köpekle r dog_paddle Tense=Aorist dog Number=Plural • Rule-based, or span labeling models

Modeling for Sequence Labeling/Classification

How do we Make Predictions? • Given an input text X • Extract features H • Predict labels Y Text Classification Sequence Labeling I like peaches I like peaches Feature Extractor Feature Extractor Predict Predict Predict Predict positive PRON VERB NOUN

A Simple Extractor: Bag of Words (BOW) I like peaches lookup lookup lookup Label Probs + + = Predict

<latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit> <latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit> <latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit> A Simple Predictor: Linear Transform+Softmax p = softmax( W * h + b ) • Softmax converts arbitrary scores into probabilities -3.2 0.002 -2.9 e s i 0.003 1.0 p i = s= p= 0.329 P j e s j 2.2 0.444 0.6 0.090 … …

Problem: Language is not a Bag of Words! I don’t love pears There’s nothing I don’t love about pears

Better Featurizers • Bag of n-grams • Syntax-based features (e.g. subject-object pairs) • Neural networks • Recurrent neural networks • Convolutional networks • Self attention

What is a Neural Net?: Computation Graphs

“Neural” Nets Original Motivation: Neurons in the Brain Current Conception: Computation Graphs X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A b x c Image credit: Wikipedia

expression: y = x > Ax + b · x + c graph: A node is a {tensor, matrix, vector, scalar} value x

expression: An edge represents a function argument. y = x > Ax + b · x + c A node with an incoming edge is a function of that edge’s tail node. graph: A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) ∂ F times a derivative of an arbitrary input . ∂ f ( u ) ✓ ∂ F f ( u ) = u > ◆ > ∂ f ( u ) ∂ F ∂ f ( u ) = ∂ f ( u ) ∂ u x

expression: y = x > Ax + b · x + c graph: Functions can be nullary, unary, binary, … n -ary. Often they are unary or binary. f ( U , V ) = UV f ( u ) = u > A x

expression: y = x > Ax + b · x + c graph: f ( M , v ) = Mv f ( U , V ) = UV f ( u ) = u > A x Computation graphs are generally directed and acyclic

expression: y = x > Ax + b · x + c graph: f ( x , A ) = x > Ax f ( M , v ) = Mv f ( U , V ) = UV A x f ( u ) = u > A ∂ f ( x , A ) = ( A > + A ) x ∂ x ∂ f ( x , A ) = xx > x ∂ A

expression: y = x > Ax + b · x + c graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A b x c

expression: y = x > Ax + b · x + c graph: X f ( x 1 , x 2 , x 3 ) = x i i y f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A b x c variable names are just labelings of nodes.

Algorithms (1) • Graph construction • Forward propagation • In topological order, compute the value of the node given its inputs

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A b x c

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A x > b x c

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV x > A f ( u , v ) = u · v f ( u ) = u > A x > b x c

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV x > A f ( u , v ) = u · v f ( u ) = u > A x > b · x b x c

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv x > Ax f ( U , V ) = UV x > A f ( u , v ) = u · v f ( u ) = u > A x > b · x b x c

Forward Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i x > Ax + b · x + c f ( M , v ) = Mv x > Ax f ( U , V ) = UV x > A f ( u , v ) = u · v f ( u ) = u > A x > b · x b x c

Algorithms (2) • Back-propagation: • Process examples in reverse topological order • Calculate the derivatives of the parameters with respect to the final value (This is usually a “loss function”, a value we want to minimize) • Parameter update: • Move the parameters in the direction of this derivative W -= α * dl/dW

Back Propagation graph: X f ( x 1 , x 2 , x 3 ) = x i i f ( M , v ) = Mv f ( U , V ) = UV f ( u , v ) = u · v f ( u ) = u > A b x c

Neural Network Frameworks Examples in this class

Basic Process in (Dynamic) Neural Network Frameworks • Create a model • For each example • create a graph that represents the computation you want • calculate the result of that computation • if training, perform back propagation and update

Recurrent Neural Networks

Long-distance Dependencies in Language • Agreement in number, gender, etc. He does not have very much confidence in himself . She does not have very much confidence in herself . • Selectional preference The reign has lasted as long as the life of the queen . The rain has lasted as long as the life of the clouds .

Recurrent Neural Networks (Elman 1990) • Tools to “remember” information Feed-forward NN Recurrent NN context context lookup lookup transform transform

Unrolling in Time • What does featurizing a sequence look like? I like these pears RNN RNN RNN RNN

Text Classification and Sequence Labeling Graham Neubig Text - PowerPoint PPT Presentation

CMU CS11-737: Multilingual NLP Text Classification and Sequence Labeling Graham Neubig Text Classification Given an input text X , predict an output label y Topic Classification food food politics politics I like

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

CS133 Computational Geometry Voronoi Diagram Delaunay Triangulation 5/17/2018 1 Nearest

= = = f f BOB BOB meaning vectors of words not does like not like = Alice Bob Alice

The language of visual attributes Kristen Grauman Facebook AI Research University of Texas at

University of British Columbia CPSC 111, Intro to Computation 2009W2: Jan-Apr 2010 Tamara

OPEN SOURCE DESIGN Bernard Tyers / @twitter: bernardtyers / ei8fdb@ei8fdb.org

Word Order & Sentence Structure M&R 143153 ENG240Y Old English / Mon 18 Oct 2010

12. Dynamic Programming Memoization, Optimal Substructure, Overlapping Sub-Problems,

HOW MANY POTATOES ARE IN A MESH? Marc van Kreveld Maarten Lffler Pach Jnos 1 HOW MANY

Text Classification and Sequence Labeling Graham Neubig Text - PowerPoint PPT Presentation

CMU CS11-737: Multilingual NLP Text Classification and Sequence Labeling Graham Neubig Text Classification Given an input text X , predict an output label y Topic Classification food food politics politics I like

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

CS133 Computational Geometry Voronoi Diagram Delaunay Triangulation 5/17/2018 1 Nearest

= = = f f BOB BOB meaning vectors of words not does like not like = Alice Bob Alice

The language of visual attributes Kristen Grauman Facebook AI Research University of Texas at

University of British Columbia CPSC 111, Intro to Computation 2009W2: Jan-Apr 2010 Tamara

OPEN SOURCE DESIGN Bernard Tyers / @twitter: bernardtyers / ei8fdb@ei8fdb.org

Word Order &amp; Sentence Structure M&amp;R 143153 ENG240Y Old English / Mon 18 Oct 2010

12. Dynamic Programming Memoization, Optimal Substructure, Overlapping Sub-Problems,

HOW MANY POTATOES ARE IN A MESH? Marc van Kreveld Maarten Lffler Pach Jnos 1 HOW MANY

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Word Order & Sentence Structure M&R 143153 ENG240Y Old English / Mon 18 Oct 2010