CMU CS11-737: Multilingual NLP
Text Classification and Sequence Labeling
Graham Neubig
Text Classification and Sequence Labeling Graham Neubig Text - - PowerPoint PPT Presentation
CMU CS11-737: Multilingual NLP Text Classification and Sequence Labeling Graham Neubig Text Classification Given an input text X , predict an output label y Topic Classification food food politics politics I like
CMU CS11-737: Multilingual NLP
Graham Neubig
... and many many more! Sentiment Analysis (sentence/document-level)
桃と梨が好き English Japanese German ...
Language Identification
I like peaches and pears food politics music ...
Topic Classification
I like peaches and herb food politics music ... English Japanese German ... I like peaches and pears I like peaches and pears positive neutral negative I hate peaches and pears positive neutral negative
... and more! He saw two birds
PRON VERB NOUN NUM
Part of Speech Tagging He saw two birds
PronType=prs Tense=past, VerbForm=fin Number=plur NumType=card
Morphological Tagging He saw two birds
he see bird two
Lemmatization
... and more! Graham Neubig is teaching at Carnegie Mellon University Named Entity Recognition PER ORG Graham Neubig is teaching at Carnegie Mellon University Syntactic Chunking NP NP VP Graham Neubig is teaching at Carnegie Mellon University Semantic Role Labeling Actor Location Predicate
Graham Neubig is teaching at Carnegie Mellon University PER ORG Graham Neubig is teaching at Carnegie Mellon University B-PER I-PER O O O B-ORG I-ORG I-ORG
Tokenization A well-conceived "thought exercise." A well - conceived " thought exercise . " 外国⼈参政権 Word Segmentation 外国 ⼈ 参政 権
foreign people voting rights
外国 ⼈参 政権
foreign carrot government
Köpekler Morphological Segmentation Köpek ler
dog Number=Plural
Köpekle r
dog_paddle Tense=Aorist
I like peaches
Text Classification Sequence Labeling
I like peaches
Feature Extractor Feature Extractor
Predict
positive
Predict
PRON
Predict
VERB
Predict
NOUN
I like peaches
lookup lookup lookup
+ + =
Predict
Label Probs
1.0 2.2 0.6 … s= 0.002 0.003 0.329 0.444 0.090 … p= p = softmax( W * h + b )
j esj
<latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit><latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit><latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit>Original Motivation: Neurons in the Brain
Image credit: Wikipedia
Current Conception: Computation Graphs
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
y = x>Ax + b · x + c A node is a {tensor, matrix, vector, scalar} value expression: x graph:
y = x>Ax + b · x + c x graph:
f(u) = u>
expression: An edge represents a function argument. A node with an incoming edge is a function of that edge’s tail node. A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .
∂F ∂f(u)
∂f(u) ∂u ∂F ∂f(u) = ✓ ∂F ∂f(u) ◆>
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV
expression: graph: Functions can be nullary, unary, binary, … n-ary. Often they are unary or binary.
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
expression: graph: Computation graphs are generally directed and acyclic
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
x A
f(x, A) = x>Ax ∂f(x, A) ∂A = xx> ∂f(x, A) ∂x = (A> + A)x
expression: graph:
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
expression: graph:
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c y
f(x1, x2, x3) = X
i
xi
expression: graph: variable names are just labelings of nodes.
node given its inputs
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x>
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x x>Ax
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x x>Ax
x>Ax + b · x + c
respect to the final value (This is usually a “loss function”, a value we want to minimize)
derivative W -= α * dl/dW
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
Examples in this class
you want
update
He does not have very much confidence in himself. She does not have very much confidence in herself. The reign has lasted as long as the life of the queen. The rain has lasted as long as the life of the clouds.
Feed-forward NN
lookup
transform
context
Recurrent NN
lookup
transform
context
I like these pears
RNN RNN RNN RNN
I like these pears
RNN RNN RNN RNN predict prediction
I like these pairs
RNN RNN RNN RNN predict label predict label predict label predict label
I like these pears
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum total loss
computation graph—we can run backprop
aggregated across all time steps
time” (BPTT)
sum total loss
I like these pears
RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss
Parameters are shared! Derivatives are accumulated.
I like these pears
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat
softmax
PRON
softmax
VERB
softmax
DET
softmax
NOUN
LTI Language Identification Corpus http://www.cs.cmu.edu/~ralf/langid.html langid.py https://github.com/saffsd/langid.py
https://arxiv.org/pdf/1804.08186.pdf Automatic Language Identification in Texts: A Survey
MLDoc: A Corpus for Multilingual Document Classification in Eight Languages https://github.com/facebookresearch/MLDoc
https://cims.nyu.edu/~sbowman/xnli/ Cross-lingual Natural Language Inference (XNLI) corpus
Available from: https://github.com/ccsasuke/adan Cross-lingual Sentiment Classification
https://github.com/google-research-datasets/paws/tree/master/pawsx PAWS-X: Paraphrase Adversaries from Word Scrambling, Cross-lingual Version
features for 90 languages
Morphology" tag sets make things consistent
https://universaldependencies.org/
CoNLL 2002/2003 Language Independent Named Entity Recognition https://www.clips.uantwerpen.be/conll2002/ner/
https://www.clips.uantwerpen.be/conll2003/ner/
"Gold Standard"
WikiAnn Entity Recognition/Linking in 282 Languages https://www.aclweb.org/anthology/P17-1178/
Available from: https://github.com/google-research/xtreme
"Silver Standard"
sequence labeling/classification tasks
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization https://github.com/google-research/xtreme
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation https://microsoft.github.io/XGLUE/
Treebank for one language you know, and one language that you do not know and is very different https://universaldependencies.org/
and think about what you can tell from them
the languages? does one language have richer morphology? what else can you see?