Text Classification and Sequence Labeling Graham Neubig Text - - PowerPoint PPT Presentation

text classification and sequence labeling
SMART_READER_LITE
LIVE PREVIEW

Text Classification and Sequence Labeling Graham Neubig Text - - PowerPoint PPT Presentation

CMU CS11-737: Multilingual NLP Text Classification and Sequence Labeling Graham Neubig Text Classification Given an input text X , predict an output label y Topic Classification food food politics politics I like


slide-1
SLIDE 1

CMU CS11-737: Multilingual NLP

Text Classification and Sequence Labeling

Graham Neubig

slide-2
SLIDE 2

Text Classification

  • Given an input text X, predict an output label y

... and many many more! Sentiment Analysis (sentence/document-level)

桃と梨が好き English Japanese German ...

Language Identification

I like peaches and pears food politics music ...

Topic Classification

I like peaches and herb food politics music ... English Japanese German ... I like peaches and pears I like peaches and pears positive neutral negative I hate peaches and pears positive neutral negative

slide-3
SLIDE 3

Sequence Labeling

  • Given an input text X, predict an output label sequence Y of equal length!

... and more! He saw two birds

PRON VERB NOUN NUM

Part of Speech Tagging He saw two birds

PronType=prs Tense=past, VerbForm=fin Number=plur NumType=card

Morphological Tagging He saw two birds

he see bird two

Lemmatization

slide-4
SLIDE 4

Span Labeling

  • Given an input text X, predict an output spans and labels Y.

... and more! Graham Neubig is teaching at Carnegie Mellon University Named Entity Recognition PER ORG Graham Neubig is teaching at Carnegie Mellon University Syntactic Chunking NP NP VP Graham Neubig is teaching at Carnegie Mellon University Semantic Role Labeling Actor Location Predicate

slide-5
SLIDE 5

Span Labeling as Sequence Labeling

  • Predict Beginning, In, and Out tags for each word in a span

Graham Neubig is teaching at Carnegie Mellon University PER ORG Graham Neubig is teaching at Carnegie Mellon University B-PER I-PER O O O B-ORG I-ORG I-ORG

slide-6
SLIDE 6

Text Segmentation

  • Given an input text X, split it into segmented text Y.

Tokenization A well-conceived "thought exercise." A well - conceived " thought exercise . " 外国⼈参政権 Word Segmentation 外国 ⼈ 参政 権

foreign people voting rights

外国 ⼈参 政権

foreign carrot government

Köpekler Morphological Segmentation Köpek ler

dog Number=Plural

Köpekle r

  • Rule-based, or span labeling models

dog_paddle Tense=Aorist

slide-7
SLIDE 7

Modeling for Sequence Labeling/Classification

slide-8
SLIDE 8

How do we Make Predictions?

  • Given an input text X
  • Extract features H
  • Predict labels Y

I like peaches

Text Classification Sequence Labeling

I like peaches

Feature Extractor Feature Extractor

Predict

positive

Predict

PRON

Predict

VERB

Predict

NOUN

slide-9
SLIDE 9

A Simple Extractor: Bag of Words (BOW)

I like peaches

lookup lookup lookup

+ + =

Predict

Label Probs

slide-10
SLIDE 10

A Simple Predictor: Linear Transform+Softmax

  • Softmax converts arbitrary scores into probabilities
  • 3.2
  • 2.9

1.0 2.2 0.6 … s= 0.002 0.003 0.329 0.444 0.090 … p= p = softmax( W * h + b )

pi = esi P

j esj

<latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit><latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit><latexit sha1_base64="EzGUA2UljByz+/VvpPYifZTvI=">ACDHicbVC7TsMwFHXKq5RXgJHFoiAxVQlCAgakChbGIhFaqQmR4zqtW8eJbAepivIDLPwKCwMgVj6Ajb/BTNAy5GudHzOvfK9J0gYlcqyvo3KwuLS8kp1tba2vrG5ZW7v3Mk4FZg4OGax6ARIEkY5cRVjHQSQVAUMNIORlcTv/1AhKQxv1XjhHgR6nMaUoyUlnzIPEpvIBuKBDOyH0mfZrnmSvTyB/C4j3Mc9+sWw2rAJwndknqoETLN7/cXozTiHCFGZKya1uJ8jIkFMWM5DU3lSRBeIT6pKspRxGRXlZck8NDrfRgGAtdXMFC/T2RoUjKcRTozgipgZz1JuJ/XjdV4ZmXUZ6kinA8/ShMGVQxnEQDe1QrNhYE4QF1btCPEA6GKUDrOkQ7NmT54lz3Dhv2Dcn9eZlmUYV7IF9cARscAqa4Bq0gAMweATP4BW8GU/Gi/FufExbK0Y5swv+wPj8AXdWnAs=</latexit>
slide-11
SLIDE 11

Problem: Language is not a Bag of Words!

There’s nothing I don’t love about pears I don’t love pears

slide-12
SLIDE 12

Better Featurizers

  • Bag of n-grams
  • Syntax-based features (e.g. subject-object pairs)
  • Neural networks
  • Recurrent neural networks
  • Convolutional networks
  • Self attention
slide-13
SLIDE 13

What is a Neural Net?: Computation Graphs

slide-14
SLIDE 14

“Neural” Nets

Original Motivation: Neurons in the Brain

Image credit: Wikipedia

Current Conception: Computation Graphs

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

slide-15
SLIDE 15

y = x>Ax + b · x + c A node is a {tensor, matrix, vector, scalar} value expression: x graph:

slide-16
SLIDE 16

y = x>Ax + b · x + c x graph:

f(u) = u>

expression: An edge represents a function argument. A node with an incoming edge is a function of that edge’s tail node. A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .

∂F ∂f(u)

∂f(u) ∂u ∂F ∂f(u) = ✓ ∂F ∂f(u) ◆>

slide-17
SLIDE 17

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV

expression: graph: Functions can be nullary, unary, binary, … n-ary. Often they are unary or binary.

slide-18
SLIDE 18

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

expression: graph: Computation graphs are generally directed and acyclic

slide-19
SLIDE 19

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

x A

f(x, A) = x>Ax ∂f(x, A) ∂A = xx> ∂f(x, A) ∂x = (A> + A)x

expression: graph:

slide-20
SLIDE 20

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

expression: graph:

slide-21
SLIDE 21

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c y

f(x1, x2, x3) = X

i

xi

expression: graph: variable names are just labelings of nodes.

slide-22
SLIDE 22

Algorithms (1)

  • Graph construction
  • Forward propagation
  • In topological order, compute the value of the

node given its inputs

slide-23
SLIDE 23

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-24
SLIDE 24

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-25
SLIDE 25

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-26
SLIDE 26

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x>

Forward Propagation

slide-27
SLIDE 27

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A

Forward Propagation

slide-28
SLIDE 28

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x

Forward Propagation

slide-29
SLIDE 29

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x x>Ax

Forward Propagation

slide-30
SLIDE 30

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x x>Ax

Forward Propagation

x>Ax + b · x + c

slide-31
SLIDE 31

Algorithms (2)

  • Back-propagation:
  • Process examples in reverse topological order
  • Calculate the derivatives of the parameters with

respect to the final value (This is usually a “loss function”, a value we want to minimize)

  • Parameter update:
  • Move the parameters in the direction of this

derivative W -= α * dl/dW

slide-32
SLIDE 32

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Back Propagation

slide-33
SLIDE 33

Neural Network Frameworks

Examples in this class

slide-34
SLIDE 34

Basic Process in (Dynamic) Neural Network Frameworks

  • Create a model
  • For each example
  • create a graph that represents the computation

you want

  • calculate the result of that computation
  • if training, perform back propagation and

update

slide-35
SLIDE 35

Recurrent Neural Networks

slide-36
SLIDE 36

Long-distance Dependencies in Language

  • Agreement in number, gender, etc.
  • Selectional preference

He does not have very much confidence in himself. She does not have very much confidence in herself. The reign has lasted as long as the life of the queen. The rain has lasted as long as the life of the clouds.

slide-37
SLIDE 37

Recurrent Neural Networks

(Elman 1990)

Feed-forward NN

lookup

transform

context

Recurrent NN

lookup

transform

context

  • Tools to “remember” information
slide-38
SLIDE 38

Unrolling in Time

  • What does featurizing a sequence look like?

I like these pears

RNN RNN RNN RNN

slide-39
SLIDE 39

Representing Sentences

I like these pears

RNN RNN RNN RNN predict prediction

  • Text Classification
  • Conditioned Generation
  • Retrieval
slide-40
SLIDE 40

Representing Words

I like these pairs

RNN RNN RNN RNN predict label predict label predict label predict label

  • Sequence Labeling
  • Language Modeling
  • Calculating Representations for Parsing, etc.
slide-41
SLIDE 41

Training RNNs

I like these pears

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum total loss

slide-42
SLIDE 42

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop

  • Parameters are tied across time, derivatives are

aggregated across all time steps

  • This is historically called “backpropagation through

time” (BPTT)

sum total loss

slide-43
SLIDE 43

Parameter Tying

I like these pears

RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss

Parameters are shared! Derivatives are accumulated.

slide-44
SLIDE 44

Bi-RNNs

  • A simple extension, run the RNN in both directions

I like these pears

RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat

softmax

PRON

softmax

VERB

softmax

DET

softmax

NOUN

slide-45
SLIDE 45

Multilingual Labeling/Classification Data and Models

slide-46
SLIDE 46

Language Identification

  • Benchmark on 1152 languages from a variety of free sources

LTI Language Identification Corpus http://www.cs.cmu.edu/~ralf/langid.html langid.py https://github.com/saffsd/langid.py

  • Off-the-shelf language ID system for 90+ languages

https://arxiv.org/pdf/1804.08186.pdf Automatic Language Identification in Texts: A Survey

slide-47
SLIDE 47

Text Classification

  • Very broad field, many different datasets

MLDoc: A Corpus for Multilingual Document Classification in Eight Languages https://github.com/facebookresearch/MLDoc

  • Topic classification, eight languages

https://cims.nyu.edu/~sbowman/xnli/ Cross-lingual Natural Language Inference (XNLI) corpus

  • Textual entailment prediction (sentence pair classification)

Available from: https://github.com/ccsasuke/adan Cross-lingual Sentiment Classification

  • Chinese-English cross-lingual sentiment dataset

https://github.com/google-research-datasets/paws/tree/master/pawsx PAWS-X: Paraphrase Adversaries from Word Scrambling, Cross-lingual Version

  • Paraphrase detection (sentence pair classification)
slide-48
SLIDE 48

Part of Speech/ Morphological Tagging

  • Part of universal dependencies treebank
  • Contains parts of speech and morphologcal

features for 90 languages

  • Standardized "Universal POS" and "Universal

Morphology" tag sets make things consistent

  • Several pre-trained models on these datasets:
  • Udify: https://github.com/Hyperparticle/udify
  • Stanza: https://stanfordnlp.github.io/stanza/

https://universaldependencies.org/

slide-49
SLIDE 49

Named Entity Recognition

CoNLL 2002/2003 Language Independent Named Entity Recognition https://www.clips.uantwerpen.be/conll2002/ner/

  • English, German, Spanish, Dutch human annotated data

https://www.clips.uantwerpen.be/conll2003/ner/

"Gold Standard"

WikiAnn Entity Recognition/Linking in 282 Languages https://www.aclweb.org/anthology/P17-1178/

  • Data automatically extracted from Wikipedia using inter-page links

Available from: https://github.com/google-research/xtreme

"Silver Standard"

slide-50
SLIDE 50

Composite Benchmarks

  • Benchmarks that aggregate many different

sequence labeling/classification tasks

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization https://github.com/google-research/xtreme

  • 10 different tasks, 40 different languages

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation https://microsoft.github.io/XGLUE/

  • 11 tasks over 19 languages (including generation)
slide-51
SLIDE 51

Discussion Exercise

slide-52
SLIDE 52

Universal Dependencies Comparison

  • Download data from the Universal Dependencies

Treebank for one language you know, and one language that you do not know and is very different https://universaldependencies.org/

  • Look at the part-of-speech and morphological tags,

and think about what you can tell from them

  • For example: is the word order different between

the languages? does one language have richer morphology? what else can you see?

slide-53
SLIDE 53

Thank You!