(Even More) Language Modeling: Multi-Task Learning, and Building - - PowerPoint PPT Presentation
(Even More) Language Modeling: Multi-Task Learning, and Building - - PowerPoint PPT Presentation
(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders Remember
Outline
Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders
Multi-class Classification
Given input 𝑦, predict discrete label 𝑧
If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task
Multi-label Classification
Single
- utput
Multi-
- utput
Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)
If multiple 𝑧𝑚 are predicted, then a multi- label classification task Each 𝑧𝑚 could be binary or multi-class Remember from Deck 5
Multi-Label vs. Multi-Task
- These can be considered the same thing but often
they’re different
- “Task”: a thing of interest to predict
Multi-Label vs. Multi-Task
- These can be considered the same thing but often
they’re different
- “Task”: a thing of interest to predict
- Multi-label classification often involves multiple
labels for the same task
– E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)
Multi-Label vs. Multi-Task
- These can be considered the same thing but often they’re
different
- “Task”: a thing of interest to predict
- Multi-label classification often involves multiple labels for the
same task
– E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)
- Multi-task learning is for different “tasks,” e.g.,
– Task 1: Category of document (SPORTS, FINANCE, etc.) – Task 2: Sentiment of document – Task 3: Part-of-speech per token – Task 4: Syntactic parsing – …
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task)
x h y
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task)
x h y If you have multiple (T) tasks, then train multiple systems x h1 y1 x h2 y2 x hT yT
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task)
x h y If you have multiple (T) tasks, then train multiple systems x h1 y1 x h2 y2 x hT yT different encoders different decoders
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y
Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y
Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?
Example: could features/embeddings useful for language modeling (task i) also be useful for part-of-speech tagging (task j)?
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y x h y1 y2 yT …
Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y x h y1 y2 yT …
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y x h y1 y2 yT … same encoder learns good, general features/embeddings
Multi-Task Learning
Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)
x h y x h y1 y2 yT … same encoder learns good, general features/embeddings different decoders learn how to use those
- reps. for each task
General Multi-Task Training Procedure
Given:
T different corpora 𝐷1, … 𝐷𝑈 for tasks 𝐷𝑢 = { 𝑦1
𝑢, 𝑧1 𝑢 , … , (𝑦𝑂𝑢 𝑢 , 𝑧𝑂𝑢 𝑢 )}
Encoder 𝐹 and T different decoders 𝐸1, … 𝐸𝑈
These have weights (parameters) you need to learn
General Multi-Task Training Procedure
Given:
T different corpora 𝐷1, … 𝐷𝑈 for tasks 𝐷𝑢 = { 𝑦1
𝑢, 𝑧1 𝑢 , … , (𝑦𝑂𝑢 𝑢 , 𝑧𝑂𝑢 𝑢 )}
Encoder 𝐹 and T different decoders 𝐸1, … 𝐸𝑈
Until converged or done:
- 1. Select the next task t
- 2. Randomly sample an instance 𝑦𝑗
𝑢, 𝑧𝑗 𝑢 from 𝐷𝑢
- 3. Train the encoder 𝐹 and decoder 𝐷𝑢 on 𝑦𝑗
𝑢, 𝑧𝑗 𝑢
WARNING: Multi-task learning did not begin in 2008
Two Well-Known Instances of Multi- Task Learning in NLP
Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)
Two Well-Known Instances of Multi- Task Learning in NLP
Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)
We’ll return to this
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Semantic Role Labeling (SRL)
- For each predicate (e.g., verb)
- 1. find its arguments (e.g., NPs)
- 2. determine their semantic roles
John drove Mary from Austin to Dallas in his Toyota Prius. The hammer broke the window.
– agent: Actor of an action – patient: Entity affected by the action – source: Origin of the affected entity – destination: Destination of the affected entity – instrument: Tool used in performing action. – beneficiary: Entity for whom action is performed
Slide thanks to Ray Mooney (modified)
Slide courtesy Jason Eisner, with mild edits
Remember from Deck 4
Uses of Semantic Roles
- Find the answer to a user’s question
– “Who” questions usually want Agents – “What” question usually want Patients – “How” and “with what” questions usually want Instruments – “Where” questions frequently want Sources/Destinations. – “For whom” questions usually want Beneficiaries – “To whom” questions usually want Destinations
- Generate text
– Many languages have specific syntactic constructions that must or should be used for specific semantic roles.
- Word sense disambiguation, using selectional restrictions
– The bat ate the bug. (what kind of bat? what kind of bug?)
- Agents (particularly of “eat”) should be animate – animal bat, not baseball bat
- Patients of “eat” should be edible – animal bug, not software bug
– John fired the secretary. John fired the rifle.
Patients of fire1 are different than patients of fire2 Slide thanks to Ray Mooney (modified)
Slide courtesy Jason Eisner, with mild edits
Remember from Deck 4
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Part of Speech Tagging
British Left Waffles on Falkland Islands
Noun Verb Noun Prep Noun Noun
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5 (sequence is probably not right!)
Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence
Syntactic Parsing (One Option)
(parse from the Berkeley parser: https://parser.kitaev.io/)
(parse is probably not right!)
Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules
Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules
Chunking: A Shallow Syntactic Parsing
Chunking: A Shallow Syntactic Parse
- (Variant 1) For every token, predict whether
it’s in a noun-phrase (NP) or not
Chunking: A Shallow Syntactic Parse
- (Variant 1) For every token, predict whether
it’s in a noun-phrase (NP) or not
British Left Waffles on Falkland Islands ✓ ✓ ✓ ✓ X X Treat this as a sequence prediction problem
Chunking: A Shallow Syntactic Parse
- (Variant 1) For every token, predict whether
it’s in a noun-phrase (NP) or not
- (Variant 2) For every token, predict the type of
grammatical phrase it should be part of
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Collobert & Weston Language Modeling
Our approach so far: predict a word given some previous words 𝑞 𝑥1, … , 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗 𝑥<𝑗) Their approach: predict* whether 𝑥𝑗 is the correct word, based on context 𝑞 𝑧 = 1 𝑑 = 𝑥𝑗−𝑁, 𝑥𝑗−1, … , 𝑥𝑗+1, 𝑥𝑗+𝑁 , 𝑥𝑗)
V-class classification Binary classification
*They actually use a ranking loss, but it’s close enough to what’s described here
Collobert & Weston Language Modeling (Example)
Sentence: British Left Waffles on Falkland Islands Word: “Waffles” Predict*: 𝑞 𝑧 = 1 𝑑 = Left, on , 𝑥𝑗 = Waffles) 𝑞 𝑧 = 0 𝑑 = Left, on , 𝑥𝑗 = Hats)
Any word but “Waffles” *They actually use a ranking loss, but it’s close enough to what’s described here
Collobert and Weston (2008, ICML)
Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks
- Part-of-Speech Tagging
- Chunking
- Named Entity Recognition
- Language Modeling
- Prediction of Semantic Relatedness
Prediction of Semantic Relatedness
Are two words “semantically related?”
- Synonym: different word, same meaning
- Is-a relationships
- Part/whole relationships
- (and others)
Prediction of Semantic Relatedness
Are two words “semantically related?”
- Synonym: different word, same meaning
- Is-a relationships
– X hypernym Y: X is a (sub)type of Y
- car hypernym “motor vehicle”
– X hyponym Y: X is a (super)type of Y
- car hyponym sedan
- Part/whole relationships
- (and others)
Prediction of Semantic Relatedness
Are two words “semantically related?”
- Synonym: different word, same meaning
- Is-a relationships
– X hypernym Y: X is a (sub)type of Y
- car hypernym “motor vehicle”
– X hyponym Y: X is a (super)type of Y
- car hyponym sedan
- Part/whole relationships
– X meronym Y: X is a part of Y
- window meronym car
– X holonym Y: X is the whole, with Y as a part
- car holonym window
- (and others)
WordNet
Knowledge graph containing concept relations
hamburger sandwich hero gyro
WordNet
Knowledge graph containing concept relations
hamburger sandwich hero gyro hypernym: specific to general a hamburger is-a sandwich
WordNet
Knowledge graph containing concept relations
hamburger sandwich hero gyro hyponym: general to specific a hamburger is-a sandwich
WordNet
Knowledge graph containing concept relations
hamburger sandwich hero gyro Other relationships too:
- meronymy, holonymy
(part of whole, whole of part)
- troponymy
(describing manner of an event)
- entailment
(what else must happen in an event)
WordNet Knows About Hamburgers
hamburger sandwich snack food dish nutriment food substance matter physical entity entity
specific general
Browsing WordNet
http://wordnetweb.princeton.edu/perl/webwn
Each of these is a synset (synonym set)
Browsing WordNet
http://wordnetweb.princeton.edu/perl/webwn
Get the relationships for each synset
Results (Error Rate: Lower is Better)
Word embedding size
Results (Error Rate: Lower is Better)
Word embedding size
Results (Error Rate: Lower is Better)
Word embedding size
Outline
Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders
Attention
A mechanism for signaling where in the input to focus (“attend to”) when producing some
- utput
Each attention mechanism results in a probability distribution over the input There are many ways of computing this
Example: Translation
The cat is on the chair. Le chat est sur la chaise.
variable # of input words variable # of
- utput words
Example: Translation
The cat is on the chair. Le chat est sur la chaise.
x0 h0 h2 x2
y0
- 3
y3 … …
?
Attention
The cat is on the chair. Le chat est sur la chaise. The cat is on the chair. Le chat est sur la chaise.
Example: Translation
The cat is on the chair. Le chat est sur la chaise.
x0 h0 h2 x2
y0
- 3
y3 … …
Vaswani et al. (NeurIPS, 2017)
“Attention Is All You Need”: Take-Aways
- 1. Formulation of attention as a query-key-
value triple
- 2. “Transformer” model that uses self-attention
- 3. Demonstration that the transformer can
- utperform sequence-to-sequence recurrent
models (but at a large computational cost!)
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and
- utput are all vectors.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.” these query, key, value, and output items will be task dependent
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and
- utput are all vectors.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and
- utput are all vectors.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an
- utput, where the query, keys,
values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
The cat is on the chair.
[Le] [chat] … [fromage] …
chat
[The] [cat] … [bandage] …
Le ◊
query: input & current translation key: English words value: French words
- utput: next translated word
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an
- utput, where the query, keys,
values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
The cat is on the chair.
[Le] [chat] … [fromage] …
chat
[The] [cat] … [bandage] …
Le ◊
“Attention Is All You Need” Description
- f Attention
“An attention function can be described as mapping a query and a set of key-value pairs to an
- utput, where the query, keys,
values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
The cat is on the chair.
[Le] [chat] … [fromage] …
chat
[The] [cat] … [bandage] …
Le ◊
Outline
Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders
Two Well-Known Instances of Multi- Task Learning in NLP
Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)
Two Well-Known (Recent) Instances of Learning from Language Models
GPT2 [Radford et al., 2018] BERT [Devlin et al., 2019 NAACL)
GPT-2 & BERT (and others) In Practice
- Use tensorflow code
- Use pytorch code
→ The huggingface transformers package is very popular
GPT-2 Take-Away
Language models can provide an effective way of learning embeddings that are useful for downstream tasks
- Auto-regressive model that uses a transformer cell
𝑞 𝑥1, 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗 𝑥1, … , 𝑥𝑗−1)
https://openai.com/blog/gpt-2-1-5b-release/ https://github.com/openai/gpt-2
GPT-2 Model & Representation
x0 h0 h1 hN x1 xN y0 y1 yN BOS British Islands British Left [CLS] …
Computed w/ transformer cells
BERT Take-Aways
- 1. Demonstration of bidirectional transformer
for language understanding
- 2. Clean separation of “pre-training” and “fine-
tuning” tasks
- 3. Clear demonstration that language model
“pre-training” can yield useful embeddings
Pretraining vs. Fine-tuning
Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic
Pretraining vs. Fine-tuning
Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic
- 1. Next-sentence prediction
[NSP]
- 2. Masked Language Modeling
[MLM] Pre-training: NSP
- Given two sentences 𝑡1 and
𝑡2, predict whether 𝑡2 follows 𝑡1 in “natural” text
Pretraining vs. Fine-tuning
Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic
- 1. Next-sentence prediction
[NSP]
- 2. Masked Language Modeling
[MLM] Pre-training: MLM
- Given a sentence 𝑡 = 𝑥1 … 𝑥𝑂,
mask out (remove) a word 𝑥𝑗 and predict what that word should be “The cat chased the mouse” ➔ “The cat [MASK] the mouse” 𝑞 𝑥 The cat MASK the mouse)
Pretraining vs. Fine-tuning
Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic
- 1. Next-sentence prediction
[NSP]
- 2. Masked Language Modeling
[MLM]
- Fig. 1
Pretraining vs. Fine-tuning
Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic
- 1. Next-sentence prediction
[NSP]
- 2. Masked Language Modeling
[MLM] Fine-Tuning Learning task-specific decoders using the embeddings produced from the pre-training, e.g.,
- RTE
- Question-answering
- <Your task here>
Pretraining vs. Fine-tuning
Fine-Tuning Learning task-specific decoders using the embeddings produced from the pre-training, e.g.,
- RTE
- Question-answering
- <Your task here>
- Fig. 1
Pre-training then Fine-tuning
- Fig. 1
BERT Representation
- 1. A special [CLS] token should precede the
entire input to BERT
- 2. Every sentence should be followed by a
special [SEP] token
- 3. The input must be tokenized in a special way
- 4. Segment & position embeddings must be
provided
BERT Representation
x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]
BERT Representation
x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]
- 1. A special [CLS] token should
precede the entire input to BERT
BERT Representation
x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]
- 2. Every sentence should be
followed by a special [SEP] token
BERT Representation (Even More)
- Fig. 2
BERT Representation (Even More)
- Fig. 2
- 1. A special [CLS] token should
precede the entire input to BERT
BERT Representation (Even More)
- Fig. 2
- 2. Every sentence should be followed
by a special [SEP] token
BERT Representation (Even More)
- Fig. 2
- 3. The input must be tokenized in a
special way
BERT Representation (Even More)
- Fig. 2
- 4. Segment & position embeddings
must be provided
Transformer Language Model Take-Aways
- 1. Clean separation of “pre-training” and “fine-
tuning” tasks
- 2. Clear demonstration that language model
“pre-training” can yield useful embeddings
GPT-2 & BERT (and others) In Practice
- Use
tensorflow code
- Use pytorch
code
→ The huggingface transformers package is very popular
- Fig. 2 (Wolf et al., 2020: https://arxiv.org/pdf/1910.03771.pdf)
GPT-2 & BERT (and others) In Practice
- Use
tensorflow code
- Use pytorch
code
→ The huggingface transformers package is very popular
- Fig. 2 (Wolf et al., 2020:
https://arxiv.org/pdf/1910.03771.pdf)
Five Broad Categories of Neural Networks
Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)
- Fig. 2 (Wolf et al., 2020:
https://arxiv.org/pdf/1910.03771.pdf)
Five Broad Categories of Neural Networks
Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)
- Fig. 2 (Wolf et al., 2020:
https://arxiv.org/pdf/1910.03771.pdf)
Five Broad Categories of Neural Networks
Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)
- Fig. 2 (Wolf et al., 2020:
https://arxiv.org/pdf/1910.03771.pdf)