(Even More) Language Modeling: Multi-Task Learning, and Building - - PowerPoint PPT Presentation

even more language modeling
SMART_READER_LITE
LIVE PREVIEW

(Even More) Language Modeling: Multi-Task Learning, and Building - - PowerPoint PPT Presentation

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders Remember


slide-1
SLIDE 1

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers

CMSC 473/673 Frank Ferraro

slide-2
SLIDE 2

Outline

Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders

slide-3
SLIDE 3

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

  • utput

Multi-

  • utput

Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)

If multiple 𝑧𝑚 are predicted, then a multi- label classification task Each 𝑧𝑚 could be binary or multi-class Remember from Deck 5

slide-4
SLIDE 4

Multi-Label vs. Multi-Task

  • These can be considered the same thing but often

they’re different

  • “Task”: a thing of interest to predict
slide-5
SLIDE 5

Multi-Label vs. Multi-Task

  • These can be considered the same thing but often

they’re different

  • “Task”: a thing of interest to predict
  • Multi-label classification often involves multiple

labels for the same task

– E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)

slide-6
SLIDE 6

Multi-Label vs. Multi-Task

  • These can be considered the same thing but often they’re

different

  • “Task”: a thing of interest to predict
  • Multi-label classification often involves multiple labels for the

same task

– E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)

  • Multi-task learning is for different “tasks,” e.g.,

– Task 1: Category of document (SPORTS, FINANCE, etc.) – Task 2: Sentiment of document – Task 3: Part-of-speech per token – Task 4: Syntactic parsing – …

slide-7
SLIDE 7

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task)

x h y

slide-8
SLIDE 8

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task)

x h y If you have multiple (T) tasks, then train multiple systems x h1 y1 x h2 y2 x hT yT

slide-9
SLIDE 9

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task)

x h y If you have multiple (T) tasks, then train multiple systems x h1 y1 x h2 y2 x hT yT different encoders different decoders

slide-10
SLIDE 10

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y

Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?

slide-11
SLIDE 11

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y

Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?

Example: could features/embeddings useful for language modeling (task i) also be useful for part-of-speech tagging (task j)?

slide-12
SLIDE 12

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y x h y1 y2 yT …

Key idea/assumption: if the tasks are somehow related, can we leverage an ability to do task i well into an ability to do task j well?

slide-13
SLIDE 13

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y x h y1 y2 yT …

slide-14
SLIDE 14

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y x h y1 y2 yT … same encoder learns good, general features/embeddings

slide-15
SLIDE 15

Multi-Task Learning

Single-Task Learning Train a system to “do one thing” (make predictions for one task) Multi-Task Learning Train a system to “do multiple things” (make predictions for T different tasks)

x h y x h y1 y2 yT … same encoder learns good, general features/embeddings different decoders learn how to use those

  • reps. for each task
slide-16
SLIDE 16

General Multi-Task Training Procedure

Given:

T different corpora 𝐷1, … 𝐷𝑈 for tasks 𝐷𝑢 = { 𝑦1

𝑢, 𝑧1 𝑢 , … , (𝑦𝑂𝑢 𝑢 , 𝑧𝑂𝑢 𝑢 )}

Encoder 𝐹 and T different decoders 𝐸1, … 𝐸𝑈

These have weights (parameters) you need to learn

slide-17
SLIDE 17

General Multi-Task Training Procedure

Given:

T different corpora 𝐷1, … 𝐷𝑈 for tasks 𝐷𝑢 = { 𝑦1

𝑢, 𝑧1 𝑢 , … , (𝑦𝑂𝑢 𝑢 , 𝑧𝑂𝑢 𝑢 )}

Encoder 𝐹 and T different decoders 𝐸1, … 𝐸𝑈

Until converged or done:

  • 1. Select the next task t
  • 2. Randomly sample an instance 𝑦𝑗

𝑢, 𝑧𝑗 𝑢 from 𝐷𝑢

  • 3. Train the encoder 𝐹 and decoder 𝐷𝑢 on 𝑦𝑗

𝑢, 𝑧𝑗 𝑢

slide-18
SLIDE 18

WARNING: Multi-task learning did not begin in 2008

slide-19
SLIDE 19

Two Well-Known Instances of Multi- Task Learning in NLP

Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)

slide-20
SLIDE 20

Two Well-Known Instances of Multi- Task Learning in NLP

Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)

We’ll return to this

slide-21
SLIDE 21

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-22
SLIDE 22

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-23
SLIDE 23

Semantic Role Labeling (SRL)

  • For each predicate (e.g., verb)
  • 1. find its arguments (e.g., NPs)
  • 2. determine their semantic roles

John drove Mary from Austin to Dallas in his Toyota Prius. The hammer broke the window.

– agent: Actor of an action – patient: Entity affected by the action – source: Origin of the affected entity – destination: Destination of the affected entity – instrument: Tool used in performing action. – beneficiary: Entity for whom action is performed

Slide thanks to Ray Mooney (modified)

Slide courtesy Jason Eisner, with mild edits

Remember from Deck 4

slide-24
SLIDE 24

Uses of Semantic Roles

  • Find the answer to a user’s question

– “Who” questions usually want Agents – “What” question usually want Patients – “How” and “with what” questions usually want Instruments – “Where” questions frequently want Sources/Destinations. – “For whom” questions usually want Beneficiaries – “To whom” questions usually want Destinations

  • Generate text

– Many languages have specific syntactic constructions that must or should be used for specific semantic roles.

  • Word sense disambiguation, using selectional restrictions

– The bat ate the bug. (what kind of bat? what kind of bug?)

  • Agents (particularly of “eat”) should be animate – animal bat, not baseball bat
  • Patients of “eat” should be edible – animal bug, not software bug

– John fired the secretary. John fired the rifle.

Patients of fire1 are different than patients of fire2 Slide thanks to Ray Mooney (modified)

Slide courtesy Jason Eisner, with mild edits

Remember from Deck 4

slide-25
SLIDE 25

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-26
SLIDE 26

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-27
SLIDE 27

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-28
SLIDE 28

Part of Speech Tagging

British Left Waffles on Falkland Islands

Noun Verb Noun Prep Noun Noun

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5 (sequence is probably not right!)

slide-29
SLIDE 29

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence

slide-30
SLIDE 30

Syntactic Parsing (One Option)

(parse from the Berkeley parser: https://parser.kitaev.io/)

(parse is probably not right!)

slide-31
SLIDE 31

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules

slide-32
SLIDE 32

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules

Chunking: A Shallow Syntactic Parsing

slide-33
SLIDE 33

Chunking: A Shallow Syntactic Parse

  • (Variant 1) For every token, predict whether

it’s in a noun-phrase (NP) or not

slide-34
SLIDE 34

Chunking: A Shallow Syntactic Parse

  • (Variant 1) For every token, predict whether

it’s in a noun-phrase (NP) or not

British Left Waffles on Falkland Islands ✓ ✓ ✓ ✓ X X Treat this as a sequence prediction problem

slide-35
SLIDE 35

Chunking: A Shallow Syntactic Parse

  • (Variant 1) For every token, predict whether

it’s in a noun-phrase (NP) or not

  • (Variant 2) For every token, predict the type of

grammatical phrase it should be part of

slide-36
SLIDE 36

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-37
SLIDE 37

Collobert & Weston Language Modeling

Our approach so far: predict a word given some previous words 𝑞 𝑥1, … , 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑥<𝑗) Their approach: predict* whether 𝑥𝑗 is the correct word, based on context 𝑞 𝑧 = 1 𝑑 = 𝑥𝑗−𝑁, 𝑥𝑗−1, … , 𝑥𝑗+1, 𝑥𝑗+𝑁 , 𝑥𝑗)

V-class classification Binary classification

*They actually use a ranking loss, but it’s close enough to what’s described here

slide-38
SLIDE 38

Collobert & Weston Language Modeling (Example)

Sentence: British Left Waffles on Falkland Islands Word: “Waffles” Predict*: 𝑞 𝑧 = 1 𝑑 = Left, on , 𝑥𝑗 = Waffles) 𝑞 𝑧 = 0 𝑑 = Left, on , 𝑥𝑗 = Hats)

Any word but “Waffles” *They actually use a ranking loss, but it’s close enough to what’s described here

slide-39
SLIDE 39

Collobert and Weston (2008, ICML)

Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks

  • Part-of-Speech Tagging
  • Chunking
  • Named Entity Recognition
  • Language Modeling
  • Prediction of Semantic Relatedness
slide-40
SLIDE 40

Prediction of Semantic Relatedness

Are two words “semantically related?”

  • Synonym: different word, same meaning
  • Is-a relationships
  • Part/whole relationships
  • (and others)
slide-41
SLIDE 41

Prediction of Semantic Relatedness

Are two words “semantically related?”

  • Synonym: different word, same meaning
  • Is-a relationships

– X hypernym Y: X is a (sub)type of Y

  • car hypernym “motor vehicle”

– X hyponym Y: X is a (super)type of Y

  • car hyponym sedan
  • Part/whole relationships
  • (and others)
slide-42
SLIDE 42

Prediction of Semantic Relatedness

Are two words “semantically related?”

  • Synonym: different word, same meaning
  • Is-a relationships

– X hypernym Y: X is a (sub)type of Y

  • car hypernym “motor vehicle”

– X hyponym Y: X is a (super)type of Y

  • car hyponym sedan
  • Part/whole relationships

– X meronym Y: X is a part of Y

  • window meronym car

– X holonym Y: X is the whole, with Y as a part

  • car holonym window
  • (and others)
slide-43
SLIDE 43

WordNet

Knowledge graph containing concept relations

hamburger sandwich hero gyro

slide-44
SLIDE 44

WordNet

Knowledge graph containing concept relations

hamburger sandwich hero gyro hypernym: specific to general a hamburger is-a sandwich

slide-45
SLIDE 45

WordNet

Knowledge graph containing concept relations

hamburger sandwich hero gyro hyponym: general to specific a hamburger is-a sandwich

slide-46
SLIDE 46

WordNet

Knowledge graph containing concept relations

hamburger sandwich hero gyro Other relationships too:

  • meronymy, holonymy

(part of whole, whole of part)

  • troponymy

(describing manner of an event)

  • entailment

(what else must happen in an event)

slide-47
SLIDE 47

WordNet Knows About Hamburgers

hamburger sandwich snack food dish nutriment food substance matter physical entity entity

specific general

slide-48
SLIDE 48

Browsing WordNet

http://wordnetweb.princeton.edu/perl/webwn

Each of these is a synset (synonym set)

slide-49
SLIDE 49

Browsing WordNet

http://wordnetweb.princeton.edu/perl/webwn

Get the relationships for each synset

slide-50
SLIDE 50

Results (Error Rate: Lower is Better)

Word embedding size

slide-51
SLIDE 51

Results (Error Rate: Lower is Better)

Word embedding size

slide-52
SLIDE 52

Results (Error Rate: Lower is Better)

Word embedding size

slide-53
SLIDE 53

Outline

Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders

slide-54
SLIDE 54

Attention

A mechanism for signaling where in the input to focus (“attend to”) when producing some

  • utput

Each attention mechanism results in a probability distribution over the input There are many ways of computing this

slide-55
SLIDE 55

Example: Translation

The cat is on the chair. Le chat est sur la chaise.

variable # of input words variable # of

  • utput words
slide-56
SLIDE 56

Example: Translation

The cat is on the chair. Le chat est sur la chaise.

x0 h0 h2 x2

y0

  • 3

y3 … …

slide-57
SLIDE 57

?

Attention

The cat is on the chair. Le chat est sur la chaise. The cat is on the chair. Le chat est sur la chaise.

slide-58
SLIDE 58

Example: Translation

The cat is on the chair. Le chat est sur la chaise.

x0 h0 h2 x2

y0

  • 3

y3 … …

slide-59
SLIDE 59

Vaswani et al. (NeurIPS, 2017)

slide-60
SLIDE 60

“Attention Is All You Need”: Take-Aways

  • 1. Formulation of attention as a query-key-

value triple

  • 2. “Transformer” model that uses self-attention
  • 3. Demonstration that the transformer can
  • utperform sequence-to-sequence recurrent

models (but at a large computational cost!)

slide-61
SLIDE 61

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and

  • utput are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.” these query, key, value, and output items will be task dependent

slide-62
SLIDE 62

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and

  • utput are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

slide-63
SLIDE 63

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and

  • utput are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

slide-64
SLIDE 64

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an

  • utput, where the query, keys,

values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

The cat is on the chair.

[Le] [chat] … [fromage] …

chat

[The] [cat] … [bandage] …

Le ◊

query: input & current translation key: English words value: French words

  • utput: next translated word
slide-65
SLIDE 65

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an

  • utput, where the query, keys,

values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

The cat is on the chair.

[Le] [chat] … [fromage] …

chat

[The] [cat] … [bandage] …

Le ◊

slide-66
SLIDE 66

“Attention Is All You Need” Description

  • f Attention

“An attention function can be described as mapping a query and a set of key-value pairs to an

  • utput, where the query, keys,

values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

The cat is on the chair.

[Le] [chat] … [fromage] …

chat

[The] [cat] … [bandage] …

Le ◊

slide-67
SLIDE 67

Outline

Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders

slide-68
SLIDE 68

Two Well-Known Instances of Multi- Task Learning in NLP

Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)

slide-69
SLIDE 69

Two Well-Known (Recent) Instances of Learning from Language Models

GPT2 [Radford et al., 2018] BERT [Devlin et al., 2019 NAACL)

slide-70
SLIDE 70

GPT-2 & BERT (and others) In Practice

  • Use tensorflow code
  • Use pytorch code

→ The huggingface transformers package is very popular

slide-71
SLIDE 71

GPT-2 Take-Away

Language models can provide an effective way of learning embeddings that are useful for downstream tasks

  • Auto-regressive model that uses a transformer cell

𝑞 𝑥1, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑥1, … , 𝑥𝑗−1)

https://openai.com/blog/gpt-2-1-5b-release/ https://github.com/openai/gpt-2

slide-72
SLIDE 72

GPT-2 Model & Representation

x0 h0 h1 hN x1 xN y0 y1 yN BOS British Islands British Left [CLS] …

Computed w/ transformer cells

slide-73
SLIDE 73

BERT Take-Aways

  • 1. Demonstration of bidirectional transformer

for language understanding

  • 2. Clean separation of “pre-training” and “fine-

tuning” tasks

  • 3. Clear demonstration that language model

“pre-training” can yield useful embeddings

slide-74
SLIDE 74

Pretraining vs. Fine-tuning

Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic

slide-75
SLIDE 75

Pretraining vs. Fine-tuning

Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic

  • 1. Next-sentence prediction

[NSP]

  • 2. Masked Language Modeling

[MLM] Pre-training: NSP

  • Given two sentences 𝑡1 and

𝑡2, predict whether 𝑡2 follows 𝑡1 in “natural” text

slide-76
SLIDE 76

Pretraining vs. Fine-tuning

Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic

  • 1. Next-sentence prediction

[NSP]

  • 2. Masked Language Modeling

[MLM] Pre-training: MLM

  • Given a sentence 𝑡 = 𝑥1 … 𝑥𝑂,

mask out (remove) a word 𝑥𝑗 and predict what that word should be “The cat chased the mouse” ➔ “The cat [MASK] the mouse” 𝑞 𝑥 The cat MASK the mouse)

slide-77
SLIDE 77

Pretraining vs. Fine-tuning

Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic

  • 1. Next-sentence prediction

[NSP]

  • 2. Masked Language Modeling

[MLM]

  • Fig. 1
slide-78
SLIDE 78

Pretraining vs. Fine-tuning

Pre-training Learning an encoder to produce effective embeddings through “general” training objectives that are end-task agnostic

  • 1. Next-sentence prediction

[NSP]

  • 2. Masked Language Modeling

[MLM] Fine-Tuning Learning task-specific decoders using the embeddings produced from the pre-training, e.g.,

  • RTE
  • Question-answering
  • <Your task here>
slide-79
SLIDE 79

Pretraining vs. Fine-tuning

Fine-Tuning Learning task-specific decoders using the embeddings produced from the pre-training, e.g.,

  • RTE
  • Question-answering
  • <Your task here>
  • Fig. 1
slide-80
SLIDE 80

Pre-training then Fine-tuning

  • Fig. 1
slide-81
SLIDE 81

BERT Representation

  • 1. A special [CLS] token should precede the

entire input to BERT

  • 2. Every sentence should be followed by a

special [SEP] token

  • 3. The input must be tokenized in a special way
  • 4. Segment & position embeddings must be

provided

slide-82
SLIDE 82

BERT Representation

x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]

slide-83
SLIDE 83

BERT Representation

x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]

  • 1. A special [CLS] token should

precede the entire input to BERT

slide-84
SLIDE 84

BERT Representation

x1 h1 h2 hN+1 x2 xN+1 y1 y2 yN+1 British Left [SEP] … British Left [SEP] x0 h0 y0 [CLS] [CLS]

  • 2. Every sentence should be

followed by a special [SEP] token

slide-85
SLIDE 85

BERT Representation (Even More)

  • Fig. 2
slide-86
SLIDE 86

BERT Representation (Even More)

  • Fig. 2
  • 1. A special [CLS] token should

precede the entire input to BERT

slide-87
SLIDE 87

BERT Representation (Even More)

  • Fig. 2
  • 2. Every sentence should be followed

by a special [SEP] token

slide-88
SLIDE 88

BERT Representation (Even More)

  • Fig. 2
  • 3. The input must be tokenized in a

special way

slide-89
SLIDE 89

BERT Representation (Even More)

  • Fig. 2
  • 4. Segment & position embeddings

must be provided

slide-90
SLIDE 90

Transformer Language Model Take-Aways

  • 1. Clean separation of “pre-training” and “fine-

tuning” tasks

  • 2. Clear demonstration that language model

“pre-training” can yield useful embeddings

slide-91
SLIDE 91

GPT-2 & BERT (and others) In Practice

  • Use

tensorflow code

  • Use pytorch

code

→ The huggingface transformers package is very popular

  • Fig. 2 (Wolf et al., 2020: https://arxiv.org/pdf/1910.03771.pdf)
slide-92
SLIDE 92

GPT-2 & BERT (and others) In Practice

  • Use

tensorflow code

  • Use pytorch

code

→ The huggingface transformers package is very popular

  • Fig. 2 (Wolf et al., 2020:

https://arxiv.org/pdf/1910.03771.pdf)

slide-93
SLIDE 93

Five Broad Categories of Neural Networks

Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)

  • Fig. 2 (Wolf et al., 2020:

https://arxiv.org/pdf/1910.03771.pdf)

slide-94
SLIDE 94

Five Broad Categories of Neural Networks

Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)

  • Fig. 2 (Wolf et al., 2020:

https://arxiv.org/pdf/1910.03771.pdf)

slide-95
SLIDE 95

Five Broad Categories of Neural Networks

Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence-to- sequence”: with time delay)

  • Fig. 2 (Wolf et al., 2020:

https://arxiv.org/pdf/1910.03771.pdf)

slide-96
SLIDE 96

Outline

Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders