Transformers Pre-trained Language Models LING572 Advanced - - PowerPoint PPT Presentation

transformers pre trained language models
SMART_READER_LITE
LIVE PREVIEW

Transformers Pre-trained Language Models LING572 Advanced - - PowerPoint PPT Presentation

Transformers Pre-trained Language Models LING572 Advanced Statistical Methods for NLP March 10, 2020 1 Announcements Thanks for being here! Please be active on Zoom chat! Thats the only form of interaction; I wont be able to


slide-1
SLIDE 1

Transformers
 Pre-trained Language Models

LING572 Advanced Statistical Methods for NLP March 10, 2020

1

slide-2
SLIDE 2

Announcements

  • Thanks for being here!
  • Please be active on Zoom chat! That’s the only form of interaction; I won’t

be able to tell what’s sticking and what’s not without the physical classroom and its visual cues.

  • HW7: excellent. 94 avg, no major comments.
  • HW9: will post this afternoon. Deep Averaging Network for text

classification; you will implement: linear layer, L2 regularization, early stopping.

  • Office hours today: https://washington.zoom.us/my/shanest

2

slide-3
SLIDE 3

Outline

  • Transformer Architecture
  • Transfer learning and pre-training
  • History / main idea
  • In NLP: ELMo, BERT, …

3

slide-4
SLIDE 4

Transformer Architecture

4

slide-5
SLIDE 5

5

Paper link (but see Annotated and Illustrated Transformer)

slide-6
SLIDE 6

Full Model

6

encoder decoder

slide-7
SLIDE 7

Transformer Block

7

slide-8
SLIDE 8

Transformer Block

7

Single layer, applied to each position

slide-9
SLIDE 9

Transformer Block

7

What’s this?

Single layer, applied to each position

slide-10
SLIDE 10

Scaled Dot-Product Attention

  • Recall:



 
 
 


  • Putting it together: 


(keys/values in matrices)


  • Stacking multiple queries:


(and scaling)

8

Attention(q, K, V) = ∑

j

eq⋅kj ∑i eq⋅ki vj Attention(Q, K, V) = softmax ( QKT dk ) V

slide-11
SLIDE 11

Scaled Dot-Product Attention

  • Recall:



 
 
 


  • Putting it together: 


(keys/values in matrices)


  • Stacking multiple queries:


(and scaling)

8

αj = q ⋅ kj ej = eαj/Σjeαj c = Σjejvj

Attention(q, K, V) = ∑

j

eq⋅kj ∑i eq⋅ki vj Attention(Q, K, V) = softmax ( QKT dk ) V

slide-12
SLIDE 12

Scaled Dot-Product Attention

  • Recall:



 
 
 


  • Putting it together: 


(keys/values in matrices)


  • Stacking multiple queries:


(and scaling)

8

αj = q ⋅ kj ej = eαj/Σjeαj c = Σjejvj

Attention(q, K, V) = ∑

j

eq⋅kj ∑i eq⋅ki vj Attention(Q, K, V) = softmax ( QKT dk ) V

slide-13
SLIDE 13

Why multiple queries?

9

slide-14
SLIDE 14

Why multiple queries?

  • seq2seq: single decoder token attends to all encoder states

9

slide-15
SLIDE 15

Why multiple queries?

  • seq2seq: single decoder token attends to all encoder states
  • Transformer: self-attention
  • Every (token) position attends to every other position [including self!]
  • Caveat: in the encoder, and only by default
  • Mask in decoder to attend only to previous positions
  • Masking technique applied in some Transformer-based LMs
  • So vector at each position is a query

9

slide-16
SLIDE 16

Multi-headed Attention

  • So far: a single attention mechanism.
  • Could be a bottleneck: need to pay

attention to different vectors for different reasons

  • Multi-headed: several attention

mechanisms in parallel

10

slide-17
SLIDE 17

Multi-headed Attention

  • So far: a single attention mechanism.
  • Could be a bottleneck: need to pay

attention to different vectors for different reasons

  • Multi-headed: several attention

mechanisms in parallel

10

slide-18
SLIDE 18

Multi-headed Attention

  • So far: a single attention mechanism.
  • Could be a bottleneck: need to pay

attention to different vectors for different reasons

  • Multi-headed: several attention

mechanisms in parallel

10

slide-19
SLIDE 19

Representing Order

11

slide-20
SLIDE 20

Representing Order

  • No notion of order in
  • Transformer. Represented

via positional encodings.

11

slide-21
SLIDE 21

Representing Order

  • No notion of order in
  • Transformer. Represented

via positional encodings.

11

source

slide-22
SLIDE 22

Representing Order

  • No notion of order in
  • Transformer. Represented

via positional encodings.

  • Usually fixed, though can be

learned.

11

source

slide-23
SLIDE 23

Representing Order

  • No notion of order in
  • Transformer. Represented

via positional encodings.

  • Usually fixed, though can be

learned.

  • No significant improvement;

less generalization.

11

source

slide-24
SLIDE 24

Initial WMT Results

12

slide-25
SLIDE 25

Initial WMT Results

12

More on why important later

slide-26
SLIDE 26

Attention Visualization: Coreference?

13

source

slide-27
SLIDE 27

Transformer: Summary

  • Entirely feed-forward
  • Therefore massively parallelizable
  • RNNs are inherently sequential, a parallelization bottleneck
  • (Self-)attention everywhere
  • Long-term dependencies:
  • LSTM: has to maintain representation of early item
  • Transformer: very short “path-lengths”

14

slide-28
SLIDE 28

Transfer Learning and Pre-training

15

slide-29
SLIDE 29

NLP’s “ImageNet Moment”

16

link

slide-30
SLIDE 30

What is ImageNet?

17

CVPR ‘09

slide-31
SLIDE 31

Why is ImageNet Important?

18

link

slide-32
SLIDE 32

Why is ImageNet Important?

  • 1. Deep learning
  • 2. Transfer learning

18

link

slide-33
SLIDE 33

ILSVRC results

19 source

slide-34
SLIDE 34

ILSVRC results

19 source

AlexNet (CNN)

slide-35
SLIDE 35

Transfer Learning

“We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of

  • datasets. We selected these tasks and datasets as they grad-ually move further away from the
  • riginal task and data the OverFeat network was trained to solve [cf. ImageNet].

Astonishingly, we report consistent superior results compared to the highly tuned state-of-the- art systems in all the visual classification tasks on various datasets”

20

slide-36
SLIDE 36

Standard Supervised Learning

21

Task 1 inputs Task 1 outputs

slide-37
SLIDE 37

Standard Supervised Learning

21

Task 1 inputs Task 1 outputs Task 2 inputs Task 2 outputs

slide-38
SLIDE 38

Standard Supervised Learning

21

Task 1 inputs Task 1 outputs Task 2 inputs Task 2 outputs Task 3 inputs Task 3 outputs

slide-39
SLIDE 39

Standard Supervised Learning

21

Task 1 inputs Task 1 outputs Task 2 inputs Task 2 outputs Task 3 inputs Task 3 outputs Task 4 inputs Task 4 outputs

slide-40
SLIDE 40

Standard Learning

  • New task = new model
  • Expensive!
  • Training time
  • Storage space
  • Data availability
  • Can be impossible in low-data regimes

22

slide-41
SLIDE 41

Transfer Learning

23

“pre-training” task inputs “pre-training” task outputs

slide-42
SLIDE 42

Transfer Learning

23

“pre-training” task outputs

slide-43
SLIDE 43

Transfer Learning

23

“pre-training” task outputs Task 1 inputs

slide-44
SLIDE 44

Transfer Learning

23

Task 1 inputs

slide-45
SLIDE 45

Transfer Learning

23

Task 1 inputs Task 1 outputs

slide-46
SLIDE 46

Transfer Learning

23

Task 1 outputs

slide-47
SLIDE 47

Transfer Learning

23

Task 1 outputs Task 2 inputs

slide-48
SLIDE 48

Transfer Learning

23

Task 1 outputs Task 2 inputs Task 2 outputs

slide-49
SLIDE 49

Transfer Learning

23

Task 1 outputs Task 2 outputs

slide-50
SLIDE 50

Transfer Learning

23

Task 1 outputs Task 2 outputs Task 3 inputs

slide-51
SLIDE 51

Transfer Learning

23

Task 1 outputs Task 2 outputs Task 3 outputs Task 3 inputs

slide-52
SLIDE 52

Transfer Learning

23

Task 1 outputs Task 2 outputs Task 3 outputs

slide-53
SLIDE 53

Transfer Learning

23

Task 1 outputs Task 2 outputs Task 3 outputs Pre-trained model, either:

  • General feature extractor
  • Fine-tuned on tasks
slide-54
SLIDE 54

Example: Scene Parsing

24

slide-55
SLIDE 55

Example: Scene Parsing

25

CVPR ’17 paper

slide-56
SLIDE 56

Example: Scene Parsing

25

CVPR ’17 paper

Pre-trained ResNet

slide-57
SLIDE 57

Transfer Learning in NLP

26

slide-58
SLIDE 58

Where to transfer from?

27

slide-59
SLIDE 59

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

27

slide-60
SLIDE 60

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:

27

slide-61
SLIDE 61

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing

27

slide-62
SLIDE 62

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing
  • Semantic parsing

27

slide-63
SLIDE 63

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing
  • Semantic parsing
  • Machine translation

27

slide-64
SLIDE 64

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing
  • Semantic parsing
  • Machine translation
  • QA

27

slide-65
SLIDE 65

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing
  • Semantic parsing
  • Machine translation
  • QA

27

slide-66
SLIDE 66

Where to transfer from?

  • Goal: find a linguistic task that will build general-purpose / transferable

representations

  • Possibilities:
  • Constituency or dependency parsing
  • Semantic parsing
  • Machine translation
  • QA
  • Scalability issue: all require expensive annotation

27

slide-67
SLIDE 67

Language Modeling

28

slide-68
SLIDE 68

Language Modeling

  • Recent innovation: use language modeling (a.k.a. next word prediction)
  • And variants thereof

28

slide-69
SLIDE 69

Language Modeling

  • Recent innovation: use language modeling (a.k.a. next word prediction)
  • And variants thereof
  • Linguistic knowledge:
  • The students were happy because ____ …
  • The student was happy because ____ …

28

slide-70
SLIDE 70

Language Modeling

  • Recent innovation: use language modeling (a.k.a. next word prediction)
  • And variants thereof
  • Linguistic knowledge:
  • The students were happy because ____ …
  • The student was happy because ____ …
  • World knowledge:
  • The POTUS gave a speech after missiles were fired by _____
  • The Seattle Sounders are so-named because Seattle lies on the Puget _____

28

slide-71
SLIDE 71

Language Modeling is “Unsupervised”

  • An example of “unsupervised” or “semi-supervised” learning
  • NB: I think that “un-annotated” is a better term. Formally, the learning is
  • supervised. But the labels come directly from the data, not an annotator.
  • E.g.: “Today is the first day of 575.”
  • (<s>, Today)
  • (<s> Today, is)
  • (<s> Today is, the)
  • (<s> Today is the, first)

29

slide-72
SLIDE 72

Data for LM is cheap

30

slide-73
SLIDE 73

Data for LM is cheap

30

slide-74
SLIDE 74

Data for LM is cheap

30

Text

slide-75
SLIDE 75

Text is abundant

  • News sites (e.g. Google 1B)
  • Wikipedia (e.g. WikiText103)
  • Reddit
  • ….
  • General web crawling:
  • https://commoncrawl.org/

31

slide-76
SLIDE 76

The Revolution will not be [Annotated]

32

https://twitter.com/rgblong/status/916062474545319938?lang=en

Yann LeCun

slide-77
SLIDE 77

ULMFiT

33

Universal Language Model Fine-tuning for Text Classification (ACL ’18)

slide-78
SLIDE 78

ULMFiT

34

slide-79
SLIDE 79

ULMFiT

35

slide-80
SLIDE 80

Deep Contextualized Word Representations


Peters et. al (2018)

36

slide-81
SLIDE 81

Deep Contextualized Word Representations


Peters et. al (2018)

  • NAACL 2018 Best Paper Award

36

slide-82
SLIDE 82

Deep Contextualized Word Representations


Peters et. al (2018)

  • NAACL 2018 Best Paper Award
  • Embeddings from Language Models (ELMo)
  • [aka the OG NLP Muppet]

36

slide-83
SLIDE 83

Deep Contextualized Word Representations


Peters et. al (2018)

  • Comparison to GloVe:

37

Source Nearest Neighbors GloVe

play playing, game, games, played, players, plays, player, Play, football, multiplayer

biLM

Chico Ruiz made a spectacular play on Alusik’s grounder… Kieffer, the only junior in the group, was commended for his ability to hit in the clutch, as well as his all-round excellent play. Olivia De Havilland signed to do a Broadway play for Garson… …they were actors who had been handed fat roles in a successful play, and had talent enough to fill the roles competently, with nice understatement.

slide-84
SLIDE 84

Deep Contextualized Word Representations


Peters et. al (2018)

  • Used in place of other

embeddings on multiple tasks:

38

SQuAD = Stanford Question Answering Dataset SNLI = Stanford Natural Language Inference Corpus SST

  • 5 = Stanford Sentiment Treebank

figure: Matthew Peters

slide-85
SLIDE 85

BERT: Bidirectional Encoder Representations from Transformers

Devlin et al NAACL 2019

39

slide-86
SLIDE 86

Overview

  • Encoder Representations from Transformers:
  • Bidirectional: ………?
  • BiLSTM (ELMo): left-to-right and right-to-left
  • Self-attention: every token can see every other
  • How do you treat the encoder as an LM (as computing

)?

  • Don’t: modify the task

✓ P(wt|wt−1, wt−2, …, w1)

40

slide-87
SLIDE 87

Masked Language Modeling

  • Language modeling: next word prediction
  • Masked Language Modeling (a.k.a. cloze task): fill-in-the-blank
  • Nancy Pelosi sent the articles of ____ to the Senate.
  • Seattle ____ some snow, so UW was delayed due to ____ roads.
  • I.e.
  • (very similar to CBOW: continuous bag of words from word2vec)
  • Auxiliary training task: next sentence prediction.
  • Given sentences A and B, binary classification: did B follow A in the corpus or not?

P(wt|wt+k, wt+(k−1), …, wt+1, wt−1, …, wt−(m+1), wt−m)

41

slide-88
SLIDE 88

Schematically

42

slide-89
SLIDE 89

Some details

43

slide-90
SLIDE 90

Some details

  • BASE model:
  • 12 Transformer Blocks
  • Hidden vector size: 768
  • Attention heads / layer: 12
  • Total parameters: 110M

43

slide-91
SLIDE 91

Some details

  • BASE model:
  • 12 Transformer Blocks
  • Hidden vector size: 768
  • Attention heads / layer: 12
  • Total parameters: 110M
  • LARGE model:
  • 24 Transformer Blocks
  • Hidden vector size: 1024
  • Attention heads / layer: 16
  • Total parameters: 340M

43

slide-92
SLIDE 92

Input Representation

44

slide-93
SLIDE 93

Input Representation

44

  • [CLS], [SEP]: special tokens
slide-94
SLIDE 94

Input Representation

44

  • [CLS], [SEP]: special tokens
  • Segment: is this a token from sentence A or B?
slide-95
SLIDE 95

Input Representation

44

  • [CLS], [SEP]: special tokens
  • Segment: is this a token from sentence A or B?
  • Position embeddings: provide position in sequence (learned, not fixed, in this case)
slide-96
SLIDE 96

Input Representation

44

  • [CLS], [SEP]: special tokens
  • Segment: is this a token from sentence A or B?
  • Position embeddings: provide position in sequence (learned, not fixed, in this case)

🧑🧑🤕🤕

slide-97
SLIDE 97

WordPiece Embeddings

  • Another solution to OOV problem, from NMT context (see Wu et al 2016)
  • Main idea:
  • Fix vocabulary size |V| in advance [for BERT: 30k]
  • Choose |V| wordpieces (subwords) such that total number of wordpieces in the

corpus is minimized

  • Frequent words aren’t split, but rarer ones are
  • NB: this is a small issue when you transfer to / evaluate on pre-existing

tagging datasets with their own vocabularies.

45

slide-98
SLIDE 98

Training Details

  • BooksCorpus (800M words) + Wikipedia (2.5B)
  • Masking the input text. 15% of all tokens are chosen. Then:
  • 80% of the time: replaced by designated ‘[MASK]’ token
  • 10% of the time: replaced by random token
  • 10% of the time: unchanged
  • Loss is cross-entropy of the prediction at the masked positions.
  • Max seq length: 512 tokens (final 10%; 128 for first 90%)
  • 1M training steps, batch size 256 = 4 days on 4 or 16 TPUs

46

slide-99
SLIDE 99

Initial Results

47

slide-100
SLIDE 100

Ablations

48

  • Not a given (depth doesn’t help ELMo);

possibly a difference between fine- tuning vs. feature extraction

  • Many more variations to explore
slide-101
SLIDE 101

Major Application

49

https://www.blog.google/products/search/search-language-understanding-bert/

slide-102
SLIDE 102

Major Application

50

slide-103
SLIDE 103

Pre-trained Neural Models Everywhere

51

General Language Understanding Evaluation (GLUE) / SuperGLUE

slide-104
SLIDE 104

Note on the costs of LMs

52

slide-105
SLIDE 105

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

52

slide-106
SLIDE 106

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

  • Hugely expensive
  • Carbon emissions
  • Monetarily
  • Inequitable access

52

slide-107
SLIDE 107

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

  • Hugely expensive
  • Carbon emissions
  • Monetarily
  • Inequitable access

52

slide-108
SLIDE 108

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

  • Hugely expensive
  • Carbon emissions
  • Monetarily
  • Inequitable access

52

slide-109
SLIDE 109

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

  • Hugely expensive
  • Carbon emissions
  • Monetarily
  • Inequitable access

52

slide-110
SLIDE 110

Note on the costs of LMs

  • Currently something of an ‘arms race’ between e.g. Google, Facebook,

OpenAI, MS, Baidu

  • Hugely expensive
  • Carbon emissions
  • Monetarily
  • Inequitable access
  • A role for interpretability/analysis:
  • Bigger is better, but:
  • Which factors really matter

52

slide-111
SLIDE 111

Sidebar: Word Embeddings

53

slide-112
SLIDE 112

Sidebar: Word Embeddings

  • Aren’t word embeddings like word2vec and GloVe examples of transfer

learning?

  • Yes: get linguistic representations from raw text to use in downstream tasks
  • No: not to be used as general-purpose representations

53

slide-113
SLIDE 113

Sidebar: Word Embeddings

54

slide-114
SLIDE 114

Sidebar: Word Embeddings

  • One distinction:
  • Global representations:
  • word2vec, GloVe: one vector for each word type (e.g. ‘play’)
  • Contextual representations (from LMs):
  • Representation of word in context, not independently

54

slide-115
SLIDE 115

Sidebar: Word Embeddings

  • One distinction:
  • Global representations:
  • word2vec, GloVe: one vector for each word type (e.g. ‘play’)
  • Contextual representations (from LMs):
  • Representation of word in context, not independently
  • Another:
  • Shallow (global) vs. Deep (contextual) pre-training

54

slide-116
SLIDE 116

Global Embeddings: Models

55

slide-117
SLIDE 117

Global Embeddings: Models

55

Mikolov et al 2013a (the OG word2vec paper)

slide-118
SLIDE 118

Shallow vs Deep Pre-training

56

Global embedding Model for task Raw tokens Model for task Contextual embedding (pre-trained) Raw tokens

slide-119
SLIDE 119

State of the Field

  • Manning 2017: “The BiLSTM Hegemony”
  • Right now: “The pre-trained Transformer Hegemony”
  • By default: fine-tune a large pre-trained Transformer on the task you care about
  • Will often yield the best results
  • Beware: often not significantly better than very simple baselines (SVM, etc)
  • Very useful library to quickly use these models: HuggingFace Transformers
  • https://huggingface.co/transformers/
  • Variants of BERT differ on: hyper-parameters, architectural choices, pre-

training tasks, ….

57