Linguistic Knowledge and Transferability of Contextual - - PowerPoint PPT Presentation

linguistic knowledge and transferability of contextual
SMART_READER_LITE
LIVE PREVIEW

Linguistic Knowledge and Transferability of Contextual - - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;


slide-1
SLIDE 1

Linguistic Knowledge and Transferability of Contextual Representations

Nelson F. Liu

UWNLP NAACL 2019—June 3, 2019

Matt Gardner Noah A. Smith Matthew E. Peters Yonatan Belinkov

  • 1
slide-2
SLIDE 2

Contextual Word Representations Are Extraordinarily Effective

  • Contextual word representations (from contextualizers

like ELMo or BERT) work well on many NLP tasks.

  • But why do they work so well?
  • Better understanding enables principled enhancement.
  • This work: studies a few questions about their

generalizability and transferability.

[McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia]

2

slide-3
SLIDE 3

(1) Probing Contextual Representations

Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge.

3

slide-4
SLIDE 4

Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers.

4

(2) How Does Transferability Vary?

slide-5
SLIDE 5

(3) Why Does Transferability Vary?

Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity!

5

slide-6
SLIDE 6

(4) Alternative Pretraining Objectives

Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help.

6

slide-7
SLIDE 7

Probing Models

[Shi et al., 2016; Adi et al., 2017]

7

slide-8
SLIDE 8

Probing Models

[Shi et al., 2016; Adi et al., 2017]

8

slide-9
SLIDE 9

Probing Models

[Shi et al., 2016; Adi et al., 2017]

9

slide-10
SLIDE 10

Probing Models

[Shi et al., 2016; Adi et al., 2017]

10

slide-11
SLIDE 11

Probing Models

[Shi et al., 2016; Adi et al., 2017]

11

slide-12
SLIDE 12

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

slide-13
SLIDE 13

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

13

slide-14
SLIDE 14

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

14

slide-15
SLIDE 15

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

15

slide-16
SLIDE 16

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

16

slide-17
SLIDE 17

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

17

slide-18
SLIDE 18

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

18

slide-19
SLIDE 19

Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]

19

slide-20
SLIDE 20
  • Contextualizer weights are always frozen.
  • Results are from the highest-performing contextualizer layer.
  • We use a linear probing model.

Probing Model Setup

20

slide-21
SLIDE 21

Contextualizers Analyzed

21

slide-22
SLIDE 22

Contextualizers Analyzed

ELMo Bidirectional language model (BiLM) pretraining

  • n 1B Word Benchmark

2-layer LSTM

(ELMo original)

4-layer LSTM

(ELMo 4-layer)

6-layer Transformer

(ELMo transformer)

[Peters et al., 2018a,b]

22

slide-23
SLIDE 23

Contextualizers Analyzed

ELMo Bidirectional language model (BiLM) pretraining

  • n 1B Word Benchmark

OpenAI Transformer Left-to-right language model pretraining on uncased BookCorpus

12-layer transformer 2-layer LSTM

(ELMo original)

4-layer LSTM

(ELMo 4-layer)

6-layer Transformer

(ELMo transformer)

[Peters et al., 2018a,b; Radford et al., 2018]

23

slide-24
SLIDE 24

24-layer transformer (BERT large)

Contextualizers Analyzed

ELMo Bidirectional language model (BiLM) pretraining

  • n 1B Word Benchmark

OpenAI Transformer Left-to-right language model pretraining on uncased BookCorpus BERT (cased) Masked language model pretraining on BookCorpus + Wikipedia

12-layer transformer (BERT base) 12-layer transformer 2-layer LSTM

(ELMo original)

4-layer LSTM

(ELMo 4-layer)

6-layer Transformer

(ELMo transformer)

[Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019]

24

slide-25
SLIDE 25

(1) Probing Contextual Representations

Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge.

25

slide-26
SLIDE 26

Examined 17 Diverse Probing Tasks

  • Part-of-Speech

Tagging

  • CCG Supertagging
  • Semantic Tagging
  • Preposition

supersense disambiguation

  • Event Factuality
  • Syntactic

Constituency Ancestor Tagging

26

  • Syntactic

Chunking

  • Named entity

recognition

  • Grammatical

error detection

  • Conjunct

identification

  • Syntactic Dependency

Arc Prediction

  • Syntactic Dependency

Arc Classification

  • Semantic Dependency

Arc Prediction

  • Semantic Dependency

Arc Classification

  • Coreference Arc

Prediction

slide-27
SLIDE 27

Linear Probing Models Rival Task-Specific Architectures

  • Part-of-Speech

Tagging

  • CCG Supertagging
  • Semantic Tagging
  • Preposition

supersense disambiguation

  • Event Factuality
  • Syntactic

Constituency Ancestor Tagging

27

  • Syntactic

Chunking

  • Named entity

recognition

  • Grammatical

error detection

  • Conjunct

identification

  • Syntactic Dependency

Arc Prediction

  • Syntactic Dependency

Arc Classification

  • Semantic Dependency

Arc Prediction

  • Semantic Dependency

Arc Classification

  • Coreference Arc

Prediction

slide-28
SLIDE 28

CCG Supertagging

Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

28

slide-29
SLIDE 29

CCG Supertagging

Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

71.58

29

slide-30
SLIDE 30

CCG Supertagging

Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

94.28 82.69 92.68 93.31 71.58

30

slide-31
SLIDE 31

CCG Supertagging

Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

94.7 94.28 82.69 92.68 93.31 71.58

31

slide-32
SLIDE 32

Event Factuality

Pearson Correlation (r) x 100 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

32

slide-33
SLIDE 33

Event Factuality

Pearson Correlation (r) x 100 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

77.10 76.25 74.03 70.88 73.20 49.70

33

slide-34
SLIDE 34

But Linear Probing Models Underperform on Some Tasks

  • Tasks that linear model + contextual word representation

performs poorly may require more fine-grained linguistic knowledge.

  • In these cases, task-specific contextualization leads to

especially large gains. See the paper for more details.

34

slide-35
SLIDE 35

Named Entity Recognition

F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

35

slide-36
SLIDE 36

Named Entity Recognition

F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

53.22

36

slide-37
SLIDE 37

Named Entity Recognition

F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

84.44 58.14 81.21 82.85 53.22

37

slide-38
SLIDE 38

Named Entity Recognition

F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA

91.38 84.44 58.14 81.21 82.85 53.22

38

slide-39
SLIDE 39

Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers.

39

(2) How Does Transferability Vary?

slide-40
SLIDE 40

Layerwise Patterns in Transferability

40

slide-41
SLIDE 41

Layerwise Patterns in Transferability

LSTM-based Contextualizers

ELMo (original) ELMo (4-layer)

41

Tasks Tasks

slide-42
SLIDE 42

Layerwise Patterns in Transferability

LSTM-based Contextualizers

ELMo (original) ELMo (4-layer)

42

Tasks Tasks

slide-43
SLIDE 43

Layerwise Patterns in Transferability

LSTM-based Contextualizers Transformer-based Contextualizers

ELMo (original) ELMo (4-layer)

43

Tasks Tasks

slide-44
SLIDE 44

Layerwise Patterns in Transferability

LSTM-based Contextualizers Transformer-based Contextualizers

ELMo (original) ELMo (transformer) ELMo (4-layer) OpenAI Transformer BERT (base, cased) BERT (large, cased)

44

Tasks Tasks Tasks Tasks Tasks Tasks

slide-45
SLIDE 45

(3) Why Does Transferability Vary?

Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity!

45

slide-46
SLIDE 46

Perplexity 1000 2000 3000 4000 5000 6000 7000 8000 Layer 0 Layer 1 Layer 2

235 920 7026

Layerwise Patterns Dictated by Perplexity

46

LSTM-based ELMo (original)

Outputs of higher LSTM layers are better for language modeling (have lower perplexity)

slide-47
SLIDE 47

Layerwise Patterns Dictated by Perplexity

LSTM-based ELMo (4-layer)

47

Perplexity 500 1000 1500 2000 2500 3000 3500 4000 4500 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4

195 1013 2363 2398 4204 Outputs of higher LSTM layers are better for language modeling (have lower perplexity)

slide-48
SLIDE 48

Perplexity 100 200 300 400 500 600 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6

91 523 314 374 448 295 546

Layerwise Patterns Dictated by Perplexity

Transformer-based ELMo (6-layer)

48

slide-49
SLIDE 49

(4) Alternative Pretraining Objectives

Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help.

49

slide-50
SLIDE 50

Investigating Alternatives to Language Model Pretraining

  • How does the language modeling as a pretraining
  • bjective compare to explicitly supervised tasks?
  • Pretrain ELMo (original)-architecture contextualizer on the

Penn Treebank, with a variety of different objectives.

  • Evaluate how well the resultant representations transfer to

target (held-out) tasks.

50

slide-51
SLIDE 51

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

Average Across Target Tasks

51

slide-52
SLIDE 52

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

60.55

Average Across Target Tasks

52

slide-53
SLIDE 53

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

54.42 60.55

Average Across Target Tasks

53

slide-54
SLIDE 54

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

66.53 63.60 66.06 64.67 63.96 54.42 60.55

Average Across Target Tasks

54

slide-55
SLIDE 55

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

79.05 66.53 63.60 66.06 64.67 63.96 54.42 60.55

Average Across Target Tasks

55

See Wang et al. (ACL 2019) "How to Get Past Sesame Street: Sentence-Level Pretraining Beyond Language Modeling" for more tasks + multitasking.

slide-56
SLIDE 56

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

86.86 90.98 88.11 87.75 87.57 70.62 72.74

Target Task: Syntactic Dependency Classification (EWT)

56

Pretraining on related tasks is better than BiLM

slide-57
SLIDE 57

Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)

93.01 86.86 90.98 88.11 87.75 87.57 70.62 72.74

Target Task: Syntactic Dependency Classification (EWT)

57

But, BiLM on more data is even better.

slide-58
SLIDE 58

Accuracy 10 20 30 40 50 60 70 80 90 100 Layer 0 Layer 1 Layer 2

77.72 79.05 64.40 65.82 65.91 66.53

BiLM trained on PTB BiLM trained on 1B Word Benchmark

PTB-trained BiLM vs ELMo

58

Also found by Saphra and Lopez (2019), check out poster 1402 on Wednesday!

slide-59
SLIDE 59

Some Related Work at NAACL

  • Wed. June 5, 10:30 – 12:00. ML & Syntax, Hyatt Exhibit Hall:

Understanding Learning Dynamics Of Language Models with

  • SVCCA. Naomi Saphra and Adam Lopez.

Structural Supervision Improves Learning of Non-Local Grammatical Dependencies. Ethan Wilcox et al. Analysis Methods in Neural Language Processing: A Survey. Yonatan Belinkov and James Glass.

  • Wed. June 5, 16:15–16:30. Machine Learning, Nicollet B/C:

A Structural Probe for Finding Syntax in Word Representations. John Hewitt and Christopher D. Manning.

59

Online at: bit.ly/cwr-analysis-related

slide-60
SLIDE 60

Takeaways

  • Features from pretrained contextualizers are sufficient for

high performance on a broad set of tasks.

  • Tasks with lower performance might require fine-grained

linguistic knowledge.

  • Layerwise patterns in transferability exist. Dictated by how

task-specific each layer is.

  • Even on PTB-size data, BiLM pretraining yields the most

general representations.

  • Pretraining on related tasks helps
  • More data helps even more!

http://nelsonliu.me/papers/contextual-repr-analysis

Code:

60

slide-61
SLIDE 61

Takeaways

http://nelsonliu.me/papers/contextual-repr-analysis

Code: T h a n k s ! Q u e s t i

  • n

s ?

  • Features from pretrained contextualizers are sufficient for

high performance on a broad set of tasks.

  • Tasks with lower performance might require fine-grained

linguistic knowledge.

  • Layerwise patterns in transferability exist. Dictated by how

task-specific each layer is.

  • Even on PTB-size data, BiLM pretraining yields the most

general representations.

  • Pretraining on related tasks helps
  • More data helps even more!

61

slide-62
SLIDE 62

Bonus Slides

slide-63
SLIDE 63

Probing Task Examples

slide-64
SLIDE 64

Part-of-Speech Tagging

slide-65
SLIDE 65

CCG Supertagging

slide-66
SLIDE 66

Syntactic Constituency Ancestor Tagging

Parent Grandparent Great-Grandparent

slide-67
SLIDE 67

Semantic Tagging

  • Semantic tags abstract over redundant POS distinctions

and disambiguate useful cases within POS tags.

  • (1) Sarah bought herself a book
  • (2) Sarah herself bought a book
  • Same POS tag (Personal Pronoun), but different semantic
  • function. (1) reflexive function, (2) emphasizing function
slide-68
SLIDE 68

Preposition Supersense Disambiguation

  • Classify a preposition's lexical semantic contribution

(function), or the semantic role / relation it mediates (role).

  • Specialized kind of word sense disambiguation.
slide-69
SLIDE 69

Preposition Supersense Disambiguation

slide-70
SLIDE 70

Event Factuality

  • Label predicates with the factuality of events they describe.

Event "leave" did not happen. Event "leaving" happened.

slide-71
SLIDE 71

Syntactic Chunking

slide-72
SLIDE 72

Named Entity Recognition

slide-73
SLIDE 73

Grammatical Error Detection

slide-74
SLIDE 74

Conjunct Identification

  • And the city decided to treat its guests more like [royalty]
  • r [rock stars] than factory owners.
slide-75
SLIDE 75

Two Types of Pairwise Relations

  • Arc prediction tasks: Given two random tokens, identify

whether a relation exists between them.

  • Arc classification tasks: Given two tokens that are

known to be related, identify what the relation is.

slide-76
SLIDE 76

Syntactic Dependency Arc Prediction

Input Tokens Label: True, there exists a relation

slide-77
SLIDE 77

Syntactic Dependency Arc Prediction

Input Tokens Label: True, there exists a relation

slide-78
SLIDE 78

Syntactic Dependency Arc Prediction

Input Tokens Label: False, there does not exist a relation

slide-79
SLIDE 79

Syntactic Dependency Arc Classification

Input Tokens

?

slide-80
SLIDE 80

Syntactic Dependency Arc Classification

Input Tokens

?

Label

slide-81
SLIDE 81

Syntactic Dependency Arc Classification

?

Input Tokens

slide-82
SLIDE 82

Syntactic Dependency Arc Classification

Input Tokens Label

slide-83
SLIDE 83

Semantic Dependencies

slide-84
SLIDE 84

Coreference Relations

slide-85
SLIDE 85

Setting Up Alternative Pretraining Objectives

slide-86
SLIDE 86

Language Model Pretraining

86

slide-87
SLIDE 87

Language Model Pretraining

87

slide-88
SLIDE 88

Chunking Pretraining

88

slide-89
SLIDE 89

Chunking Pretraining

89

slide-90
SLIDE 90

Flexible Paradigm, Use Any Task!

90

slide-91
SLIDE 91

91

slide-92
SLIDE 92

92