Linguistic Knowledge and Transferability of Contextual Representations
Nelson F. Liu
UWNLP NAACL 2019—June 3, 2019
Matt Gardner Noah A. Smith Matthew E. Peters Yonatan Belinkov
- 1
Linguistic Knowledge and Transferability of Contextual - - PowerPoint PPT Presentation
Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;
Nelson F. Liu
UWNLP NAACL 2019—June 3, 2019
Matt Gardner Noah A. Smith Matthew E. Peters Yonatan Belinkov
like ELMo or BERT) work well on many NLP tasks.
generalizability and transferability.
[McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia]
2
Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge.
3
Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers.
4
Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity!
5
Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help.
6
[Shi et al., 2016; Adi et al., 2017]
7
[Shi et al., 2016; Adi et al., 2017]
8
[Shi et al., 2016; Adi et al., 2017]
9
[Shi et al., 2016; Adi et al., 2017]
10
[Shi et al., 2016; Adi et al., 2017]
11
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
13
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
14
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
15
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
16
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
17
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
18
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019]
19
20
21
ELMo Bidirectional language model (BiLM) pretraining
2-layer LSTM
(ELMo original)
4-layer LSTM
(ELMo 4-layer)
6-layer Transformer
(ELMo transformer)
[Peters et al., 2018a,b]
22
ELMo Bidirectional language model (BiLM) pretraining
OpenAI Transformer Left-to-right language model pretraining on uncased BookCorpus
12-layer transformer 2-layer LSTM
(ELMo original)
4-layer LSTM
(ELMo 4-layer)
6-layer Transformer
(ELMo transformer)
[Peters et al., 2018a,b; Radford et al., 2018]
23
24-layer transformer (BERT large)
ELMo Bidirectional language model (BiLM) pretraining
OpenAI Transformer Left-to-right language model pretraining on uncased BookCorpus BERT (cased) Masked language model pretraining on BookCorpus + Wikipedia
12-layer transformer (BERT base) 12-layer transformer 2-layer LSTM
(ELMo original)
4-layer LSTM
(ELMo 4-layer)
6-layer Transformer
(ELMo transformer)
[Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019]
24
Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge.
25
Tagging
supersense disambiguation
Constituency Ancestor Tagging
26
Chunking
recognition
error detection
identification
Arc Prediction
Arc Classification
Arc Prediction
Arc Classification
Prediction
Tagging
supersense disambiguation
Constituency Ancestor Tagging
27
Chunking
recognition
error detection
identification
Arc Prediction
Arc Classification
Arc Prediction
Arc Classification
Prediction
Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
28
Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
71.58
29
Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
94.28 82.69 92.68 93.31 71.58
30
Accuracy 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
94.7 94.28 82.69 92.68 93.31 71.58
31
Pearson Correlation (r) x 100 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
32
Pearson Correlation (r) x 100 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
77.10 76.25 74.03 70.88 73.20 49.70
33
performs poorly may require more fine-grained linguistic knowledge.
especially large gains. See the paper for more details.
34
F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
35
F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
53.22
36
F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
84.44 58.14 81.21 82.85 53.22
37
F1 25 50 75 100 GloVe ELMo (original) ELMo (transformer) OpenAI Transformer BERT (large) SOTA
91.38 84.44 58.14 81.21 82.85 53.22
38
Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers.
39
40
ELMo (original) ELMo (4-layer)
41
Tasks Tasks
ELMo (original) ELMo (4-layer)
42
Tasks Tasks
ELMo (original) ELMo (4-layer)
43
Tasks Tasks
ELMo (original) ELMo (transformer) ELMo (4-layer) OpenAI Transformer BERT (base, cased) BERT (large, cased)
44
Tasks Tasks Tasks Tasks Tasks Tasks
Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity!
45
Perplexity 1000 2000 3000 4000 5000 6000 7000 8000 Layer 0 Layer 1 Layer 2
235 920 7026
46
Outputs of higher LSTM layers are better for language modeling (have lower perplexity)
47
Perplexity 500 1000 1500 2000 2500 3000 3500 4000 4500 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4
195 1013 2363 2398 4204 Outputs of higher LSTM layers are better for language modeling (have lower perplexity)
Perplexity 100 200 300 400 500 600 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
91 523 314 374 448 295 546
48
Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help.
49
Penn Treebank, with a variety of different objectives.
target (held-out) tasks.
50
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
51
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
60.55
52
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
54.42 60.55
53
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
66.53 63.60 66.06 64.67 63.96 54.42 60.55
54
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
79.05 66.53 63.60 66.06 64.67 63.96 54.42 60.55
55
See Wang et al. (ACL 2019) "How to Get Past Sesame Street: Sentence-Level Pretraining Beyond Language Modeling" for more tasks + multitasking.
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
86.86 90.98 88.11 87.75 87.57 70.62 72.74
56
Pretraining on related tasks is better than BiLM
Accuracy 10 20 30 40 50 60 70 80 90 100 Pretraining Task GloVe Randomly Initialized Chunking Semantic Dependency Classification CCG Syntactic Dependency Classification BiLM BiLM (1B Bench)
93.01 86.86 90.98 88.11 87.75 87.57 70.62 72.74
57
But, BiLM on more data is even better.
Accuracy 10 20 30 40 50 60 70 80 90 100 Layer 0 Layer 1 Layer 2
77.72 79.05 64.40 65.82 65.91 66.53
BiLM trained on PTB BiLM trained on 1B Word Benchmark
58
Also found by Saphra and Lopez (2019), check out poster 1402 on Wednesday!
Understanding Learning Dynamics Of Language Models with
Structural Supervision Improves Learning of Non-Local Grammatical Dependencies. Ethan Wilcox et al. Analysis Methods in Neural Language Processing: A Survey. Yonatan Belinkov and James Glass.
A Structural Probe for Finding Syntax in Word Representations. John Hewitt and Christopher D. Manning.
59
Online at: bit.ly/cwr-analysis-related
high performance on a broad set of tasks.
linguistic knowledge.
task-specific each layer is.
general representations.
http://nelsonliu.me/papers/contextual-repr-analysis
Code:
60
http://nelsonliu.me/papers/contextual-repr-analysis
Code: T h a n k s ! Q u e s t i
s ?
high performance on a broad set of tasks.
linguistic knowledge.
task-specific each layer is.
general representations.
61
Parent Grandparent Great-Grandparent
and disambiguate useful cases within POS tags.
(function), or the semantic role / relation it mediates (role).
Event "leave" did not happen. Event "leaving" happened.
whether a relation exists between them.
known to be related, identify what the relation is.
Input Tokens Label: True, there exists a relation
Input Tokens Label: True, there exists a relation
Input Tokens Label: False, there does not exist a relation
Input Tokens
?
Input Tokens
?
Label
?
Input Tokens
Input Tokens Label
86
87
88
89
90
91
92