Linguistic Knowledge and Transferability of Contextual - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019—June 3, 2019 UWNLP � 1

[McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia ] Contextual Word Representations Are Extraordinarily Effective • Contextual word representations (from contextualizers like ELMo or BERT) work well on many NLP tasks. • But why do they work so well? • Better understanding enables principled enhancement. • This work: studies a few questions about their generalizability and transferability. � 2

(1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 3

(2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 4

(3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 5

(4) Alternative Pretraining Objectives Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help. � 6

[Shi et al., 2016; Adi et al., 2017] Probing Models � 7

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing

[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 13

Probing Model Setup • Contextualizer weights are always frozen. • Results are from the highest-performing contextualizer layer. • We use a linear probing model. � 20

Contextualizers Analyzed � 21

[Peters et al., 2018a,b] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark � 22

[Peters et al., 2018a,b; Radford et al., 2018] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus � 23

[Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus BERT (cased) 24-layer 12-layer Masked language model transformer transformer pretraining on (BERT large) (BERT base) BookCorpus + Wikipedia � 24

(1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 25

Examined 17 Diverse Probing Tasks • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 26

Linear Probing Models Rival Task-Specific Architectures • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 27

CCG Supertagging 100 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 28

CCG Supertagging 100 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 29

CCG Supertagging 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 30

CCG Supertagging 94.7 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 31

Event Factuality 100 Pearson Correlation (r) x 100 75 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 32

Event Factuality 100 77.10 76.25 74.03 Pearson Correlation (r) x 100 73.20 70.88 75 49.70 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 33

But Linear Probing Models Underperform on Some Tasks • Tasks that linear model + contextual word representation performs poorly may require more fine-grained linguistic knowledge. • In these cases, task-specific contextualization leads to especially large gains. See the paper for more details. � 34

Named Entity Recognition 100 75 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 35

Named Entity Recognition 100 75 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 36

Named Entity Recognition 100 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 37

Named Entity Recognition 100 91.38 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 38

(2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 39

Layerwise Patterns in Transferability � 40

Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 41

Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 42

Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers � 43

Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers OpenAI Transformer ELMo (transformer) Tasks Tasks BERT (base, cased) BERT (large, cased) Tasks Tasks � 44

(3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 45

Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (original) 8000 7026 Outputs of higher LSTM layers are 7000 better for language modeling 6000 (have lower perplexity) 5000 Perplexity 4000 3000 2000 920 1000 235 0 Layer 0 Layer 1 Layer 2 � 46

Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (4-layer) 4500 4204 Outputs of higher LSTM layers are 4000 better for language modeling (have lower perplexity) 3500 3000 Perplexity 2398 2363 2500 2000 1500 1013 1000 500 195 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 � 47

Layerwise Patterns Dictated by Perplexity Transformer-based ELMo (6-layer) 600 546 523 500 448 374 400 Perplexity 314 295 300 200 91 100 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 � 48

Linguistic Knowledge and Transferability of Contextual - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Evidence in Aged Care Research Edward Cheong with Professor Joseph Ibrahim Transferability and

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Using Universal Linguistic Knowledge to Guide Grammar Induction [Naseem et al., 2010] Juri

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Master EmLex CiTIUS Design and use of linguistic tools Introduction Linguistic Analysis

LCS 11: Cognitive Science Linguistic relativity Linguistic relativity GQ # 4.3 discussions

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

Electrons for neutrinos Lawrence Weinstein Old Dominion University Neutrino Cross Section

Cryptanalysi Ben Nassi Raz Ben-Netanel s Prof. Adi Shamir Prof. Yuval Elovici Agenda 1)

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Building Efficient ML Pipelines and Responsible AI Solutions Adi Polak Microsoft @adipolak

De Development of lopment of a a Risk-Adjusted Risk-Adjusted Readmission Ra eadmission Rate:

Irish Waste Management Conference 2017 Carton House Hotel 28th November

Complete Acyclic Colorings GGTW 2019, Ghent, Belgium attler 2 , Kolja Knauer 3 and Stefan Felsner

From Reduced-Form to Structural Evaluation: Expanding Financial Infrastructure and Impact Robert

Linguistic Knowledge and Transferability of Contextual - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Evidence in Aged Care Research Edward Cheong with Professor Joseph Ibrahim Transferability and

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Using Universal Linguistic Knowledge to Guide Grammar Induction [Naseem et al., 2010] Juri

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Master EmLex CiTIUS Design and use of linguistic tools Introduction Linguistic Analysis

LCS 11: Cognitive Science Linguistic relativity Linguistic relativity GQ # 4.3 discussions

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &amp;

Electrons for neutrinos Lawrence Weinstein Old Dominion University Neutrino Cross Section

Cryptanalysi Ben Nassi Raz Ben-Netanel s Prof. Adi Shamir Prof. Yuval Elovici Agenda 1)

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Building Efficient ML Pipelines and Responsible AI Solutions Adi Polak Microsoft @adipolak

De Development of lopment of a a Risk-Adjusted Risk-Adjusted Readmission Ra eadmission Rate:

Irish Waste Management Conference 2017 Carton House Hotel 28th November

Complete Acyclic Colorings GGTW 2019, Ghent, Belgium attler 2 , Kolja Knauer 3 and Stefan Felsner

From Reduced-Form to Structural Evaluation: Expanding Financial Infrastructure and Impact Robert

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards