Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets
Nelson F. Liu UWNLP NAACL 2019—June 4, 2019 Roy Schwartz Noah A. Smith
Inoculation by Fine-Tuning: A Method for Analyzing Challenge - - PowerPoint PPT Presentation
Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets Nelson F. Liu Roy Schwartz Noah A. Smith NAACL 2019June 4, 2019 UWNLP Two Key Ingredients of NLP Systems Training Model Dataset Architecture NLP System 2
Nelson F. Liu UWNLP NAACL 2019—June 4, 2019 Roy Schwartz Noah A. Smith
2
Training Dataset Model Architecture
Two Key Ingredients of NLP Systems
NLP System
3
Training Dataset Model Architecture
Why Might NLP Systems Fail?
NLP System
4
Training Dataset Model Architecture
Dataset Weaknesses
NLP System
5
Training Dataset Model Architecture
Model Weaknesses
NLP System
Challenge Datasets Break Models
6
7
Challenge Datasets Break Models
8
Challenge Datasets Break Models
NLP Systems Are Brittle
9
NLP Systems Are Brittle
10
Inoculation by Fine-Tuning
11
Inoculation by Fine-Tuning
12
Inoculation by Fine-Tuning
13
Inoculation
14
Inoculate Models to Better Understand Why They Fail
15
Three Clear Outcomes of Interest
16
Challenge Evaluation Outcome Inoculation
(1) Dataset Weakness
17
Challenge Evaluation Outcome Inoculation Dataset Weakness
(2) Model Weakness
18
Model Weakness Challenge Evaluation Outcome Inoculation
(3) Predictive Artifacts / Other
19
Predictive Artifacts / Other Challenge Evaluation Outcome Inoculation
Three Clear Outcomes of Interest
20
Dataset Weakness Model Weakness
Predictive Artifacts / Other
Challenge Evaluation Outcome Inoculation
Case Studies
inference (NLI) models
comprehension models
21
Natural Language Inference (NLI)
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders."
22
Entailment Contradiction Neutral
[Dagan et al., 2004] Example from MultiNLI [Williams et al., 2018]
Two NLI Challenge Datasets
[Naik and Ravichander et al., 2018]
23
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders."
Two NLI Challenge Datasets
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders and true is true."
24
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders."
Word Overlap Challenge Dataset
[Naik and Ravichander et al., 2018]
Two NLI Challenge Datasets
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders and true is true." Premise: "I have done what you asked." Hypothesis: "I have disobeyed your ordets."
25
Premise: "I have done what you asked." Hypothesis: "I have disobeyed your orders."
Word Overlap Challenge Dataset Spelling Errors Challenge Dataset
[Naik and Ravichander et al., 2018]
Small Perturbations Break NLI Models
26
Word Overlap Spelling Errors
(absolute)
(absolute)
Inoculating NLI models
27
Word Overlap Spelling Errors
Inoculating NLI models
28
Word Overlap Spelling Errors
Dataset Weakness Model Weakness
More Examples in the Paper!
29
Dataset Weakness Dataset Weakness Model Weakness Model Weakness
Predictive Artifacts / Other
SQuAD
[Rajpurkar et al., 2016]
Example from Robin Jia
Question: "The number of new Huguenot colonists declined after what year?" Passage: "The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined…" Correct Answer: "1700"
30
Adversarial SQuAD
Question: "The number of new Huguenot colonists declined after what year?" Passage: "The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of
1675." Correct Answer: "1700"
[Jia and Liang, 2017]
Example from Robin Jia
31
Small Perturbations Break SQuAD Models
32
(absolute)
Inoculating SQuAD models
33
34
Predictive Artifacts / Other
Inoculating SQuAD models
Takeaways
35
why our models fail.
they stress them in different ways.
Dataset Weakness Model Weakness
Predictive Artifacts / Other
clarify model results when transferring to other datasets.
Takeaways
36
Thank You! Questions?
why our models fail.
they stress them in different ways.
Dataset Weakness Model Weakness
Predictive Artifacts / Other
clarify model results when transferring to other datasets.
Limitations of Inoculation by Fine-Tuning
challenge dataset.
challenge} datasets and models.
37
38
Inoculating Multiple SQuAD Reading Comprehension Models
39
Inoculating Multiple NLI Models Against Word Overlap Adversary
40
Inoculating Multiple NLI Models Against Spelling Errors
41