Development of AI has been driven by benchmarks and datasets. - - PowerPoint PPT Presentation

development of ai has been driven by benchmarks and
SMART_READER_LITE
LIVE PREVIEW

Development of AI has been driven by benchmarks and datasets. - - PowerPoint PPT Presentation

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1 Development of AI has been driven by benchmarks and


slide-1
SLIDE 1

Adversarial NLI: A New Benchmark for Natural

Language Understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research

1

slide-2
SLIDE 2

Development of AI has been driven by benchmarks and datasets. Computer Vision: (Russakovsky et al. 2015) NLP: (Rajpurkar et al. 2016), (Wang et al. 2018)

2

slide-3
SLIDE 3

Human: 5.1

26 16.4 11.7 7.3 6.7 3.6 3.1 2.3

5 10 15 20 25 30 XRCE AlexNet ZF VGG GoogleNet ResNet GoogleNet-v4 SENet 2011 2012 2013 2014 2014 2015 2016 2017

Error Rate

3 years

3

slide-4
SLIDE 4

64.74 67.97 72.14 78.58 85.08 89.9 50 60 70 80 90 100

Match-LSTM Ptr BiDAF BiDAF+SelfAtt BiDAF+SelfAtt+ELMo BERT XLNet 2016 2016 2017 2018 2018 2019

Exact Match Human: 86.8

2 years

4

slide-5
SLIDE 5

70 80.5 88.1 90.3 60 65 70 75 80 85 90 95 100 BiLSTM+Attn+ELMo BERT RoBERTa T5 2018 2018 2019 2019

Score Human: 87.1

1 year

5

slide-6
SLIDE 6
  • Word2Vec
  • Glove
  • ELMo
  • GPT-1
  • BERT
  • RoBERTa
  • GPT-2

Superhuman performance achieved

Model vs. Human on Static Benchmarks

Human won Human still won

6

slide-7
SLIDE 7
  • Word2Vec
  • Glove
  • ELMo
  • GPT-1
  • BERT
  • RoBERTa
  • GPT-2

  • T5
  • GPT-3

Superhuman performance achieved

Model vs. Human on Static Benchmarks

Human won Human still won ……

7

slide-8
SLIDE 8

Superhuman at NLU?

  • Word2Vec
  • Glove
  • ELMo
  • GPT-1
  • BERT
  • RoBERTa
  • GPT-2

  • T5
  • GPT-3

Superhuman performance achieved

Model vs. Human on Static Benchmarks

Human won Human still won ……

8

slide-9
SLIDE 9

Are current NLU models genuinely as good as their high performance on static benchmark?

  • Word2Vec
  • Glove
  • ELMo
  • GPT-1
  • BERT
  • RoBERTa
  • GPT-2

  • T5
  • GPT-3

Superhuman performance achieved

Model vs. Human on Static Benchmarks

Human won Human still won ……

9

slide-10
SLIDE 10

Overestimated NLU Ability

The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries.

Adversary for reading comprehension (Jia and Liang, 2017) Adversary for natural language inference (Nie et al., 2018)

10

slide-11
SLIDE 11

Overestimated NLU Ability

Adversary for reading comprehension (Jia and Liang, 2017) Adversary for natural language inference (Nie et al., 2018)

§ Annotation artifacts (Gururangan et al., 2018, Poliak et al. 2018) § Breaking NLI with lexical inference (Glockner et al., 2018) § Pathologies of Neural Models (Feng et al., 2018) § Modeling task or annotator? (Geva et al., 2019) § Right for the wrong reason (McCoy et al., 2019) …

The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries.

11

slide-12
SLIDE 12

Performance is Overestimated

Model brittleness can be exposed by researchers or non-experts. General NLU is still far from achieved despite the high performance. How to solve the benchmark fast-saturation and robustness issues?

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

14

slide-15
SLIDE 15

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

15

slide-16
SLIDE 16

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

16

slide-17
SLIDE 17

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

17

slide-18
SLIDE 18

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

18

slide-19
SLIDE 19

HAMLET

Human-And-Model-in-the-Loop Enabled Training

Context is also premise according to NLI terminology.

19

slide-20
SLIDE 20

Related work

20

Adversarial & Human-in-the-Loop

slide-21
SLIDE 21

Adversarial NLI (ANLI)

Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round.

Three rounds of data collection.

  • Round 1

Model: BERT (Trained on SNLI+MNLI) Domain: Wikipedia

  • Round 2

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1) Domain: Wikipedia

  • Round 3

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2) Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5

21

slide-22
SLIDE 22

Adversarial NLI (ANLI)

Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round.

Three rounds of data collection.

  • Round 1 (A1)

Model: BERT (Trained on SNLI+MNLI) Domain: Wikipedia

  • Round 2 (A2)

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1) Domain: Wikipedia

  • Round 3 (A3)

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2) Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5

Dataset Genre Context Train / Dev / Test A1 Wiki 2,080 16,946 / 1,000 / 1,000 A2 Wiki 2,694 45,460 / 1,000 / 1,000 A3 Various 6,002 100,459 / 1,200 / 1,200 (Wiki subset) 1,000 19,920 / 200 / 200 ANLI Various 10,776 162,865 / 3,200 / 3,200

22

SNLI: 570K MNLI: 433K ANLI: 163K

slide-23
SLIDE 23

Adversarial NLI (ANLI)

Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round.

Three rounds of data collection.

  • Round 1 (A1)

Model: BERT (Trained on SNLI+MNLI) Domain: Wikipedia

  • Round 2 (A2)

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1) Domain: Wikipedia

  • Round 3 (A3)

Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2) Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5

Dataset Genre Context Train / Dev / Test A1 Wiki 2,080 16,946 / 1,000 / 1,000 A2 Wiki 2,694 45,460 / 1,000 / 1,000 A3 Various 6,002 100,459 / 1,200 / 1,200 (Wiki subset) 1,000 19,920 / 200 / 200 ANLI Various 10,776 162,865 / 3,200 / 3,200

23

SNLI: 570K MNLI: 433K ANLI: 163K

  • Adversarially collected
  • More data-efficient in training
slide-24
SLIDE 24

Collection Statistics

29.68 16.59 14.79 17.47

A1 A2 A3

Model Error Rate during Collection

wiki all

125.2 189.1 189.6 157

A1 A2 A3

Median Time (sec.) per Example during Collection

wiki all

24

slide-25
SLIDE 25

Collection Statistics

29.68 16.59 14.79 17.47

A1 A2 A3

Model Error Rate during Collection

wiki all

125.2 189.1 189.6 157

A1 A2 A3

Median Time (sec.) per Example during Collection

wiki all Error rate halved with 3 rounds Room for improvement on NLI still exists

25

slide-26
SLIDE 26

Findings

Base model (backend model in the collection) performance is low

26

slide-27
SLIDE 27

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M Chance

27

slide-28
SLIDE 28

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F Chance

28

slide-29
SLIDE 29

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 Chance

29

slide-30
SLIDE 30

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 S+M+F+A1+A2 Chance

30

slide-31
SLIDE 31

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 S+M+F+A1+A2 S+M+F+A1+A2+A3 Chance

31

slide-32
SLIDE 32

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 S+M+F+A1+A2 S+M+F+A1+A2+A3 Chance Rounds become increasingly more difficult.

32

slide-33
SLIDE 33

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 S+M+F+A1+A2 S+M+F+A1+A2+A3 Training on more rounds improves robustness. Chance

33

slide-34
SLIDE 34

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER)

S+M S+M+F S+M+F+A1 S+M+F+A1+A2 S+M+F+A1+A2+A3 XLNet (All Data) BERT(All) Chance

34

slide-35
SLIDE 35

10 20 30 40 50 60 70 80

A1 A2 A3

RoBERTa (All Data) vs. XLNet (All Data) vs. BERT (All Data)

S+M+F+A1+A2+A3 XLNet (All Data) BERT(All) Chance RoBERTa (All Data) Different models have different weakness

35

slide-36
SLIDE 36

10 20 30 40 50 60 70 80 90 100 A1 A2 A3 SNLI MNLI-m MNLI-mm

RoBERTa performance with different training data

SNLI+MNLI Chance Model trained only on SNLI and MNLI (statically collected) is not good at ANLI

36

(~900K)

slide-37
SLIDE 37

10 20 30 40 50 60 70 80 90 100 A1 A2 A3 SNLI MNLI-m MNLI-mm

RoBERTa performance with different training data

SNLI+MNLI ANLI-Only Chance But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI Model trained only on SNLI and MNLI (statically collected) is not good at ANLI

37

(~900K) ( 162K)

slide-38
SLIDE 38

10 20 30 40 50 60 70 80 90 100 A1 A2 A3 SNLI MNLI-m MNLI-mm

RoBERTa performance with different training data

SNLI+MNLI ANLI-Only Chance But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI Model trained only on SNLI and MNLI (statically collected) is not good at ANLI

38

(~900K) ( 162K)

ANLI is less than 1/5 of SNLI+MNLI

slide-39
SLIDE 39

10 20 30 40 50 60 70 80 90 100 A1 A2 A3 SNLI MNLI-m MNLI-mm

RoBERTa performance with different training data

SNLI+MNLI ANLI-Only ANLI+SNLI+MNLI Chance But model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI Model trained only on SNLI and MNLI (statically collected) is not good at ANLI Combining them together helps

39

(~900K) ( 162K)

slide-40
SLIDE 40

NLI Stress Test

Model SNLI-Hard NLI Stress Tests AT (m/mm) NR LN (m/mm) NG (m/mm) WO (m/mm) SE (m/mm) Previous models 72.7 14.4 / 10.2 28.8 58.7 / 59.4 48.8 / 46.6 50.0 / 50.2 58.3 / 59.4 BERT (All) 82.3 75.0 / 72.9 65.8 84.2 / 84.6 64.9 / 64.4 61.6 / 60.6 78.3 / 78.3 XLNet (All) 83.5 88.2 / 87.1 85.4 87.5 / 87.5 59.9 / 60.0 68.7 / 66.1 84.3 / 84.4 RoBERTa (S+M+F) 84.5 81.6 / 77.2 62.1 88.0 / 88.5 61.9 / 61.9 67.9 / 66.2 86.2 / 86.5 RoBERTa (All) 84.7 85.9 / 82.1 80.6 88.4 / 88.5 62.2 / 61.9 67.4 / 65.6 86.3 / 86.7

Table 4: Model Performance on NLI stress tests (tuned on their respective dev. sets). All=S+M+F+ANLI. AT=‘Antonym’; ‘NR’=Numerical Reasoning; ‘LN’=Length; ‘NG’=Negation; ‘WO’=Word Overlap; ‘SE’=Spell

  • Error. Previous models refers to the Naik et al. (2018) implementation of Conneau et al. (2017, InferSent) for the

Stress Tests, and to the Gururangan et al. (2018) implementation of Gong et al. (2018, DIIN) for SNLI-Hard.

All=S+M+F+ANLI; AT=Antonym; NR=Numerical Reasoning; LN=Length; NG=Negation; WO=Word Overlap SE=Spell Error

Training on ANLI is useful for the Antonym, Numerical Reasoning, and Negation.

40

slide-41
SLIDE 41

Analysis

What kind of vulnerabilities do annotators find?

41

Type of inference in the data changed, and so are the model weaknesses.

slide-42
SLIDE 42

Examples

Premise Hypothesis Reason Model Prediction Human Label

Linguistic Annotation

Kota Ramakrishna Karanth (born May 1, 1894) was an Indian lawyer and politician who served as the Minister of Land Revenue for the Madras Presidency from March 1, 1946 to March 23, 1947. He was the elder brother of noted Kannada novelist K. Shivarama Karanth. Kota Ramakrishna Karanth has a brother who was a novelist and a politician. Although Kota Ramakrishna Karanth’s brother is a novelist, we do not know if the brother is also a politician

Entailment Neutral Standard Conjunction, Reasoning Plausibility Likely, Tricky Syntactic

slide-43
SLIDE 43

Discussion

Discussion:

  • HAMLET is model-agnostic. (Ensemble different backend models)
  • It can be easily applied to any classification tasks.

What is underexplored?:

  • How to extend the framework to generation tasks.
  • Cost and time trade-off between adversarial and static data collection.

43

slide-44
SLIDE 44

Summary

  • NLU is far from solved;
  • HAMLET (Human-And-Model-in-the-Loop-Enabled-Training);
  • We applied it to NLI and collect ANLI;
  • The procedure can provide more difficult and iterative benchmarks.

“… all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼33%), whereas GPT-3 itself shows signs of life on Round 3.” GPT-3 performance on ANLI(A1/A2/A3): 36.8/34.0/40.2

Ideally, in its limit, HAMLET can help converge towards “real NLU” Adversarial collecting & training help improve robustness

44

slide-45
SLIDE 45

Thank you

Demo: https://adversarialnli.com/ GitHub: https://github.com/facebookresearch/anli/

45