Stress Test Evaluation for Natural Language Inference Aakanksha - - PowerPoint PPT Presentation

stress test evaluation for natural language inference
SMART_READER_LITE
LIVE PREVIEW

Stress Test Evaluation for Natural Language Inference Aakanksha - - PowerPoint PPT Presentation

Stress Test Evaluation for Natural Language Inference Aakanksha Naik*, Abhilasha Ravichander* , Norman Sadeh, Carolyn Rose, Graham Neubig 1 Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who


slide-1
SLIDE 1

Stress Test Evaluation for Natural Language Inference

Aakanksha Naik*, Abhilasha Ravichander*, Norman Sadeh, Carolyn Rose, Graham Neubig

1

slide-2
SLIDE 2

Natural Language Inference

2

Premise: Stimpy was a little cat who believed he could fly

(a.k.a Recognizing Textual Entailment)

Hypothesis: Stimpy could fly

(Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009)

slide-3
SLIDE 3

Natural Language Inference

Given a premise, determine whether a hypothesis is True (entailment), False (contradiction), Undecided (neutral)

3

Premise: Stimpy was a little cat who believed he could fly

(a.k.a Recognizing Textual Entailment)

Hypothesis: Stimpy could fly

(Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009)

slide-4
SLIDE 4

4

Natural Language Inference

(a.k.a Recognizing Textual Entailment)

Benchmark task for Natural Language Understanding

  • Prevalent View: To perform well at NLI, models must

○ learn good sentence representations: “handle nearly the full complexity of compositional semantics” (Williams et al, 2018) ○ reason over “difficult” phenomena like lexical entailment, quantification, coreference, tense, belief, modality, lexical and syntactic ambiguity (Dagan et al, 2009; McCartney and Manning, 2009; Marelli et al, 2014; Williams et al, 2018)

slide-5
SLIDE 5

Natural Language Inference

5

(a.k.a Recognizing Textual Entailment)

MultiNLI

  • Text from 10 genres!
  • Covers written & spoken english
  • Longer, more complex sentences
  • Variety of linguistic phenomena
  • Sentence-encoder SOTA: 74.5 % (Nie and Bansal, 2017)*

* All results shown for matched. Refer to paper for further details

slide-6
SLIDE 6

Motivation

6

Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset!

slide-7
SLIDE 7

Motivation

7

Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But, more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance

slide-8
SLIDE 8

Motivation

8

Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But, more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance

We want to figure out whether our systems have the ability to make real inferential decisions, and if so, to what extent.

slide-9
SLIDE 9

What are Stress Tests?

Building large-scale diagnostic datasets to exercise models on their weaknesses and better understand their capabilities. Stress Testing: Testing a system beyond normal operational capacity to confirm that intended specifications are being met and identify weaknesses if any

9

For NLI

slide-10
SLIDE 10

Why Stress Tests?

10

Reward system ability to reason about task instead of encouraging reliance on misleading correlations in datasets. “Sanity checking” for NLP models Analyze strengths and weaknesses of various models Fine-grained phenomenon-by-phenomenon evaluation scheme

slide-11
SLIDE 11

Weaknesses of SOTA NLI Models

  • To construct stress tests, we must first identify “bugs” (potential weaknesses)
  • Analyzed errors of Nie & Bansal (2017) (best-performing single model)

11

slide-12
SLIDE 12

Word Overlap (29%)

Premise : And, could it not result in a decline in Postal Service volumes across–the–board? Hypothesis :There may not be a decline in Postal Service volumes across–the–board.

12

Neutral Entailment

slide-13
SLIDE 13

Negation (13%)

Premise : Enthusiasm for Disney’s Broadway production of The Lion King dwindles. Hypothesis : The broadway production of The Lion King is no longer enthusiastically attended.

13

Entailment Contradiction

slide-14
SLIDE 14

Length Mismatch (3%)

Premise : So you know well a lot of the stuff you hear coming from South Africa now and from West Africa that’s considered world music because it’s not particularly using certain types of folk styles. Hypothesis : They rely too heavily on the types of folk styles.

14

Contradiction Neutral

slide-15
SLIDE 15

Numerical Reasoning (3%)

Premise : Deborah Pryce said Ohio Legal Services in Columbus will receive a $200,000 federal grant toward an online legal self-help center. Hypothesis : A $900,000 federal grant will be received by Missouri Legal Services, said Deborah Pryce.

15

Contradiction Entailment

slide-16
SLIDE 16

Antonymy (5%)

Premise : “Have her show it,” said Thorn Hypothesis : Thorn told her to hide it.

16

Contradiction Entailment

slide-17
SLIDE 17

Grammaticality (3%)

Premise : So if there are something interesting or something worried, please give me a call at any time. Hypothesis : The person is open to take a call anytime.

17

Contradiction Neutral

slide-18
SLIDE 18

Real World Knowledge (12%)

Premise : It was still night. Hypothesis : The sun hadn’t risen yet, for the moon was shining daringly in the sky.

18

Entailment Neutral

slide-19
SLIDE 19

Ambiguity (6%)

Premise: Outside the cathedral you will find a statue of John Knox with Bible in hand. Hypothesis: John Knox was someone who read the Bible.

19

Entailment Neutral

slide-20
SLIDE 20

Unknown (26%)

Premise : We’re going to try something different this morning, said Jon. Hypothesis : Jon decided to try a new approach.

20

Entailment Contradiction

slide-21
SLIDE 21

Constructing Stress Tests

Competence Tests

Evaluate model ability to reason about quantities and understand antonyms Target error categories: antonymy, numerical reasoning Construction framework: Heuristic rules, external knowledge sources

Distraction Tests

Evaluate model robustness to shallow distractions Target error categories: word overlap, negation, length mismatch Construction framework: label-preserving perturbations using propositional logic

Noise Tests

Evaluate model robustness to noise in data Target error categories: grammaticality Construction framework: Random perturbation

21

slide-22
SLIDE 22

Constructing Competence Tests: Antonymy

P: I saw a big house

POS tagging NOUNS house ADJECTIVES big WSD WORD SENSES WordNet ANTONYMS: big x small Replace word with antonym in sentence

H: I saw a small house

22

slide-23
SLIDE 23

Constructing Competence Tests: Antonymy

P: I saw a big house

POS tagging NOUNS house ADJECTIVES big WSD WORD SENSES WordNet ANTONYMS: big x small Replace word with antonym in sentence

H: I saw a small house

23

Final Entailment Pair Premise: I saw a big house. Hypothesis: I saw a small house. Label: Contradiction Size: 3.2 k pairs

slide-24
SLIDE 24

Constructing Competence Tests: Numerical Reasoning

AQuA-RAT Word problem Options Answer Rationale Preprocessing:

  • 1. Discard “complex”

problems

  • 2. Split problems into

sentences

  • 3. Discard sentences

without numbers and NEs

P: Tim had 750 bags of cement Randomly choose one quantity 750 bags Use heuristics Ent H: Tim had more than 550 bags

  • f cement

Cont H: Tim had 350 bags of cement ENT pair CONT pair Flip NEU pair

P: Tim had more than 550 bags of cement H: Tim had 750 bags of cement L: neutral

24

slide-25
SLIDE 25

Constructing Competence Tests: Numerical Reasoning

AQuA-RAT Word problem Options Answer Rationale Preprocessing:

  • 1. Discard “complex”

problems

  • 2. Split problems into

sentences

  • 3. Discard sentences

without numbers and NEs

P: Tim had 750 bags of cement Randomly choose one quantity 750 bags Use heuristics Ent H: Tim had more than 550 bags

  • f cement

Cont H: Tim had 350 bags of cement ENT pair CONT pair Flip NEU pair

P: Tim had more than 550 bags of cement H: Tim had 750 bags of cement L: neutral

25

Final Entailment Pairs Premise: Tim had 750 bags of cement. Hypothesis: Tim had more than 550 bags of cement. Label: Entailment Premise: Tim had 750 bags of cement. Hypothesis: Tim had 350 bags of cement. Label: Contradiction Size: 7.5 k pairs

slide-26
SLIDE 26

Constructing Distraction Tests

Premise P Hypothesis H Entailment (P⇒H)⇒((P⋀True)⇒H) Contradiction (P⤃H)⇒((P⋀True)⤃H) Neutral Appending true still keeps P and H neutral Logic Framework for Distraction Test Construction: Appending tautology to either premise or hypothesis

26

slide-27
SLIDE 27

Constructing Distraction Tests (Cont.)

Word Overlap

  • Tautology: true is

true

  • Append to

hypothesis

  • Reducing word
  • verlap

Negation

  • Tautology: false is

not true

  • Append to

hypothesis

  • Introducing strong

negation Length Mismatch

  • Tautology: (true is

true)*5

  • Append to premise
  • Add irrelevant

information to premise

27

slide-28
SLIDE 28

Constructing Distraction Tests (Cont.)

Word Overlap

  • Tautology: true is

true

  • Append to

hypothesis

  • Reducing word
  • verlap

Negation

  • Tautology: false is

not true

  • Append to

hypothesis

  • Introducing strong

negation Length Mismatch

  • Tautology: (true is

true)*5

  • Append to premise
  • Add irrelevant

information to premise

28

Final Entailment Pair Premise: Possibly no other country has had such a turbulent history and true is true. Hypothesis: The country’s history has been turbulent. Label: Entailment Final Entailment Pair Premise: Possibly no other country has had such a turbulent history and false is not true. Hypothesis: The country’s history has been turbulent. Label: Entailment Final Entailment Pair Premise: Possibly no other country has had such a turbulent history and true is true and true is true and true is true and true is true and true is true. Hypothesis: The country’s history has been turbulent. Label: Entailment

slide-29
SLIDE 29

Constructing Noise Tests

S: I saw the movie

Randomly pick a word

the

Adjacent character swap Keyboard character swap

teh yhe

Replace

  • riginal word

S: I saw teh movie S: I saw yhe movie

Replace

  • riginal

sentence

NOISY PAIR

29

slide-30
SLIDE 30

Constructing Noise Tests

S: I saw the movie

Randomly pick a word

the

Adjacent character swap Keyboard character swap

teh yhe

Replace

  • riginal word

S: I saw teh movie S: I saw yhe movie

Replace

  • riginal

sentence

NOISY PAIR

30

Final Entailment Pair Premise: I saw yhe movie. Hypothesis: I thought of going to see a movie. Label: Neutral

slide-31
SLIDE 31

Experimental Setup: Models

SOTA

31

  • Nie & Bansal, 2017: Shortcut-Stacked BiLSTMs
  • Chen et al, 2017: Shortcut-Stacked BiLSTMs + Char CNN
  • Balazs et al, 2017: BiLSTMs with Inner Attention
  • Conneau et al, 2017: BiLSTMs with Max Pooling
slide-32
SLIDE 32

Results on Competence Tests

32

Random Baseline Performance (33%)

slide-33
SLIDE 33

Results on Competence Tests

33

  • 59% -62%
  • 35%
  • 56%

Random Baseline Performance (33%)

slide-34
SLIDE 34

Results on Competence Tests

34

  • 43%
  • 53%
  • 35%
  • 42%

Random Baseline Performance (33%)

slide-35
SLIDE 35

ANTONYMY

  • All models overpredict entailment relations: 86.4% of all errors are Contradiction Entailment

predictions!

  • Some success on “easy” antonym-pairs (seen before in training data)
  • Antonym pairs recognized by weakest model occur nearly twice as often in training data as pairs

recognized by stronger model NUMERICAL REASONING

  • All models overpredict entailment relations. 78.3% of all errors are Neutral Entailment or

Contradiction Entailment

  • Models rely on lexical cues due to artifacts in training data
  • No quantitative reasoning performed

35

Performance Analysis on Competence Tests

slide-36
SLIDE 36

Results on Distraction Tests

36

Random Baseline Performance (33%)

slide-37
SLIDE 37

Results on Distraction Tests

37

  • 15%
  • 27%
  • 18%
  • 20%

Random Baseline Performance (33%)

slide-38
SLIDE 38

Results on Distraction Tests

38

  • 21% -22%
  • 24%
  • 35%

Random Baseline Performance (33%)

slide-39
SLIDE 39

Results on Distraction Tests

39

  • 10%
  • 23%
  • 12%
  • 26%

Random Baseline Performance (33%)

slide-40
SLIDE 40

Performance Analysis on Distraction Tests

High proportion of false neutral errors: Account for 67.1% of errors on word overlap and length mismatch tests

40

Crowdworkers shown sentence: Asked to construct entailed, neutral and contradictory hypotheses Construct hypotheses with inherent biases biased datasets (Bar-Haim et al, 2006; Parent et. al, 2010; Wang et. al, 2012; Yih et al, 2013; Gururangan et. al, 2018; Poliak et.al, 2018) Neural models predict entailment for high word overlap, regardless of semantic meaning, and neutral for low word overlap, regardless of semantic meaning

P: Stimpy the cat believed he could fly H: Stimpy thought he could fly

(tests with lexical similarity), compared to 35.05% of errors on MultiNLI Dev

slide-41
SLIDE 41

Results on Noise Tests

41

Random Baseline Performance (33%)

slide-42
SLIDE 42

Results on Noise Tests

42

  • 5%
  • 5%
  • 12%
  • 23%

Random Baseline Performance (33%)

slide-43
SLIDE 43

Conclusion

Neural networks will take shortcuts! Latch onto misleading correlations in datasets.

43

slide-44
SLIDE 44

Conclusion

  • Must ensure that our models are learning reasonable things. Good

performance on dev set != good performance on task.

44

  • Avoid attributing good model performance to doing the kinds of

reasoning we imagine the task needs.

  • Calls for diagnostic datasets - Stress Tests!
slide-45
SLIDE 45

Conclusion

45

All stress tests/code/leaderboard now available: https://abhilasharavichander.github.io/NLI_StressTest/ . Use the data to show models are actually learning!

slide-46
SLIDE 46

Visit our website: https://abhilasharavichander.github.io/NLI_StressTest/ For questions, contact: anaik@cs.cmu.edu, aravicha@cs.cmu.edu

46

THANK YOU!