Stress Test Evaluation for Natural Language Inference Aakanksha - PowerPoint PPT Presentation

Stress Test Evaluation for Natural Language Inference Aakanksha Naik*, Abhilasha Ravichander* , Norman Sadeh, Carolyn Rose, Graham Neubig 1

Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who believed he could fly Hypothesis: Stimpy could fly (Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009) 2

Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who believed he could fly Given a premise, determine whether a hypothesis is True ( entailment ), False ( contradiction ), Undecided ( neutral ) Hypothesis: Stimpy could fly (Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009) 3

Natural Language Inference (a.k.a Recognizing Textual Entailment) Benchmark task for Natural Language Understanding Prevalent View: To perform well at NLI, models must ● ○ learn good sentence representations: “handle nearly the full complexity of compositional semantics” (Williams et al, 2018) ○ reason over “difficult” phenomena like lexical entailment, quantification, coreference, tense, belief, modality, lexical and syntactic ambiguity (Dagan et al, 2009; McCartney and Manning, 2009; Marelli et al, 2014; Williams et al, 2018) 4

Natural Language Inference (a.k.a Recognizing Textual Entailment) MultiNLI ● Text from 10 genres! ● Covers written & spoken english ● Longer, more complex sentences ● Variety of linguistic phenomena ● Sentence-encoder SOTA: 74.5 % (Nie and Bansal, 2017)* 5 * All results shown for matched. Refer to paper for further details

Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! 6

Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But , more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance 7

Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But , more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance We want to figure out whether our systems have the ability to make real inferential decisions, and if so, to what extent. 8

What are Stress Tests? Stress Testing: Testing a system beyond normal operational capacity to confirm that intended specifications are being For NLI met and identify weaknesses if any Building large-scale diagnostic datasets to exercise models on their weaknesses and better understand their capabilities. 9

Why Stress Tests? Reward system ability to reason about task instead of encouraging reliance on misleading correlations in datasets. “Sanity checking” for NLP models Analyze strengths and weaknesses of various models Fine-grained phenomenon-by-phenomenon evaluation scheme 10

Weaknesses of SOTA NLI Models ● To construct stress tests, we must first identify “bugs” (potential weaknesses) ● Analyzed errors of Nie & Bansal (2017) (best-performing single model) 11

Word Overlap (29%) Premise : And, could it not result in a decline in Postal Service volumes across–the–board? Hypothesis :There may not be a decline in Postal Service volumes across–the–board. Neutral Entailment 12

Negation (13%) Premise : Enthusiasm for Disney’s Broadway production of The Lion King dwindles. Hypothesis : The broadway production of The Lion King is no longer enthusiastically attended. Entailment Contradiction 13

Length Mismatch (3%) Premise : So you know well a lot of the stuff you hear coming from South Africa now and from West Africa that’s considered world music because it’s not particularly using certain types of folk styles. Hypothesis : They rely too heavily on the types of folk styles. Contradiction Neutral 14

Numerical Reasoning (3%) Premise : Deborah Pryce said Ohio Legal Services in Columbus will receive a $200,000 federal grant toward an online legal self-help center. Hypothesis : A $900,000 federal grant will be received by Missouri Legal Services, said Deborah Pryce. Contradiction Entailment 15

Antonymy (5%) Premise : “Have her show it,” said Thorn Hypothesis : Thorn told her to hide it. Contradiction Entailment 16

Grammaticality (3%) Premise : So if there are something interesting or something worried, please give me a call at any time. Hypothesis : The person is open to take a call anytime. Contradiction Neutral 17

Real World Knowledge (12%) Premise : It was still night. Hypothesis : The sun hadn’t risen yet, for the moon was shining daringly in the sky. Entailment Neutral 18

Ambiguity (6%) Premise : Outside the cathedral you will find a statue of John Knox with Bible in hand. Hypothesis : John Knox was someone who read the Bible. Entailment Neutral 19

Unknown (26%) Premise : We’re going to try something different this morning, said Jon. Hypothesis : Jon decided to try a new approach. Entailment Contradiction 20

Constructing Stress Tests Competence Tests Distraction Tests Noise Tests Evaluate model ability to Evaluate model robustness Evaluate model robustness reason about quantities and to shallow distractions to noise in data understand antonyms Target error categories: Target error categories: Target error categories: word overlap, negation, grammaticality antonymy, numerical length mismatch reasoning Construction framework: Construction framework: Random perturbation Construction framework: label-preserving Heuristic rules, external perturbations using knowledge sources propositional logic 21

Constructing Competence Tests: Antonymy P: I saw a big house POS tagging H: I saw a small house NOUNS ADJECTIVES house big Replace word with antonym in sentence WSD WordNet ANTONYMS: WORD SENSES big x small 22

Constructing Competence Tests: Antonymy P: I saw a big house Final Entailment Pair POS tagging H: I saw a small house Premise: I saw a big house. NOUNS Hypothesis: I saw a small house. ADJECTIVES house big Replace word with Label: Contradiction antonym in sentence Size: 3.2 k pairs WSD WordNet ANTONYMS: WORD SENSES big x small 23

Constructing Competence Tests: Numerical Reasoning ENT CONT AQuA-RAT Word problem pair pair Options Answer Rationale Ent H: Tim had more than 550 bags of cement Preprocessing: 1. Discard “complex” Flip problems Cont H: Tim had 350 bags of cement 2. Split problems into sentences 3. Discard sentences NEU pair Use without numbers and P: Tim had more heuristics Randomly NEs than 550 bags of choose one cement quantity H: Tim had 750 P: Tim had 750 bags of cement 750 bags bags of cement L: neutral 24

Constructing Competence Tests: Numerical Reasoning ENT CONT AQuA-RAT Word problem pair pair Options Answer Final Entailment Pairs Rationale Ent H: Tim had more than 550 bags Premise: Tim had 750 bags of cement. of cement Preprocessing: Hypothesis: Tim had more than 550 bags of 1. Discard “complex” cement. Flip problems Cont H: Tim had 350 bags of cement Label: Entailment 2. Split problems into sentences 3. Discard sentences Premise: Tim had 750 bags of cement. NEU pair Use without numbers and Hypothesis: Tim had 350 bags of cement. P: Tim had more heuristics Randomly NEs Label: Contradiction than 550 bags of choose one cement Size: 7.5 k pairs quantity H: Tim had 750 P: Tim had 750 bags of cement 750 bags bags of cement L: neutral 25

Constructing Distraction Tests Premise P Hypothesis H Entailment Neutral Contradiction (P ⇒ H) ⇒ ((P ⋀ True) ⇒ H) Appending true still (P ⤃ H) ⇒ ((P ⋀ True) ⤃ H) keeps P and H neutral Logic Framework for Distraction Test Construction: Appending tautology to either premise or hypothesis 26

Constructing Distraction Tests (Cont.) Word Overlap Negation Length Mismatch ● Tautology: true is ● Tautology: false is ● Tautology: (true is true not true true)*5 ● Append to ● Append to ● Append to premise hypothesis hypothesis ● Add irrelevant ● Reducing word ● Introducing strong information to overlap negation premise 27

Constructing Distraction Tests (Cont.) Word Overlap Negation Length Mismatch ● Tautology: true is ● Tautology: false is ● Tautology: (true is true not true true)*5 Final Entailment Pair Final Entailment Pair Final Entailment Pair ● Append to ● Append to ● Append to premise hypothesis hypothesis ● Add irrelevant Premise: Possibly no other Premise: Possibly no other Premise: Possibly no other country has had such a ● Reducing word country has had such a ● Introducing strong country has had such a information to overlap negation premise turbulent history and true is turbulent history and false turbulent history and true is true. is not true. true and true is true and true Hypothesis: The country’s Hypothesis: The country’s is true and true is true and history has been turbulent. history has been turbulent. true is true. Hypothesis: The country’s Label: Entailment Label: Entailment history has been turbulent. Label: Entailment 28

Stress Test Evaluation for Natural Language Inference Aakanksha - PowerPoint PPT Presentation

Stress Test Evaluation for Natural Language Inference Aakanksha Naik, Abhilasha Ravichander , Norman Sadeh, Carolyn Rose, Graham Neubig 1 Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who

Loan Portfolio Stress Testing How to Use Stress Test Results and How to Use Stress Test Results

Stress Stress What is Stress? What is Stress? A stressor is any demand on mind and A

To Stress or Not to Stress? Psychological & Emotional Well Being at work To stress or

Management Management of Stress of Stress Stress can be brought on by: 1. An internal state

Stress/Strain Lecture 1 ME EN 372 Andrew Ning aning@byu.edu Outline Stress Strain Plane

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Objectives 1. Define Stress and Occupational Stress 2. Learn the Importance of Healthy Stress

We all have different sized stress buckets We all live under leaky stress taps We all

Part 2: Stress By Liz Witherspoon and Julie Elliott of Simplicity Coaching STRESS MEETING STRESS

Address Your Stress Please download the Address Your Stress Worksheet before beginning this 1

Tonic Stress Made Simple SITUATION : What is tonic stress? PROBLEM : Tonic stress too

Bending Stress Lecture 3 ME EN 372 Andrew Ning aning@byu.edu Outline Axial Stress Bending

Effective Stress Chapter 8 Effective Stress 1 3/23/2015 Effective Stress

stress-ng A stress-testing Swiss army knife Presentation by Colin Ian King

1 PH255A, SP 06, Nuru-Jeter Toxic stress 2 Stress means change. It is not inherently

2016 EU wide stress test: Presentation to analysts 30/7/2016 2016 EU wide stress test

Stress Testing and Scenario Analysis in Life Insurance and Beyond An overview of the Paper

Stuff Lab is due by 5pm today Exam 1 next Tues Ill be out of town so Zhe will give

R/exams: A One-for-All Exams Generator Online Tests, Live Quizzes, and Written Exams with R Achim

The Effects of Capital Buffers on Bank Lending and Firm Activity: What can we learn from Stress

Stress Test Communication and Security of gas Supply in (CEE/SEE) GRI SSE SG meeting, Bucharest,

Software Testing Strategies Chapter 18 1 Review SW Testing Techniques Chapter 17 2 Software

Life after testing positive Mary Beth Bialick LISW Social Worker HDSA Ohio Valley Chapter

R/exams: A One-for-All Exams Generator Written Exams, Online Tests, and Live Quizzes with R Achim

Stress Test Evaluation for Natural Language Inference Aakanksha - PowerPoint PPT Presentation

Stress Test Evaluation for Natural Language Inference Aakanksha Naik*, Abhilasha Ravichander* , Norman Sadeh, Carolyn Rose, Graham Neubig 1 Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who

Loan Portfolio Stress Testing How to Use Stress Test Results and How to Use Stress Test Results

Stress Stress What is Stress? What is Stress? A stressor is any demand on mind and A

To Stress or Not to Stress? Psychological &amp; Emotional Well Being at work To stress or

Management Management of Stress of Stress Stress can be brought on by: 1. An internal state

Stress/Strain Lecture 1 ME EN 372 Andrew Ning aning@byu.edu Outline Stress Strain Plane

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Objectives 1. Define Stress and Occupational Stress 2. Learn the Importance of Healthy Stress

We all have different sized stress buckets We all live under leaky stress taps We all

Part 2: Stress By Liz Witherspoon and Julie Elliott of Simplicity Coaching STRESS MEETING STRESS

Address Your Stress Please download the Address Your Stress Worksheet before beginning this 1

Tonic Stress Made Simple SITUATION : What is tonic stress? PROBLEM : Tonic stress too

Bending Stress Lecture 3 ME EN 372 Andrew Ning aning@byu.edu Outline Axial Stress Bending

Effective Stress Chapter 8 Effective Stress 1 3/23/2015 Effective Stress

stress-ng A stress-testing Swiss army knife Presentation by Colin Ian King

1 PH255A, SP 06, Nuru-Jeter Toxic stress 2 Stress means change. It is not inherently

2016 EU wide stress test: Presentation to analysts 30/7/2016 2016 EU wide stress test

Stress Testing and Scenario Analysis in Life Insurance and Beyond An overview of the Paper

Stuff Lab is due by 5pm today Exam 1 next Tues Ill be out of town so Zhe will give

R/exams: A One-for-All Exams Generator Online Tests, Live Quizzes, and Written Exams with R Achim

The Effects of Capital Buffers on Bank Lending and Firm Activity: What can we learn from Stress

Stress Test Communication and Security of gas Supply in (CEE/SEE) GRI SSE SG meeting, Bucharest,

Software Testing Strategies Chapter 18 1 Review SW Testing Techniques Chapter 17 2 Software

Life after testing positive Mary Beth Bialick LISW Social Worker HDSA Ohio Valley Chapter

R/exams: A One-for-All Exams Generator Written Exams, Online Tests, and Live Quizzes with R Achim

Stress Test Evaluation for Natural Language Inference Aakanksha Naik, Abhilasha Ravichander , Norman Sadeh, Carolyn Rose, Graham Neubig 1 Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who

To Stress or Not to Stress? Psychological & Emotional Well Being at work To stress or