Development of AI has been driven by benchmarks and datasets. - PowerPoint PPT Presentation

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1

Development of AI has been driven by benchmarks and datasets. Computer Vision: (Russakovsky et al. 2015) NLP: (Rajpurkar et al. 2016), (Wang et al. 2018) 2

Error Rate 30 25 26 20 15 16.4 11.7 10 Human: 5.1 7.3 6.7 5 3.6 3.1 2.3 0 XRCE AlexNet ZF VGG GoogleNet ResNet GoogleNet-v4 SENet 2011 2012 2013 2014 2014 2015 2016 2017 3 years 3

Exact Match 100 Human: 86.8 90 89.9 85.08 80 78.58 70 72.14 67.97 64.74 60 50 Match-LSTM Ptr BiDAF BiDAF+SelfAtt BiDAF+SelfAtt+ELMo BERT XLNet 2016 2016 2017 2018 2018 2019 2 years 4

Score 100 95 Human: 87.1 90 90.3 88.1 85 80 80.5 75 70 70 65 60 BiLSTM+Attn+ELMo BERT RoBERTa T5 2018 2018 2019 2019 1 year 5

Model vs. Human on Static Benchmarks Superhuman performance achieved Human won Human still won • Word2Vec • ELMo • BERT • Glove • GPT-1 • RoBERTa • GPT-2 … 6

Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … 7

Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … Superhuman at NLU? 8

Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … Are current NLU models genuinely as good as their high performance on static benchmark? 9

Overestimated NLU Ability The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries. Adversary for reading comprehension Adversary for natural language inference (Jia and Liang, 2017) (Nie et al., 2018) 10

Overestimated NLU Ability The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries. Adversary for reading comprehension Adversary for natural language inference (Jia and Liang, 2017) (Nie et al., 2018) § Annotation artifacts (Gururangan et al., 2018, Poliak et al. 2018) § Breaking NLI with lexical inference (Glockner et al., 2018) § Pathologies of Neural Models (Feng et al., 2018) § Modeling task or annotator? (Geva et al., 2019) § Right for the wrong reason (McCoy et al., 2019) … 11

Performance is Overestimated Model brittleness can be exposed by researchers or non-experts. General NLU is still far from achieved despite the high performance. How to solve the benchmark fast-saturation and robustness issues? 12

HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 14

Related work Adversarial & Human-in-the-Loop 20

Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. - Round 1 Model: BERT (Trained on SNLI+MNLI ) Domain: Wikipedia - Round 2 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia - Round 3 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 21

Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. Dataset Genre Context Train / Dev / Test - Round 1 (A1) Model: BERT (Trained on SNLI+MNLI ) A1 Wiki 2,080 16,946 / 1,000 / 1,000 Domain: Wikipedia A2 Wiki 2,694 45,460 / 1,000 / 1,000 Various 6,002 100,459 / 1,200 / 1,200 - Round 2 (A2) A3 (Wiki subset) 1,000 19,920 / 200 / 200 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia ANLI Various 10,776 162,865 / 3,200 / 3,200 - Round 3 (A3) SNLI: 570K Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) MNLI: 433K Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 ANLI: 163K 22

Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. Dataset Genre Context Train / Dev / Test - Round 1 (A1) Model: BERT (Trained on SNLI+MNLI ) A1 Wiki 2,080 16,946 / 1,000 / 1,000 Domain: Wikipedia A2 Wiki 2,694 45,460 / 1,000 / 1,000 Various 6,002 100,459 / 1,200 / 1,200 - Round 2 (A2) A3 (Wiki subset) 1,000 19,920 / 200 / 200 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia ANLI Various 10,776 162,865 / 3,200 / 3,200 - Round 3 (A3) SNLI: 570K Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) MNLI: 433K Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 ANLI: 163K • Adversarially collected • More data-efficient in training 23

Collection Statistics Model Error Rate during Median Time (sec.) per Example during Collection Collection 189.6 189.1 29.68 157 125.2 wiki wiki 17.47 16.59 all all 14.79 0 0 0 0 A1 A2 A3 A1 A2 A3 24

Collection Statistics Model Error Rate during Median Time (sec.) per Example during Collection Collection 189.6 189.1 Error rate halved with 3 rounds 29.68 157 125.2 wiki wiki 17.47 16.59 all all 14.79 0 0 0 0 A1 A2 A3 A1 A2 A3 Room for improvement on NLI still exists 25

Findings Base model (backend model in the collection) performance is low 26

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M 50 40 Chance 30 20 10 0 A1 A2 A3 27

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 40 Chance 30 20 10 0 A1 A2 A3 28

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 Chance 30 20 10 0 A1 A2 A3 29

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 20 10 0 A1 A2 A3 30

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 31

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 Rounds become increasingly more difficult. 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 32

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 Training on more rounds improves robustness. 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 33

RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 XLNet (All Data) 20 BERT(All) 10 0 A1 A2 A3 34

RoBERTa (All Data) vs. XLNet (All Data) vs. BERT (All Data) 80 70 60 50 40 Chance 30 RoBERTa (All Data) S+M+F+A1+A2+A3 XLNet (All Data) 20 BERT(All) 10 0 A1 A2 A3 Different models have different weakness 35

RoBERTa performance with different training data 100 90 80 70 60 SNLI+MNLI (~900K) 50 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI 36

RoBERTa performance with different training data 100 90 80 70 60 SNLI+MNLI (~900K) 50 ANLI-Only ( 162K) 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI 37

RoBERTa performance with different training data 100 90 80 ANLI is less than 1/5 of SNLI+MNLI 70 60 SNLI+MNLI (~900K) 50 ANLI-Only ( 162K) 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI 38

Development of AI has been driven by benchmarks and datasets. - PowerPoint PPT Presentation

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1 Development of AI has been driven by benchmarks and

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

False fasting is driven by pride False fasting is driven by pride False fasting is

Software Engineering I cs361 Test Driven Development What is Test Driven Development (TDD)

Agile SW Development plus Scrum in - depth Slide 1 Plan-driven and agile development

FOUNDA TION FINANCE COMMITTEE September 23, 2016 ULF Performance vs. Peers and Benchmarks 2

and Benchmarks May 24, 2018 Panelists Katy Miller Regional Coordinator Jasmine Hayes Deputy

Criticality experiments and benchmarks for for validation of cross validation of cross sections:

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence

WCET Tool Challenge 2014 Outline 1. Objectives of the challenge 2. Benchmarks and problems 3.

Scrambling and Descrambling SMT-LIB Benchmarks Tjark Weber Uppsala University, Sweden SMT 2016

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

Early Childhood Program-Wide PBIS Benchmarks of Quality T ra ining o n Co mple ting the E C

A brief introduction to economics Part IV Tyler Moore Computer Science & Engineering

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel) Digital Signatures

Data Sciences CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear

CSC 151 Spring 2020 Topic: Wrap Up May 6, 2020 Day 40 A Brief History/Review of Scheme 1958

Proactive Quality Guidance for Model Evolution in Model Libraries Andreas Ganser, Horst Lichter,

D2: Access Control #2: Access cess Contr trol ol Similar to OWASP Top 10 Insufficient

Active Logic Erik Grampp May 22, 2017 Erik Grampp 1(10) EDAN70 Active logic Integer labelled

Session 13 INFM 603 Bugs, process, assurance Software assurance: quality assurance for

Development of AI has been driven by benchmarks and datasets. - PowerPoint PPT Presentation

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1 Development of AI has been driven by benchmarks and

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

False fasting is driven by pride False fasting is driven by pride False fasting is

Software Engineering I cs361 Test Driven Development What is Test Driven Development (TDD)

Agile SW Development plus Scrum in - depth Slide 1 Plan-driven and agile development

FOUNDA TION FINANCE COMMITTEE September 23, 2016 ULF Performance vs. Peers and Benchmarks 2

and Benchmarks May 24, 2018 Panelists Katy Miller Regional Coordinator Jasmine Hayes Deputy

Criticality experiments and benchmarks for for validation of cross validation of cross sections:

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence

WCET Tool Challenge 2014 Outline 1. Objectives of the challenge 2. Benchmarks and problems 3.

Scrambling and Descrambling SMT-LIB Benchmarks Tjark Weber Uppsala University, Sweden SMT 2016

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

Early Childhood Program-Wide PBIS Benchmarks of Quality T ra ining o n Co mple ting the E C

A brief introduction to economics Part IV Tyler Moore Computer Science &amp; Engineering

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel) Digital Signatures

Data Sciences CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear

CSC 151 Spring 2020 Topic: Wrap Up May 6, 2020 Day 40 A Brief History/Review of Scheme 1958

Proactive Quality Guidance for Model Evolution in Model Libraries Andreas Ganser, Horst Lichter,

D2: Access Control #2: Access cess Contr trol ol Similar to OWASP Top 10 Insufficient

Active Logic Erik Grampp May 22, 2017 Erik Grampp 1(10) EDAN70 Active logic Integer labelled

Session 13 INFM 603 Bugs, process, assurance Software assurance: quality assurance for

A brief introduction to economics Part IV Tyler Moore Computer Science & Engineering