Questioning Question Answering Answers Sameer Singh University of - - PowerPoint PPT Presentation

questioning question answering answers
SMART_READER_LITE
LIVE PREVIEW

Questioning Question Answering Answers Sameer Singh University of - - PowerPoint PPT Presentation

Questioning Question Answering Answers Sameer Singh University of California, Irvine Questioning Question Answering Answers Sameer Singh University of California, Irvine QA Systems are really good! Is there a moustache in the picture? >


slide-1
SLIDE 1

Questioning Question Answering Answers

Sameer Singh

University of California, Irvine

slide-2
SLIDE 2

Questioning Question Answering Answers

Sameer Singh

University of California, Irvine

slide-3
SLIDE 3

QA Systems are really good!

Is there a moustache in the picture? > Yes What is the moustache made of? > Banana

Visual7A [Zhu et al 2016]

slide-4
SLIDE 4

QA Systems are really good!

4

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) How long is the Rhine? 1230km

Is it doing the right thing?

BiDAF [Seo et al 2017]

slide-5
SLIDE 5

We know that they are not

Jia and Liang, EMNLP 2017 Mudrakarta et al ACL 2018

slide-6
SLIDE 6

Overstability!

What is the moustache made of? > Banana What are the eyes made of? > Bananas What is? > Banana What? > Banana

slide-7
SLIDE 7

Oversensitivity to phrasing!

What type of road sign is shown? > Do not Enter. > STOP. What type of road sign is shown?

slide-8
SLIDE 8

Oversensitivity to unimportant typos!

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) How long is the Rhine? > More than 1,050,000 > 1230km How long is the Rhine?

slide-9
SLIDE 9

QA Systems are brittle

  • Our goals are to provide automated tools
  • For both oversensitivity and overstability
  • Can we figure these out automatically, with minimal human time?
  • Can we try to rationalize/explain predictions? analyze the mistakes?
  • Hopefully, they help design choices for:
  • Data gathering and annotations
  • Model structure and training
  • Evaluation pipelines
slide-10
SLIDE 10

Being Model-Agnostic…

10

Ignore the internal structure

X1 > 0.5 X2 > 0.5

f(x)

Practically easy: not tied to PyTorch, Tflow, etc. Not restricted to differentiable modules Study models that you don’t have access to!

slide-11
SLIDE 11

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining Predictions

slide-12
SLIDE 12

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining Predictions

slide-13
SLIDE 13

Being Local…

“Global” explanation is too complicated

slide-14
SLIDE 14

Being Local…

“Global” explanation is too complicated

slide-15
SLIDE 15

Being Local…

“Global” explanation is too complicated Describe the locally-accurate behavior, using interpretable representations

slide-16
SLIDE 16

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining Predictions

KDD 2016

slide-17
SLIDE 17

LIME: Sparse, Linear Explanations

Identify the important words, and present their relative importance

slide-18
SLIDE 18

What an explanation looks like

Why did this happen?

From: Keith Richards Subject: Christianity is the answer NTTP-Posting-Host: x.x.com I think Christianity is the one true religion. If you’d like to know more, send me a note

slide-19
SLIDE 19

LIME on VisualQA

What type of road sign is shown? > STOP. LIME What type of road sign is shown?

slide-20
SLIDE 20

LIME on SQuAD

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) What is the longest river in Central and Western Europe? the Danube

BiDAF [Seo et al 2017]

LIME What is the longest river in Central and Western Europe?

slide-21
SLIDE 21

LIME on SQuAD

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) What is the second longest river in Central and Western Europe?

BiDAF [Seo et al 2017]

LIME What is the second longest river in Central and Western Europe? the Danube

slide-22
SLIDE 22

Limitations of LIME

Gain understanding of local behavior, but very little generalization… Unless they run it, the users have little idea of what the answer will be

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) Which is the second longest river in Germany’s part of Europe?

slide-23
SLIDE 23

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining Predictions

AAAI 2018

slide-24
SLIDE 24

Anchors: Sufficient Conditions

Identify the conditions under which the classifier has the same prediction

slide-25
SLIDE 25

Anchors on VisualQA

What type of road sign is shown? What type of road sign is shown? If question starts with What (and is similarly structured) the prediction will be STOP STOP. What type of road sign is shown?

96.8%

slide-26
SLIDE 26

Anchors on Visual QA

Anchor

slide-27
SLIDE 27

Anchors on Visual QA

Anchor

slide-28
SLIDE 28

Anchors on SQuAD

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) What is the longest river in Central and Western Europe? the Danube What is the longest river in Central and Western Europe? What is the longest river in Central and Western Europe?

96.5%

slide-29
SLIDE 29

Anchors on SQuAD

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) What is the second longest river in Central and Western Europe? the Danube What is the second longest river in Central and Western Europe? What is the second longest river in Central and Western Europe?

slide-30
SLIDE 30

User study on VisualQA

Show humans predictions + explanations Ask them to predict what the model will do in new instances (only if confident)

Anchor: “longest river” → Danube

Anchor Which is second longest river? , , “I don’t know”

Which is the longest river ?

No explanations

Danube Danube Rhine

LIME

Which is the longest river ?

Danube

slide-31
SLIDE 31

Summary of VisualQA Results

35.3 62.85 29.6 20 40 60 80 100 No Explanations LIME Anchor 64.95 66.9 95.95 20 40 60 80 100 No Explanations LIME Anchor 16.3 9.85 4.55 10 20 No Explanations LIME Anchor

 Users are more precise and quicker with anchors

How often they predict How often they correct Time per prediction

slide-32
SLIDE 32

Anchors: Tools for Overstability

What about Over-sensitivity?

slide-33
SLIDE 33

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining predictions

ACL 2018

slide-34
SLIDE 34

Oversensitivity: Adversarial Examples

Find closest example with different prediction

37

slide-35
SLIDE 35

Oversensitivity in images

Adversaries are indistinguishable to humans… But unlikely in the real world (except for attacks)

“panda” 57.7% confidence “gibbon” 99.3% confidence

39

slide-36
SLIDE 36

What about text?

What type of road sign is shown? > STOP. What type of road sign is shown?

Perceptible by humans, unlikely in real world

What type of road sign is sho wn?

40

slide-37
SLIDE 37

What about text?

What type of road sign is shown? > STOP. What type of road sign is shown?

A single word changes too much!

41

slide-38
SLIDE 38

Semantics matter

What type of road sign is shown? > Do not Enter. > STOP. What type of road sign is shown?

Bug, and likely in the real world

42

slide-39
SLIDE 39

Semantics matter

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) How long is the Rhine? > More than 1,050,000 > 1230km How long is the Rhine?

Not all changes are the same: meaning should be same

43

slide-40
SLIDE 40

Characterize via Rules

Find rule that generates many adversaries

44

slide-41
SLIDE 41

Characterizing via Rules

What type of road sign is shown? > Do not Enter. > STOP. What type of road sign is shown?

  • flips 3.9% of examples

Rule What NOUN Which NOUN

slide-42
SLIDE 42

Characterizing via Rules

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people. It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi) How long is the Rhine? > More than 1,050,000 > 1230km How long is the Rhine?

  • flips 3% of examples

Rule ? ??

47

slide-43
SLIDE 43

SEARS: Adversarial Rules

Rules are global and actionable, more interesting than individual adversaries

48

slide-44
SLIDE 44

SEARS Examples: VisualQA

49

Visual7a-Telling [Zhu et al 2016]

slide-45
SLIDE 45

SEARS Examples: SQuAD

50

BiDAF [Seo et al 2017]

slide-46
SLIDE 46

VQA User Study: Detecting adversaries

33.6 36 45 20 40 Human SEA Human + SEA Human SEA Human + SEA

SEAs find adversaries as often as humans! SEAs + Humans better than humans!

slide-47
SLIDE 47

VQA User study: Can experts find bugs?

3 14.2 20 Visual QA Experts SEARs 16.9 10.1 20 Visual QA Finding Rules Evaluating SEARs

% predictions flipped Time (minutes)

SEARs are much better than expert-produced rules Evaluating is much easier than finding them

Closing the loop brings it down to 1.4%

slide-48
SLIDE 48

Talk Overview

LIME: Linear Explanations Anchors: Sufficient Conditions SEARS: Detecting Oversensitivity Explaining Predictions

slide-49
SLIDE 49

Why such tools can be useful

  • Annotations and Task Definitions
  • SQuAD 2.0: unanswerable questions
  • VisualQA 2.0: questions with different answers
  • Evaluation
  • Create robust test set
  • Include explanations/bugs as qualitative evaluation
  • End to End QA may not be sufficient
  • Saleforce’s NLP Decathalon
  • ELMO Representation: learn across domains, and fine-tune!
slide-50
SLIDE 50

Thanks!

sameer@uci.edu sameersingh.org Work with Marco T. Ribeiro and Carlos Guestrin, University of Washington

Questioning Question Answering Answers

@sameer_ Work with Matt Gardner and me as part of The Allen Institute for Artificial Intelligence in Irvine, CA All levels: pre-docs, PhD interns, postdocs, and research scientists!