Errudite: Scalable, Reproducible, and Testable Error Analysis - - PowerPoint PPT Presentation

errudite scalable reproducible and testable error analysis
SMART_READER_LITE
LIVE PREVIEW

Errudite: Scalable, Reproducible, and Testable Error Analysis - - PowerPoint PPT Presentation

Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington 1


slide-1
SLIDE 1

1

Errudite: Scalable, Reproducible, and Testable Error Analysis

Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington

slide-2
SLIDE 2

2

Motivation & Contributions

slide-3
SLIDE 3
  • 3

Error analysis is important for…

Uncovering bugs Improving the state-of-art Safeguarding deployments

slide-4
SLIDE 4
  • 4

Where We Are

We performed an error analysis on a sample of 100 questions

Fader et tal. ACL’13

We sample 100 incorrect predictions and try to find common error categories.

Chen et al. ACL’16

We randomly select 50 incorrect questions and categorize them into 6 classes.

Wadhwa et al. ACL’18

slide-5
SLIDE 5
  • 5

Where We Are

We performed an error analysis on a sample of 100 questions

Fader et tal. ACL’13

We sample 100 incorrect predictions and try to find common error categories.

Chen et al. ACL’16

We randomly select 50 incorrect questions and categorize them into 6 classes.

Wadhwa et al. ACL’18

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”

slide-6
SLIDE 6
  • 6

Where We Are

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause

slide-7
SLIDE 7
  • 7

Where We Are & Our Contribution

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

slide-8
SLIDE 8

A B C D E F

  • 8
slide-9
SLIDE 9

A B C D E F

  • 9

Video demo: https://tinyurl.com/errudite-video

slide-10
SLIDE 10

10

Core Design

Precise & Reproducible Domain Specific Language

slide-11
SLIDE 11
  • 11

Precise DSL (Domain Specific Language)

A B C D E F

Attribute Extractor Operators Target

DSL = + +

B

Extract Instance Attribute

E

Instance Groups Filter length(q) > 20

slide-12
SLIDE 12

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

  • 12

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”

Too ambiguous to reproduce

Biased conclusion due to… Subjectively defined hypotheses

slide-13
SLIDE 13
  • 13

User study: What is imprecise answer boundaries?

Off by at most 2 tokens both on the left and right

exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2 D1 exact_match(p(m)) == 0 and f1(p(m)) > 0.7

No exact match, but high overlap

“The model is making predictions with missing or additional words…?”

slide-14
SLIDE 14
  • 14

User study: What is imprecise answer boundaries?

exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2

Off by at most 2 tokens both on the left and right

D2 exact_match(p(m)) == 0 and f1(p(m)) > 0.7 D1 No exact match, but high overlap

“The model is making predictions with missing or additional words…?”

slide-15
SLIDE 15

…the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does..

prediction groundtruth

  • 15

User study: What is imprecise answer boundaries?

Off by at most 2 tokens both on the left and right

D2 D1 No exact match, but high overlap

D1 D2

slide-16
SLIDE 16

Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses

  • 16

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”

Quantify instances with a domain specific language

Biased conclusion due to… Subjectively defined hypotheses

slide-17
SLIDE 17

17

Design & Use Scenario

Examine the distractor hypothesis

  • n BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016)

Independently tested by 4 (out of 10) participants in the user study

slide-18
SLIDE 18
  • 18

Scenario: distractor hypothesis

…John Debney created a new arrangement

  • f Ron Grainer’s original theme for Doctor

Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who?

Common belief: BiDAF…

Matches entity types Knows to find a PERSON Finds the exact answer spans Distracted by other PERSON spans

slide-19
SLIDE 19

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

  • 19

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Small samples

100 << 2000+ errors in total

slide-20
SLIDE 20
  • 20

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Small samples Errudite Scale up to the entire dev set

slide-21
SLIDE 21

C D

  • 21

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

C

slide-22
SLIDE 22
  • 22

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity

“The groundtruth is an ENTity.”

ENT(Murray Gold) == PERSON

slide-23
SLIDE 23
  • 23

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity has_distractor

“There are more tokens matching the ground truth entity type (ENT(g)) in the whole context than in the groundtruth.”

count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3 count(PERSON : Murray Gold) == 1

slide-24
SLIDE 24
  • 24

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity has_distractor correct_type

“The model prediction ENTity type matches the groundtruth ENTity type.”

ENT(John Debney) == PERSON

slide-25
SLIDE 25
  • 25

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity has_distractor correct_type is_distracted

“The model prediction is incorrect.”

slide-26
SLIDE 26
  • 26

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity has_distractor correct_type is_distracted

Correct Incorrect

5.7% of all BiDAF errors: The distractor hypothesis seems correct!

slide-27
SLIDE 27
  • 27

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Focus exclusively on errors

Wrongly prioritize groups that are well-handled in average.

slide-28
SLIDE 28

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

  • 28

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”

Wrongly prioritize groups that are well-handled in average.

Biased conclusion due to… Focus exclusively on errors Errudite Cover errors & correct instances

slide-29
SLIDE 29
  • 29

Build distractor groups with DSL

ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0

1 2 3 4 5

is_entity has_distractor correct_type is_distracted all_instance

Correct Incorrect

88% EM > 68% EM: BiDAF performs better when have distractors & entity type is matched, than overall. Reject / revise the hypothesis!

slide-30
SLIDE 30
  • 30

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Small samples + Focus exclusively on errors Errudite Scale up to the entire dev set + Cover errors & correct instances

slide-31
SLIDE 31
  • 31

…John Debney created a new arrangement

  • f Ron Grainer’s original theme for Doctor

Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? is_distracted

Distractor entity? HAS distractor prediction != IS WRONG due to distractor prediction Multi-sentence reasoning?

slide-32
SLIDE 32

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

  • 32

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”

HAS distractor prediction != IS WRONG due to distractor prediction

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… No Test on true cause

slide-33
SLIDE 33

A B C D E F

  • 33

Are the 192 instances really wrong because of the distractor? Would BiDAF work perfectly if we remove the distractors? Answer what-if questions with counterfactual analysis!

F

slide-34
SLIDE 34
  • 34

Counterfactual Analysis with Rewrite Rules

rewrite( , → ) target to from Re-write the target part of an instance by replacing from with to

slide-35
SLIDE 35
  • 35

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) target to from Re-write the target part of an instance by replacing from with to

slide-36
SLIDE 36
  • 36

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) c to from Re-write the context part of an instance by replacing from with to

slide-37
SLIDE 37
  • 37

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) c to string(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with to

slide-38
SLIDE 38
  • 38

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) c "#" string(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with a placeholder token “#”

slide-39
SLIDE 39
  • 39

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) c "#" string(p(m))

Q: Who created the 2005 theme for Doctor Who? C: …John Dobney # created a new arrangement of Ron Grainer’s … Murray Gold provided a new arrangement… Incorrect Incorrect

slide-40
SLIDE 40

Another distractor is still confusing the model!

  • 40

Counterfactual Analysis with Rewrite Rules

Would BiDAF work perfectly if we remove the distractors?

rewrite( , → ) c "#" string(p(m))

Incorrect Incorrect

slide-41
SLIDE 41
  • 41

Counterfactual Analysis with Rewrite Rules

p(m) for the 192 rewritten is_distracted instances are…

rewrite( , → ) c "#" string(p(m))

Another distractor is still confusing the model!

Incorrect Incorrect

slide-42
SLIDE 42
  • 42

Counterfactual Analysis with Rewrite Rules

p(m) for the 192 rewritten is_distracted instances are…

rewrite( , → ) c "#" string(p(m))

Incorrect Incorrect

29% Another distractor is still confusing the model! 48% The distractor was fooling the model!

Incorrect Correct

23% Other factors are at play!

Unchanged age of 18, 10.5% # from 18 to 24…

slide-43
SLIDE 43
  • 43

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… No Test on true cause Errudite Test via counterfactual analysis

slide-44
SLIDE 44
  • 44

Deliveries: Precise + Reproducible + Re-applicable

Groups Rewrite rule Attribute

is_entity has_distractor correct_type is_distracted all_instance ENT(g) rewrite( c, string(p(m))→"#")

slide-45
SLIDE 45
  • 45

Deliveries: Precise + Reproducible + Re-applicable

Groups Rewrite rule Attribute

BiDAF is … not particularly bad at distractors. Seemingly distractor errors can be due to other factors.

+ + applied to…

slide-46
SLIDE 46
  • 46

Deliveries: Precise + Reproducible + Re-applicable

Groups Rewrite rule Attribute

Other datasets & Other models… ? at handling distractor.

+ + applied to… Re-

slide-47
SLIDE 47

47

User Study

10 participants = NLP graduate students + QA engineers from industry Examine BiDAF (Seo et al., 2016) on SQuAD (Rajpurkar et al., 2016) One hour section: Replicate prior error analysis + Freely explore the model

slide-48
SLIDE 48
  • 48

User Feedback: Did they like Errudite?

Enhanced their error analysis experience. Systematically scaled up the analysis. Precise and inspiring more confidence. Much faster exploration.

slide-49
SLIDE 49

49

Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis

slide-50
SLIDE 50

50

Better error analysis will then… Uncover bugs Improve the state-of-art Safeguard deployments Errudite improves error analysis… Precise Reproducible Scalable Testable

slide-51
SLIDE 51

51

A B C D E F

slide-52
SLIDE 52

A B C D E F

D

52

slide-53
SLIDE 53

A B C D E F

(b)

A

53

slide-54
SLIDE 54
  • 54

Future Work

More sophisticated suggestions Collaborative error analysis Model comparison Extend to more domains

slide-55
SLIDE 55

55

Thank you!

https://tinyurl.com/errudite-video Come talk to me if you want a (small) live demo!!

slide-56
SLIDE 56

56

Backup

slide-57
SLIDE 57

questions with more than N tokens Refer to

  • 57

The model is bad on long questions Qualitative Description

?

slide-58
SLIDE 58
  • 58

The model is bad on long questions Qualitative Description questions with more than 20 tokens

slide-59
SLIDE 59
  • 59

Precise DSL (Domain Specific Language)

The model is bad on long questions Qualitative Description questions with more than 20 tokens Quantitative Description length(q) > 20 Translate with DSL

slide-60
SLIDE 60
  • 60

Precise DSL (Domain Specific Language)

The model is bad on long questions questions with more than 20 tokens Qualitative Description Quantitative Description length(q) > 20 Attribute Extractor

length question_type answer_type

Target

question context groundtruth prediction (model) token sentence

Operators

> != in has_any

slide-61
SLIDE 61
  • 61

DSL (Domain Specific Language) Attribute Extractor

Basic Attributes General purpose linguistic features Standard prediction performance metrics Between-target relations Domain-specific attribute Length LEMMA,POS,ENT f1,accuracy

  • verlap(t1, t2)

answer_type,question_type

slide-62
SLIDE 62
  • 62

DSL (Domain Specific Language) Attribute Extractor

…John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original.

Who created the 2005 theme for Doctor Who?

slide-63
SLIDE 63
  • 63

Suggestions via programming-by-demonstration

…John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original.

Who created the 2005 theme for Doctor Who?

starts_with(p(m),pattern="NNP")) starts_with(p(m),pattern="PERSON")) answer_type(g) == answer_type(p(m)) exact_match(m) == 0 is_correct_sent(m) == False

  • verlap(q, sentence(p(m))) > overlap(q, sentence(g))
slide-64
SLIDE 64
  • 64

Suggestions via programming-by-demonstration

Who What person created the 2005 theme for Doctor Who?

slide-65
SLIDE 65
  • 65

Check attribute distribution ENT(g) in groups

is_entity is_distracted

Correct Incorrect

slide-66
SLIDE 66
  • 66

Check attribute distribution ENT(g) in groups

is_entity is_distracted

Correct Incorrect

slide-67
SLIDE 67
  • 67

Check attribute distribution ENT(g) in groups

is_entity is_distracted

Correct Incorrect

slide-68
SLIDE 68
  • 68

Check attribute distribution ENT(g) in groups

is_entity is_distracted

Correct Incorrect

slide-69
SLIDE 69

69

User Study

10 participants = NLP graduate students + QA engineers from industry Examine BiDAF (Seo et al., 2016) on SQuAD (Rajpurkar et al., 2016) One hour section: Replicate prior error analysis + Freely explore the model

slide-70
SLIDE 70
  • 70

Study #1 Replication: Errudite flexible enough?

Read BiDAF error analysis: 50 errors, hand-labeled into 6 classes Rate closeness: Recreated groups == originals?

semantic

Recreate 4 classes with Errudite on the entire dataset

slide-71
SLIDE 71
  • 71

Users were able to express their intended groups well with Errudite.

slide-72
SLIDE 72
  • 72

Study #1 Replication: Easy group ≠ reproducible!

How many errors are covered by user-built Imprecise Error Boundary? Groups with low inter-agreement!

13.8% 45.8%

How close does the approximation match the paper definition? Most confident, an easy group

1 2 3 4 5

Closeness

Boundary

Group

0% 10% 20% 30% 40% 50%

Error Coverage

Boundary

Group

slide-73
SLIDE 73
  • 73

Study #1 Replication: Easy group ≠ reproducible!

Off by at most 2 tokens both on the left and right

exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D1

Coverage = 22.1%

D2 exact_match(m) == 0 and f1(m) > 0.7

No exact match, but high overlap

Coverage = 13.8%

slide-74
SLIDE 74
  • 74

Study #1 Replication: Easy group ≠ reproducible!

Off by at most 2 tokens both on the left and right

exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2 exact_match(m) == 0 and f1(m) > 0.7

No exact match, but high overlap

D1

Coverage = 22.1% Coverage = 13.8%

slide-75
SLIDE 75
  • 75

Study #1 Replication: Easy group ≠ reproducible!

Coverage = 22.1% Coverage = 13.8%

…commercial, scientific, and cultural growth…

D1 D2 D1 D2 D1 D2

…from Karakorum in Mongolia to Khanbaliq… …the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does..

Off by at most 2 tokens both on the left and right

D2 No exact match, but high overlap D1

slide-76
SLIDE 76
  • 76

Users were able to express their intended groups well with Errudite. Ambiguous manual labels prevents consistent replication, even when users thought they did!

slide-77
SLIDE 77
  • 77

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined group + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible grouping + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Subjectively defined group Errudite Precise & reproducible grouping

slide-78
SLIDE 78
  • 78

Study #2 Exploration: Errudite Useful Enough?

Freely explore BiDAF with Errudite, think aloud Rate insights on importance, confidence, relative easiness Describe their observations / insights on BiDAF

slide-79
SLIDE 79
  • 79

Study #2 Exploration: Errudite Useful Enough?

Confirmed prior hypotheses Extended previous knowledge Rejected prior hypotheses

Users reported μ = 2.1, σ = 0.94 findings.

Users thought their insights are…

1 2 3 4 5

Score

Importance Fidelity Easiness

Quality

Users learned more about the model (μ= 3.9,σ=0.94).

slide-80
SLIDE 80
  • 80

User Feedback: Did they like Errudite?

Enhanced their error analysis experience. Systematically scaled up the analysis. Precise and inspiring more confidence. Much faster exploration.