errudite scalable reproducible and testable error analysis
play

Errudite: Scalable, Reproducible, and Testable Error Analysis - PowerPoint PPT Presentation

Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington 1


  1. Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington � 1

  2. Motivation & Contributions � 2

  3. Error analysis is important for… Uncovering bugs Improving the state-of-art Safeguarding deployments � 3

  4. Where We Are Fader et tal. We performed an error analysis on a sample of 100 questions ACL’13 Chen et al. We randomly select 50 incorrect questions and categorize ACL’16 them into 6 classes. We sample 100 incorrect predictions and try to find common Wadhwa et al. error categories. ACL’18 � 4

  5. Where We Are Fader et tal. We performed an error analysis on a sample of 100 questions ACL’13 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Chen et al. We randomly select 50 incorrect questions and categorize ACL’16 them into 6 classes. We sample 100 incorrect predictions and try to find common Wadhwa et al. error categories. ACL’18 � 5

  6. Where We Are “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause � 6

  7. Where We Are & Our Contribution “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Principles & Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 7 �

  8. A E C B D F � 8

  9. A E C B D F Video demo: https://tinyurl.com/errudite-video � 9

  10. Core Design Precise & Reproducible Domain Specific Language � 10

  11. Precise DSL (Domain Specific Language) DSL = + + Target Attribute Extractor Operators length(q) > 20 Extract A E C E Instance Attribute B B D Filter F Instance Groups 11 �

  12. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Subjectively defined hypotheses Precise & reproducible h ypotheses + + Too ambiguous to reproduce Small samples Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis � 12

  13. User study: What is imprecise answer boundaries? “The model is making predictions with missing or additional words…?” D1 D2 No exact match, but high overlap O ff by at most 2 tokens both on the left and right exact_match(p(m)) == 0 exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and f1(p(m)) > 0.7 and abs(answer_offset(p(m),"right")) <= 2 � 13

  14. User study: What is imprecise answer boundaries? “The model is making predictions with missing or additional words…?” D1 No exact match, but high overlap D2 O ff by at most 2 tokens both on the left and right exact_match(p(m)) == 0 exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and f1(p(m)) > 0.7 and abs(answer_offset(p(m),"right")) <= 2 � 14

  15. User study: What is imprecise answer boundaries? D1 No exact match, but high overlap D2 O ff by at most 2 tokens both on the left and right D1 D2 groundtruth …the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does.. prediction � 15

  16. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Subjectively defined hypotheses Precise & reproducible h ypotheses Precise & reproducible h ypotheses + + Quantify instances with a domain Small samples Scale up to the entire dev set specific language + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 16 �

  17. Design & Use Scenario Examine the distractor hypothesis on BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016) Independently tested by 4 (out of 10) participants in the user study � 17

  18. Scenario: distractor hypothesis Who created the 2005 theme for Doctor Who? Common belief: BiDAF… … John Debney created a new arrangement Matches entity types of Ron Grainer’s original theme for Doctor Knows to find a PERSON Who in 1996. For the return of the series in 2005, Murray Gold provided a new Finds the exact answer spans Distracted by other PERSON spans arrangement... featured sampled from the 1963 original. � 18

  19. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Small samples Scale up to the entire dev set + + 100 << 2000+ errors in total Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis � 19

  20. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Small samples Scale up to the entire dev set Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 20 �

  21. Build distractor groups with DSL C C D ENT(g) != "" 1 and count(token(c, pattern=ENT(g))) > 2 count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 5 � 21

  22. Build distractor groups with DSL ENT(Murray Gold) == PERSON 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 5 “The g roundtruth is an ENT ity.” � 22

  23. Build distractor groups with DSL count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 count(PERSON : Murray Gold) == 1 and f1(m) == 0 5 “There are more tokens matching the ground truth entity type ( ENT(g) ) in the whole c ontext than in the g roundtruth.” 23 �

  24. Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 ENT(John Debney) == PERSON 5 “The m odel p rediction ENT ity type matches the g roundtruth ENT ity type.” 24 �

  25. Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 “The m odel prediction is incorrect.” 25 �

  26. Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 Correct Incorrect 5.7% of all BiDAF errors: The distractor hypothesis seems correct! 26 �

  27. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Focus exclusively on errors Cover errors & correct instances + + Wrongly prioritize groups that are No Test on true cause Test via counterfactual analysis well-handled in average. 27 �

  28. “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Focus exclusively on errors Cover errors & correct instances Cover errors & correct instances + + Wrongly prioritize groups that are No Test on true cause Test via counterfactual analysis well-handled in average. 28 �

  29. Build distractor groups with DSL all_instance 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 Correct Incorrect 88% EM > 68% EM: BiDAF performs better when have distractors & entity type is matched, than overall. Reject / revise the hypothesis! 29 �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend