squad 100 000 questions for machine comprehension of text
play

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav - PowerPoint PPT Presentation

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018 1 SQuAD = S tanford Qu estion A nswering D ataset Online


  1. SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018 1

  2. SQuAD = S tanford Qu estion A nswering D ataset Online challenge: https://rajpurkar.github.io/SQuAD-explorer/ 2

  3. Overall contribution • A benchmark dataset with: • Proper difficulty • Principled curation process • Detailed data analysis 3

  4. Outline • What are the QA datasets prior to SQuAD? • What does SQuAD look like? • How is SQuAD created? • What are the properties of SQuAD? • How well we can do on SQuAD? 4

  5. What are the QA datasets prior to SQuAD? 5

  6. Related Datasets Type I: Complex reading comprehension datasets Type II: Open-domain QA datasets Type III: Cloze datasets 6

  7. Type I: Complex Reading Comprehension Datasets • Require commonsense knowledge, very challenge • Dataset size too small 7

  8. Type II: Open-domain QA Datasets • Open-domain QA: answer a question from a large collection of documents. • WikiQA: only sentence selection • TREC-QA: free-form answer -> hard to evaluate 8

  9. Type III: Cloze Datasets • Automatically generated -> large scale • Limitations are described in ACL 2016 Best Paper. 9

  10. What does SQuAD look like? 10

  11. SQuAD Dataset Format A passage One QA pair 11

  12. SQuAD Dataset Format • One passage can have multiple question-answer pairs. • Totally 100,000+ QA pairs from 23,215 passages. 12

  13. How is SQuAD created? 13

  14. SQuAD Dataset Collection • Consisting three steps: • Step1: Passage curation • Step2: Question-answer collection • Step3: Additional answers collection 14

  15. Step 1: Passage Curation • Select top 10000 articles of English Wikipedia based on Wikipedia’s internal PageRanks scores. • Random sample 536 articles out of 10000 articles. • Extract passages longer than 500 characters from all 536 articles -> 23,115 paragraphs. • Train/dev/test datasets are split in the article level. • Train/dev datasets are released and test dataset is holdout. 15

  16. Step 2: Question-Answer Collection • Using crowdsourcing technique • Crowd-workers with 97% HIT acceptance rate, larger than 1000 HITs, and located in USA/Canada. • Spend 4 minutes on each paragraph and asking up to 5 questions with answer highlighted in the text. 16

  17. Step 2: Question-Answer Collection 17

  18. Step 3: Additional Answers Collection • For each question in dev/test datasets, get at least two additional answers. • Why we do this? • Make evaluation more robust. • Assess human performance. 18

  19. What are the properties of SQuAD? 19

  20. Data Analysis • Diversity in answers • Reasoning for answering questions • Syntactic divergence 20

  21. Diversity in Answers • 67.4% non-entity answers, and many answers are not even noun -> Can be challenging. 21

  22. Reasoning for answering questions 22

  23. Syntactic divergence • Syntactic divergence is the minimum edit distance (belong all unlexicalized dependency path) over all possible anchors (word-lemma pairs). 23

  24. Syntactic divergence • Histogram of syntactic divergence 24

  25. How well we can do on SQuAD? 25

  26. “Baseline” method • Candidate answer generation: use constituency parser. • Feature extraction • Train a Logistic Regression Model 26

  27. Help to pick the correct sentence Resolve lexical variations Resolve syntactic variations 27

  28. Evaluation • After ignoring punctuations and articles, using the following two metrics: • Exact Match • Macro-averaged F1 score: maximum F1 over all of the ground truth answers 28

  29. Experiment results • Overall results For SQuAD v1.1 Test: EM: 82.304 F1: 91.221 29

  30. Experiment results • Performance stratified by answer type 30

  31. Experiment results • Performance stratified by syntactic divergence 31

  32. Experiment results • Performance with feature ablations 32

  33. Summary • SQuAD is a machine reading style QA dataset. • SQuAD consists of 100,000+ QA pairs. • SQuAD is constructed based on crowdsourcing. • SQuAD drives the field forward. 33

  34. Thanks Q & A 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend