SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav - - PowerPoint PPT Presentation

squad 100 000 questions for machine comprehension of text
SMART_READER_LITE
LIVE PREVIEW

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav - - PowerPoint PPT Presentation

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018 1 SQuAD = S tanford Qu estion A nswering D ataset Online


slide-1
SLIDE 1

SQuAD:100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018

1

slide-2
SLIDE 2

SQuAD = Stanford Question Answering Dataset

2

Online challenge: https://rajpurkar.github.io/SQuAD-explorer/

slide-3
SLIDE 3

Overall contribution

  • A benchmark dataset with:
  • Proper difficulty
  • Principled curation process
  • Detailed data analysis

3

slide-4
SLIDE 4

Outline

  • What are the QA datasets prior to SQuAD?
  • What does SQuAD look like?
  • How is SQuAD created?
  • What are the properties of SQuAD?
  • How well we can do on SQuAD?

4

slide-5
SLIDE 5

What are the QA datasets prior to SQuAD?

5

slide-6
SLIDE 6

Related Datasets

6

Type I: Complex reading comprehension datasets Type II: Open-domain QA datasets Type III: Cloze datasets

slide-7
SLIDE 7

Type I: Complex Reading Comprehension Datasets

  • Require commonsense knowledge, very challenge
  • Dataset size too small

7

slide-8
SLIDE 8

Type II: Open-domain QA Datasets

  • Open-domain QA: answer a question from a large

collection of documents.

  • WikiQA: only sentence selection
  • TREC-QA: free-form answer -> hard to evaluate

8

slide-9
SLIDE 9

Type III: Cloze Datasets

9

  • Automatically generated -> large scale
  • Limitations are described in ACL 2016 Best Paper.
slide-10
SLIDE 10

What does SQuAD look like?

10

slide-11
SLIDE 11

SQuAD Dataset Format

A passage One QA pair

11

slide-12
SLIDE 12

SQuAD Dataset Format

  • One passage can have multiple question-answer pairs.
  • Totally 100,000+ QA pairs from 23,215 passages.

12

slide-13
SLIDE 13

How is SQuAD created?

13

slide-14
SLIDE 14

SQuAD Dataset Collection

  • Consisting three steps:
  • Step1: Passage curation
  • Step2: Question-answer collection
  • Step3: Additional answers collection

14

slide-15
SLIDE 15

Step 1: Passage Curation

  • Select top 10000 articles of English Wikipedia based on

Wikipedia’s internal PageRanks scores.

  • Random sample 536 articles out of 10000 articles.
  • Extract passages longer than 500 characters from all

536 articles -> 23,115 paragraphs.

  • Train/dev/test datasets are split in the article level.
  • Train/dev datasets are released and test dataset is

holdout.

15

slide-16
SLIDE 16

Step 2: Question-Answer Collection

  • Using crowdsourcing technique
  • Crowd-workers with 97% HIT acceptance rate,

larger than 1000 HITs, and located in USA/Canada.

  • Spend 4 minutes on each paragraph and asking up

to 5 questions with answer highlighted in the text.

16

slide-17
SLIDE 17

Step 2: Question-Answer Collection

17

slide-18
SLIDE 18

Step 3: Additional Answers Collection

  • For each question in dev/test datasets, get at least

two additional answers.

  • Why we do this?
  • Make evaluation more robust.
  • Assess human performance.

18

slide-19
SLIDE 19

What are the properties of SQuAD?

19

slide-20
SLIDE 20

Data Analysis

  • Diversity in answers
  • Reasoning for answering questions
  • Syntactic divergence

20

slide-21
SLIDE 21

Diversity in Answers

  • 67.4% non-entity answers, and many answers are not

even noun -> Can be challenging.

21

slide-22
SLIDE 22

Reasoning for answering questions

22

slide-23
SLIDE 23

Syntactic divergence

  • Syntactic divergence is the minimum edit distance

(belong all unlexicalized dependency path) over all possible anchors (word-lemma pairs).

23

slide-24
SLIDE 24

Syntactic divergence

  • Histogram of syntactic divergence

24

slide-25
SLIDE 25

How well we can do on SQuAD?

25

slide-26
SLIDE 26

“Baseline” method

  • Candidate answer generation: use constituency parser.
  • Feature extraction
  • Train a Logistic Regression Model

26

slide-27
SLIDE 27

27

Help to pick the correct sentence Resolve lexical variations Resolve syntactic variations

slide-28
SLIDE 28

Evaluation

  • After ignoring punctuations and articles, using the

following two metrics:

  • Exact Match
  • Macro-averaged F1 score: maximum F1 over all of

the ground truth answers

28

slide-29
SLIDE 29

Experiment results

29

  • Overall results

For SQuAD v1.1 Test: EM: 82.304 F1: 91.221

slide-30
SLIDE 30

Experiment results

30

  • Performance stratified by answer type
slide-31
SLIDE 31

Experiment results

31

  • Performance stratified by syntactic divergence
slide-32
SLIDE 32

Experiment results

32

  • Performance with feature ablations
slide-33
SLIDE 33

Summary

  • SQuAD is a machine reading style QA dataset.
  • SQuAD consists of 100,000+ QA pairs.
  • SQuAD is constructed based on crowdsourcing.
  • SQuAD drives the field forward.

33

slide-34
SLIDE 34

Thanks Q & A

34