SLIDE 1 How will we know when machines can read?
Matt Gardner, with many collaborators MRQA workshop, November 4, 2019
SLIDE 2
Look mom, I can read like a human!
SLIDE 3
Look mom, I can read like a human!
SLIDE 4
But...
SLIDE 5
So what’s the right evaluation?
SLIDE 6 Building the right test
- What format should the test be?
- What should be on the test?
- How do we evaluate the test?
MRQA 2019
SLIDE 7
Test format
SLIDE 8
What is reading?
Postulate: an entity understands a passage of text if it is able to answer arbitrary questions about that text.
SLIDE 9
Why is QA the right format?
It has issues, but really, what other choice is there? We don’t have a formalism for this.
SLIDE 10
What kind of QA?
SLIDE 11
What about multiple choice, or NLI?
SLIDE 12 What about multiple choice, or NLI?
Both have same problems:
- 1. Distractors have biases
- 2. Low entropy output space
- 3. Machines (and people!) use different models for this
SLIDE 13
Bottom line
I propose standardizing on SQuAD-style inputs, arbitrary (evaluable) outputs
SLIDE 14
Test content
SLIDE 15 I really meant arbitrary
- The test won’t be convincing unless it has all kinds of questions, about
every aspect of reading you can think of.
- So what are those aspects?
SLIDE 16
Sentence-level linguistic structure
SLIDE 17 Sentence-level linguistic structure
But SQuAD just scratches the surface:
- Many other kinds of local structure
- Need to test coherence more broadly
SLIDE 18 DROP:
Discrete Reasoning Over Paragraphs
NAACL 2019
SLIDE 19 DROP:
Discrete Reasoning Over Paragraphs
NAACL 2019
SLIDE 20 Discourse structure
- Tracking entities across a discourse
- Understanding discourse connectives and discourse coherence
- ...
SLIDE 21 Quoref:
Question-based coreference resolution
EMNLP 2019
SLIDE 22 Quoref:
Question-based coreference resolution
EMNLP 2019
SLIDE 23 Quoref:
Question-based coreference resolution
EMNLP 2019
SLIDE 24 Implicative meaning
- What do the propositions in the text imply about other propositions I
might see in other text?
- E.g., “Bill loves Mary”, “Mary gets sick” → “Bill is sad”
- Where do these implications come from?
SLIDE 25 ROPES:
Reasoning Over Paragraph Effects in Situations
MRQA 2019
SLIDE 26
ROPES:
Reasoning Over Paragraph Effects in Situations
SLIDE 27
ROPES:
Reasoning Over Paragraph Effects in Situations
SLIDE 28 Time
- Temporal ordering of events
- Duration of events
- Which things are events in the first place?
SLIDE 29 Grounding
- Common sense
- Factual knowledge
- More broadly: speaker is trying to communicate world state, and in a
person it induces a mental model of that world state. We need to figure
- ut ways to probe these mental models.
SLIDE 30
Grounding
SLIDE 31
Grounding
SLIDE 32 Many, many, many, more…
- Pragmatics, factuality
- Coordination, distributive vs. non-distributive
- Deixis
- Aspectual verbs
- Bridging and other elided elements
- Negation and quantifier scoping
- Distribution of quantifiers
- Preposition senses
- Noun compounds
- ...
SLIDE 33
Test evaluation
SLIDE 34 How do we evaluate generative QA?
- This is a serious problem that severely limits our test
- No solution yet, but we’re working on it
- See Anthony’s talk for more detail
MRQA 2019 Best paper
SLIDE 35 What about reasoning shortcuts?
- It’s easy to write questions that don’t test what you think they’re testing
- See our MRQA paper for more on how to combat this
SLIDE 36 What about generalization?
- There is growing realization that the traditional supervised learning
paradigm is broken in high level, large-dataset NLP - we’re fitting artifacts
- The test should include not just hidden test data, but hidden test data from
a different distribution than the training data
- MRQA has the right idea here
- That is, we should explicitly make test sets without training sets (as long
as they are close enough to training that it should be possible to generalize)
SLIDE 37
A beginning, and a call for help
SLIDE 38 An Open Reading Benchmark
Ananth
MRQA 2019
- Evaluate one model on all of these questions at the
same time
- Standardized (SQuAD-like) input, arbitrary output
- Will grow over time, as more datasets are built
SLIDE 39 An Open Reading Benchmark
Ananth
MRQA 2019
SLIDE 40 An Open Reading Benchmark
- Making a good test is a bigger problem than any one group can solve
- We need to work together to make this happen
- We will add any good dataset that matches the input format
SLIDE 41 To conclude
- Current reading comprehension benchmarks are insufficient to convince a
reasonable researcher that machines can read
- There are a lot of things that need to be tested before we will be convinced
- We need to work together to make a sufficient test - there’s too much for
anyone to do on their own
Thanks!
Ananth
We’re hiring!