How will we know when machines can read? Matt Gardner , with many - - PowerPoint PPT Presentation

how will we know when machines can read
SMART_READER_LITE
LIVE PREVIEW

How will we know when machines can read? Matt Gardner , with many - - PowerPoint PPT Presentation

How will we know when machines can read? Matt Gardner , with many collaborators MRQA workshop, November 4, 2019 Look mom, I can read like a human! Look mom, I can read like a human! But... So whats the right evaluation? MRQA 2019 Building


slide-1
SLIDE 1

How will we know when machines can read?

Matt Gardner, with many collaborators MRQA workshop, November 4, 2019

slide-2
SLIDE 2

Look mom, I can read like a human!

slide-3
SLIDE 3

Look mom, I can read like a human!

slide-4
SLIDE 4

But...

slide-5
SLIDE 5

So what’s the right evaluation?

slide-6
SLIDE 6

Building the right test

  • What format should the test be?
  • What should be on the test?
  • How do we evaluate the test?

MRQA 2019

slide-7
SLIDE 7

Test format

slide-8
SLIDE 8

What is reading?

Postulate: an entity understands a passage of text if it is able to answer arbitrary questions about that text.

slide-9
SLIDE 9

Why is QA the right format?

It has issues, but really, what other choice is there? We don’t have a formalism for this.

slide-10
SLIDE 10

What kind of QA?

slide-11
SLIDE 11

What about multiple choice, or NLI?

slide-12
SLIDE 12

What about multiple choice, or NLI?

Both have same problems:

  • 1. Distractors have biases
  • 2. Low entropy output space
  • 3. Machines (and people!) use different models for this
slide-13
SLIDE 13

Bottom line

I propose standardizing on SQuAD-style inputs, arbitrary (evaluable) outputs

slide-14
SLIDE 14

Test content

slide-15
SLIDE 15

I really meant arbitrary

  • The test won’t be convincing unless it has all kinds of questions, about

every aspect of reading you can think of.

  • So what are those aspects?
slide-16
SLIDE 16

Sentence-level linguistic structure

slide-17
SLIDE 17

Sentence-level linguistic structure

But SQuAD just scratches the surface:

  • Many other kinds of local structure
  • Need to test coherence more broadly
slide-18
SLIDE 18

DROP:

Discrete Reasoning Over Paragraphs

NAACL 2019

slide-19
SLIDE 19

DROP:

Discrete Reasoning Over Paragraphs

NAACL 2019

slide-20
SLIDE 20

Discourse structure

  • Tracking entities across a discourse
  • Understanding discourse connectives and discourse coherence
  • ...
slide-21
SLIDE 21

Quoref:

Question-based coreference resolution

EMNLP 2019

slide-22
SLIDE 22

Quoref:

Question-based coreference resolution

EMNLP 2019

slide-23
SLIDE 23

Quoref:

Question-based coreference resolution

EMNLP 2019

slide-24
SLIDE 24

Implicative meaning

  • What do the propositions in the text imply about other propositions I

might see in other text?

  • E.g., “Bill loves Mary”, “Mary gets sick” → “Bill is sad”
  • Where do these implications come from?
slide-25
SLIDE 25

ROPES:

Reasoning Over Paragraph Effects in Situations

MRQA 2019

slide-26
SLIDE 26

ROPES:

Reasoning Over Paragraph Effects in Situations

slide-27
SLIDE 27

ROPES:

Reasoning Over Paragraph Effects in Situations

slide-28
SLIDE 28

Time

  • Temporal ordering of events
  • Duration of events
  • Which things are events in the first place?
slide-29
SLIDE 29

Grounding

  • Common sense
  • Factual knowledge
  • More broadly: speaker is trying to communicate world state, and in a

person it induces a mental model of that world state. We need to figure

  • ut ways to probe these mental models.
slide-30
SLIDE 30

Grounding

slide-31
SLIDE 31

Grounding

slide-32
SLIDE 32

Many, many, many, more…

  • Pragmatics, factuality
  • Coordination, distributive vs. non-distributive
  • Deixis
  • Aspectual verbs
  • Bridging and other elided elements
  • Negation and quantifier scoping
  • Distribution of quantifiers
  • Preposition senses
  • Noun compounds
  • ...
slide-33
SLIDE 33

Test evaluation

slide-34
SLIDE 34

How do we evaluate generative QA?

  • This is a serious problem that severely limits our test
  • No solution yet, but we’re working on it
  • See Anthony’s talk for more detail

MRQA 2019 Best paper

slide-35
SLIDE 35

What about reasoning shortcuts?

  • It’s easy to write questions that don’t test what you think they’re testing
  • See our MRQA paper for more on how to combat this
slide-36
SLIDE 36

What about generalization?

  • There is growing realization that the traditional supervised learning

paradigm is broken in high level, large-dataset NLP - we’re fitting artifacts

  • The test should include not just hidden test data, but hidden test data from

a different distribution than the training data

  • MRQA has the right idea here
  • That is, we should explicitly make test sets without training sets (as long

as they are close enough to training that it should be possible to generalize)

slide-37
SLIDE 37

A beginning, and a call for help

slide-38
SLIDE 38

An Open Reading Benchmark

Ananth

MRQA 2019

  • Evaluate one model on all of these questions at the

same time

  • Standardized (SQuAD-like) input, arbitrary output
  • Will grow over time, as more datasets are built
slide-39
SLIDE 39

An Open Reading Benchmark

Ananth

MRQA 2019

slide-40
SLIDE 40

An Open Reading Benchmark

  • Making a good test is a bigger problem than any one group can solve
  • We need to work together to make this happen
  • We will add any good dataset that matches the input format
slide-41
SLIDE 41

To conclude

  • Current reading comprehension benchmarks are insufficient to convince a

reasonable researcher that machines can read

  • There are a lot of things that need to be tested before we will be convinced
  • We need to work together to make a sufficient test - there’s too much for

anyone to do on their own

Thanks!

Ananth

We’re hiring!