How will we know when machines can read? Matt Gardner , with many - PowerPoint PPT Presentation

How will we know when machines can read? Matt Gardner , with many collaborators MRQA workshop, November 4, 2019

Look mom, I can read like a human!

But...

So what’s the right evaluation?

MRQA 2019 Building the right test - What format should the test be? - What should be on the test? - How do we evaluate the test?

Test format

What is reading? Postulate: an entity understands a passage of text if it is able to answer arbitrary questions about that text.

Why is QA the right format? It has issues, but really, what other choice is there? We don’t have a formalism for this.

What kind of QA?

What about multiple choice, or NLI?

What about multiple choice, or NLI? Both have same problems: 1. Distractors have biases 2. Low entropy output space 3. Machines (and people!) use different models for this

Bottom line I propose standardizing on SQuAD-style inputs, arbitrary (evaluable) outputs

Test content

I really meant arbitrary - The test won’t be convincing unless it has all kinds of questions, about every aspect of reading you can think of. - So what are those aspects?

Sentence-level linguistic structure

Sentence-level linguistic structure But SQuAD just scratches the surface: - Many other kinds of local structure - Need to test coherence more broadly

NAACL 2019 DROP: Discrete Reasoning Over Paragraphs

Discourse structure - Tracking entities across a discourse - Understanding discourse connectives and discourse coherence - ...

EMNLP 2019 Quoref: Question-based coreference resolution

Implicative meaning - What do the propositions in the text imply about other propositions I might see in other text? - E.g., “Bill loves Mary”, “Mary gets sick” → “Bill is sad” - Where do these implications come from?

MRQA 2019 ROPES: Reasoning Over Paragraph Effects in Situations

ROPES: Reasoning Over Paragraph Effects in Situations

Time - Temporal ordering of events - Duration of events - Which things are events in the first place?

Grounding - Common sense - Factual knowledge - More broadly: speaker is trying to communicate world state, and in a person it induces a mental model of that world state. We need to figure out ways to probe these mental models.

Grounding

Many, many, many, more… - Pragmatics, factuality - Coordination, distributive vs. non-distributive - Deixis - Aspectual verbs - Bridging and other elided elements - Negation and quantifier scoping - Distribution of quantifiers - Preposition senses - Noun compounds - ...

Test evaluation

MRQA 2019 Best paper How do we evaluate generative QA? - This is a serious problem that severely limits our test - No solution yet, but we’re working on it - See Anthony’s talk for more detail

What about reasoning shortcuts? - It’s easy to write questions that don’t test what you think they’re testing - See our MRQA paper for more on how to combat this

What about generalization? - There is growing realization that the traditional supervised learning paradigm is broken in high level, large-dataset NLP - we’re fitting artifacts - The test should include not just hidden test data, but hidden test data from a different distribution than the training data - MRQA has the right idea here - That is, we should explicitly make test sets without training sets (as long as they are close enough to training that it should be possible to generalize)

A beginning, and a call for help

MRQA 2019 Ananth An Open Reading Benchmark - Evaluate one model on all of these questions at the same time - Standardized (SQuAD-like) input, arbitrary output - Will grow over time, as more datasets are built

MRQA 2019 Ananth An Open Reading Benchmark

An Open Reading Benchmark - Making a good test is a bigger problem than any one group can solve - We need to work together to make this happen - We will add any good dataset that matches the input format

To Ananth conclude - Current reading comprehension benchmarks are insufficient to convince a reasonable researcher that machines can read - There are a lot of things that need to be tested before we will be convinced - We need to work together to make a sufficient test - there’s too much for anyone to do on their own Thanks! We’re hiring!

How will we know when machines can read? Matt Gardner , with many - PowerPoint PPT Presentation

How will we know when machines can read? Matt Gardner , with many collaborators MRQA workshop, November 4, 2019 Look mom, I can read like a human! Look mom, I can read like a human! But... So whats the right evaluation? MRQA 2019 Building

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

The Drycleaning Machines BWE P 12 / P15 Presentation The Drycleaning Machines BWE P 12 / P15

Virtual machines COMP 520 Fall 2012 Virtual machines (2) Compilation and execution modes of

Virtual Machines Uses for Virtual Machines There are several uses for virtual machines:

Lecture 13: Oracle Turing Machines Arijit Bishnu 13.04.2010 Oracle Turing Machines

Machines Murray Cole Machines 1 Machines 2 Implementing Systems Monitor, mouse, keyboard etc

The Internet 192.168.178.1/24 DHCP 192.168.178.42/24 GW: 192.168.178.1 The

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Bro: The Network Defense Framework Comprehensive Visibility & Defense for Every Corner of

Surveillance, Censorship, and Countermeasures Professor Ristenpart

Survey overview: Online survey in English, Spanish, Portuguese, Haitian Creole, French,

tr s r

LHC optical model and necessary corrections (aperture model, tune, -beat, coupling,

A semi-analytic way of Simulating light Diego Garcia Gamez, Patrick Green, and Andrzej Szelc 1

Stellar Population Modeling of High-z Galaxies and the sSFR Plateau at z ~ 4 7 Valentino

New FDIRC for SuperB J. Vavra, SLAC D. Roberts, Maryland University B. Ratcliff, SLAC

How will we know when machines can read? Matt Gardner , with many - PowerPoint PPT Presentation

How will we know when machines can read? Matt Gardner , with many collaborators MRQA workshop, November 4, 2019 Look mom, I can read like a human! Look mom, I can read like a human! But... So whats the right evaluation? MRQA 2019 Building

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

The Drycleaning Machines BWE P 12 / P15 Presentation The Drycleaning Machines BWE P 12 / P15

Virtual machines COMP 520 Fall 2012 Virtual machines (2) Compilation and execution modes of

Virtual Machines Uses for Virtual Machines There are several uses for virtual machines:

Lecture 13: Oracle Turing Machines Arijit Bishnu 13.04.2010 Oracle Turing Machines

Machines Murray Cole Machines 1 Machines 2 Implementing Systems Monitor, mouse, keyboard etc

The Internet 192.168.178.1/24 DHCP 192.168.178.42/24 GW: 192.168.178.1 The

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Bro: The Network Defense Framework Comprehensive Visibility &amp; Defense for Every Corner of

Surveillance, Censorship, and Countermeasures Professor Ristenpart

Survey overview: Online survey in English, Spanish, Portuguese, Haitian Creole, French,

tr s r

LHC optical model and necessary corrections (aperture model, tune, -beat, coupling,

A semi-analytic way of Simulating light Diego Garcia Gamez, Patrick Green, and Andrzej Szelc 1

Stellar Population Modeling of High-z Galaxies and the sSFR Plateau at z ~ 4 7 Valentino

New FDIRC for SuperB J. Vavra, SLAC D. Roberts, Maryland University B. Ratcliff, SLAC

Bro: The Network Defense Framework Comprehensive Visibility & Defense for Every Corner of