PIQA: Reasoning about Physical Commonsense in Natural Language - - PowerPoint PPT Presentation

piqa reasoning about physical commonsense in natural
SMART_READER_LITE
LIVE PREVIEW

PIQA: Reasoning about Physical Commonsense in Natural Language - - PowerPoint PPT Presentation

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv abs/1911.11641 (2019) Outline 1. Motivation 2. Dataset 2.1.


slide-1
SLIDE 1

PIQA: Reasoning about Physical Commonsense in Natural Language

Shailesh M Pandey

Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)

slide-2
SLIDE 2

Outline

1. Motivation 2. Dataset

2.1. Collection 2.2. Cleaning 2.3. Statistics

3. Experiments

3.1. Results

4. Analysis

4.1. Quantitative 4.2. Qualitative

5. Critique

slide-3
SLIDE 3

Motivation

  • Modeling physical common sense

knowledge is essential to true AI-completeness.

  • Can AI systems learn to reliably answer

physical common sense questions without experiencing the physical world?

○ The common sense properties are rarely directly reported.

  • No extensive evaluation of SOTA models
  • n questions that require physical

common sense knowledge.

slide-4
SLIDE 4

Dataset

  • Task: given a question and two possible answers, choose the most

appropriate answer.

  • Question: indicates a post-condition (goal)
  • Answer: procedure for accomplishing the goal (solution)
slide-5
SLIDE 5

Dataset - Collection

  • Qualification HIT for annotators

○ Identify well formed (goal, solution) pairs >80% times.

  • Provided annotators with a prompt derived

from instructables.com

○ Drawn from six categories - costume,

  • utside, craft, home, food, and workshop

○ Reminds about less prototypical uses of everyday objects

  • Annotators asked to construct two

component tasks

○ Articulate the goal and solution ○ Perturb the solution subtly to make it invalid

slide-6
SLIDE 6

Dataset - Cleaning

  • Removed examples with low agreement

○ Correct examples that require expert knowledge are removed

  • Used AFLite to perform systematic data bias reduction

○ Used 5k examples to fine-tune BERT-Large ○ Computed corresponding embeddings of remaining instances ○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings are strong indicators of the correct answer. ○ Discarded instances whose embeddings are highly indicative of the target label.

slide-7
SLIDE 7

AFLite (Adversarial Filtering Lite)

000 1 001 1 010 1 011 1 100 101 110 111

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-8
SLIDE 8

000 1 001 1 010 1 011 1 100 101 110 111 000 1 010 1 101 001

  • 100
  • 110

101 001 1 011

  • 000
  • 110

100 111 000

  • 100
  • AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-9
SLIDE 9

000 [ ] 001 [ 1 ] 010 [ ] 011 [ ] 100 [ 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011

  • 000
  • 110

100 111 000

  • 100
  • AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-10
SLIDE 10

000 [ 1 ] 001 [ 1 ] 010 [ ] 011 [ 0 ] 100 [ 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011 000 1 110 100 111 000

  • 100
  • AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-11
SLIDE 11

000 [ 1 0 ] 001 [ 1 ] 010 [ ] 011 [ 0 ] 100 [ 0 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011 000 1 110 100 111 000 100

AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-12
SLIDE 12

000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 [ 0 0 ] 1.0 101 [ ] 110 [ ] 111 [ ]

AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-13
SLIDE 13

000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 [ 0 0 ] 1.0 101 [ ] 110 [ ] 111 [ ]

Threshold - 0.75

AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

slide-14
SLIDE 14

Examples

slide-15
SLIDE 15

Dataset - Statistics

  • Number of QA pairs

○ Training: > 16k ○ Development: ~ 2k ○ Testing: ~3k

  • Average number of words:

○ Goal: 7.8 ○ Correct solution: 21.3 ○ Incorrect solution: 21.3

slide-16
SLIDE 16

Dataset - Statistics

  • Nearly identical distribution of correct and incorrect solution length.
  • At least 85% overlap b/w words used in correct and incorrect solutions.
slide-17
SLIDE 17

Experiments

  • For each choice, provided the model with

goal, solution, and [CLS].

  • Extracted hidden states corresponding to

[CLS].

  • Applied linear transformation to each

hidden state and softmax over the two

  • ptions.
  • Trained using a cross-entropy loss.
  • Truncated examples at 150 tokens -

affects ~1% of the data.

  • Human performance was calculated by a

majority vote on the development set.

slide-18
SLIDE 18

Quantitative Analysis

  • Two solution choices that differ by editing

a single phrase must test the common sense understanding of that phrase.

  • ~60% data involves 1-2 word edit-distance

b/w solutions.

  • Dataset complexity generally increases

with the edit distance b/w the solution pairs.

slide-19
SLIDE 19

Quantitative Analysis

  • RoBERTa struggles to understand certain

flexible relations.

○ ‘before’, ‘after’, ‘top’, and ‘bottom’

  • Performs worse than average on solutions

differing in ‘water’ even after ~300 training examples.

  • Performs much better at certain nouns,

such as ‘spoon’.

slide-20
SLIDE 20

Quantitative Analysis

  • ‘water’ is prevalent but highly versatile.

○ Substituted with a variety of different household items.

  • ‘spoon’ has fewer common replacements

which indicates RoBERTa understands these simple affordances.

slide-21
SLIDE 21

Qualitative Analysis

  • RoBERTa distinguishes prototypical correct solutions from clearly ridiculous

trick solutions.

  • Struggles with subtle relations and non-prototypical situations.
slide-22
SLIDE 22

Critique

  • Try to advance a crucial ‘grounding’ problem

○ A benchmark for testing physical understanding of new models ○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t perform very good

  • Good effort at creating an unbiased dataset

○ No ‘annotate for smart robot’ instruction to the workers. ○ Good cleaning of the dataset - agreement scores and AFLite.

  • Reasonably good analysis of the performance of RoBERTa on their dataset.
slide-23
SLIDE 23

Critique

  • An intelligent model will have good performance on this benchmark but is the

converse true?

○ What if we pre-train RoBERTa on text from ‘instructables.com’?

  • Should we expect models trained on text to have physical understanding?

○ How would a text-trained model know that squeezing and then releasing a bottle creates suction? ○ Should the focus have been on some ‘grounded’ models? e.g. VQA models.

  • Is the dataset easy because we have just two choices?
  • The paper does not report a few important dataset statistics

○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions? ○ How many examples were actually removed during cleaning?

  • Is a majority vote good indicator of human performance?

○ What is the average score of a single person? ○ Should the dataset have questions where majority vote gets it wrong?

slide-24
SLIDE 24

Questions?