piqa reasoning about physical commonsense in natural
play

PIQA: Reasoning about Physical Commonsense in Natural Language - PowerPoint PPT Presentation

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv abs/1911.11641 (2019) Outline 1. Motivation 2. Dataset 2.1.


  1. PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)

  2. Outline 1. Motivation 2. Dataset 2.1. Collection 2.2. Cleaning 2.3. Statistics 3. Experiments 3.1. Results 4. Analysis 4.1. Quantitative 4.2. Qualitative 5. Critique

  3. Motivation ● Modeling physical common sense knowledge is essential to true AI-completeness. ● Can AI systems learn to reliably answer physical common sense questions without experiencing the physical world? ○ The common sense properties are rarely directly reported. ● No extensive evaluation of SOTA models on questions that require physical common sense knowledge.

  4. Dataset ● Task : given a question and two possible answers, choose the most appropriate answer. ● Question : indicates a post-condition ( goal ) ● Answer : procedure for accomplishing the goal ( solution )

  5. Dataset - Collection ● Qualification HIT for annotators ○ Identify well formed (goal, solution) pairs >80% times. ● Provided annotators with a prompt derived from instructables.com ○ Drawn from six categories - costume, outside, craft, home, food, and workshop ○ Reminds about less prototypical uses of everyday objects ● Annotators asked to construct two component tasks ○ Articulate the goal and solution ○ Perturb the solution subtly to make it invalid

  6. Dataset - Cleaning ● Removed examples with low agreement ○ Correct examples that require expert knowledge are removed ● Used AFLite to perform systematic data bias reduction ○ Used 5k examples to fine-tune BERT-Large ○ Computed corresponding embeddings of remaining instances ○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings are strong indicators of the correct answer. ○ Discarded instances whose embeddings are highly indicative of the target label.

  7. AFLite (Adversarial Filtering Lite) 000 1 001 1 010 1 011 1 100 0 101 0 110 0 111 0 Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  8. AFLite (Adversarial Filtering Lite) 000 1 000 1 110 0 110 0 001 1 010 1 101 0 100 0 010 1 101 0 001 1 111 0 011 1 100 0 001 - 011 - 000 - 101 0 100 - 000 - 100 - 110 0 111 0 Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  9. AFLite (Adversarial Filtering Lite) 000 [ ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ ] 100 [ 0 ] 001 1 011 - 000 - 101 [ ] 100 0 000 - 100 - 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  10. AFLite (Adversarial Filtering Lite) 000 [ 1 ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ 0 ] 100 [ 0 ] 001 1 011 0 000 - 101 [ ] 100 0 000 1 100 - 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  11. AFLite (Adversarial Filtering Lite) 000 [ 1 0 ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ 0 ] 100 [ 0 0 ] 001 1 011 0 000 0 101 [ ] 100 0 000 1 100 0 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  12. AFLite (Adversarial Filtering Lite) 000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 0 [ 0 0 ] 1.0 101 0 [ ] 110 0 [ ] 111 0 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  13. AFLite (Adversarial Filtering Lite) 000 1 [ 1 0 ] 0.5 Threshold - 0.75 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 0 [ 0 0 ] 1.0 101 0 [ ] 110 0 [ ] 111 0 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

  14. Examples

  15. Dataset - Statistics ● Number of QA pairs ○ Training: > 16k ○ Development: ~ 2k ○ Testing: ~3k ● Average number of words: ○ Goal: 7.8 ○ Correct solution: 21.3 ○ Incorrect solution: 21.3

  16. Dataset - Statistics ● Nearly identical distribution of correct and incorrect solution length. ● At least 85% overlap b/w words used in correct and incorrect solutions.

  17. Experiments ● For each choice, provided the model with goal, solution, and [CLS]. ● Extracted hidden states corresponding to [CLS]. ● Applied linear transformation to each hidden state and softmax over the two options. ● Trained using a cross-entropy loss. ● Truncated examples at 150 tokens - affects ~1% of the data. ● Human performance was calculated by a majority vote on the development set.

  18. Quantitative Analysis ● Two solution choices that differ by editing a single phrase must test the common sense understanding of that phrase. ● ~60% data involves 1-2 word edit-distance b/w solutions. ● Dataset complexity generally increases with the edit distance b/w the solution pairs.

  19. Quantitative Analysis ● RoBERTa struggles to understand certain flexible relations. ○ ‘before’, ‘after’, ‘top’, and ‘bottom’ ● Performs worse than average on solutions differing in ‘water’ even after ~300 training examples. ● Performs much better at certain nouns, such as ‘spoon’.

  20. Quantitative Analysis ● ‘water’ is prevalent but highly versatile. ○ Substituted with a variety of different household items. ● ‘spoon’ has fewer common replacements which indicates RoBERTa understands these simple affordances.

  21. Qualitative Analysis ● RoBERTa distinguishes prototypical correct solutions from clearly ridiculous trick solutions. ● Struggles with subtle relations and non-prototypical situations.

  22. Critique ● Try to advance a crucial ‘grounding’ problem ○ A benchmark for testing physical understanding of new models ○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t perform very good ● Good effort at creating an unbiased dataset ○ No ‘annotate for smart robot’ instruction to the workers. ○ Good cleaning of the dataset - agreement scores and AFLite. ● Reasonably good analysis of the performance of RoBERTa on their dataset.

  23. Critique ● An intelligent model will have good performance on this benchmark but is the converse true? ○ What if we pre-train RoBERTa on text from ‘ instructables.com ’? ● Should we expect models trained on text to have physical understanding? ○ How would a text-trained model know that squeezing and then releasing a bottle creates suction? ○ Should the focus have been on some ‘grounded’ models? e.g. VQA models. ● Is the dataset easy because we have just two choices? ● The paper does not report a few important dataset statistics ○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions? ○ How many examples were actually removed during cleaning? ● Is a majority vote good indicator of human performance? ○ What is the average score of a single person? ○ Should the dataset have questions where majority vote gets it wrong?

  24. Questions?

Recommend


More recommend