PIQA: Reasoning about Physical Commonsense in Natural Language
Shailesh M Pandey
Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)
PIQA: Reasoning about Physical Commonsense in Natural Language - - PowerPoint PPT Presentation
PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv abs/1911.11641 (2019) Outline 1. Motivation 2. Dataset 2.1.
Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)
1. Motivation 2. Dataset
2.1. Collection 2.2. Cleaning 2.3. Statistics
3. Experiments
3.1. Results
4. Analysis
4.1. Quantitative 4.2. Qualitative
5. Critique
knowledge is essential to true AI-completeness.
physical common sense questions without experiencing the physical world?
○ The common sense properties are rarely directly reported.
common sense knowledge.
appropriate answer.
○ Identify well formed (goal, solution) pairs >80% times.
from instructables.com
○ Drawn from six categories - costume,
○ Reminds about less prototypical uses of everyday objects
component tasks
○ Articulate the goal and solution ○ Perturb the solution subtly to make it invalid
○ Correct examples that require expert knowledge are removed
○ Used 5k examples to fine-tune BERT-Large ○ Computed corresponding embeddings of remaining instances ○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings are strong indicators of the correct answer. ○ Discarded instances whose embeddings are highly indicative of the target label.
000 1 001 1 010 1 011 1 100 101 110 111
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1 001 1 010 1 011 1 100 101 110 111 000 1 010 1 101 001
101 001 1 011
100 111 000
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ ] 001 [ 1 ] 010 [ ] 011 [ ] 100 [ 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011
100 111 000
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ 1 ] 001 [ 1 ] 010 [ ] 011 [ 0 ] 100 [ 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011 000 1 110 100 111 000
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ 1 0 ] 001 [ 1 ] 010 [ ] 011 [ 0 ] 100 [ 0 0 ] 101 [ ] 110 [ ] 111 [ ] 000 1 010 1 101 001 1 100 110 101 001 1 011 000 1 110 100 111 000 100
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 [ 0 0 ] 1.0 101 [ ] 110 [ ] 111 [ ]
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 [ 0 0 ] 1.0 101 [ ] 110 [ ] 111 [ ]
Threshold - 0.75
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
○ Training: > 16k ○ Development: ~ 2k ○ Testing: ~3k
○ Goal: 7.8 ○ Correct solution: 21.3 ○ Incorrect solution: 21.3
goal, solution, and [CLS].
[CLS].
hidden state and softmax over the two
affects ~1% of the data.
majority vote on the development set.
a single phrase must test the common sense understanding of that phrase.
b/w solutions.
with the edit distance b/w the solution pairs.
flexible relations.
○ ‘before’, ‘after’, ‘top’, and ‘bottom’
differing in ‘water’ even after ~300 training examples.
such as ‘spoon’.
○ Substituted with a variety of different household items.
which indicates RoBERTa understands these simple affordances.
trick solutions.
○ A benchmark for testing physical understanding of new models ○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t perform very good
○ No ‘annotate for smart robot’ instruction to the workers. ○ Good cleaning of the dataset - agreement scores and AFLite.
converse true?
○ What if we pre-train RoBERTa on text from ‘instructables.com’?
○ How would a text-trained model know that squeezing and then releasing a bottle creates suction? ○ Should the focus have been on some ‘grounded’ models? e.g. VQA models.
○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions? ○ How many examples were actually removed during cleaning?
○ What is the average score of a single person? ○ Should the dataset have questions where majority vote gets it wrong?