PIQA: Reasoning about Physical Commonsense in Natural Language - PowerPoint PPT Presentation

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)

Outline 1. Motivation 2. Dataset 2.1. Collection 2.2. Cleaning 2.3. Statistics 3. Experiments 3.1. Results 4. Analysis 4.1. Quantitative 4.2. Qualitative 5. Critique

Motivation ● Modeling physical common sense knowledge is essential to true AI-completeness. ● Can AI systems learn to reliably answer physical common sense questions without experiencing the physical world? ○ The common sense properties are rarely directly reported. ● No extensive evaluation of SOTA models on questions that require physical common sense knowledge.

Dataset ● Task : given a question and two possible answers, choose the most appropriate answer. ● Question : indicates a post-condition ( goal ) ● Answer : procedure for accomplishing the goal ( solution )

Dataset - Collection ● Qualification HIT for annotators ○ Identify well formed (goal, solution) pairs >80% times. ● Provided annotators with a prompt derived from instructables.com ○ Drawn from six categories - costume, outside, craft, home, food, and workshop ○ Reminds about less prototypical uses of everyday objects ● Annotators asked to construct two component tasks ○ Articulate the goal and solution ○ Perturb the solution subtly to make it invalid

Dataset - Cleaning ● Removed examples with low agreement ○ Correct examples that require expert knowledge are removed ● Used AFLite to perform systematic data bias reduction ○ Used 5k examples to fine-tune BERT-Large ○ Computed corresponding embeddings of remaining instances ○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings are strong indicators of the correct answer. ○ Discarded instances whose embeddings are highly indicative of the target label.

AFLite (Adversarial Filtering Lite) 000 1 001 1 010 1 011 1 100 0 101 0 110 0 111 0 Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 1 000 1 110 0 110 0 001 1 010 1 101 0 100 0 010 1 101 0 001 1 111 0 011 1 100 0 001 - 011 - 000 - 101 0 100 - 000 - 100 - 110 0 111 0 Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 [ ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ ] 100 [ 0 ] 001 1 011 - 000 - 101 [ ] 100 0 000 - 100 - 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 [ 1 ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ 0 ] 100 [ 0 ] 001 1 011 0 000 - 101 [ ] 100 0 000 1 100 - 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 [ 1 0 ] 000 1 110 0 110 0 001 [ 1 ] 010 1 101 0 100 0 010 [ ] 101 0 001 1 111 0 011 [ 0 ] 100 [ 0 0 ] 001 1 011 0 000 0 101 [ ] 100 0 000 1 100 0 110 [ ] 111 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 1 [ 1 0 ] 0.5 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 0 [ 0 0 ] 1.0 101 0 [ ] 110 0 [ ] 111 0 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

AFLite (Adversarial Filtering Lite) 000 1 [ 1 0 ] 0.5 Threshold - 0.75 001 1 [ 1 ] 1.0 010 1 [ ] 011 1 [ 0 ] 0.0 100 0 [ 0 0 ] 1.0 101 0 [ ] 110 0 [ ] 111 0 [ ] Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

Examples

Dataset - Statistics ● Number of QA pairs ○ Training: > 16k ○ Development: ~ 2k ○ Testing: ~3k ● Average number of words: ○ Goal: 7.8 ○ Correct solution: 21.3 ○ Incorrect solution: 21.3

Dataset - Statistics ● Nearly identical distribution of correct and incorrect solution length. ● At least 85% overlap b/w words used in correct and incorrect solutions.

Experiments ● For each choice, provided the model with goal, solution, and [CLS]. ● Extracted hidden states corresponding to [CLS]. ● Applied linear transformation to each hidden state and softmax over the two options. ● Trained using a cross-entropy loss. ● Truncated examples at 150 tokens - affects ~1% of the data. ● Human performance was calculated by a majority vote on the development set.

Quantitative Analysis ● Two solution choices that differ by editing a single phrase must test the common sense understanding of that phrase. ● ~60% data involves 1-2 word edit-distance b/w solutions. ● Dataset complexity generally increases with the edit distance b/w the solution pairs.

Quantitative Analysis ● RoBERTa struggles to understand certain flexible relations. ○ ‘before’, ‘after’, ‘top’, and ‘bottom’ ● Performs worse than average on solutions differing in ‘water’ even after ~300 training examples. ● Performs much better at certain nouns, such as ‘spoon’.

Quantitative Analysis ● ‘water’ is prevalent but highly versatile. ○ Substituted with a variety of different household items. ● ‘spoon’ has fewer common replacements which indicates RoBERTa understands these simple affordances.

Qualitative Analysis ● RoBERTa distinguishes prototypical correct solutions from clearly ridiculous trick solutions. ● Struggles with subtle relations and non-prototypical situations.

Critique ● Try to advance a crucial ‘grounding’ problem ○ A benchmark for testing physical understanding of new models ○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t perform very good ● Good effort at creating an unbiased dataset ○ No ‘annotate for smart robot’ instruction to the workers. ○ Good cleaning of the dataset - agreement scores and AFLite. ● Reasonably good analysis of the performance of RoBERTa on their dataset.

Critique ● An intelligent model will have good performance on this benchmark but is the converse true? ○ What if we pre-train RoBERTa on text from ‘ instructables.com ’? ● Should we expect models trained on text to have physical understanding? ○ How would a text-trained model know that squeezing and then releasing a bottle creates suction? ○ Should the focus have been on some ‘grounded’ models? e.g. VQA models. ● Is the dataset easy because we have just two choices? ● The paper does not report a few important dataset statistics ○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions? ○ How many examples were actually removed during cleaning? ● Is a majority vote good indicator of human performance? ○ What is the average score of a single person? ○ Should the dataset have questions where majority vote gets it wrong?

Questions?

PIQA: Reasoning about Physical Commonsense in Natural Language - PowerPoint PPT Presentation

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan et al. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv abs/1911.11641 (2019) Outline 1. Motivation 2. Dataset 2.1.

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense

Commonsense in Natural Language Yonatan Bisk , Rowan Zellers, Ronan Le Bras, Jianfeng Gao and Yejin

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Which Material Design Is Commonsense . . . Possible Under Additive Commonsense . . . How

Representing Knowledge Dustin Smith MIT Media Lab July 2008 Commonsense Computing MIT MediaLab

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

SECTION 1: Introductions Code Reasoning Forward Reasoning CODE REASONING +

Probabilistic Reasoning; Probabilistic Reasoning; Network-based reasoning Network-based

CHAPTER-4 1 LOGIC AND REASONING ! Knowledge and ! Reasoning in Knowledge- Reasoning Based

Commonsense for Generative Multi-Hop Question Answering Tasks Lisa Bauer* Yicheng Wang* Mohit

WebChild: Harvesting and Organizing Commonsense Knowledge from Web Niket Tandon Max Planck

Commonsense Properties from Query Logs and Question Answering Forums Julien Romero, Simon

Our Digital Citizenship Pledge commonsense.org/education Shareable with attribution for

Op Optio tions ns fo for r Se Sec 3 c 3 (F (For r 2020 Se Sec 2 st student dents) s)

WELCOME Comments about ZOOM Final District 6860 Business Meeting AGENDA for Business Meeting

Strategies for Success Business Development Plans Silvia Vitiello 8 July 2020 Todays

CS5412: BIMODAL MULTICAST ASTROLABE Lecture XIX Ken Birman Leiden; Dec 06 Gossip 201 2

Contributions to the verification and control of timed and probabilistic models Nathalie Bertrand

The Ladder: A Reliable Leaderboard for Machine Learning Competitions COMS 6998-4 2017, Topics in

Improving the Long Term Care Workforce Serving Older Adults by Robyn I. Stone, DrPH Using

Orange Harvester Red B Mockup Presentation October 20, 2005 1 U.S. Orange Market Fresh: 2.5