Commonsense benchmarks
Or how to measure that your model is actually doing some commonsense reasoning
Commonsense benchmarks Or how to measure that your model is - - PowerPoint PPT Presentation
Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning How do you know that a model is doing commonsense reasoning? How do you know that a model is doing commonsense reasoning? Unsuperv rvised
Or how to measure that your model is actually doing some commonsense reasoning
https://leaderboard.allenai.org/
https://leaderboard.allenai.org/
Abductive reasoning
https://leaderboard.allenai.org/
Visual commonsense reasoning Abductive reasoning
Alex spilt food all over the floor and it made a huge mess.
What will Alex want to do next?
Alex spilt food all over the floor and it made a huge mess.
What will Alex want to do next?
run around in the mess mop up the mess
Alex spilt food all over the floor and it made a huge mess.
What will Alex want to do next?
run around in the mess mop up the mess less likely more likely
effects stative causes
no intent drink too much clumsy embarrassed clean it up slip on the spill fall over careless upset get a broom gets dirty PersonX spills ___ all over the floor
X is seen as has effect on X X will want X will feel X wanted to X needed to
Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs
Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs
The city councilmen refused the demonstrators a permit because the they advocated violence. Who is “the they”? (a)The city councilmen (b)The demonstrators The city councilmen refused the demonstrators a permit because the they feared violence. Who is “the they”? (a)The city councilmen (b)The demonstrators
Wino inograd Sc Schema Ch Chall llenge (W (WSC SC)
27 273 3 example les
Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA)
50 500 0 dev, 50 500 0 test
I hung up the phone. What was the cause of this? (a)The caller said goodbye to me. (b)The caller identified himself to me. The toddler became cranky. What happened as a result? (a)Her mother put her down for a nap. (b)Her mother fixed her hair into pigtails.
Wino inograd Sc Schema Ch Chall llenge (W (WSC SC)
27 273 3 example les
Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA)
50 500 0 dev, 50 500 0 test
Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs
Goal: negative answers have to be pla lausib ible le but t unli likely ly
Goal: negative answers have to be pla lausib ible le but t unli likely ly
Goal: negative answers have to be pla lausib ible le but t unli likely ly
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
Context and Question
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
Context and Question
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
✔ mop up ✔ give up and order take out ✘ leave the mess ✘ run around in the mess Handwritten ✔ and ✘ Answers
Context and Question Free Text Response
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
✔ mop up ✔ give up and order take out ✘ leave the mess ✘ run around in the mess Handwritten ✔ and ✘ Answers
Context and Question Free Text Response
Problem: handwritten unlikely answers are too easy to detect
incorrect answers
2018, Zellers et al. 2018, etc.
incorrect answers
2018, Zellers et al. 2018, etc.
large pretrained LMs (BERT, GPT, etc.)
incorrect answers
2018, Zellers et al. 2018, etc.
large pretrained LMs (BERT, GPT, etc.)
SOCIAL IQ IQA, , COMMONSENSEQA QA: Modified answer collection
SOCIAL IQ IQA, , COMMONSENSEQA QA: Modified answer collection Hella llaSwag & AF-lite: Adversarial filtering of artifacts
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
✔ mop up ✔ give up and order take out
Original Question
✘ have slippery hands ✘ get ready to eat
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
Question-Switching Answer
✔ mop up ✔ give up and order take out
What did Alex need to do before this?
WHAT HAPPENED BEFORE
Original Question
✘ have slippery hands ✘ get ready to eat
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
Question-Switching Answer
✔ mop up ✔ give up and order take out
What did Alex need to do before this?
WHAT HAPPENED BEFORE
Original Question
✘ have slippery hands ✘ get ready to eat
have slippery hands get ready to eat
✔ have slippery hands ✔ get ready to eat
Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
WHAT HAPPENS NEXT
Question-Switching Answer
✔ mop up ✔ give up and order take out
What did Alex need to do before this?
WHAT HAPPENED BEFORE
✔ have slippery hands ✔ get ready to eat
Original Question
✘ have slippery hands ✘ get ready to eat
More stylistically different from correct More stylistically similar
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Arousal Dominance Valence
Effect Size when comparing to Correct Answers
Handwritten Incorrect Question Switching
More stylistically different from correct More stylistically similar
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Arousal Dominance Valence
Effect Size when comparing to Correct Answers
Handwritten Incorrect Question Switching
More stylistically different from correct More stylistically similar
Talmor et al. 2019
Talmor et al. 2019
Talmor et al. 2019
Talmor et al. 2019
Talmor et al. 2019
Talmor et al. 2019
Goal: remove examples with exploitable artifacts or spurious correlations
easiest to predict by a linear classifier (e.g., logistic)
HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)
Unfiltered examples
Goal: remove examples with exploitable artifacts or spurious correlations
easiest to predict by a linear classifier (e.g., logistic)
HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)
Unfiltered examples Unfiltered examples
Goal: remove examples with exploitable artifacts or spurious correlations
easiest to predict by a linear classifier (e.g., logistic)
HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)
Unfiltered examples
Unfiltered examples
Goal: remove examples with exploitable artifacts or spurious correlations
easiest to predict by a linear classifier (e.g., logistic)
HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)
“Easy” examples Unfiltered examples
Unfiltered examples
Goal: remove examples with exploitable artifacts or spurious correlations
easiest to predict by a linear classifier (e.g., logistic)
HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)
“Easy” examples Robust examples Unfiltered examples
Unfiltered examples
Performance of models on the WikiHow portion of HellaSwag (Zellers et al., 2019) with different AF settings and different training models
Performance of models on the WikiHow portion of HellaSwag (Zellers et al., 2019) with different AF settings and different training models
Adversarial filtering removes examples with spurious correlations => Task becomes harder
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%
Accuracy
Humans Bert-large Bert-base GPT Random
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%
Accuracy
Humans Bert-large Bert-base GPT Random
>20% gap to improve on
Although Aubrey was older and stronger, they lost to Alex in arm wrestling.
How would Alex feel as a result? ashamed boastful they need to practice more
how Aubrey would feel, not Alex
Remy gave Skylar, the concierge, her account so that she could check into the hotel.
What will Remy want to do next?
Although Aubrey was older and stronger, they lost to Alex in arm wrestling.
How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more
how Aubrey would feel, not Alex what Remy did before
Remy gave Skylar, the concierge, her account so that she could check into the hotel.
What will Remy want to do next?
Although Aubrey was older and stronger, they lost to Alex in arm wrestling.
How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more
how Aubrey would feel, not Alex what Remy did before
Need more robust, person-centric reasoning
Remy gave Skylar, the concierge, her account so that she could check into the hotel.
What will Remy want to do next?
Although Aubrey was older and stronger, they lost to Alex in arm wrestling.
How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more
how Aubrey would feel, not Alex what Remy did before
Need more robust, person-centric reasoning Need better notion of causes vs. effects
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Social commonsense
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Physical commonsense Social commonsense
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Temporal commonsense Physical commonsense Social commonsense
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Temporal commonsense Commonsense reading comprehension Physical commonsense Social commonsense
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Temporal commonsense Commonsense reading comprehension Physical commonsense Social commonsense
Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD
Thanks! Questions?