Commonsense benchmarks Or how to measure that your model is - - PowerPoint PPT Presentation

commonsense benchmarks
SMART_READER_LITE
LIVE PREVIEW

Commonsense benchmarks Or how to measure that your model is - - PowerPoint PPT Presentation

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning How do you know that a model is doing commonsense reasoning? How do you know that a model is doing commonsense reasoning? Unsuperv rvised


slide-1
SLIDE 1

Commonsense benchmarks

Or how to measure that your model is actually doing some commonsense reasoning

slide-2
SLIDE 2

How do you know that a model is doing commonsense reasoning?

slide-3
SLIDE 3

How do you know that a model is doing commonsense reasoning?

Unsuperv rvised:

  • Observe behavior,
  • Probe representations,
  • etc.
slide-4
SLIDE 4

How do you know that a model is doing commonsense reasoning?

Unsuperv rvised:

  • Observe behavior,
  • Probe representations,
  • etc.

Benchmarks: knowledge-specific tests (w/ or w/o training data)

slide-5
SLIDE 5

How do you know that a model is doing commonsense reasoning?

Unsuperv rvised:

  • Observe behavior,
  • Probe representations,
  • etc.

Benchmarks: knowledge-specific tests (w/ or w/o training data) QA format: easy to evaluate (e.g., accuracy)

slide-6
SLIDE 6

Step 1: Determine type of reasoning

https://leaderboard.allenai.org/

slide-7
SLIDE 7

Step 1: Determine type of reasoning

https://leaderboard.allenai.org/

Abductive reasoning

slide-8
SLIDE 8

Step 1: Determine type of reasoning

https://leaderboard.allenai.org/

Visual commonsense reasoning Abductive reasoning

slide-9
SLIDE 9

Reasoning about Social Situations

slide-10
SLIDE 10

Reasoning about Social Situations

Alex spilt food all over the floor and it made a huge mess.

What will Alex want to do next?

slide-11
SLIDE 11

Reasoning about Social Situations

Alex spilt food all over the floor and it made a huge mess.

What will Alex want to do next?

run around in the mess mop up the mess

slide-12
SLIDE 12

Reasoning about Social Situations

Alex spilt food all over the floor and it made a huge mess.

What will Alex want to do next?

run around in the mess mop up the mess less likely more likely

slide-13
SLIDE 13

effects stative causes

no intent drink too much clumsy embarrassed clean it up slip on the spill fall over careless upset get a broom gets dirty PersonX spills ___ all over the floor

X is seen as has effect on X X will want X will feel X wanted to X needed to

Knowledge tested in SOCIAL IQA: ATOMIC

slide-14
SLIDE 14

Step 2: Choosing a benchmark size

Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs

slide-15
SLIDE 15

Step 2: Choosing a benchmark size

Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs

Winograd Schema Challenge (WSC), Choice of Plausible Alternatives (COPA)

slide-16
SLIDE 16

Small commonsense benchmarks

The city councilmen refused the demonstrators a permit because the they advocated violence. Who is “the they”? (a)The city councilmen (b)The demonstrators The city councilmen refused the demonstrators a permit because the they feared violence. Who is “the they”? (a)The city councilmen (b)The demonstrators

Wino inograd Sc Schema Ch Chall llenge (W (WSC SC)

27 273 3 example les

Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA)

50 500 0 dev, 50 500 0 test

slide-17
SLIDE 17

Small commonsense benchmarks

I hung up the phone. What was the cause of this? (a)The caller said goodbye to me. (b)The caller identified himself to me. The toddler became cranky. What happened as a result? (a)Her mother put her down for a nap. (b)Her mother fixed her hair into pigtails.

Wino inograd Sc Schema Ch Chall llenge (W (WSC SC)

27 273 3 example les

Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA)

50 500 0 dev, 50 500 0 test

slide-18
SLIDE 18

Step 2: Choosing a QA benchmark size

Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs

Challenge: do to collect positive/negative answers?

slide-19
SLIDE 19

Challenge of collecting unlikely answers

slide-20
SLIDE 20

Challenge of collecting unlikely answers

Goal: negative answers have to be pla lausib ible le but t unli likely ly

slide-21
SLIDE 21

Challenge of collecting unlikely answers

Goal: negative answers have to be pla lausib ible le but t unli likely ly

  • Automatic matching?
  • Random negative sampling won’t work, too topically different
  • “smart” negative sampling isn’t effective either
slide-22
SLIDE 22

Challenge of collecting unlikely answers

Goal: negative answers have to be pla lausib ible le but t unli likely ly

  • Automatic matching?
  • Random negative sampling won’t work, too topically different
  • “smart” negative sampling isn’t effective either
  • Need better solution… maybe we can ask crowd workers?
slide-23
SLIDE 23

Collecting answers from crowdworkers

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

Context and Question

slide-24
SLIDE 24

Collecting answers from crowdworkers

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

Context and Question

slide-25
SLIDE 25

Collecting answers from crowdworkers

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

✔ mop up ✔ give up and order take out ✘ leave the mess ✘ run around in the mess Handwritten ✔ and ✘ Answers

Context and Question Free Text Response

slide-26
SLIDE 26

Collecting answers from crowdworkers

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

✔ mop up ✔ give up and order take out ✘ leave the mess ✘ run around in the mess Handwritten ✔ and ✘ Answers

Context and Question Free Text Response

Problem: handwritten unlikely answers are too easy to detect

slide-27
SLIDE 27

Problem: annotation artifacts

slide-28
SLIDE 28

Problem: annotation artifacts

  • Models can exploit artifacts in handwritten

incorrect answers

  • Exaggerations, off-topic, overly emotional, etc.
  • See Schwartz et al. 2017, Gururangan et al.

2018, Zellers et al. 2018, etc.

slide-29
SLIDE 29

Problem: annotation artifacts

  • Models can exploit artifacts in handwritten

incorrect answers

  • Exaggerations, off-topic, overly emotional, etc.
  • See Schwartz et al. 2017, Gururangan et al.

2018, Zellers et al. 2018, etc.

  • Seemingly “super-human” performance by

large pretrained LMs (BERT, GPT, etc.)

slide-30
SLIDE 30

Problem: annotation artifacts

  • Models can exploit artifacts in handwritten

incorrect answers

  • Exaggerations, off-topic, overly emotional, etc.
  • See Schwartz et al. 2017, Gururangan et al.

2018, Zellers et al. 2018, etc.

  • Seemingly “super-human” performance by

large pretrained LMs (BERT, GPT, etc.)

slide-31
SLIDE 31

How to make unlikely answers robust to annotation artifacts?

slide-32
SLIDE 32

How to make unlikely answers robust to annotation artifacts?

SOCIAL IQ IQA, , COMMONSENSEQA QA: Modified answer collection

slide-33
SLIDE 33

How to make unlikely answers robust to annotation artifacts?

SOCIAL IQ IQA, , COMMONSENSEQA QA: Modified answer collection Hella llaSwag & AF-lite: Adversarial filtering of artifacts

slide-34
SLIDE 34

Question-Switching Answers (SOCIAL IQA)

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

✔ mop up ✔ give up and order take out

Original Question

✘ have slippery hands ✘ get ready to eat

slide-35
SLIDE 35

Question-Switching Answers (SOCIAL IQA)

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

Question-Switching Answer

✔ mop up ✔ give up and order take out

What did Alex need to do before this?

WHAT HAPPENED BEFORE

Original Question

✘ have slippery hands ✘ get ready to eat

slide-36
SLIDE 36

Question-Switching Answers (SOCIAL IQA)

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

Question-Switching Answer

✔ mop up ✔ give up and order take out

What did Alex need to do before this?

WHAT HAPPENED BEFORE

Original Question

✘ have slippery hands ✘ get ready to eat

have slippery hands get ready to eat

✔ have slippery hands ✔ get ready to eat

slide-37
SLIDE 37

Question-Switching Answers (SOCIAL IQA)

Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

WHAT HAPPENS NEXT

Question-Switching Answer

✔ mop up ✔ give up and order take out

What did Alex need to do before this?

WHAT HAPPENED BEFORE

✔ have slippery hands ✔ get ready to eat

Original Question

✘ have slippery hands ✘ get ready to eat

slide-38
SLIDE 38

Comparing incorrect/correct answers’ styles

More stylistically different from correct More stylistically similar

slide-39
SLIDE 39

Comparing incorrect/correct answers’ styles

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Arousal Dominance Valence

Effect Size when comparing to Correct Answers

Handwritten Incorrect Question Switching

More stylistically different from correct More stylistically similar

slide-40
SLIDE 40

Comparing incorrect/correct answers’ styles

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Arousal Dominance Valence

Effect Size when comparing to Correct Answers

Handwritten Incorrect Question Switching

More stylistically different from correct More stylistically similar

Question switching answers are more sty tylistically sim imilar to correct answers

slide-41
SLIDE 41

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-42
SLIDE 42

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-43
SLIDE 43

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-44
SLIDE 44

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-45
SLIDE 45

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-46
SLIDE 46

COMMONSENSEQA: pivot on knowledge graphs

Talmor et al. 2019

slide-47
SLIDE 47

Adversarial Filtering (lite)

Goal: remove examples with exploitable artifacts or spurious correlations

  • Use pre-trained representations
  • Iteratively remove data that’s

easiest to predict by a linear classifier (e.g., logistic)

  • Robust examples remain

HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)

Unfiltered examples

slide-48
SLIDE 48

Adversarial Filtering (lite)

Goal: remove examples with exploitable artifacts or spurious correlations

  • Use pre-trained representations
  • Iteratively remove data that’s

easiest to predict by a linear classifier (e.g., logistic)

  • Robust examples remain

HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)

Unfiltered examples Unfiltered examples

slide-49
SLIDE 49

Adversarial Filtering (lite)

Goal: remove examples with exploitable artifacts or spurious correlations

  • Use pre-trained representations
  • Iteratively remove data that’s

easiest to predict by a linear classifier (e.g., logistic)

  • Robust examples remain

HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)

Unfiltered examples

+

Unfiltered examples

slide-50
SLIDE 50

Adversarial Filtering (lite)

Goal: remove examples with exploitable artifacts or spurious correlations

  • Use pre-trained representations
  • Iteratively remove data that’s

easiest to predict by a linear classifier (e.g., logistic)

  • Robust examples remain

HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)

“Easy” examples Unfiltered examples

+

Unfiltered examples

slide-51
SLIDE 51

Adversarial Filtering (lite)

Goal: remove examples with exploitable artifacts or spurious correlations

  • Use pre-trained representations
  • Iteratively remove data that’s

easiest to predict by a linear classifier (e.g., logistic)

  • Robust examples remain

HellaSwag (Zellers et al., 2019) AF-lite (Le Bras et al., 2019)

“Easy” examples Robust examples Unfiltered examples

+

Unfiltered examples

slide-52
SLIDE 52

Performance of models on the WikiHow portion of HellaSwag (Zellers et al., 2019) with different AF settings and different training models

slide-53
SLIDE 53

Performance of models on the WikiHow portion of HellaSwag (Zellers et al., 2019) with different AF settings and different training models

Adversarial filtering removes examples with spurious correlations => Task becomes harder

slide-54
SLIDE 54

Model performance on SOCIAL IQA

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%

Accuracy

Humans Bert-large Bert-base GPT Random

slide-55
SLIDE 55

Model performance on SOCIAL IQA

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%

Accuracy

Humans Bert-large Bert-base GPT Random

>20% gap to improve on

slide-56
SLIDE 56

Challenging SOCIAL IQA examples for BERT-large

Although Aubrey was older and stronger, they lost to Alex in arm wrestling.

How would Alex feel as a result? ashamed boastful they need to practice more

how Aubrey would feel, not Alex

slide-57
SLIDE 57

Challenging SOCIAL IQA examples for BERT-large

Remy gave Skylar, the concierge, her account so that she could check into the hotel.

What will Remy want to do next?

Although Aubrey was older and stronger, they lost to Alex in arm wrestling.

How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more

how Aubrey would feel, not Alex what Remy did before

slide-58
SLIDE 58

Challenging SOCIAL IQA examples for BERT-large

Remy gave Skylar, the concierge, her account so that she could check into the hotel.

What will Remy want to do next?

Although Aubrey was older and stronger, they lost to Alex in arm wrestling.

How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more

how Aubrey would feel, not Alex what Remy did before

Need more robust, person-centric reasoning

slide-59
SLIDE 59

Challenging SOCIAL IQA examples for BERT-large

Remy gave Skylar, the concierge, her account so that she could check into the hotel.

What will Remy want to do next?

Although Aubrey was older and stronger, they lost to Alex in arm wrestling.

How would Alex feel as a result? lose her credit card arrive at a hotel get the key from Skylar ashamed boastful they need to practice more

how Aubrey would feel, not Alex what Remy did before

Need more robust, person-centric reasoning Need better notion of causes vs. effects

slide-60
SLIDE 60

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

slide-61
SLIDE 61

Social commonsense

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

slide-62
SLIDE 62

Physical commonsense Social commonsense

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

slide-63
SLIDE 63

Temporal commonsense Physical commonsense Social commonsense

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

slide-64
SLIDE 64

Temporal commonsense Commonsense reading comprehension Physical commonsense Social commonsense

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

slide-65
SLIDE 65

Temporal commonsense Commonsense reading comprehension Physical commonsense Social commonsense

Commonsense benchmarks

Social IQa Physical IQa Abductive NLI HellaSwag VCR WinoGrande CommonsenseQA JHU Ordinal Commonsense MCTaco SWAG Naïve Psychology COPA WSC ROC story MultiRC CosmosQA ReCORD

Thanks! Questions?