Breaking NLI Systems with Sentences that Require Simple Lexical - - PowerPoint PPT Presentation

breaking nli systems
SMART_READER_LITE
LIVE PREVIEW

Breaking NLI Systems with Sentences that Require Simple Lexical - - PowerPoint PPT Presentation

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences Max Glockner 1 , Vered Shwartz 2 and Yoav Goldberg 2 1 TU Darmstadt 2 Bar-Ilan University July 18, 2018 SNLI [Bowman et al., 2015] A large scale dataset for NLI (Natural


slide-1
SLIDE 1

Breaking NLI Systems

with Sentences that Require Simple Lexical Inferences Max Glockner1, Vered Shwartz2 and Yoav Goldberg2

1TU Darmstadt 2Bar-Ilan University

July 18, 2018

slide-2
SLIDE 2

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013])

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-3
SLIDE 3

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-4
SLIDE 4

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-5
SLIDE 5

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street ⇒ ENTAILMENT

ENTAILMENT

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-6
SLIDE 6

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street ⇒ ENTAILMENT

ENTAILMENT

  • 2. A juggler entertaining a group of children on the street

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-7
SLIDE 7

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street ⇒ ENTAILMENT

ENTAILMENT

  • 2. A juggler entertaining a group of children on the street ⇒ NEUTRAL

NEUTRAL

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-8
SLIDE 8

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street ⇒ ENTAILMENT

ENTAILMENT

  • 2. A juggler entertaining a group of children on the street ⇒ NEUTRAL

NEUTRAL

  • 3. A magician performing for an audience in a nightclub

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-9
SLIDE 9

SNLI [Bowman et al., 2015]

A large scale dataset for NLI (Natural Language Inference; Recognizing Textual Entailment [Dagan et al., 2013]) Premises are image captions, hypotheses generated by crowdsourcing workers:

Premise

Street performer is doing his act for kids

Hypotheses

  • 1. A person performing for children on the street ⇒ ENTAILMENT

ENTAILMENT

  • 2. A juggler entertaining a group of children on the street ⇒ NEUTRAL

NEUTRAL

  • 3. A magician performing for an audience in a nightclub ⇒ CONTRADICTION

CONTRADICTION

Event co-reference assumption

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 2 / 13

slide-10
SLIDE 10

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis

1 Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-11
SLIDE 11

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Attention

1 Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-12
SLIDE 12

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Attention

Lexical knowledge: only from pre-trained word embeddings

As opposed to using resources like WordNet

1 Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-13
SLIDE 13

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Attention

Lexical knowledge: only from pre-trained word embeddings

As opposed to using resources like WordNet

SOTA exceeds human performance...

1 Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-14
SLIDE 14

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Attention

Lexical knowledge: only from pre-trained word embeddings

As opposed to using resources like WordNet

SOTA exceeds human performance... 1

1 Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-15
SLIDE 15

Neural NLI Models

End-to-end, either sentence-encoding or attention-based

Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Label Classifier Extract Features Premise Encoder Hypothesis Encoder Premise Hypothesis Attention

Lexical knowledge: only from pre-trained word embeddings

As opposed to using resources like WordNet

SOTA exceeds human performance... 1

1[Gururangan et al., 2018, Poliak et al., 2018]: by learning “easy clues” Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 3 / 13

slide-16
SLIDE 16

Do neural NLI models implicitly learn lexical semantic relations?

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 4 / 13

slide-17
SLIDE 17

New Test Set

We constructed a new test set to answer this question

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-18
SLIDE 18

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-19
SLIDE 19

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-20
SLIDE 20

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′ w′ is in the SNLI vocabulary and in pre-trained embeddings

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-21
SLIDE 21

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′ w′ is in the SNLI vocabulary and in pre-trained embeddings Crowdsourcing labels (mostly contradictions!)

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-22
SLIDE 22

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′ w′ is in the SNLI vocabulary and in pre-trained embeddings Crowdsourcing labels (mostly contradictions!)

Contradiction

The man is holding a saxophone → The man is holding an electric guitar

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-23
SLIDE 23

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′ w′ is in the SNLI vocabulary and in pre-trained embeddings Crowdsourcing labels (mostly contradictions!)

Contradiction

The man is holding a saxophone → The man is holding an electric guitar

Entailment

A little girl is very sad → A little girl is very unhappy

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-24
SLIDE 24

New Test Set

We constructed a new test set to answer this question Premise: sentences from the SNLI training set Hypothesis:

Replacing a single term w in the premise with a related term w′ w′ is in the SNLI vocabulary and in pre-trained embeddings Crowdsourcing labels (mostly contradictions!)

Contradiction

The man is holding a saxophone → The man is holding an electric guitar

Entailment

A little girl is very sad → A little girl is very unhappy

Neutral

A couple drinking wine → A couple drinking champagne

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 5 / 13

slide-25
SLIDE 25

Evaluation Setting

3 representative models:

Residual-Stacked-Encoder [Nie and Bansal, 2017] ESIM (Enhanced Sequential Inference Model) [Chen et al., 2017] Decomposable Attention [Parikh et al., 2016]

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 6 / 13

slide-26
SLIDE 26

Evaluation Setting

3 representative models:

Residual-Stacked-Encoder [Nie and Bansal, 2017] ESIM (Enhanced Sequential Inference Model) [Chen et al., 2017] Decomposable Attention [Parikh et al., 2016]

Train on SNLI training set, test on the original & new test set

In the paper: enhancing with additional existing datasets

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 6 / 13

slide-27
SLIDE 27

Results

Can neural NLI models recognize lexical inferences?

Decomposable Attention ESIM Residual-Stacked-Encoder 50 100

84.7 87.9 86 51.9 65.6 62.2 SNLI Test Set New Test Set

Dramatic drop in performance across models.

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 7 / 13

slide-28
SLIDE 28

Sanity Check

Performance of WordNet-informed Models

50 100

65.6 83.5 85.8 Best Neural Model KIM [Chen et al., 2018] WordNet baseline

The test set is solvable using WordNet.

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 8 / 13

slide-29
SLIDE 29

What do neural NLI models learn with respect to lexical semantic relations?

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 9 / 13

slide-30
SLIDE 30

Analysis 1: Word Similarity

Models err on contradicting word-pairs with similar embeddings

A man starts his day in India → A man starts his day in Malaysia

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 10 / 13

slide-31
SLIDE 31

Analysis 1: Word Similarity

Models err on contradicting word-pairs with similar embeddings

A man starts his day in India → A man starts his day in Malaysia

Especially for fixed word embeddings

0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 20 40

46.2 42.3 37.5 29.7 20.2

Cosine Similarity of (word, replacement) Decomposable Attention Accuracy

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 10 / 13

slide-32
SLIDE 32

Analysis 2: Frequency in Training

Tuning embeddings may associate specific (word, replacement) pairs to a label, e.g. (man, woman) → contradiction

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 11 / 13

slide-33
SLIDE 33

Analysis 2: Frequency in Training

Tuning embeddings may associate specific (word, replacement) pairs to a label, e.g. (man, woman) → contradiction Accuracy increases with frequency in training set

1-4 5-9 10-49 50-99 100+ 40 60 80 100

40.2 70.6 91.4 92.1 97.5 98.5

Frequency of (word, replacement) pairs in contradiction training examples ESIM Accuracy

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 11 / 13

slide-34
SLIDE 34

Recap

New NLI test set that evaluates systems’ ability to make inferences that require very simple lexical knowledge

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 12 / 13

slide-35
SLIDE 35

Recap

New NLI test set that evaluates systems’ ability to make inferences that require very simple lexical knowledge SOTA systems perform poorly on the test set

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 12 / 13

slide-36
SLIDE 36

Recap

New NLI test set that evaluates systems’ ability to make inferences that require very simple lexical knowledge SOTA systems perform poorly on the test set Systems are limited in their generalization ability

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 12 / 13

slide-37
SLIDE 37

Recap

New NLI test set that evaluates systems’ ability to make inferences that require very simple lexical knowledge SOTA systems perform poorly on the test set Systems are limited in their generalization ability May be used as a complementary test set to assess the lexical inference abilities of NLI systems

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 12 / 13

slide-38
SLIDE 38

Recap

New NLI test set that evaluates systems’ ability to make inferences that require very simple lexical knowledge SOTA systems perform poorly on the test set Systems are limited in their generalization ability May be used as a complementary test set to assess the lexical inference abilities of NLI systems

Thank you!

Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 12 / 13

slide-39
SLIDE 39

References

[Bowman et al., 2015] Bowman, S. R., Angeli, G., Potts, C., and Manning, D. C. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Linguistics. [Chen et al., 2018] Chen, Q., Zhu, X., Ling, Z.-H., Inkpen, D., and Wei, S. (2018). Neural natural language inference models enhanced with external knowledge. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia. [Chen et al., 2017] Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., and Inkpen, D. (2017). Enhanced lstm for natural language

  • inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

pages 1657–1668, Vancouver, Canada. Association for Computational Linguistics. [Dagan et al., 2013] Dagan, I., Roth, D., Sammons, M., and Zanzotto, F. M. (2013). Recognizing textual entailment: Models and

  • applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220.

[Gururangan et al., 2018] Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., and Smith, N. A. (2018). Annotation artifacts in natural language inference data. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana. [Nie and Bansal, 2017] Nie, Y. and Bansal, M. (2017). Shortcut-stacked sentence encoders for multi-domain inference. arXiv preprint arXiv:1708.02312. [Parikh et al., 2016] Parikh, A., Täckström, O., Das, D., and Uszkoreit, J. (2016). A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics. [Poliak et al., 2018] Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and Van Durme, B. (2018). Hypothesis Only Baselines in Natural Language Inference. In Joint Conference on Lexical and Computational Semantics (StarSem). Max Glockner, Vered Shwartz and Yoav Goldberg · Breaking NLI Systems with Sentences that Require Simple Lexical Inferences 13 / 13