Split and Rephrase: Better Evaluation and a Stronger Baseline Roee - - PowerPoint PPT Presentation

split and rephrase better evaluation and a stronger
SMART_READER_LITE
LIVE PREVIEW

Split and Rephrase: Better Evaluation and a Stronger Baseline Roee - - PowerPoint PPT Presentation

Split and Rephrase: Better Evaluation and a Stronger Baseline Roee Aharoni and Yoav Goldberg NLP Lab, Bar Ilan University, Israel ACL 2018 Motivation Motivation Processing long, complex sentences is hard! Motivation Processing long,


slide-1
SLIDE 1

Split and Rephrase: Better Evaluation and a Stronger Baseline

Roee Aharoni and Yoav Goldberg

NLP Lab, Bar Ilan University, Israel

ACL 2018

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

  • Processing long, complex sentences is hard!
slide-4
SLIDE 4

Motivation

  • Processing long, complex sentences is hard!
  • Children, people with reading disabilities, L2

learners…

slide-5
SLIDE 5

Motivation

  • Processing long, complex sentences is hard!
  • Children, people with reading disabilities, L2

learners…

  • Sentence level NLP systems:
slide-6
SLIDE 6

Motivation

  • Processing long, complex sentences is hard!
  • Children, people with reading disabilities, L2

learners…

  • Sentence level NLP systems:
  • Dependency Parsers

McDonald & Nivre, 2011

slide-7
SLIDE 7

Motivation

  • Processing long, complex sentences is hard!
  • Children, people with reading disabilities, L2

learners…

  • Sentence level NLP systems:
  • Dependency Parsers
  • Neural Machine Translation

Koehn & Knowles, 2017

slide-8
SLIDE 8

Motivation

  • Processing long, complex sentences is hard!
  • Children, people with reading disabilities, L2

learners…

  • Sentence level NLP systems:
  • Dependency Parsers
  • Neural Machine Translation
  • Can we automatically break a complex

sentence into several simple ones while preserving its meaning?

Koehn & Knowles, 2017

slide-9
SLIDE 9

The Split and Rephrase Task

slide-10
SLIDE 10

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
slide-11
SLIDE 11

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
slide-12
SLIDE 12

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
  • Task definition: complex sentence -> several simple sentences with the same meaning
slide-13
SLIDE 13

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
  • Task definition: complex sentence -> several simple sentences with the same meaning

Alan Bean joined NASA in 1963 where he became a member of the Apollo 12 mission along with Alfred Worden as back up pilot and David Scott as commander .

slide-14
SLIDE 14

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
  • Task definition: complex sentence -> several simple sentences with the same meaning

Alan Bean joined NASA in 1963 where he became a member of the Apollo 12 mission along with Alfred Worden as back up pilot and David Scott as commander .

slide-15
SLIDE 15

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
  • Task definition: complex sentence -> several simple sentences with the same meaning

Alan Bean served as a crew member of Apollo 12 . Alfred Worden was the backup pilot of Apollo 12 . Apollo 12 was commanded by David Scott . Alan Bean was selected by Nasa in 1963 . Alan Bean joined NASA in 1963 where he became a member of the Apollo 12 mission along with Alfred Worden as back up pilot and David Scott as commander .

slide-16
SLIDE 16

The Split and Rephrase Task

  • Narayan, Gardent, Cohen & Shimorina, EMNLP 2017
  • Dataset, evaluation method, baseline models
  • Task definition: complex sentence -> several simple sentences with the same meaning
  • Requires (a) identifying independent semantic units (b) rephrasing those units to single

sentences

Alan Bean served as a crew member of Apollo 12 . Alfred Worden was the backup pilot of Apollo 12 . Apollo 12 was commanded by David Scott . Alan Bean was selected by Nasa in 1963 . Alan Bean joined NASA in 1963 where he became a member of the Apollo 12 mission along with Alfred Worden as back up pilot and David Scott as commander .

slide-17
SLIDE 17

This Work

slide-18
SLIDE 18

This Work

  • We show that simple neural models seem to perform very on the
  • riginal benchmark due to memorization of the training set
slide-19
SLIDE 19

This Work

  • We show that simple neural models seem to perform very on the
  • riginal benchmark due to memorization of the training set
  • We propose a more challenging data split for the task to

discourage memorization

slide-20
SLIDE 20

This Work

  • We show that simple neural models seem to perform very on the
  • riginal benchmark due to memorization of the training set
  • We propose a more challenging data split for the task to

discourage memorization

  • We perform automatic evaluation and error analysis on the new

benchmark, showing that the task is still far from being solved

slide-21
SLIDE 21

WebSplit Dataset Construction (Narayan et al. 2017)

slide-22
SLIDE 22

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12>

slide-23
SLIDE 23

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12> Alan Bean is a US national. Simple Sentences Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963.

slide-24
SLIDE 24

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | nationality | United_States, Alan_Bean | mission | Apollo_12, Alan_Bean | NASA selection | 1963>

Sets of RDF triples

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12> Alan Bean is a US national. Simple Sentences Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963.

slide-25
SLIDE 25

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | nationality | United_States, Alan_Bean | mission | Apollo_12, Alan_Bean | NASA selection | 1963>

Sets of RDF triples

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12> Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12.

Complex Sentences

Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean is a US national. Simple Sentences Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963.

slide-26
SLIDE 26

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | nationality | United_States, Alan_Bean | mission | Apollo_12, Alan_Bean | NASA selection | 1963>

Sets of RDF triples

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12> Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12.

Complex Sentences

Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean is a US national. Simple Sentences Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Matching via RDFs

slide-27
SLIDE 27

WebSplit Dataset Construction (Narayan et al. 2017)

<Alan_Bean | nationality | United_States, Alan_Bean | mission | Apollo_12, Alan_Bean | NASA selection | 1963>

Sets of RDF triples

<Alan_Bean | NASA selection | 1963> Simple RDF Triples (facts from DBpedia) <Alan_Bean | nationality | United_States> <Alan_Bean | mission | Apollo_12> Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12.

Complex Sentences

Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean, born in the United States, was selected by NASA in 1963 and served as a crew member of Apollo 12. Alan Bean is a US national. Simple Sentences Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Alan Bean is a US national. Alan Bean was on the crew of Apollo 12. Alan Bean was hired by NASA in 1963. Matching via RDFs ~1M examples

slide-28
SLIDE 28

Preliminary Experiments

slide-29
SLIDE 29

Preliminary Experiments

  • ~1M training examples
slide-30
SLIDE 30

Preliminary Experiments

  • ~1M training examples
  • “Vanilla” LSTM seq2seq with attention

comp lex sen ten ce 2 ple 1 sim ple sim sim ple 3

slide-31
SLIDE 31

Preliminary Experiments

  • ~1M training examples
  • “Vanilla” LSTM seq2seq with attention
  • Shared vocabulary between the encoder and the decoder

comp lex sen ten ce 2 ple 1 sim ple sim sim ple 3

slide-32
SLIDE 32

Preliminary Experiments

  • ~1M training examples
  • “Vanilla” LSTM seq2seq with attention
  • Shared vocabulary between the encoder and the decoder
  • Simple sentences predicted as a single sequence

comp lex sen ten ce 2 ple 1 sim ple sim sim ple 3

slide-33
SLIDE 33

Preliminary Experiments

  • ~1M training examples
  • “Vanilla” LSTM seq2seq with attention
  • Shared vocabulary between the encoder and the decoder
  • Simple sentences predicted as a single sequence
  • Evaluated using single-sentence, multi-reference BLEU as in Narayan et al. 2017

comp lex sen ten ce 2 ple 1 sim ple sim sim ple 3

slide-34
SLIDE 34

Preliminary Results

slide-35
SLIDE 35

Preliminary Results

  • Our simple seq2seq

baseline outperform all but

  • ne of the baselines from

Narayan et al. 2017

20 40 60 80

seq2seq (ours) hybrid seq2seq multi-seq2seq split-multi split-seq2seq

slide-36
SLIDE 36

Preliminary Results

  • Our simple seq2seq

baseline outperform all but

  • ne of the baselines from

Narayan et al. 2017

  • Their best baselines were

using the RDF structures as additional information

20 40 60 80

seq2seq (ours) hybrid seq2seq multi-seq2seq split-multi split-seq2seq

Text Only Text + RDFs

slide-37
SLIDE 37

Preliminary Results

  • Our simple seq2seq

baseline outperform all but

  • ne of the baselines from

Narayan et al. 2017

  • Their best baselines were

using the RDF structures as additional information

  • Do the simple seq2seq

model really performs so well?

20 40 60 80

seq2seq (ours) hybrid seq2seq multi-seq2seq split-multi split-seq2seq

Text Only Text + RDFs

slide-38
SLIDE 38

BLEU can be Misleading

slide-39
SLIDE 39

BLEU can be Misleading

  • In spite of the high BLEU scores, our neural models suffer from:
slide-40
SLIDE 40

BLEU can be Misleading

  • In spite of the high BLEU scores, our neural models suffer from:
  • Missing facts - appeared in the input but not in the output
slide-41
SLIDE 41

BLEU can be Misleading

  • In spite of the high BLEU scores, our neural models suffer from:
  • Missing facts - appeared in the input but not in the output
  • Unsupported facts - appeared in the output but not in the input
slide-42
SLIDE 42

BLEU can be Misleading

  • In spite of the high BLEU scores, our neural models suffer from:
  • Missing facts - appeared in the input but not in the output
  • Unsupported facts - appeared in the output but not in the input
  • Repeated facts - appeared several times in the output
slide-43
SLIDE 43

A Closer Look

slide-44
SLIDE 44

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

slide-45
SLIDE 45

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

  • The network mainly attends

to a single token instead

  • f spreading the attention
slide-46
SLIDE 46

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

  • The network mainly attends

to a single token instead

  • f spreading the attention
  • This token was usually a

part of the first mentioned entity

slide-47
SLIDE 47

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

  • The network mainly attends

to a single token instead

  • f spreading the attention
  • This token was usually a

part of the first mentioned entity

  • Consistent among different

input examples

slide-48
SLIDE 48

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

  • The network mainly attends

to a single token instead

  • f spreading the attention
  • This token was usually a

part of the first mentioned entity

  • Consistent among different

input examples

slide-49
SLIDE 49

A Closer Look

  • Visualizing the attention

weights we find an unexpected pattern

  • The network mainly attends

to a single token instead

  • f spreading the attention
  • This token was usually a

part of the first mentioned entity

  • Consistent among different

input examples

slide-50
SLIDE 50

Testing for Over-Memorization

slide-51
SLIDE 51

Testing for Over-Memorization

  • In this stage we suspect that the network heavily memorizes entity-fact pairs
slide-52
SLIDE 52

Testing for Over-Memorization

  • In this stage we suspect that the network heavily memorizes entity-fact pairs
  • We test this by introducing it with inputs consisting of repeated entities alone
slide-53
SLIDE 53

Testing for Over-Memorization

  • In this stage we suspect that the network heavily memorizes entity-fact pairs
  • We test this by introducing it with inputs consisting of repeated entities alone
  • The network indeed generates facts it memorized about those specific

entities

slide-54
SLIDE 54

Testing for Over-Memorization

  • In this stage we suspect that the network heavily memorizes entity-fact pairs
  • We test this by introducing it with inputs consisting of repeated entities alone
  • The network indeed generates facts it memorized about those specific

entities

slide-55
SLIDE 55

Searching for the Cause: Dataset Artifacts

slide-56
SLIDE 56

Searching for the Cause: Dataset Artifacts

  • The original dataset included overlap between the training/development/test sets
slide-57
SLIDE 57

Searching for the Cause: Dataset Artifacts

  • The original dataset included overlap between the training/development/test sets
  • When looking at the complex sentences side, there is no overlap

Train Complex

Dev Complex Test Complex

source

slide-58
SLIDE 58

Searching for the Cause: Dataset Artifacts

  • The original dataset included overlap between the training/development/test sets
  • When looking at the complex sentences side, there is no overlap
  • On the other hand, most of the simple sentences did overlap (~90%)

Train Complex

Dev Complex Test Complex

source

Train Simple

Dev Simple Test Simple

target

slide-59
SLIDE 59

Searching for the Cause: Dataset Artifacts

  • The original dataset included overlap between the training/development/test sets
  • When looking at the complex sentences side, there is no overlap
  • On the other hand, most of the simple sentences did overlap (~90%)
  • Makes memorization very effective - “leakage” from train on the target side

Train Complex

Dev Complex Test Complex

source

Train Simple

Dev Simple Test Simple

target

slide-60
SLIDE 60

New Data Split

slide-61
SLIDE 61

New Data Split

  • To remedy this, we construct a new data split by using the RDF information:
slide-62
SLIDE 62

New Data Split

  • To remedy this, we construct a new data split by using the RDF information:
  • Ensuring that all RDF relation types appear in the training set (enable generalization)
slide-63
SLIDE 63

New Data Split

  • To remedy this, we construct a new data split by using the RDF information:
  • Ensuring that all RDF relation types appear in the training set (enable generalization)
  • Ensuring that no RDF triple (fact) appears in two different sets (reduce memorization)
slide-64
SLIDE 64

New Data Split

  • To remedy this, we construct a new data split by using the RDF information:
  • Ensuring that all RDF relation types appear in the training set (enable generalization)
  • Ensuring that no RDF triple (fact) appears in two different sets (reduce memorization)
  • The resulting dataset has no overlapping simple sentences

Original Split New Split unique dev simple sentences in train 90.9% 0.09% unique test simple sentences in train 89.8% 0% % dev vocabulary in train 97.2% 63% % test vocabulary in train 96.3% 61.7%

slide-65
SLIDE 65

New Data Split

  • To remedy this, we construct a new data split by using the RDF information:
  • Ensuring that all RDF relation types appear in the training set (enable generalization)
  • Ensuring that no RDF triple (fact) appears in two different sets (reduce memorization)
  • The resulting dataset has no overlapping simple sentences
  • Has more unknown symbols in dev/test - need better models!

Original Split New Split unique dev simple sentences in train 90.9% 0.09% unique test simple sentences in train 89.8% 0% % dev vocabulary in train 97.2% 63% % test vocabulary in train 96.3% 61.7%

slide-66
SLIDE 66

Copy Mechanism

slide-67
SLIDE 67

Copy Mechanism

  • To help with the increase in unknown words in the harder split, we incorporate a

copy mechanism

slide-68
SLIDE 68

Copy Mechanism

  • To help with the increase in unknown words in the harder split, we incorporate a

copy mechanism

  • Gu et al. 2016, See et al. 2017, Merity et al. 2017
slide-69
SLIDE 69

Copy Mechanism

  • To help with the increase in unknown words in the harder split, we incorporate a

copy mechanism

  • Gu et al. 2016, See et al. 2017, Merity et al. 2017
  • Uses a “copy switch” - feed-forward NN component with a sigmoid-activated

scalar output

slide-70
SLIDE 70

Copy Mechanism

  • To help with the increase in unknown words in the harder split, we incorporate a

copy mechanism

  • Gu et al. 2016, See et al. 2017, Merity et al. 2017
  • Uses a “copy switch” - feed-forward NN component with a sigmoid-activated

scalar output

  • Controls the interpolation of the softmax probabilities and the copy probabilities
  • ver the input tokens in each decoder step

copy switch 1 - copy switch attention weights (copy) softmax

  • utput
slide-71
SLIDE 71

Results - New Split

slide-72
SLIDE 72

Results - New Split

  • Baseline seq2seq models

completely break (BLEU < 7) on the new split

22.5 45 67.5 90

  • riginal split

new split

seq2seq +copy

slide-73
SLIDE 73

Results - New Split

  • Baseline seq2seq models

completely break (BLEU < 7) on the new split

  • Copy mechanism helps to

generalize

22.5 45 67.5 90

  • riginal split

new split

seq2seq +copy

slide-74
SLIDE 74

Results - New Split

  • Baseline seq2seq models

completely break (BLEU < 7) on the new split

  • Copy mechanism helps to

generalize

  • Much lower than the original

benchmark - memorization was crucial for the high BLEU

22.5 45 67.5 90

  • riginal split

new split

seq2seq +copy

slide-75
SLIDE 75

Copying and Attention

slide-76
SLIDE 76

Copying and Attention

No-Copy With-Copy

The copy-enhanced models spread the attention across the input tokens while improving results

slide-77
SLIDE 77

Error Analysis

slide-78
SLIDE 78

Error Analysis

  • On the original split the

models did very well (due to memorization) with up to 91% correct simple sentences

12.5 25 37.5 50

  • riginal split

new split

correct repeated missing unsupported

slide-79
SLIDE 79

Error Analysis

  • On the original split the

models did very well (due to memorization) with up to 91% correct simple sentences

  • On the new benchmark the

best model got only up to 20% correct simple sentences

12.5 25 37.5 50

  • riginal split

new split

correct repeated missing unsupported

slide-80
SLIDE 80

Error Analysis

  • On the original split the

models did very well (due to memorization) with up to 91% correct simple sentences

  • On the new benchmark the

best model got only up to 20% correct simple sentences

  • The task is much more

challenging then previously demonstrated

12.5 25 37.5 50

  • riginal split

new split

correct repeated missing unsupported

slide-81
SLIDE 81

Conclusions

slide-82
SLIDE 82

Conclusions

  • Simple neural models seem to perform well due to memorization
slide-83
SLIDE 83

Conclusions

  • Simple neural models seem to perform well due to memorization
  • We propose a more challenging data split for the task to discourage this
slide-84
SLIDE 84

Conclusions

  • Simple neural models seem to perform well due to memorization
  • We propose a more challenging data split for the task to discourage this
  • A similar update was proposed by Narayan et al. in parallel to our work

(WebSplit v1.0)

slide-85
SLIDE 85

Conclusions

  • Simple neural models seem to perform well due to memorization
  • We propose a more challenging data split for the task to discourage this
  • A similar update was proposed by Narayan et al. in parallel to our work

(WebSplit v1.0)

  • We perform automatic evaluation and error analysis on the new benchmarks,

showing that the task is still far from being solved

slide-86
SLIDE 86

More Broadly

slide-87
SLIDE 87

More Broadly

  • Creating datasets is hard!
slide-88
SLIDE 88

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
slide-89
SLIDE 89

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
slide-90
SLIDE 90

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
  • Look for leakage of train to dev/test
slide-91
SLIDE 91

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
  • Look for leakage of train to dev/test
  • Numbers can be misleading!
slide-92
SLIDE 92

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
  • Look for leakage of train to dev/test
  • Numbers can be misleading!
  • Look at the data
slide-93
SLIDE 93

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
  • Look for leakage of train to dev/test
  • Numbers can be misleading!
  • Look at the data
  • Look at the model
slide-94
SLIDE 94

More Broadly

  • Creating datasets is hard!
  • Think how models can “cheat"
  • Create a challenging evaluation environment to capture generalization
  • Look for leakage of train to dev/test
  • Numbers can be misleading!
  • Look at the data
  • Look at the model
  • Error analysis
slide-95
SLIDE 95

Thank You!

Link to code and data is available in the paper :)