[PPT] - Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou PowerPoint Presentation

SLIDE 1

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou

urse-Aw

Aware Sente ntence Represent ntations

Mingda Chen Joint work with Zewei Chu and Kevin Gimpel

SLIDE 2

Prior work on evaluation benchmarks

Focus on capabilities of representations for stand-alone

sentences

Sentiment analysis
Linguistic properties, e.g. verb tense prediction
…
What about the broader context (i.e. discourse) for a

sentence?

1

SLIDE 3

Our contributions

An evaluation suite for evaluating discourse

knowledge encoded in sen senten ence r e rep epresen esentati tion

ns.
Benchmark and compare several pretrained sentence

representations.

Novel learning criteria for capturing discourse

structures.

2

SLIDE 4

Discourse Evaluation (DiscoEval)

Focus on evaluating the role of a sentence in its

discourse context.

7 task groups, covering multiple domains (e.g.

Wikipedia, stories, dialogues, and scientific literature).

Probing tasks. Pretrained embeddings are kept fixed

and we only use simple classifiers.

3

SLIDE 5

Discourse Evaluation (DiscoEval)

In general, we follow SentEval and use following input

for tasks involving pairs of sentences

4

[x1, x2, x1 x2, |x1 x2|]

<latexit sha1_base64="bvaxiqMuIvBbqER7MO1ac5wEjoI=">ACD3icbVC7TsMwFHXKq5RXgJHFogIxQJUJBgrWBiLRB9SGkWO47ZWnTiyHdQq7R+w8CsDCDEysrG3+C0GaDlSLbOPede2f4MaNSWda3UVhaXldK6XNja3tnfM3b2m5InApIE546LtI0kYjUhDUcVIOxYEhT4jLX9wk/mtByIk5dG9GsXEDVEvol2KkdKSZx47Q8+hUOvml027PCAq1k5zuqzjI9dzyxbFWsKuEjsnJRBjrpnfnUCjpOQRAozJKVjW7FyUyQUxYxMSp1EkhjhAeoR9MIhUS6XSfCTzSgC7XOgTKThVf0+kKJRyFPq6M0SqL+e9TPzPcxLVvXJTGsWJIhGePdRNGFQcZuHAgAqCFRtpgrCg+q8Q95FAWOkISzoEe37lRdKsVuzSvXuoly7zuMogNwCE6ADS5BDdyCOmgADB7BM3gFb8aT8WK8Gx+z1oKRz+yDPzA+fwBSoZmx</latexit>

x1, x2

<latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit>

SLIDE 6

[x1, x2, x1 x2, |x1 x2|]

<latexit sha1_base64="O13UVs1+Wil2AH1HG0wZF7TLsA=">ACMXicbVA9T8MwEHX4pnwVGFksKiQGqJKCBCOChbFIFJDaKHKcK1g4cWRfUKuQv8TCP0EsHUCIlT+B03aAwkm23r27d2e/MJXCoOsOnKnpmdm5+YXFytLyupadX3jyqhMc2hxJZW+CZkBKRJoUAJN6kGFocSrsP7s7J+/QDaCJVcYj8FP2a3iegKztBSQfW8g9D4ZxcQ1Tk7aIXeHu0FzTKy6MdFSkcpY9lvl/ix0mVXwTVmlt3h0H/Am8MamQczaD60okUz2JIkEtmTNtzU/RzplFwCUWlkxlIGb9nt9C2MGExGD8f7izojmUi2lXangTpkP2pyFlsTD8ObWfM8M5M1kryv1o7w+6xn4skzRASPlrUzSRFRUv7aCQ0cJR9CxjXwr6V8jumGUdrcsWa4E1+S+4atS9g3rj4rB2cjq2Y4FskW2ySzxyRE7IOWmSFuHkibySN/LuPDsD58P5HLVOWPNJvkVztc3mf6p2w=</latexit>

Discourse Evaluation (DiscoEval)

In general, we follow SentEval and use following input

for tasks involving pairs of sentences

5

x1, x2

<latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit>

SLIDE 7

[x1, x2, x1 x2, |x1 x2|]

<latexit sha1_base64="Cv+ztvsh7fKnajPEbExTQdBEYo=">ACIHicbVDLTsMwEHR4U14FjlwsKiQOUCUFCY4ILhxBogWpjSLH2YKFE0f2BrUK5U+48CtcOIAQ3OBrcNIeK1kazQzu15PmEph0HU/nLHxicmp6ZnZytz8wuJSdXmlZVSmOTS5kpfhMyAFAk0UaCEi1QDi0MJ5+H1UaGf34A2QiVn2E/Bj9lIrqCM7RUN1r9wJvi/aCRnF5dx2EHpZjcw3RIO+oSOHgrtRvrYFuF95bP6jW3LpbFv0LvBGokVGdBNX3TqR4FkOCXDJj2p6bop8zjYJLGFQ6mYGU8Wt2CW0LExaD8fNykwHdsExEu0rbkyAt2e8dOYuN6cehdcYMr8xvrSD/09oZdvf9XCRphpDw4UPdTFJUtEiLRkIDR9m3gHEt7K6UXzHNONpMKzYE7/eX/4JWo+7t1Bunu7WDw1EcM2SNrJN4pE9ckCOyQlpEk7uySN5Ji/Og/PkvDpvQ+uYM+pZJT/K+fwCadaig=</latexit>

Discourse Evaluation (DiscoEval)

In general, we follow SentEval and use following input

for tasks involving pairs of sentences

6

x1, x2

<latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit>

SLIDE 8

Discourse Evaluation (DiscoEval)

In general, we follow SentEval and use following input

for tasks involving pairs of sentences

7

x1, x2

<latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit>

[x1, x2, x1 x2, |x1 x2|]

<latexit sha1_base64="S8HMBUZD6ql8pU5lTkqauFa2iXg=">ACMXicbVBNSwMxEM36WetX1aOXYBE8aNmtgh6LXjwqWBXaZclmpzY0u1mSWlZ+5e8+E/EiwdFvPonzLY9qHUg4eW9N8zkhakUBl31ZmZnZtfWCwtlZdXVtfWKxub10ZlmkOTK6n0bcgMSJFAEwVKuE01sDiUcBP2zgr95h60ESq5wkEKfszuEtERnKGlgsp5qx94+7Qf1IvLo20VKRw/2wh9HE3INUTD/GFYGA4KcVryg0rVrbmjotPAm4AqmdRFUHluR4pnMSTIJTOm5bkp+jnTKLiEYbmdGUgZ7E7aFmYsBiMn4+GDumuZSLaUdqeBOmI/dmRs9iYQRxaZ8ywa/5qBfmf1sqwc+LnIkzhISPB3UySVHRIj4aCQ0c5cACxrWwu1LeZpxtCGXbQje3y9Pg+t6zTus1S+Pqo3TSRwlsk12yB7xyDFpkHNyQZqEk0fyQt7Iu/PkvDofzufYOuNMerbIr3K+vgFbcqnb</latexit>

SLIDE 9

What is a discourse?

A discourse is a coherent, structured group of

sentences that acts as a fundamental type of structure in natural language.

8

SLIDE 10

What is a discourse?

Linearly-structured, e.g. sentence ordering.
The timing of introducing entities.
Tree-structured, e.g. RST discourse tree.

9

1. The European Community's consumer price index

rose a provisional 0.6% in September from August

2. and was up 5.3% from September 1988,
3. according to Eurostat, the EC's statistical agency.

NN-Comparison NS-Attribution 1 2 3 “N” represents “nucleus”, containing basic information for the relation. “S” represents “satellite”, containing additional information about the nucleus.

SLIDE 11

Discourse Relations

Two human-annotated datasets: Penn Discourse

Treebank (PDTB) and RST Discourse Treebank (RST-DT).

PDTB provides discourse markers for ad

adjac jacent sen senten ences es, whereas RST-DT offers do docum ument-le level discourse trees.

10

SLIDE 12

Discourse Relations – PDTB

Use a pair of sentences to predict discourse relations.
We focus on predicting implicit relations (PDTB-I) and

explicit relations (PDTB-E).

11

1. In any case, the brokerage firms are clearly moving faster to create new ads than they did in the fall of 1987. 2. But it remains to be seen whether their ads will be any more effective. La Label el: Co Comparison.Co Contrast 1. “A lot of investor confidence comes from the fact that they can speak to us,” he says.

2. [so] “To maintain that dialogue is absolutely

crucial.” La Label el: Contingency cy.Cause PDTB-E PDTB-I

SLIDE 13

Discourse Relations – RST-DT

Text is segmented into basic units, elementary

discourse units (EDUs), upon which a discourse tree is built recursively.

We use 18 fine-grained relations.

12

1. The European Community's consumer price index rose

a provisional 0.6% in September from August

2. and was up 5.3% from September 1988,
3. according to Eurostat, the EC's statistical agency.

NN-Comparison NS-Attribution 1 2 3

SLIDE 14

Discourse Relations – RST-DT

Text is segmented into basic units, elementary

discourse units (EDUs), upon which a discourse tree is built recursively.

We use 18 fine-grained relations.

13

1. The European Community's consumer price index rose

a provisional 0.6% in September from August

2. and was up 5.3% from September 1988,
3. according to Eurostat, the EC's statistical agency.

NN-Comparison NS-Attribution 1 2 3

SLIDE 15

Discourse Relations – RST-DT

We first encode EDUs into vectors, then use averaged vectors of

EDUs of subtrees as the representation of the subtrees.

The target prediction is the label of nodes in discourse trees.
We use a linear classifier and the input is

14

1. The European Community's consumer price index rose

a provisional 0.6% in September from August

2. and was up 5.3% from September 1988,
3. according to Eurostat, the EC's statistical agency.

NN-Comparison NS-Attribution 1 2 3

[xleft, xright, xleft xright, |xleft xright|]

<latexit sha1_base64="U9/Ic8Vb5c+mwaOY+q2zi75rk0=">ACTnicbVFNSwMxEM3Wr1q/qh69BIvgQctuFfQoevGoYG2hXUo2nW2D2c2SzIpl7S/0It78GV48KJprdAPBwJv3nvDJC9BIoVB131cnPzC4tL+eXCyura+kZxc+vWqFRzqHIla4HzIAUMVRoIR6oFgYRacHcx0Gv3oI1Q8Q32EvAj1olFKDhDS7WK0HhoNREeMJMQYv+A/rVadLrj/VCmTdVWON5nDQdThke/Vax5JbdYdFZ4I1AiYzqlV8abYVTyOIkUtmTMNzE/QzplFwCf1CMzWQMH7HOtCwMGYRGD8bxtGne5Zp01Bpe2KkQ3Z8ImORMb0osM6IYdMawPyP62RYnjqZyJOUoSY/y4KU0lR0UG2tC0cJQ9CxjXwt6V8i7TjKP9gYINwZt+8iy4rZS9o3Ll+rh0dj6KI092yC7ZJx45IWfklyRKuHkibyRD/LpPDvzpfz/WvNOaOZbTJRufwPOka38A=</latexit>

SLIDE 16

Discourse Relations – RST-DT

We first encode EDUs into vectors, then use averaged vectors of

EDUs of subtrees as the representation of the subtrees.

The target prediction is the label of nodes in discourse trees.
We use a linear classifier and the input is

15

1. The European Community's consumer price index rose

a provisional 0.6% in September from August

2. and was up 5.3% from September 1988,
3. according to Eurostat, the EC's statistical agency.

NN-Comparison NS-Attribution 1 2 3

[xleft, xright, xleft xright, |xleft xright|]

<latexit sha1_base64="U9/Ic8Vb5c+mwaOY+q2zi75rk0=">ACTnicbVFNSwMxEM3Wr1q/qh69BIvgQctuFfQoevGoYG2hXUo2nW2D2c2SzIpl7S/0It78GV48KJprdAPBwJv3nvDJC9BIoVB131cnPzC4tL+eXCyura+kZxc+vWqFRzqHIla4HzIAUMVRoIR6oFgYRacHcx0Gv3oI1Q8Q32EvAj1olFKDhDS7WK0HhoNREeMJMQYv+A/rVadLrj/VCmTdVWON5nDQdThke/Vax5JbdYdFZ4I1AiYzqlV8abYVTyOIkUtmTMNzE/QzplFwCf1CMzWQMH7HOtCwMGYRGD8bxtGne5Zp01Bpe2KkQ3Z8ImORMb0osM6IYdMawPyP62RYnjqZyJOUoSY/y4KU0lR0UG2tC0cJQ9CxjXwt6V8i7TjKP9gYINwZt+8iy4rZS9o3Ll+rh0dj6KI092yC7ZJx45IWfklyRKuHkibyRD/LpPDvzpfz/WvNOaOZbTJRufwPOka38A=</latexit>

xleft

<latexit sha1_base64="UnJ58PRGjplezKhZyj5jQMLmV4=">AB9HicbVDLSgNBEJz1GeMr6tHLYBA8hd0o6DHoxWME84BkCbOT3mTI7Ow60xsSlnyHFw+KePVjvPk3Th4HTSxoKq6e4KEikMu63s7a+sbm1ndvJ7+7tHxwWjo7rJk41hxqPZaybATMghYIaCpTQTDSwKJDQCAZ3U78xBG1ErB5xnIAfsZ4SoeAMreSPOm2EWYSQpx0CkW35M5AV4m3IEWyQLVT+Gp3Y5GoJBLZkzLcxP0M6ZRcAmTfDs1kDA+YD1oWapYBMbPZkdP6LlVujSMtS2FdKb+nshYZMw4CmxnxLBvlr2p+J/XSjG8TOhkhRB8fmiMJUYzpNgHaFBo5ybAnjWthbKe8zTjanPI2BG/5VSL5e8y1L54apYuV3EkSOn5IxcEI9ckwq5J1VSI5w8kWfySt6cofPivDsf89Y1ZzFzQv7A+fwBjWySmw=</latexit>

xright

<latexit sha1_base64="3LnEQ6Qh9/oRzf6ed8rPLlxJ9WI=">AB9XicbVDLTgJBEOz1ifhCPXqZSEw8kV0SPRi0dM5JHASmaHASbMzm5mehWy4T+8eNAYr/6LN/GAfagYCWdVKq6090VxFIYdN1vZ2V1bX1jM7eV397Z3dsvHBzWTZRoxmskpFuBtRwKRSvoUDJm7HmNAwkbwTDm6nfeOTaiEjd4zjmfkj7SvQEo2ilh1GnjXyEqRb9AU46haJbcmcgy8TLSBEyVDuFr3Y3YknIFTJjWl5box+SjUKJvk304Mjykb0j5vWapoyI2fzq6ekFOrdEkv0rYUkpn6eyKloTHjMLCdIcWBWfSm4n9eK8HelZ8KFSfIFZsv6iWSYESmEZCu0JyhHFtCmRb2VsIGVFOGNqi8DcFbfHmZ1Msl7xUvrsoVq6zOHJwDCdwBh5cQgVuoQo1YKDhGV7hzXlyXpx352PeuJkM0fwB87nD2Zukxg=</latexit>

SLIDE 17

Sentence Position (SP)

Probe the knowledge of a linearly-structured discourse.
Data source: Wikipedia article, ROC Stories corpus, and

arXiv papers.

We take fi

five consecutive sentences from a corpus, randomly move one of these five sentences to the first position, and ask models to pr predi dict the he true ue po position n of th the f e first sen t senten ence in the modified sequence.

16

She was excited thinking she must have lost weight.
Bonnie hated trying on clothes.
She picked up a pair of size 12 jeans from the display.
When she tried them on they were too big!
Then she realized they actually size 14s, and 12s.

SLIDE 18

Sentence Position (SP)

Probe the knowledge of a linearly-structured discourse.
Data source: Wikipedia article, ROC Stories corpus, and

arXiv papers.

We take fi

five consecutive sentences from a corpus, randomly move one of these five sentences to the first position, and ask models to pr predi dict the he true ue po position n of th the f e first sen t senten ence in the modified sequence.

17

She was excited thinking she must have lost weight.
Bonnie hated trying on clothes.
She picked up a pair of size 12 jeans from the display.
When she tried them on they were too big!
Then she realized they actually size 14s, and 12s.

True position

SLIDE 19

Discourse Coherence (DC)

Binary prediction: determine whether a sequence of 6

sentences forms a coherent paragraph.

Data source: Ubuntu IRC Channel and Wikipedia.
We start with a coherent sequence of six sentences,

then randomly replace one of the sentences (chosen uniformly among positions 2-5) with a sentence from another discourse.

18

SLIDE 20

Discourse Coherence (DC)

An example from the Wikipedia domain

19

1. The Broadway production took place on May 1, 1947, at the Ethel Barrymore Theatre. 2. The Metropolitan Opera presented it once, on July 31, 1965. 3. After years on the job, Ramsay has found himself one of the division's few real experts . 4. Despite his attempts to get her attention for sufficient time to ask his question, Lucy is occupied with interminable conversations on the telephone. 5. Between her calls, when Lucy leaves the room, Ben even takes the risk of trying to cut the telephone cord, though his attempt is unsuccessful. 6. Not wanting to miss his train, Ben leaves without asking Lucy for her hand in marriage.

SLIDE 21

Discourse Coherence (DC)

An example from the Wikipedia domain

20

1. The Broadway production took place on May 1, 1947, at the Ethel Barrymore Theatre. 2. The Metropolitan Opera presented it once, on July 31, 1965. 3. After years on the job, Ramsay has found himself one of the division's few real experts . 4. Despite his attempts to get her attention for sufficient time to ask his question, Lucy is occupied with interminable conversations on the telephone. 5. Between her calls, when Lucy leaves the room, Ben even takes the risk of trying to cut the telephone cord, though his attempt is unsuccessful. 6. Not wanting to miss his train, Ben leaves without asking Lucy for her hand in marriage.

SLIDE 22

Discourse Coherence (DC)

Solving this task is non-trivial as it may require the

ability to perform inference across multiple sentences.

21

SLIDE 23

Experiments

We benchmark following pretrained models
n DiscoEval:
Skip-thought
DisSent
BERT
InferSent
ELMo

22

indicates models that are trained to encode neighboring sentence information.

SLIDE 24

Experiments – Benchmark pretrained models

n DiscoEval

23 39.6 38.7 59.7 44.6 54.9 37.1 38 53.2 43.6 56.3 41.6 39.9 57.8 44.9 54.8 41.5 41.5 58.8 46.4 59.4 44.1 43.8 58.8 49.9 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Skip-thought InferSent DisSent ELMo BERT-Large

SLIDE 25

Experiments – Benchmark pretrained models

n DiscoEval

24 39.6 38.7 59.7 44.6 54.9 37.1 38 53.2 43.6 56.3 41.6 39.9 57.8 44.9 54.8 41.5 41.5 58.8 46.4 59.4 44.1 43.8 58.8 49.9 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Skip-thought InferSent DisSent ELMo BERT-Large

BERT-Large performs best for the most of tasks.

SLIDE 26

Experiments – Benchmark pretrained models

n DiscoEval

25 39.6 38.7 59.7 44.6 54.9 37.1 38 53.2 43.6 56.3 41.6 39.9 57.8 44.9 54.8 41.5 41.5 58.8 46.4 59.4 44.1 43.8 58.8 49.9 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Skip-thought InferSent DisSent ELMo BERT-Large

Skip-thought performs best on RST-DT.

SLIDE 27

Experiments – Benchmark pretrained models

n DiscoEval

26 39.6 38.7 59.7 44.6 54.9 37.1 38 53.2 43.6 56.3 41.6 39.9 57.8 44.9 54.8 41.5 41.5 58.8 46.4 59.4 44.1 43.8 58.8 49.9 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Skip-thought InferSent DisSent ELMo BERT-Large

InferSent performs much worse than other pretrained embeddings

that are trained with information about neighboring sentences.

SLIDE 28

Experiments – Per-Layer analysis based on BERT

27

SLIDE 29

Experiments – Per-Layer analysis based on BERT

28

SentEval DiscoEval

SLIDE 30

Experiments – Per-Layer analysis

29

EL ELMo Mo BE BERT-Ba Base SentEval 0.8 5.0 DiscoEval 1.3 8.9

Average of the layer number for the best layers in SentEval and DiscoEval.

Assumption: deeper layers è higher-level structures

Aligns with the information needed to solve the discourse tasks.

è

SLIDE 31

Human Evaluation

Human still outperforms BERT-Large

by a large margin.

30

Sentence ce Position Discourse Coherence ce Human 77.3 87.0 BERT-Large 49.9 60.5 Wiki arXiv ROC Wiki Ubuntu Human 84.0 76.0 94.0 98.0 74.0 BERT-Large 43.0 56.0 50.9 64.9 56.1

SLIDE 32

Learning Criteria

General idea: make use of document structures.
Document structures are related to discourse

comprehension, showing how are the information units unfolded.

Naturally annotated data from structured document

collections, e.g. Wikipedia.

31

SLIDE 33

Learning Criteria

32

Nesting Level (NL) Section and Document Title (SDT) Sentence and Paragraph Position (SPP)

SLIDE 34

Learning Criteria

Our models are built upon Skip-thought. All are trained

with Neighboring Sentence Prediction (NSP).

Models are trained to reconstruct bag-of-words

representations of target sequences in NSP and SDT.

33

SLIDE 35

Experiments – Benchmark proposed learning

bjectives on DiscoEval

34 36.9 38 57 44.1 61.2 37 37.7 56.2 43.9 60 37.1 37.7 57.1 45.6 60.8 37.2 37.8 56.4 44.7 61.2 37.9 39.3 56.7 45.7 60.9 37.3 36.9 56.2 44.4 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Baseline +SDT +SPP +NL +SPP+NL +SDT+SPP

SLIDE 36

Experiments – Benchmark proposed learning

bjectives on DiscoEval

35 36.9 38 57 44.1 61.2 37 37.7 56.2 43.9 60 37.1 37.7 57.1 45.6 60.8 37.2 37.8 56.4 44.7 61.2 37.9 39.3 56.7 45.7 60.9 37.3 36.9 56.2 44.4 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Baseline +SDT +SPP +NL +SPP+NL +SDT+SPP

SPP+NL gives the strongest performance compared to
ther combinations.

SLIDE 37

Experiments – Benchmark proposed learning

bjectives on DiscoEval

36 36.9 38 57 44.1 61.2 37 37.7 56.2 43.9 60 37.1 37.7 57.1 45.6 60.8 37.2 37.8 56.4 44.7 61.2 37.9 39.3 56.7 45.7 60.9 37.3 36.9 56.2 44.4 60.5 35 40 45 50 55 60 65 PDTB-E PDTB-I RST-DT SP DC Baseline +SDT +SPP +NL +SPP+NL +SDT+SPP

Simply adding all the losses is not optimal as some of them

could be contradictory.

SLIDE 38

Conclusion

We introduce DiscoEval for evaluating discourse

knowledge encoded in pretrained sentence representations, which is comprised of 7 task groups and covers multiple domains.

We also introduce a set of multi-task losses that make

use of document structures for learning discourse- aware sentence representations.

Human evaluations show that humans still
utperform BERT-Large by a large margin.

37

SLIDE 39

DiscoEval is available at https://github.com/ZeweiChu/DiscoEval

38