Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet - - PowerPoint PPT Presentation

building a discourse annotated dutch text corpus
SMART_READER_LITE
LIVE PREVIEW

Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet - - PowerPoint PPT Presentation

Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet , Ildik Berzlnovich , Gosse Bouma , Markus Egg , Gisela Redeker Beyond Semantics DGfS Workshop, Gttingen, February 23-25, 2011 Overview Introduction MTO Project


slide-1
SLIDE 1

Building a Discourse-annotated Dutch Text Corpus

Nynke van der Vliet, Ildikó Berzlánovich, Gosse Bouma, Markus Egg, Gisela Redeker

Beyond Semantics DGfS Workshop, Göttingen, February 23-25, 2011

slide-2
SLIDE 2

2

  • Introduction MTO Project
  • Corpus design

Text selection

Segmentation

Annotation

 Discourse structure  Genre structure  Discourse connectives  Lexical cohesion

  • Preliminary results and future work

Overview

slide-3
SLIDE 3

3

Modeling Textual Organization (MTO) Program

Build a Dutc tch h te text t corp rpus, annotated for discourse structure, genre structure, lexical cohesion, coreference, and discourse connectives

Project Goals:

 Investigate the genre-dependent interaction between discourse

structure and lexical cohesion (Project 1, Ildikó Berzlánovich)

 Investigate the mechanisms that establish coherence in text and

develop algorithms for discourse parsing (Project 2, Nynke van der Vliet)

  • http://www.let.rug.nl/mto/

Introduction

slide-4
SLIDE 4

4

  • Provide a reliably annotated “gold standard” resource

covering a range of genres

  • Core corpus: 80 texts (length: 190 - 400 words)
  • expository texts:

20 encyclopedia texts

20 popular scientific news texts

  • persuasive texts:

20 fundraising letters

20 advertisements

Corpus design

slide-5
SLIDE 5

5

Text Selection (1)

Preparation: selection of text material, stripping off ‘text-external’ elements

  • Exclude pictures and picture captions
  • Exclude genre-specific elements that are not related to

rhetorical choices

slide-6
SLIDE 6

6

Text Selection (2) Example

slide-7
SLIDE 7

7

Segmentation (1)

EDU ~ simple sentence Each donation is valuable!

EDU ~ finite clause You can build with us by donating, ][ but you can also build with us literally.

EDU ~ fragment functioning as complete utterance Nice gadgets.

EDU ~ non-restrictive relative clause This gap is caused by one of the moons of Saturn, Mimas, ][ which disturbs the rings.

slide-8
SLIDE 8

8

Segmentation (2)

 EDU ~ embedded discourse unit

However during the night, [ which can last for months on Mercury, ] the temperature decreases to about -185 degrees Celsius.

 EDU ~ coordinated VP

At a young age a cataract in her eye was diagnosed ][ and treated.

 EDU ~ elliptical clause

The planet turns around its axis in 58.6 days ][ and around the sun in 88.0 days.

slide-9
SLIDE 9

9

Discourse Structure (1)

Rhetorical etorical Str tructure ucture Th Theory

  • ry (RST)

T)

(Mann and Thompson,1988)

Full hierarchical text structure

Extended Classic RST (30 relations)

Semantic and pragmatic relations

Non-binary trees

slide-10
SLIDE 10

10

(37) P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. (38 ) Help us now in

  • ur fight against malaria (39) by donating today (40) - within an

hour more than 120 children will die needlessly from this deadly

  • disease. (41) Give generously (42) and do this today, (43) so that

we can help more children (44) before it is too late.

Discourse Structure (2)

slide-11
SLIDE 11

11

Discourse Structure (3)

Multi-satellite construction (non-binary tree)tree) Restriction to binary trees yields implausible analyses

2-3 2 Help us now in our fight against malaria 3 by donating today Means 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify 1-4

1-4 2-4 2-3 2 Help us now in our fight against malaria 3 by donating today Means 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 2 Help us now in our fight against malaria 3 by donating today Means 2-3 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 1-3 1-4 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify

slide-12
SLIDE 12

12

Move ve analysis alysis (Upton and Cohen, 2009)

  • Moves = functional components in the text
  • Each genre has a particular set of move types
  • The moves create a linear, non-hierarchical

partition of the text

Genre structure (1)

slide-13
SLIDE 13

13

Genre structure (2)

Encyclopedia texts

 Name  Define  Describe

Fundraising letters

 Get attention  Introduce the cause  Establish credentials of organization  Solicit response  Offer incentive  Reference insert  Express gratitude  Conclude with pleasantries

slide-14
SLIDE 14

14

Genre structure (3)

slide-15
SLIDE 15

Mapping the move structure onto RST structure

Genre structure (4)

slide-16
SLIDE 16

16

Discourse Connectives (1)

Wh Why annotate notate connectives nnectives?

  • At least at intra-sentential level (but probably also

across sentences), connectives should be valuable cues to coherence relations.

  • Frequencies of connectives may differ between

genres and thus provide a cue for genre classification.

  • Genre information may help the parser by biasing

the disambiguation of multifunctional connectives, e.g., toward a semantic meaning for expository texts and pragmatic one for persuasive texts.

slide-17
SLIDE 17

17

(16) With the help of research much has already been

  • achieved. (17) But to protect you and others from the

consequences of diabetes (18) more research is

  • needed. (19) That is why we keep asking for your

support.

Discourse Connectives (2)

slide-18
SLIDE 18

18

Lexical cohesion (1)

  • Lexical cohesive items build up graph structures in the

text

  • For each lexical item, lexical links to items in preceding

and following EDUs are identified

slide-19
SLIDE 19

19

EDU5 [After the fo

forming ming of the sun and the solar lar system tem, our st star r began its long existence as a so-called dwarf arf st star r ] EDU6 [In the dwarf arf phase ase of its life fe, the energy that the sun gives

  • ff is generated in its core through the fusion of hydrogen

into helium.] EDU7[The sun is about five billon years ]

slide-20
SLIDE 20

20

Se Segmenta ntatio tion Detailed manual with rules and examples Reliability: 25% of the material, K = 0.98 (fundraising letters and encyclopedia) Cohere renc nce e analysis ysis (RST ST) Relation definitions as published on the RST website Consensus procedure: each final analysis is based on at least two independent first versions and intensive team discussion (Berzlánovich, Redeker, van der Vliet) Reliability: K = 0.88 for the spans, 0.82 for nuclearity and 0.57 for labeling

Annotations (1)

slide-21
SLIDE 21

Genre e st structur ucture (move e analy lysi sis) s) Detailed manual Final analysis by consensus among two coders (Berzlánovich, Redeker) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) Lexic ical al cohes esio ion Detailed manual, training of the coders Final analysis by consensus among two coders (Berzlánovich with Rensema/ Wagenaar) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre)

Annotations (2)

21

slide-22
SLIDE 22

Pre relimina liminary ry re results lts on ge genre re-se sensit nsitivi ivity ty of c f cohere rence ce and co cohesion sion

(comparing encyclopedia texts and fundraising letters)

 Genre difference in coherence

  • Presentational relations are much more frequent in persuasive texts

than in expository texts.

 Genre difference in cohesion

  • Different discourse connectives in expository and persuasive texts.
  • Systematic semantic relations are more frequent in expository texts

than in persuasive texts.

 Genre-specific interaction of coherence and cohesion

  • Coherence and cohesion are closely aligned in expository texts, but

not in persuasive texts.

Coherence, Cohesion, and Genre

22

slide-23
SLIDE 23

Future plans

Automatic discourse parsing

  • automatic segmentation

(basic program already achieves good precision (0.72) and recall (0.75))

  • determine the validity of genre, connectives, co-

reference and lexical cohesion relations as cues for the recognition of RST relations (using machine learning)

23

slide-24
SLIDE 24

Thank you

24