Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet - - PowerPoint PPT Presentation
Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet - - PowerPoint PPT Presentation
Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet , Ildik Berzlnovich , Gosse Bouma , Markus Egg , Gisela Redeker Beyond Semantics DGfS Workshop, Gttingen, February 23-25, 2011 Overview Introduction MTO Project
2
- Introduction MTO Project
- Corpus design
Text selection
Segmentation
Annotation
Discourse structure Genre structure Discourse connectives Lexical cohesion
- Preliminary results and future work
Overview
3
Modeling Textual Organization (MTO) Program
Build a Dutc tch h te text t corp rpus, annotated for discourse structure, genre structure, lexical cohesion, coreference, and discourse connectives
Project Goals:
Investigate the genre-dependent interaction between discourse
structure and lexical cohesion (Project 1, Ildikó Berzlánovich)
Investigate the mechanisms that establish coherence in text and
develop algorithms for discourse parsing (Project 2, Nynke van der Vliet)
- http://www.let.rug.nl/mto/
Introduction
4
- Provide a reliably annotated “gold standard” resource
covering a range of genres
- Core corpus: 80 texts (length: 190 - 400 words)
- expository texts:
20 encyclopedia texts
20 popular scientific news texts
- persuasive texts:
20 fundraising letters
20 advertisements
Corpus design
5
Text Selection (1)
Preparation: selection of text material, stripping off ‘text-external’ elements
- Exclude pictures and picture captions
- Exclude genre-specific elements that are not related to
rhetorical choices
6
Text Selection (2) Example
7
Segmentation (1)
EDU ~ simple sentence Each donation is valuable!
EDU ~ finite clause You can build with us by donating, ][ but you can also build with us literally.
EDU ~ fragment functioning as complete utterance Nice gadgets.
EDU ~ non-restrictive relative clause This gap is caused by one of the moons of Saturn, Mimas, ][ which disturbs the rings.
8
Segmentation (2)
EDU ~ embedded discourse unit
However during the night, [ which can last for months on Mercury, ] the temperature decreases to about -185 degrees Celsius.
EDU ~ coordinated VP
At a young age a cataract in her eye was diagnosed ][ and treated.
EDU ~ elliptical clause
The planet turns around its axis in 58.6 days ][ and around the sun in 88.0 days.
9
Discourse Structure (1)
Rhetorical etorical Str tructure ucture Th Theory
- ry (RST)
T)
(Mann and Thompson,1988)
Full hierarchical text structure
Extended Classic RST (30 relations)
Semantic and pragmatic relations
Non-binary trees
10
(37) P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. (38 ) Help us now in
- ur fight against malaria (39) by donating today (40) - within an
hour more than 120 children will die needlessly from this deadly
- disease. (41) Give generously (42) and do this today, (43) so that
we can help more children (44) before it is too late.
Discourse Structure (2)
11
Discourse Structure (3)
Multi-satellite construction (non-binary tree)tree) Restriction to binary trees yields implausible analyses
2-3 2 Help us now in our fight against malaria 3 by donating today Means 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify 1-4
1-4 2-4 2-3 2 Help us now in our fight against malaria 3 by donating today Means 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 2 Help us now in our fight against malaria 3 by donating today Means 2-3 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Motivation 1-3 1-4 4 - within an hour more than 120 children will die needlessly from this deadly disease. Justify
12
Move ve analysis alysis (Upton and Cohen, 2009)
- Moves = functional components in the text
- Each genre has a particular set of move types
- The moves create a linear, non-hierarchical
partition of the text
Genre structure (1)
13
Genre structure (2)
Encyclopedia texts
Name Define Describe
Fundraising letters
Get attention Introduce the cause Establish credentials of organization Solicit response Offer incentive Reference insert Express gratitude Conclude with pleasantries
14
Genre structure (3)
Mapping the move structure onto RST structure
Genre structure (4)
16
Discourse Connectives (1)
Wh Why annotate notate connectives nnectives?
- At least at intra-sentential level (but probably also
across sentences), connectives should be valuable cues to coherence relations.
- Frequencies of connectives may differ between
genres and thus provide a cue for genre classification.
- Genre information may help the parser by biasing
the disambiguation of multifunctional connectives, e.g., toward a semantic meaning for expository texts and pragmatic one for persuasive texts.
17
(16) With the help of research much has already been
- achieved. (17) But to protect you and others from the
consequences of diabetes (18) more research is
- needed. (19) That is why we keep asking for your
support.
Discourse Connectives (2)
18
Lexical cohesion (1)
- Lexical cohesive items build up graph structures in the
text
- For each lexical item, lexical links to items in preceding
and following EDUs are identified
19
EDU5 [After the fo
forming ming of the sun and the solar lar system tem, our st star r began its long existence as a so-called dwarf arf st star r ] EDU6 [In the dwarf arf phase ase of its life fe, the energy that the sun gives
- ff is generated in its core through the fusion of hydrogen
into helium.] EDU7[The sun is about five billon years ]
20
Se Segmenta ntatio tion Detailed manual with rules and examples Reliability: 25% of the material, K = 0.98 (fundraising letters and encyclopedia) Cohere renc nce e analysis ysis (RST ST) Relation definitions as published on the RST website Consensus procedure: each final analysis is based on at least two independent first versions and intensive team discussion (Berzlánovich, Redeker, van der Vliet) Reliability: K = 0.88 for the spans, 0.82 for nuclearity and 0.57 for labeling
Annotations (1)
Genre e st structur ucture (move e analy lysi sis) s) Detailed manual Final analysis by consensus among two coders (Berzlánovich, Redeker) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) Lexic ical al cohes esio ion Detailed manual, training of the coders Final analysis by consensus among two coders (Berzlánovich with Rensema/ Wagenaar) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre)
Annotations (2)
21
Pre relimina liminary ry re results lts on ge genre re-se sensit nsitivi ivity ty of c f cohere rence ce and co cohesion sion
(comparing encyclopedia texts and fundraising letters)
Genre difference in coherence
- Presentational relations are much more frequent in persuasive texts
than in expository texts.
Genre difference in cohesion
- Different discourse connectives in expository and persuasive texts.
- Systematic semantic relations are more frequent in expository texts
than in persuasive texts.
Genre-specific interaction of coherence and cohesion
- Coherence and cohesion are closely aligned in expository texts, but
not in persuasive texts.
Coherence, Cohesion, and Genre
22
Future plans
Automatic discourse parsing
- automatic segmentation
(basic program already achieves good precision (0.72) and recall (0.75))
- determine the validity of genre, connectives, co-
reference and lexical cohesion relations as cues for the recognition of RST relations (using machine learning)
23
Thank you
24