A Coreference Corpus and Resolution System for Dutch Iris - - PowerPoint PPT Presentation

a coreference corpus and resolution system for dutch
SMART_READER_LITE
LIVE PREVIEW

A Coreference Corpus and Resolution System for Dutch Iris - - PowerPoint PPT Presentation

A Coreference Corpus and Resolution System for Dutch Iris Hendrickx, Gosse Bouma, Frederik Coppens, Walter Daelemans,V eronique Hoste, Geert Kloosterman, Anne-Marie Mineur, Joeri Van Der Vloet, Jean-Luc Verschelde, Frederik Coppens


slide-1
SLIDE 1

A Coreference Corpus and Resolution System for Dutch

Iris Hendrickx, Gosse Bouma, Frederik Coppens, Walter Daelemans,V´ eronique Hoste, Geert Kloosterman, Anne-Marie Mineur, Joeri Van Der Vloet, Jean-Luc Verschelde, Frederik Coppens

Marrakech, LREC 2008 1

slide-2
SLIDE 2

COREA project: Coreference Resolution for Extracting Answers

URL: http://www.cnts.ua.ac.be/∼iris/corea.html Team:

  • University of Antwerp: Walter Daelemans, Iris

Hendrickx, V´ eronique Hoste

  • University Groningen: Gosse Bouma, Anne-Marie

Mineur, Geert Kloosterman

  • Language & Computing N.V.: Jean-Luc Verschelde,

Frederik Coppens, Joeri Van Der Vloet

Marrakech, LREC 2008 2

slide-3
SLIDE 3

Overview of the talk

  • Corea project
  • Corpus and annotation
  • Coreference resolution module
  • Evaluation

– Effect on Question Answering – Effect on Information Extraction

Marrakech, LREC 2008 3

slide-4
SLIDE 4

Application-oriented approach

Many Natural Language Processing applications such as Information Extraction and Automatic Summarization require accurate identification of coreference relations between noun phrases. Gas station collapses Gas station Hoezaar next to highway A58 has collapsed monday

  • afternoon. The building came down after being hit by a truck with

a flat tyre.

Marrakech, LREC 2008 Corea – Project 4

slide-5
SLIDE 5

COREA Goals

  • Annotation guideline manual for Dutch
  • Annotated evaluation corpus of 100k words
  • Coreference resolution tool
  • Integration and evaluation of tool in

NLP application, Information Extraction and Question answering

Marrakech, LREC 2008 Corea – Project 5

slide-6
SLIDE 6

Annotation

  • Coreference is restricted to names, pronouns, noun

phrases(NP).

  • 200K words
  • Different text genres: newspaper, spoken language,

medical domain, Dutch and Flemish

  • Different types of coreference relations

Marrakech, LREC 2008 Corea – Annotation 6

slide-7
SLIDE 7

Types of Coreference

  • Identity (IDENT)

Xavier Malisse qualified for the semi finals in Wimbledon. The Flemish tennis player will play against an unknown opponent.

  • Quantification (BOUND)

Everybody did what they could.

  • Superset – Subset (BRIDGE)

200 people died in that plain crash. Forty-six are buried here

  • n this cemetery.
  • Predicative relations (PRED)

Michel Beuter is a writer.

  • Special cases: negation, modality, time dependency

Marrakech, LREC 2008 Corea – Annotation 7

slide-8
SLIDE 8

Corpus statistics

Corpus DCOI CGN MedEnc Knack #docs 105 264 497 267 #tokens 35,166 33,048 135,828 122,960 #ident 2,888 3,334 4,910 9,179 #bridge 310 649 1,772 na #pred 180 199 289 na #bound 34 15 19 43

Marrakech, LREC 2008 Corea – Annotation 8

slide-9
SLIDE 9

Inter-annotator Agreement

Experiment: 2 annotators, 29 documents, +- 500 relations Relation F-score Ident 76% Bridge 33% Pred 56% Bound 0 %

Marrakech, LREC 2008 Corea – Annotation 9

slide-10
SLIDE 10

Visualization

Marrakech, LREC 2008 Corea – Annotation 10

slide-11
SLIDE 11

Coreference resolution as classification task

Supervised Machine Learning approach

  • Identify the NPs in the text,
  • Link every NP to the previous NPs,
  • Step one: classify each pair as coreferential or not
  • Step two: make coreference chain of positive pairs

Marrakech, LREC 2008 Corea – Software module 11

slide-12
SLIDE 12

Effect on Question Answering

Evaluation Dutch QA system Joost: The Fact Extractor: extracts answers to frequent questions

  • ff-line, based on manually developed patterns

Who was born when? Which city is the capital of which country?

Example Fact type:What number of inhabitants for Location ? sentence: The village has 10.000 inhabitants − > resolve antecendent of the village to extract the fact

Marrakech, LREC 2008 Corea – Evaluation 12

slide-13
SLIDE 13

Effect on Question Answering

Coreference information (rules-based) in Fact Extractor More facts are extracted: from 93K to 145K How many questions are answered correctly? variant accuracy without 65.0% with 70.0% Table 1: Number of correctly answered questions in QA@CLEF 2005 test set.

Marrakech, LREC 2008 Corea – Evaluation 13

slide-14
SLIDE 14

Effect on Information Extraction

Relation Finder predicting medical semantic relations. Based on Spectrum Medical Encyclopedia annotated with medical concepts and relations between thema Medical concepts: con disease, con person, con treatment Relations: rel is symptom of, is cause of, rel treats

aCorpus developed in IMIX Rolaquad project

Marrakech, LREC 2008 Corea – Evaluation 14

slide-15
SLIDE 15

Relation Finder

  • Core: Maximum Entropy Modeling algorithm
  • Trained on 2000 encyclopedia entries
  • Tested on two test sets of 50 and 500 different entries
  • Evaluated with and without coreference information as

predicted by our module

Marrakech, LREC 2008 Corea – Evaluation 15

slide-16
SLIDE 16

Effect on Information Extraction

Results with and without coreference information: test set without with small(50) 53.03 53.51 Big(500) 59.15 59.60 Table 2: F-Scores of Relation Finder.

Marrakech, LREC 2008 Corea – Evaluation 16

slide-17
SLIDE 17

Conclusions

  • Current results show a marginal but positive effect
  • More work is needed to refine our approach

Marrakech, LREC 2008 Corea – Evaluation 17

slide-18
SLIDE 18

Future Plans

  • Groningen: Improving the coreference resolution

module in QA system JOOST

  • Antwerpen: DEASO project: multi-document

summarization

Marrakech, LREC 2008 What’s next? 18

slide-19
SLIDE 19

Thanks for your attention.

Marrakech, LREC 2008 What’s next? 19