The LIA Summarization Systems at DUC 2007 - - PowerPoint PPT Presentation

the lia summarization systems at duc 2007
SMART_READER_LITE
LIVE PREVIEW

The LIA Summarization Systems at DUC 2007 - - PowerPoint PPT Presentation

The LIA Summarization Systems at DUC 2007 florian.boudin@univ-avignon.fr Laboratoire Informatique dAvignon, France co-authors : Frdric Bchet, Marc El-Bze, Benoit Favre, Laurent Gillard and Juan-Manuel Torres-Moreno April 26, 2007


slide-1
SLIDE 1

April 26, 2007 LIA Summarizers at DUC'07

The LIA Summarization Systems at DUC 2007

florian.boudin@univ-avignon.fr

Laboratoire Informatique d’Avignon, France

co-authors: Frédéric Béchet, Marc El-Bèze, Benoit Favre, Laurent Gillard and Juan-Manuel Torres-Moreno

slide-2
SLIDE 2

April 26, 2007 LIA Summarizers at DUC'07 1

Outline

  • Main task

– Using a fusion process ? – Results – Discussion

  • Update task

– Cosine maximization-minimization approach – Novelty boosting – Results – Discussion

slide-3
SLIDE 3

April 26, 2007 LIA Summarizers at DUC'07 2

Main Task

slide-4
SLIDE 4

April 26, 2007 LIA Summarizers at DUC'07 3

How is it working

  • Use of several different summarizers as

sentence selection components

slide-5
SLIDE 5

April 26, 2007 LIA Summarizers at DUC'07 4

Using a fusion process ?

  • Successful in other domains

– Classification – Speaker Recognition

  • Robustness

– Small training dataset

  • Reliability

– Smoothing system performance variations

slide-6
SLIDE 6

April 26, 2007 LIA Summarizers at DUC'07 5

More summarizers

  • 5 systems in 2006, 7 systems in 2007

– (S1) MMR+LSA (2006 & 2007) – (S2) Neo-Cortex (2006 & 2007) – (S3) n-term with variable length insertion (2006 & 2007) – (S4) LNU*LTC (2007) – (S5) Okapi similarity (2007) – (S6) Prosit similarity (2007) – (S7) Compactness score (2006 & 2007) – (S8) Passage retrieval (2006)

slide-7
SLIDE 7

April 26, 2007 LIA Summarizers at DUC'07 6

Fusion strategy

  • Combining each system output

– Ranked sentence lists

  • Building a sentence graph

– Sentences weighted according to their ranks and scores

  • Output summary

– The best path in the graph

slide-8
SLIDE 8

April 26, 2007 LIA Summarizers at DUC'07 7

Post-processing

  • Person name rewriting
  • Acronym rewriting
  • Redundancy removal

– word overlap

  • Fusion, a second pass

– New sentence lengths, redundancy and rewriting are backpropagated

slide-9
SLIDE 9

April 26, 2007 LIA Summarizers at DUC'07 8

Results

Comparison between 2006 and 2007

slide-10
SLIDE 10

April 26, 2007 LIA Summarizers at DUC'07 9

Automatic evaluation

7 systems Without fusion Fusion

slide-11
SLIDE 11

April 26, 2007 LIA Summarizers at DUC'07 10

Manual evaluation (1)

DUC 2006 DUC 2007

2.933

Mean is 2.61 Standard deviation of 0.462

2.78

Mean is 2.542 Standard deviation of 0.288

slide-12
SLIDE 12

April 26, 2007 LIA Summarizers at DUC'07 11

Manual evaluation (2)

  • Linguistic quality scores of
  • ur submission in 2006

and 2007

  • Unchanged linguistic

processing module

  • Small difference between

the two evaluations

slide-13
SLIDE 13

April 26, 2007 LIA Summarizers at DUC'07 12

Fusion - Conclusions

  • Outperforms the best system
  • Prevent overfitting
  • Toolkits available (we use the AT&T FSM

toolkit)

  • Flexible
  • Parameter tuning using a development

corpus

slide-14
SLIDE 14

April 26, 2007 LIA Summarizers at DUC'07 13

Update Task

slide-15
SLIDE 15

April 26, 2007 LIA Summarizers at DUC'07 14

Principle

  • Based on a very simple user-focused

Multi-Document Summarizer (MDS)

– Similarity with topic

  • Added features:

– Cross summaries redundancy removal

  • Cosine maximization-minimization

– Novelty boosting

  • Topic enrichment
slide-16
SLIDE 16

April 26, 2007 LIA Summarizers at DUC'07 15

How is it working

slide-17
SLIDE 17

April 26, 2007 LIA Summarizers at DUC'07 16

A simple user-oriented MDS

  • Documents segmented in sentences
  • Sentences filtered and stemmed
  • Each sentence is scored in relation to the topic

– cosinus angle written – tf.idf weights

  • Drawbacks

– Summaries do not inform the reader of new facts

  • Cross summaries redundancy removal techniques
  • Novelty boosting
slide-18
SLIDE 18

April 26, 2007 LIA Summarizers at DUC'07 17

Two-step cosine maximization-minimization (1)

  • Improved sentence scoring method

– Cross summaries redundancy removal Sentence|early summaries sentence|topic

slide-19
SLIDE 19

April 26, 2007 LIA Summarizers at DUC'07 18

Two-step cosine maximization-minimization (2)

  • Limits

– All sentences are scored in relation to the same topic

  • Selected sentences are syntactically related

– Force irrelevant sentences to enter the summary

  • Propose a novelty boosting technique
slide-20
SLIDE 20

April 26, 2007 LIA Summarizers at DUC'07 19

Novelty boosting

  • Point summary to the major cluster novelty

– Novelty in comparison to early clusters – Extraction of high weighted term lists

  • Topic enrichment using the unique terms

Early clusters’s Bag of words Enrichment Bag of words

boost

slide-21
SLIDE 21

April 26, 2007 LIA Summarizers at DUC'07 20

Example (Novelty boosting for cluster C summary)

A

xxxxxx xxxxxx xxxxxx …

Extracted High-weighted terms

xxxxxx xxxxxx xxxxxx …

Clusters

Unique Terms

B

xxxxxx xxxxxx xxxxxx …

C

xxxxxx xxxxxx xxxxxx …

Summarization engine Topic

+

slide-22
SLIDE 22

April 26, 2007 LIA Summarizers at DUC'07 21

Summary construction (1)

  • Arranging the most high scored sentences
  • No special order within the summary
  • Limit of 100 words
  • high probability of truncated last sentence
  • Propose a better last sentence selection

method

slide-23
SLIDE 23

April 26, 2007 LIA Summarizers at DUC'07 22

Summary construction (2)

Last sentence selection method :

– If remaining word number > 5

  • After-last preferred if

– Length 1/3 shorter – Score greater than a threshold » threshold obtained empirically

  • Otherwise truncate sentence

– Else produce non-optimal sized summary

slide-24
SLIDE 24

April 26, 2007 LIA Summarizers at DUC'07 23

Post-processing (1)

  • Within summary redundancy removal

– Cosine similarity with threshold – Threshold obtained empirically (~ 0.4)

  • Sentence Rewriting techniques

– Person name rewriting

  • Vice President Al Gore …
  • … Al Gore …
slide-25
SLIDE 25

April 26, 2007 LIA Summarizers at DUC'07 24

Post-processing (2)

  • Sentence Rewriting techniques

– Acronym rewriting

  • Massachusetts Institute of Technology …
  • … MIT …

– Link words removal, say clauses removal

  • Moreover, the president is ...
  • ... said the judge.

– Cleanup punctuation

slide-26
SLIDE 26

April 26, 2007 LIA Summarizers at DUC'07 25

Experiments (1)

Automatic evaluations (ROUGE-2 and SU4) in relation to the number of extracted terms

  • Novelty boosting introduces

« noise »

  • Enhances the readability
slide-27
SLIDE 27

April 26, 2007 LIA Summarizers at DUC'07 26

Experiments (2)

Automatic evaluations (ROUGE-2 and SU4) for each cluster of documents (A~10, B~8 and C~7 articles)

  • Enhances system stability

and reliability

  • Non-optimal enrichment
  • Slight decrease with cluster B
slide-28
SLIDE 28

April 26, 2007 LIA Summarizers at DUC'07 27

Results at DUC 2007

slide-29
SLIDE 29

April 26, 2007 LIA Summarizers at DUC'07 28

Results (1)

The correlation between automatic evaluations (ROUGE-2 and SU4) and responsiveness scores

  • Responsiveness score 2.633
  • mean = 2.32
  • Standard deviation = 0.35
  • Poor sentence rewriting
slide-30
SLIDE 30

April 26, 2007 LIA Summarizers at DUC'07 29

Results (2)

Automatic evaluations (Basic Elements) for each system at DUC 2007

  • BE score 0.0546
  • Mean = 0.0409
  • Standard deviation = 0.0139
slide-31
SLIDE 31

April 26, 2007 LIA Summarizers at DUC'07 30

Conclusion

  • Very simple approach
  • Summary quality enhanced across time
  • Novelty boosting
  • Helps preventing within redundancy
  • Introduces “noise”
  • Language Independent
slide-32
SLIDE 32

April 26, 2007 LIA Summarizers at DUC'07 31

What’s next ?

  • Enhance cross summaries redundancy

removal process

– Change granularity

  • Considering previous sentences instead of

summaries

  • Dynamic novelty boosting
  • Improve sentence rewriting techniques
slide-33
SLIDE 33

April 26, 2007 LIA Summarizers at DUC'07 32

Thank You !

Florian.boudin@univ-avignon.fr

co-authors: Frédéric Béchet, Marc El-Bèze, Benoit Favre, Laurent Gillard and Juan-Manuel Torres-Moreno

This work was partially supported by the Laboratoire de chimie organique de synthèse, FUNDP (Facultés Universitaires Notre-Dame de la Paix), Namur, Belgium