User-focused Multi-document Summarization with Paragraph Clustering - - PowerPoint PPT Presentation

user focused multi document summarization with paragraph
SMART_READER_LITE
LIVE PREVIEW

User-focused Multi-document Summarization with Paragraph Clustering - - PowerPoint PPT Presentation

User-focused Multi-document Summarization with Paragraph Clustering and Sentence-type Filtering , Koji , , ,and Noriko , , Yohei Seki Seki , Koji Eguchi Eguchi ,and Noriko Kando Kando


slide-1
SLIDE 1

1 1

User-focused Multi-document Summarization with Paragraph Clustering and Sentence-type Filtering

Yohei Yohei Seki Seki†

†, Koji

, Koji Eguchi Eguchi†

†, , †† ††,and Noriko

,and Noriko Kando Kando†

†, , †† ††

The Graduate University for Advanced Studies The Graduate University for Advanced Studies†

National Institute of Informatics National Institute of Informatics††

††

NTCIR Workshop 4 Meeting June 2, NTCIR Workshop 4 Meeting June 2, 2004 2004

slide-2
SLIDE 2

Talk Outline Talk Outline

1. 1.

Objective Objective: : User User-

  • focused Summarization

focused Summarization

2. 2.

Analysis: Compare Paragraph Clustering Analysis: Compare Paragraph Clustering-

  • based Summarization Strategies

based Summarization Strategies

3. 3.

Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with Sentence Sentence-

  • type Filtering for each Cluster

type Filtering for each Cluster

4. 4.

Conclusions Conclusions

slide-3
SLIDE 3

3 3

Objective Objective: : User User-

  • focused Summarization

focused Summarization

  • Two goals

Two goals

1.

  • 1. User

User-

  • focused interactive summarization

focused interactive summarization for topical requirements for topical requirements

  • Approach: Paragraph Clustering

Approach: Paragraph Clustering-

  • based

based Summarization Summarization

2.

  • 2. To produce knowledge

To produce knowledge-

  • focused

focused summaries (evaluate with question summaries (evaluate with question-

  • answering responsiveness)

answering responsiveness)

  • Approach: Sentence

Approach: Sentence-

  • type Filtering

type Filtering

slide-4
SLIDE 4

Viewpoint Viewpoint(

= ( =Topic

Topic+

+Summary

Summary Type Type)

  • Specified Summarization

Specified Summarization

Summary Types Document Sets Different Summaries By Different Information Needs Topics + ×

Does not Match Information Needs Extract Sentences Extract Sentences User A (Opinion!) User B (Knowledge!)

slide-5
SLIDE 5

5 5

Multi Multi-

  • Document Summarization

Document Summarization with with Document Clustering Document Clustering

“Document clustering techniques Document clustering techniques” ” partition a partition a set of objects into clusters set of objects into clusters

  • Closely associated documents tend to be

Closely associated documents tend to be relevant to the same request [cluster relevant to the same request [cluster hypothesis] hypothesis]

  • Extract one or two representative elements

Extract one or two representative elements (sentences) from each cluster to produce (sentences) from each cluster to produce summaries summaries

  • Topical Requirements: Select sentences from

Topical Requirements: Select sentences from clusters in an order similar to queries clusters in an order similar to queries

slide-6
SLIDE 6

Talk Outline Talk Outline

1. 1.

Objective Objective: : User User-

  • focused Summarization

focused Summarization

2. 2.

Analysis: Compare Paragraph Clustering Analysis: Compare Paragraph Clustering-

  • based Summarization Strategies

based Summarization Strategies

3. 3.

Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with Sentence Sentence-

  • type Filtering for each Cluster

type Filtering for each Cluster

4. 4.

Conclusions Conclusions

slide-7
SLIDE 7

7 7

Comparison: Paragraph Clustering Comparison: Paragraph Clustering-

  • based Summarization Strategies

based Summarization Strategies

  • Six clustering options

Six clustering options 1.

  • 1. Cluster units

Cluster units 2.

  • 2. Features and Cluster Similarities

Features and Cluster Similarities 3.

  • 3. Clustering algorithm

Clustering algorithm 4.

  • 4. Cluster size

Cluster size 5.

  • 5. Sentence extraction clues

Sentence extraction clues 6.

  • 6. Queries

Queries

slide-8
SLIDE 8

8 8

  • 1. Cluster Units: Paragraph
  • 1. Cluster Units: Paragraph

Related Work: Clustering for Summarization Related Work: Clustering for Summarization

  • Stein et al. (1999): Cluster source documents by

Stein et al. (1999): Cluster source documents by single single document summaries document summaries

  • M.
  • M. Moens

Moens (2000): Cluster source documents by (2000): Cluster source documents by paragraph paragraph units units

  • Boros

Boros et al. (2001): Cluster source documents by et al. (2001): Cluster source documents by sentence sentence units units

Our approach (interactive summarization) Our approach (interactive summarization)

  • Sentence features

Sentence features were too sparse to make feature vectors were too sparse to make feature vectors

  • Document sizes

Document sizes were too small compared to summary sizes were too small compared to summary sizes

⇒ ⇒ Cluster source documents by Cluster source documents by paragraph paragraph units units

slide-9
SLIDE 9

9 9

  • 2. Feature and Cluster Distance
  • 2. Feature and Cluster Distance

Vector Vector-

  • length normalization does not work well for

length normalization does not work well for short documents (paragraphs in this research). short documents (paragraphs in this research). 1. 1. Feature vector Feature vector

  • Normalized term frequency

Normalized term frequency vs vs unnormalized unnormalized (raw) term frequency (raw) term frequency 2. 2. Cluster distance measure Cluster distance measure

  • Euclidean

Euclidean vs vs cosine cosine

E u c l i d e a n 1

  • c
  • s

θ E u c l i d e a n N

  • r

m a l i z e d T F C

  • v

e r a g e . 3 5 8 . 3 7 . 3 1 7 P r e c i s i

  • n

. 5 2 2 . 3 9 8 . 4 2 9 T F

Unnormalized TF and Euclidean Distance performed well significantly

slide-10
SLIDE 10

10 10

  • 3. Cluster Algorithm: Ward
  • 3. Cluster Algorithm: Ward’

’s Method s Method

Compare three agglomerative clustering methods: Compare three agglomerative clustering methods: complete complete-

  • link, group

link, group-

  • average, and Ward

average, and Ward’ ’s method s method C

  • m

p l e t e L i n k G r

  • u

p A v e r a g e W a r d ' s m e t h

  • d

C

  • v

e r a g e . 3 5 8 . 3 1 4 . 3 6 4 P r e c i s i

  • n

. 5 2 2 . 4 9 9 . 5 1 8 The summary resultant with ``Ward’s method” performed better significantly than ``group average method’’.

slide-11
SLIDE 11

11 11

  • 4. Cluster Size
  • 4. Cluster Size

Change cluster size according to Change cluster size according to number of sentences extracted number of sentences extracted

C l u s t e r # f

  • r

L

  • n

g S u m m s

× 1 × 1 . 5 × 2

C l u s t e r # f

  • r

S h

  • r

t S u m m s

× 1 . 5 × 2 × 2 . 5 C

  • v

e r a g e . 3 6 4 . 3 5 7 . 3 5 3 P r e c i s i

  • n

. 5 1 8 . 5 4 3 . 5 6 5

Small cluster size performs better, but not significantly improved

slide-12
SLIDE 12

12 12

  • 5. Sentence Extraction Clues
  • 5. Sentence Extraction Clues

Compare summarization with three sentence extraction clues:

T i t l e Y e s Y e s N

  • Y

e s T e r m F r e q u e n c y Y e s Y e s Y e s N

  • P
  • s

i t i

  • n

N

  • Y

e s N

  • N
  • C
  • v

e r a g e . 3 3 9 0 . 3 2 2 0 . 3 3 8 0 . 3 1 5 P r e c i s i

  • n

. 6 1 4 0 . 6 6 0 . 6 1 3 0 . 6 2 3 Position weighting did not work well. Title weighting effect was not clear. Term Frequency performed well.

slide-13
SLIDE 13

13 13

  • 6. Queries
  • 6. Queries

Compare cluster ordering using Queries and cluster ordering using Total Frequencies

t

  • Q

u e r i e s t

  • T
  • t

a l F r e q u e n c i e s C

  • v

e r a g e . 3 6 4 . 3 3 7 P r e c i s i

  • n

. 5 1 8 . 4 5 C l u s t e r O r d e r i n g S i m i l a r i t y

With queries, coverage improved 0.02 ~ 0.03.

slide-14
SLIDE 14

Talk Outline Talk Outline

1. 1.

Objective Objective: : User User-

  • focused Summarization

focused Summarization

2. 2.

Analysis: Compare Paragraph Clustering Analysis: Compare Paragraph Clustering-

  • based Summarization Strategies

based Summarization Strategies

3. 3.

Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with Sentence Sentence-

  • type Filtering for each Cluster

type Filtering for each Cluster

4. 4.

Conclusions Conclusions

slide-15
SLIDE 15

15 15

Five Sentence Five Sentence-

  • types to Improve User

types to Improve User’ ’s s Requirements Requirements

We annotate five sentence-types automatically. Two Topical Types

  • Main Description
  • Elaboration

Three Functional Types

  • Background
  • Opinion
  • Prospective
slide-16
SLIDE 16

16 16

Sentence Sentence-

  • type Filtering with Paragraph

type Filtering with Paragraph Clustering Clustering-

  • based Summarization

based Summarization

1.The most heavily weighted sentence in each cluster was extracted. 2.For the second/third weighted sentence in each cluster, the sentence-type information was checked. A) The redundancy of sentence-type for the most weighted sentence in the same cluster was checked. B) If the sentence type was not redundant, we extracted it to produce summaries.

slide-17
SLIDE 17

17 17

Analysis: Which sentence Analysis: Which sentence-

  • type improved

type improved the responsiveness to Questions? the responsiveness to Questions?

I D : L / S T

  • p

i c B a s e F i l t e r i n g T y p e 3 1 : L F

  • s

s i l i n E t h i

  • p

i a . 2 . 3P r

  • s

p e c t i v e 4 1 : S N a k a t a m

  • v

e m e n t . 2 7 3 . 3 6 4P r

  • s

p e c t i v e 4 5 : L C

  • m

p a n y s u b s i d a r y m

  • v

e . 2 1 4 . 2 8 6P r

  • s

p e c t i v e 5 1 : S N e u t r

  • n

. 4 4 4 . 5 5 6P r

  • s

p e c t i v e 5 6 : L M i s t a k e i n E n t r a n c e E x a m i n a t i

  • n

. 5 4 5 . 6 3 6P r

  • s

p e c t i v e 5 7 : S S p a c e S h u t t l e . 3 8 . 3 8 5P r

  • s

p e c t i v e 6 3 : L A n c i e n t t

  • m

b . 3 6 4 . 4 5 5O p i n i

  • n

R e s p

  • n

s i v e n e s s

``Prospective”-type improved responsiveness for event topics which described forecast in the future

slide-18
SLIDE 18

Talk Outline Talk Outline

1. 1.

Objective Objective: : User User-

  • focused Summarization

focused Summarization

2. 2.

Analysis: Compare Paragraph Clustering Analysis: Compare Paragraph Clustering-

  • based Summarization Strategies

based Summarization Strategies

3. 3.

Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with Sentence Sentence-

  • type Filtering for each Cluster

type Filtering for each Cluster

4. 4.

Conclusions Conclusions

slide-19
SLIDE 19

19 19

Conclusions Conclusions

F

  • r NT

CI R-4 T SC3, we fo c use d o n multi-do c ume nt summa riza tio n fro m two diffe re nt a spe c ts:

1.Paragraph Clustering Techniques for Topical Information Requirements

  • Compare Several Parameters:
  • Ward’s Methods, Unnormalized TF, Euclidean Distance
  • Sentences ×1~×1.5 Cluster Size Performed Best
  • 2. Sentence-type Filtering to Improve the

Responsiveness to Questions

  • To extract the most important sentence and

``Prospective’’-type sentence from each cluster improved responsiveness for several topics

slide-20
SLIDE 20

20 20

Thank you for your attention!