A Contextual Query Expansion Approach by Term Clustering for Robust - - PowerPoint PPT Presentation

a contextual query expansion approach by term clustering
SMART_READER_LITE
LIVE PREVIEW

A Contextual Query Expansion Approach by Term Clustering for Robust - - PowerPoint PPT Presentation

A Contextual Query Expansion Approach by Term Clustering for Robust Text Summarization Massih Amini and Nicolas Usunier April the 26 th 2007 Universit Pierre et Marie Curie (Paris 6) Laboratoire dInformatique de Paris 6 LIP6 summarizer


slide-1
SLIDE 1

A Contextual Query Expansion Approach by Term Clustering for Robust Text Summarization Massih Amini and Nicolas Usunier April the 26th 2007

Université Pierre et Marie Curie (Paris 6)

Laboratoire d’Informatique de Paris 6

slide-2
SLIDE 2

Laboratoire d'Informatique de Paris 6 2

LIP6 summarizer

Documents

Preprocessings

v

  • c

a b u l a r y

G1 Gi Gn Gk Gl

Combination Postprocessing

Topic θ

Question Tθ Title

θ

T

θ

Q

Alignement Sentence features Term clustering

slide-3
SLIDE 3

Laboratoire d'Informatique de Paris 6 3

Term clustering

Term clustering Preprocessings

Documents

θ

T

v

  • c

a b u l a r y

G1 Gi Gn Gk Gl

Combination Postprocessing

Topic θ

Question Tθ Title

θ

Q

Alignement Sentence features

slide-4
SLIDE 4

Laboratoire d'Informatique de Paris 6 4

Term clustering

Hypotheses:

Words occurring in the same context with the same

frequency are topically related (context ≡ document),

Each term is generated by a mixture density, Each term of the vocabulary V belongs to one and only one

term cluster → to each term wi we associate an indicator vector class ti={thi}h

( ) ( )

=

= = Θ

K k k k

, k c w p w p

1

θ π r r

and 1 = ≠ ∀ = ⇔ = ∀ ∈ ∀

hi ki i i

t , k h t k y , k , V w

slide-5
SLIDE 5

Laboratoire d'Informatique de Paris 6 5

Term clustering (2)

Each vocabulary term w is represented as a bag-of-

documents:

Term clustering is performed using the CEM

algorithm.

( )

{ }

n ,..., i i

d , w tf w

1 ∈

= r

slide-6
SLIDE 6

Laboratoire d'Informatique de Paris 6 6

Term clustering (3): CEM algorithm

  • Input:
  • An initial partition C(0) is chosen at random and the class conditional

probabilities are estimated on the corresponding classes

  • Repeat until convergence of the complete data log-likelihood:
  • Output: Term clusters.

E-step: Estimate the posterior class probability that each term wj

belongs to Ck,

C-step: Assign each term probability with maximal posterior probability

according to the previous step,

M-step: Estimate the new mixture parameters which maximize the

complete data log-likelihood

E-step: Estimate the posterior class probability that each term wj

belongs to Ck,

C-step: Assign each term probability with maximal posterior probability

according to the previous step,

slide-7
SLIDE 7

Laboratoire d'Informatique de Paris 6 7

Term clustering (4): examples

D0714: Term cluster containing Napster digital trade act format drives allowed illegally net napster search stored alleged released musical electronic internet signed intended idea billions distribution exchange mp3 music songs tool D0728: Term cluster containing Interferon depression interferon antiviral protein drug ribavirin combination people hepatitis liver disease treatment called doctors cancer epidemic flu fever schering plough corp D0705: Term cluster containing basque and separatism basque people separatist armed region spain separatism eta independence police france batasuna nationalists herri bilbao killed

slide-8
SLIDE 8

Laboratoire d'Informatique de Paris 6 8

Sentence alignment

Alignement

θ

T

θ

Q

Combination Postprocessing

v

  • c

a b u l a r y

G1 Gi Gn Gk Gl

Topic θ

Question Tθ Title

Documents

Preprocessings Term clustering

Sentence features

slide-9
SLIDE 9

Laboratoire d'Informatique de Paris 6 9

Sentence alignment

Aim: Remove non-informative sentences of each

topic (those which do not likely contain the answer to the topic question).

Hypothesis: Sentences containing the answer to

the topic question are those which have the maximal semantic similarity with the question.

Tool: Marcu’s alignment algorithm (Marcu 99).

slide-10
SLIDE 10

Laboratoire d'Informatique de Paris 6 10

Sentence alignment: the algorithm (2)

  • Input: topic question and a document
  • Repeat until the similarity of the remaining document set

decreases

  • Remove the sentence from the current set such that its removal

maximizes the similarity between the question and the rest of the sentences

  • Output: The set of candidate sentences

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

w df log Z , w tf Z , w c Q , w c S , w c Q , w c S , w c Q , S Sim

Q w S w Q S w

× = =

∑ ∑ ∑

∈ ∈ ∩ ∈ 2 2

slide-11
SLIDE 11

Laboratoire d'Informatique de Paris 6 11

Sentence alignment: the behavior (3)

slide-12
SLIDE 12

Laboratoire d'Informatique de Paris 6 12

Sentence alignment: filtered word distribution (4)

slide-13
SLIDE 13

Laboratoire d'Informatique de Paris 6 13

Remaining sentences in some documents of topic D0708

QuestionD0708: What countries are having chronic potable water shortages and why?

Before

Document: XIE19970212.0042

Tadesse said 18 water supply projects are underway at various stages, adding that one of such projects involved the sinking of 25 wells at Akaki, about 20 kilometers from Addis Ababa, which will supply 75,000 cubic meters of water daily to the capital city. Currently, the authority supplies only 60 percent of the city's potable water demand. According to a report here today, the announcement was made by Tadesse Kebede, general manager of the authority. The Addis Ababa Regional Water and Sewerage Authority announced that the shortage of potable water in the capital city of Ethiopia will be solved in the last quarter of this year. After The Addis Ababa Regional Water and Sewerage Authority announced that the shortage of potable water in the capital city of Ethiopia will be solved in the last quarter of this year. Tadesse said 18 water supply projects are underway at various stages, adding that one of such projects involved the sinking of 25 wells at Akaki, about 20 kilometers from Addis Ababa, which will supply 75,000 cubic meters of water daily to the capital city.

slide-14
SLIDE 14

Laboratoire d'Informatique de Paris 6 14

Sentence features and combination

Combination Postprocessing

Documents

Preprocessings Term clustering

v

  • c

a b u l a r y

G1 Gi Gn Gk Gl

Topic θ

Question Tθ Title

θ

T

θ

Q

Alignement Sentence features

slide-15
SLIDE 15

Laboratoire d'Informatique de Paris 6 15

From the topic title Tθ and question Qθ we

derived 3 queries:

q1 = question keywords, q2 = question keywords expanded with their word clusters, q3 = title keywords expanded with their word clusters,

Features

Sentence features

slide-16
SLIDE 16

Laboratoire d'Informatique de Paris 6 16

Combination: why?

Spearman rank order correlation

Object

1 2 . . . n

Rank Sys1

r1 r2 . . . rn

Rank Sys2

s1 s2 . . . sn

( ) ( ) ( )

) 1 ( 6 1

2 1 2 2 1

− − − = =

=

n n s r s , r Cov Sys , Sys an CorrSpearm

n i i i s rσ

σ

slide-17
SLIDE 17

Laboratoire d'Informatique de Paris 6 17

Combination by learning

We have developed a learning based ranking

model for extractive summarization.

Amini M.-R., Tombros A., Usunier N., Lalmas M. Learning Based Summarization

  • f XML Documents. Journal of Information Retrieval (2007), to appear.

For learning we need a training set where for each

sentence of each topic a label class is available,

We constructed a training set by labeling sentences

having highest Rouge2 Average-F measure as relevant sentences to the summary.

This strategy sounds good but it doesn’t work.

slide-18
SLIDE 18

Laboratoire d'Informatique de Paris 6 18

Handcrafted weighted

We also tried to fusion ranked lists obtained from

each feature using the weighted borda fuse algorithm (Aslam et Montague, 2001).

We determined combination weights for which we

  • btained the best Rouge2 Average F-measure on

Duc2006.

This strategy didn’t work either.

slide-19
SLIDE 19

Laboratoire d'Informatique de Paris 6 19

Results

Average F of Rouge-2

slide-20
SLIDE 20

Laboratoire d'Informatique de Paris 6 20

Results (2)

Average F of Rouge-SU4

slide-21
SLIDE 21

Laboratoire d'Informatique de Paris 6 21

Conclusion

Query expansion by term clustering may help to

simply resolve complex NLP problems,

Combination of features showed promising results, It would be worth to constitute training sets (for

example making models by extracting manually sentences for summaries)

slide-22
SLIDE 22

Laboratoire d'Informatique de Paris 6 22

Thank you