Chinese Text Summarization Using A Trainable Summarizer and Latent - - PowerPoint PPT Presentation

chinese text summarization using a trainable summarizer
SMART_READER_LITE
LIVE PREVIEW

Chinese Text Summarization Using A Trainable Summarizer and Latent - - PowerPoint PPT Presentation

Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh 1 , Hao-Ren Ke 2 , and Wei-Pang Yang 1 1 Department of Computer & Information Science, National Chiao-Tung University, Taiwan, R.O.C. 2 Digital


slide-1
SLIDE 1

Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis

Jen-Yuan Yeh1, Hao-Ren Ke2, and Wei-Pang Yang1

1Department of Computer & Information Science, National Chiao-Tung

University, Taiwan, R.O.C.

2Digital Library & Information Section of Library, National Chiao-Tung

University, Taiwan, R.O.C.

slide-2
SLIDE 2

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 2/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-3
SLIDE 3

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 3/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-4
SLIDE 4

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 4/31

Text summarization

The process of distilling the most important information from a

source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani99].

Analysis Transformation Synthesis

Documents Summaries Compression Ratio

slide-5
SLIDE 5

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 5/31

Corpus-based Approach:

A Trainable Document Summarizer [Kupiec95]

( ) ( ) ( ) ( )

∏ ∏

= =

∈ ∈ = ∈

k j j k j j k

f P S s P S s f P f f f S s P

1 1 2 1

| ,.., , |

Feature Extractor Labeler

Learning Algorithm Rule Application

Source Summary

Training Corpus

Source

Test Corpus

Rules vectors Machine-generated Summary

Training Phase Test Phase

slide-6
SLIDE 6

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 6/31

Text Relationship Map (T.R.M.) Approach:

Automated Text Structure and Summarization [Salton97]

P11:2 P10: 3 P9: 8 P8: 9 P7: 5 P6: 6 P5: 7 P4: 3 P3: 7 P2: 2 P1: 6

Three heuristic methods:

  • Global bushy path
  • Depth-first path
  • Segmented bushy path
  • Each node is represented as Pi=(k1, k2, …, kn)
  • Pi and Pj are say to be connected when their vector

similarity is greater than the threshold.

( )

j i j i j i

P P P P P P Sim ⋅ = ,

slide-7
SLIDE 7

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 7/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-8
SLIDE 8

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 8/31

Modified Corpus-based Approach

We use a “Score Function” to measure the significance of a

sentence.

Original approach computes the probability that a sentence

will be included in the summary.

( ) ( ) ( ) ( ) ( ) ( )

ature.

  • f each fe

tance the impor indicates w ity", and s "Central represent f e Title", ance to th s "Resembl represent f , e Keyword" s "Negativ represent , f e Keyword" s "Positiv represent n", f s "Positio represent where f s Score w s Score w s Score w s Score w s Score w s Score

i f f f f f Overall 5 4 3 2 1 5 4 3 2 1

5 4 3 2 1

⋅ + ⋅ + ⋅ − ⋅ + ⋅ =

( ) ( ) ( ) ( )

∏ ∏

= =

∈ ∈ = ∈

k j j k j j k

f P S s P S s f P f f f S s P

1 1 2 1

| ,..., , |

slide-9
SLIDE 9

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 9/31

f1: Position

For a sentence s, this feature-score is obtained as

( ) ( )

i i i f

  • sition

mes from P where s co . tion nk of Posi Average ra S|Position s P s Score 5

1

× ∈ =

a five-level rank from 1 to 5 used to emphasize the significance of positions.

slide-10
SLIDE 10

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 10/31

Word Aggregation for f2, f3, f4, and f5

Use Word Co-occurrence to reshape word unit.

Assume A, B, C, D, E are keywords, E is composed of B and C in

  • rder, if WC(B, C) > threshold, then replace B and C with E.

( )

C B E

freq freq freq C B WC × = ,

[Kowalski97]

個人 電腦 個人電腦 ABCD AED

WC(B,C)>threshold

slide-11
SLIDE 11

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 11/31

f2: Positive Keyword

For a sentence s, assume s contains Keyword1, Keyword2, …,

Keywordn, this feature-score is obtained as

( ) ( )

in s. rd . of Keywo is the no where c Keyword S s P c s Score

k k n k k k f

= ⋅

∈ =

,..., 2 , 1

|

2

slide-12
SLIDE 12

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 12/31

f3: Negative Keyword

For a sentence s, assume s contains Keyword1, Keyword2, …,

Keywordn, this feature-score is obtained as

( ) ( )

in s. rd . of Keywo is the no where c Keyword S s P c s Score

k k n k k k f

= ⋅

∉ =

,..., 2 , 1

|

3

slide-13
SLIDE 13

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 13/31

f4: Resemblance to the Title

For a sentence s, this feature-score is obtained as

( )

n Title Keywords i n s Keywords i n Title Keywords i n s Keywords i s Scoref U I =

4

slide-14
SLIDE 14

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 14/31

f5: Centrality

For a sentence s, this feature-score is obtained as

( )

ntences n other se Keywords i n s Keywords i ntences n other se Keywords i n s Keywords i s Scoref U I =

5

slide-15
SLIDE 15

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 15/31

Train the Score Function by the Genetic Algorithm

Help to find a suitable combination of feature-weights. Represent a genome as (w1,w2,w3,w4,w5), and perform the

genetic algorithm (GA) to determine the value of wi.

Fitness: the average recall got with the genome when

applying on the training corpus.

slide-16
SLIDE 16

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 16/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-17
SLIDE 17

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 17/31

LSA-based T.R.M. Approach

Combine T.R.M.[Salton97] and the semantic representations

derived by LSA to promote summarization to semantic-level.

Chinese Document Chinese Document Word-by-Sentence Matrix Construction Singular Value Decomposition Dimension Reduction Semantic Matrix Reconstruction Semantic Model Analysis Sentence Relationship Analysis Semantic Related Sentence Link Text Relationship Map Construction Document Summary Document Summary Global Bushy Path Construction Sentence Selection Sentence Selection Preprocessing Sentence Identification Word Segmentation & Keyword- Frequency Calculation Semantic Sentence/Word Representations Semantic Sentence/Word Representations Text Relationship Map Text Relationship Map

Latent Semantic Analysis Summarization based on Text Relationship Map [Salton97]

slide-18
SLIDE 18

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 18/31

Semantic Representations

Represent a document D as a Word-by-Sentence matrix A

and apply SVD to A to derive latent semantic structures of D from A.

( ) ( )

=

× − = − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = ⋅ =

N j ij ij i i i j ij ij i ij ij

f f N E E G n c L G L a

1

log log 1 1 1 log

[Bellegarda96]

where cij is the frequency of Wi in Sj, nj is the number

  • f words in Sj, and Ei is the normalized entropy of Wi

MN M M M N N N

a a a W a a a W a a a W S S S A L M O M M M L L L

2 1 2 22 21 2 1 12 11 1 2 1

=

Sentence Keyword: Nouns & Verbs

slide-19
SLIDE 19

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 19/31

Example of How LSA Works [Landauer98]

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = 36 . 56 . 85 . 31 . 1 50 . 1 64 . 1 35 . 2 54 . 2 34 . 3 S

Dimension Reduction=2 SVD & Dimension Reduction

T

USV minors graph trees survey EPS time response system user computer interface human m m m m c c c c c A = ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

4 3 2 1 5 4 3 2 1 T

V S U minors graph trees survey EPS time response system user computer interface human m m m m c c c c c A

' ' ' 4 3 2 1 5 4 3 2 1 '

62 . 71 . 50 . 22 . 15 . 21 . 10 . 25 . 04 . 85 . 98 . 69 . 31 . 20 . 30 . 15 . 34 . 06 . 66 . 77 . 55 . 24 . 14 . 27 . 14 . 23 . 06 . 42 . 44 . 31 . 14 . 27 . 21 . 23 . 53 . 10 . 11 . 20 . 14 . 07 . 24 . 63 . 51 . 55 . 22 . 22 . 19 . 13 . 06 . 28 . 42 . 38 . 58 . 1 16 . 22 . 19 . 13 . 06 . 28 . 42 . 38 . 58 . 16 . 05 . 21 . 15 . 07 . 56 . 27 . 1 05 . 1 23 . 1 45 . 19 . 12 . 08 . 03 . 39 . 70 . 61 . 84 . 26 . 12 . 09 . 06 . 02 . 24 . 41 . 36 . 51 . 15 . 04 . 10 . 07 . 03 . 16 . 40 . 33 . 37 . 14 . 09 . 16 . 12 . 05 . 18 . 47 . 38 . 40 . 16 . = ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − − − − − − − − − − − − − − − − − − =

Semantic Sentence Representation Semantic Word Representation

slide-20
SLIDE 20

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 20/31

Summary Generation

P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 Node Representation

  • [Salton97]: Pi= (k1, k2, …, kn)
  • Our: Semantic Sentence Representation

derived by LSA

( )

j i j i j i

P P P P P P Sim ⋅ = ,

Similarity Summary Generation

  • Global Bushy Path [Salton97]

Compared to [Salton97]

  • Our: Semantic Sentence Representations.
  • [Salton97]: Keyword Vector Representations.

Problem of T.R.M.[Salton97] is the lack of the type or the context of a link.

slide-21
SLIDE 21

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 21/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-22
SLIDE 22

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 22/31

Data Corpus

  • 100 articles about politics collected from New Taiwan Weekly.

32% 31% 32% 32% 32% Manual compression ratio per document 8.4 9.8 8.5 8.0 8.8 Sentences per manual summary 26.4 31.5 26.7 24.8 27.5 Sentences per document 20 20 20 20 20 Document per collection Set 5 Set 4 Set 3 Set 2 Set1 Document Statistics

slide-23
SLIDE 23

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 23/31

Evaluation Method

We use Recall and Precision to judge the coverage between

manual and machine-generated summaries.

S T T: manual summary of D S: machine-generated summary of D T T S Recall , S T S Precision I I = = and

slide-24
SLIDE 24

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 24/31

Modified Corpus-based Approach:

Effects of f1, f2, f3, f4, and f5

0.4838 0.4596 0.4219 0.4102 0.2067 0.2148 0.4667 0.4597 0.4829 0.4557 Average 0.5229 0.5024 0.3895 0.3817 0.1800 0.1739 0.5410 0.5399 0.4632 0.4286 Set 5 0.4967 0.4777 0.4628 0.4557 0.1826 0.1746 0.5030 0.4912 0.4955 0.4796 Set 4 0.5217 0.4723 0.3716 0.3644 0.2000 0.2301 0.4190 0.4381 0.4844 0.4648 Set 3 0.3980 0.3944 0.4487 0.4217 0.2771 0.2972 0.3865 0.3648 0.4924 0.4639 Set 2 0.4798 0.4511 0.4370 0.4274 0.1936 0.1982 0.4839 0.4647 0.4788 0.4415 Set 1 f5 f4 f3 f2 f1 Recall

f1: Position, f2: Positive Keyword, f3: Negative Keyword, f4: Resemblance to the Title, f5: Centrality

Blue: original approach. Red: our approach. Compression Ratio: about 30%

slide-25
SLIDE 25

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 25/31

Modified Corpus-based Approach:

Effects of the Score Function

0.6% 2.4%

  • 2.2%

2.6% 1.9%

  • 2.3%

Improvement 0.2870 0.2478 0.2574 0.2841 0.3772 0.2684 Our approach 0.2853 0.2419 0.2633 0.2769 0.3700 0.2746 Original approach Average Set 5 Set 4 Set 3 Set 2 Set 1 Recall 5.5% 5.1% 4.0% 4.7% 6.0% 5.6% Improvement 0.4837 0.5410 0.5348 0.4491 0.4028 0.4906 Our approach 0.4586 0.5149 0.5142 0.4191 0.3799 0.4647 Original approach Average Set 5 Set 4 Set 3 Set 2 Set 1 Recall Score function (s) = f1+f2+f3+f4+f5 Score function (s) = f1+f2+f4+f5 f1: Position, f2: Positive Keyword, f3: Negative Keyword, f4: Resemblance to the Title, f5: Centrality Compression Ratio: about 30% Compression Ratio: about 30%

slide-26
SLIDE 26

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 26/31

Modified Corpus-based Approach:

Learned Feature Weights

0.7746 0.7782 0.7674 0.7875 0.7841 Training (Recall) 0.022 0.004 0.025 0.011 0.002 f5 f3 0.5655 0.581 0.012 0.875 Combination 5 0.5376 0.527 0.021 0.981 Combination 4 0.4604 0.401 0.013 0.996 Combination 3 0.4790 0.689 0.013 0.867 Combination 2 0.5556 0.359 0.013 0.926 Combination 1 Test (Recall) f4 f2 f1 100 generations f1: Position, f2: Positive Keyword, f3: Negative Keyword, f4: Resemblance to the Title, f5: Centrality 7.4% 4.5% 0.5% 2.5% 18.9% 13.2% Improvement 0.5196 0.5655 0.5376 0.4604 0.4790 0.5556 Our approach + GA 0.4837 0.5410 0.5348 0.4491 0.4028 0.4906 Our approach Average Set 5 Set 4 Set 3 Set 2 Set 1 Recall Score function (s) = f1+f2+f4+f5 Compression Ratio: about 30%

slide-27
SLIDE 27

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 27/31

LSA-based T.R.M. Approach:

Compared with Keyword-based T.R.M.[Salton97]

0.4558 0.4943 0.4657 0.4567 0.4005 0.4616 LSA-based T.R.M. 12.9% 17.7% 9.6% 2.2% 4.5% 34.8% Improvement 0.4038 0.4201 0.4276 0.4469 0.3817 0.3425 Keyword-based T.R.M. Average Set 5 Set 4 Set 3 Set 2 Set 1 Recall Compression Ratio: about 30% 0.64 0.65 0.65 0.8 0.45 0.65 Dimension reduction ratio Average Set 5 Set 4 Set 3 Set 2 Set 1

slide-28
SLIDE 28

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 28/31

Summary of the Above Approaches

0.5196 0.5655 0.5376 0.4604 0.4790 0.5556 Our Modified Corpus-based + GA 0.4558 0.4943 0.4657 0.4567 0.4005 0.4616 Our LSA-based T.R.M. 0.4038 0.4201 0.4276 0.4469 0.3817 0.3425 Keyword T.R.M. 0.4837 0.5410 0.5348 0.4491 0.4028 0.4906 Our Modified Corpus-based 0.4586 0.5149 0.5142 0.4191 0.3799 0.4647 Old Corpus-based Average Set 5 Set 4 Set 3 Set 2 Set 1 Recall Compression Ratio: about 30%

slide-29
SLIDE 29

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 29/31

Outline

Introduction and related works Modified Corpus-based approach LSA-based Text Relationship Map approach Evaluations Conclusions

slide-30
SLIDE 30

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 30/31

Conclusions

Modified Corpus-based Approach

Employ ranked positions to emphasize the significance of

sentence positions.

Reshape word unit to achieve more accurate calculation of

keyword importance.

Train the score function by genetic algorithm to find a suitable

combination of feature-weights.

LSA-based T.R.M. Approach

Employ LSA to derive semantic representations of a document. Combine T.R.M.[Salton97] and semantic representations to promote

summarization from keyword-level to semantic-level.

  • When the compression ratio was 30%, we got average recalls of

52.0% and 45.5% respectively.

slide-31
SLIDE 31

2002/12/13 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis 31/31