A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT - - PowerPoint PPT Presentation

▶

Mar 19, 2023 149 likes •350 views

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT SUMMARIZER? Getting into a field of research is: Daunting with the amount of information presented Difficult to discern what is important and what isnt How a

SLIDE 1

A DOCUMENT SUMMARIZER FOR NOVICES

REX RUBIN

SLIDE 2

WHY A DOCUMENT SUMMARIZER?

 Getting into a field of research is:  Daunting with the amount of information presented  Difficult to discern what is important and what isn’t  How a summarizer will help:  Present the most relevant information and remove the excess

SLIDE 3

EXTRACTION VS ABSTRACTION

Extraction[1] Pulls sentences straight from the input Does not make its own sentences Abstraction[1] Creates sentences by joining several together Works better for several documents at once

SLIDE 4

TEXTRANK

 Extraction based[2]  Creates a web of sentences  This web is used as an input for PageRank  PageRank will rank the sentences[3]  Gives the summary as the output

SLIDE 5

HOW TO IMPROVE THIS MODEL?

It is important to note the glossary should be of relevant terms compared to the original document The way TextRank works, the glossary will allow for similar sentences to connect and score higher This will help by giving more informative sentences It is important to know that more informative does not mean easier to read

SLIDE 6

MY TEXTRANK MODIFICATION

SLIDE 7

RESEARCH QUESTION

 Will including a glossary of related terms in the original document bring about more informative sentences?

SLIDE 8

HYPOTHESIS

 Having a glossary included in the original document will bring out more informative sentences in the final summary

SLIDE 9

EXPERIMENT OVERVIEW

 Two experimental groups:  Control Group (Y)  Test Group (X)  Have the groups take a test on the original document

SLIDE 10

MY SUMMARY

My summary was made using a document focused on cybersecurity and the glossary was filled with similar cybersecurity terms

SLIDE 11

PARTICIPANTS

 Participants:  Union College students aged 18-22  Mixed group of CS students and non-CS students  2 Groups:  Control(Y) read the summary that was made through the original TextRank program  Test (X) read the summary that was made through my modified TextRank program

SLIDE 12

TEST GIVEN TO PARTICIPANTS

The test given to participants was based on the main points of the original document Why the main points? The main points should be in the summary Question types 3 Multiples Choice 3 Open Answer

SLIDE 13

AVERAGE SCORES OF QUESTIONS

Multiple Choice: 3 4 6 Open Answer: 1 2 5 Data on the left is Y and the right is X

0.94 0.06 0.22 0.89 0.39 0.56 3.06 0.94 0.19 0.33 0.44 0.5 0.89 3.22 0.5 1 1.5 2 2.5 3 3.5 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

SLIDE 14

AVERAGE SCORES OF QUESTIONS OUTLIERS REMOVED

1 0.0714286 1 0.5 0.428571 3 1 0.1875 0.375 0.5 0.5625 1 3.625 0.5 1 1.5 2 2.5 3 3.5 4 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

SLIDE 15

DIFFERENCES IN RESULTS X-Y

0.13 0.11

0.45

0.11 0.33 0.16

0.1 0.2 0.3 0.4 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

SLIDE 16

DIFFERENCES X-Y OUTLIERS REMOVED

0.1160714 0.375

0.0625 0.571429 0.625

0.2 0.4 0.6 0.8 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

SLIDE 17

WAS MY HYPOTHESIS CORRECT?

With these results, I can say my hypothesis is incorrect

SLIDE 18

SOMETHING ELSE?

0.44 0.89 0.45 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X Average Y Average Difference Y-X

Question 4

0.89 0.56 0.33 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X Average Y Average Difference X-Y

Question 6

Differences in 4 and 6 were significant

SLIDE 19

CITATIONS

[1]Jan Pedersen Kupiec, Julian and Francine Chen. A trainable document summarizer. ACM SIGIR conference on Research and development in information retrieval, (15):68–73, 1995 [2] Paul Tarau Rada Mihalcea. Textrank: Bringing

rder into texts. 2011.