A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT - - PowerPoint PPT Presentation

a document summarizer for novices
SMART_READER_LITE
LIVE PREVIEW

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT - - PowerPoint PPT Presentation

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT SUMMARIZER? Getting into a field of research is: Daunting with the amount of information presented Difficult to discern what is important and what isnt How a


slide-1
SLIDE 1

A DOCUMENT SUMMARIZER FOR NOVICES

REX RUBIN

slide-2
SLIDE 2

WHY A DOCUMENT SUMMARIZER?

 Getting into a field of research is:  Daunting with the amount of information presented  Difficult to discern what is important and what isn’t  How a summarizer will help:  Present the most relevant information and remove the excess

slide-3
SLIDE 3

EXTRACTION VS ABSTRACTION

Extraction[1] Pulls sentences straight from the input Does not make its own sentences Abstraction[1] Creates sentences by joining several together Works better for several documents at once

slide-4
SLIDE 4

TEXTRANK

 Extraction based[2]  Creates a web of sentences  This web is used as an input for PageRank  PageRank will rank the sentences[3]  Gives the summary as the output

slide-5
SLIDE 5

HOW TO IMPROVE THIS MODEL?

It is important to note the glossary should be of relevant terms compared to the original document The way TextRank works, the glossary will allow for similar sentences to connect and score higher This will help by giving more informative sentences It is important to know that more informative does not mean easier to read

slide-6
SLIDE 6

MY TEXTRANK MODIFICATION

slide-7
SLIDE 7

RESEARCH QUESTION

 Will including a glossary of related terms in the original document bring about more informative sentences?

slide-8
SLIDE 8

HYPOTHESIS

 Having a glossary included in the original document will bring out more informative sentences in the final summary

slide-9
SLIDE 9

EXPERIMENT OVERVIEW

 Two experimental groups:  Control Group (Y)  Test Group (X)  Have the groups take a test on the original document

slide-10
SLIDE 10

MY SUMMARY

My summary was made using a document focused on cybersecurity and the glossary was filled with similar cybersecurity terms

slide-11
SLIDE 11

PARTICIPANTS

 Participants:  Union College students aged 18-22  Mixed group of CS students and non-CS students  2 Groups:  Control(Y) read the summary that was made through the original TextRank program  Test (X) read the summary that was made through my modified TextRank program

slide-12
SLIDE 12

TEST GIVEN TO PARTICIPANTS

The test given to participants was based on the main points of the original document Why the main points? The main points should be in the summary Question types 3 Multiples Choice 3 Open Answer

slide-13
SLIDE 13

AVERAGE SCORES OF QUESTIONS

Multiple Choice: 3 4 6 Open Answer: 1 2 5 Data on the left is Y and the right is X

0.94 0.06 0.22 0.89 0.39 0.56 3.06 0.94 0.19 0.33 0.44 0.5 0.89 3.22 0.5 1 1.5 2 2.5 3 3.5 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

slide-14
SLIDE 14

AVERAGE SCORES OF QUESTIONS OUTLIERS REMOVED

1 0.0714286 1 0.5 0.428571 3 1 0.1875 0.375 0.5 0.5625 1 3.625 0.5 1 1.5 2 2.5 3 3.5 4 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

slide-15
SLIDE 15

DIFFERENCES IN RESULTS X-Y

0.13 0.11

  • 0.45

0.11 0.33 0.16

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

slide-16
SLIDE 16

DIFFERENCES X-Y OUTLIERS REMOVED

0.1160714 0.375

  • 0.5

0.0625 0.571429 0.625

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

slide-17
SLIDE 17

WAS MY HYPOTHESIS CORRECT?

With these results, I can say my hypothesis is incorrect

slide-18
SLIDE 18

SOMETHING ELSE?

0.44 0.89 0.45 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X Average Y Average Difference Y-X

Question 4

0.89 0.56 0.33 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X Average Y Average Difference X-Y

Question 6

Differences in 4 and 6 were significant

slide-19
SLIDE 19

CITATIONS

[1]Jan Pedersen Kupiec, Julian and Francine Chen. A trainable document summarizer. ACM SIGIR conference on Research and development in information retrieval, (15):68–73, 1995 [2] Paul Tarau Rada Mihalcea. Textrank: Bringing

  • rder into texts. 2011.

[3] Herwig Unger Mario Kubek. Topic detection based on the pagerank’s clustering property. 2011.