a document summarizer for novices
play

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT - PowerPoint PPT Presentation

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT SUMMARIZER? Getting into a field of research is: Daunting with the amount of information presented Difficult to discern what is important and what isnt How a


  1. A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN

  2. WHY A DOCUMENT SUMMARIZER?  Getting into a field of research is:  Daunting with the amount of information presented  Difficult to discern what is important and what isn’t  How a summarizer will help:  Present the most relevant information and remove the excess

  3. EXTRACTION VS ABSTRACTION  Extraction[1]  Abstraction[1]  Pulls sentences straight  Creates sentences by from the input joining several together  Does not make its own  Works better for several sentences documents at once

  4. TEXTRANK  Extraction based[2]  Creates a web of sentences  This web is used as an input for PageRank  PageRank will rank the sentences[3]  Gives the summary as the output

  5. HOW TO IMPROVE THIS MODEL?  It is important to note the glossary should be of relevant terms compared to the original document  The way TextRank works, the glossary will allow for similar sentences to connect and score higher  This will help by giving more informative sentences  It is important to know that more informative does not mean easier to read

  6. MY TEXTRANK MODIFICATION

  7. RESEARCH QUESTION  Will including a glossary of related terms in the original document bring about more informative sentences?

  8. HYPOTHESIS  Having a glossary included in the original document will bring out more informative sentences in the final summary

  9. EXPERIMENT OVERVIEW  Two experimental groups:  Control Group (Y)  Test Group (X)  Have the groups take a test on the original document

  10. MY SUMMARY  My summary was made using a document focused on cybersecurity and the glossary was filled with similar cybersecurity terms

  11. PARTICIPANTS  Participants:  Union College students aged 18-22  Mixed group of CS students and non-CS students  2 Groups:  Control(Y) read the summary that was made through the original TextRank program  Test (X) read the summary that was made through my modified TextRank program

  12. TEST GIVEN TO PARTICIPANTS  The test given to participants was based on the main points of the original document  Why the main points?  The main points should be in the summary  Question types  3 Multiples Choice  3 Open Answer

  13. AVERAGE SCORES OF QUESTIONS 3.5 3.22 3.06 3 2.5 2 1.5 0.94 0.94 0.89 0.89 1 0.56 0.5 0.44 0.39 0.5 0.33 0.22 0.19 0.06 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score Multiple Choice: Open Answer: Data on the left is Y 3 1 and the right is X 4 2 6 5

  14. AVERAGE SCORES OF QUESTIONS OUTLIERS REMOVED 4 3.625 3.5 3 3 2.5 2 1.5 1 1 1 1 1 0.5625 0.5 0.5 0.428571 0.5 0.375 0.1875 0.0714286 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

  15. DIFFERENCES IN RESULTS X-Y 0.4 0.33 0.3 0.2 0.16 0.13 0.11 0.11 0.1 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.1 -0.2 -0.3 -0.4 -0.45 -0.5

  16. DIFFERENCES X-Y OUTLIERS REMOVED 0.8 0.625 0.571429 0.6 0.375 0.4 0.2 0.1160714 0.0625 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.2 -0.4 -0.5 -0.6

  17. WAS MY HYPOTHESIS CORRECT? With these results, I can say my hypothesis is incorrect

  18. SOMETHING ELSE?  Differences in 4 and 6 were significant Question 4 Question 6 1 1 0.89 0.89 0.9 0.9 0.8 0.8 0.7 0.7 0.56 0.6 0.6 0.5 0.45 0.5 0.44 0.4 0.4 0.33 0.3 0.3 0.2 0.2 0.1 0.1 0 0 X Average Y Average Difference Y-X X Average Y Average Difference X-Y

  19. CITATIONS [1]Jan Pedersen Kupiec, Julian and Francine Chen. A trainable document summarizer. ACM SIGIR conference on Research and development in information retrieval, (15):68 – 73, 1995 [2] Paul Tarau Rada Mihalcea. Textrank: Bringing order into texts. 2011. [3] Herwig Unger Mario Kubek. Topic detection based on the pagerank’s clustering property. 2011.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend