Stylometry Using Adjacent Word Graphs Leon Maurer Math 76 p. 1/1 - PowerPoint PPT Presentation

Stylometry Using Adjacent Word Graphs Leon Maurer Math 76 – p. 1/1

The plan 1. Take works, chop them up, and make graphs out of them 2. Perform HITS on graphs and find Hub vectors 3. Do Principle Component Analysis on the vectors 4. Squint at results 5. ??? 6. Profit!!! – p. 2/1

Making the Graphs Words are vertices Directed edges from one word to the next If the edge already exists, add one to its weight Restart at punctuation It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness... – p. 3/1

HITS Review of HITS algorithm: 1. Start with � h 0 = (1 , 1 , 1 ... ) a t +1 = A T � 2. � h t � 3. h t +1 = A � a t +1 h t +1 ≈ � � 4. Repeat until h t when normalized Where A is the adjacency matrix, � h is the hub vector, and � a is the authority vector. For these graphs, � h converged quickly. Typically h 3 · � � h 2 > . 99 – p. 4/1

Typical � h s of .6452 .6039 .7557 .6147 .5735 in .5161 .5286 .3995 .4389 .5259 and .2968 .3497 .2463 .3352 .2506 to .2326 .2333 .2118 .2980 .2504 on .1613 .2110 .1932 .2220 .2254 at .1592 .0886 .0555 .1581 .2296 with .1181 .0916 .0775 .1460 .1921 for .0570 .1626 .1198 .1136 .1483 from .0883 .0927 .1189 .1413 .1234 by .1385 .0846 .1031 .1047 .0828 was .0622 .0966 .0646 .1450 .1150 through .1360 .0533 .0423 .0563 .0536 – p. 5/1

Typical � a s the .9543 .8981 .9231 .9117 .8628 a .1657 .3598 .3016 .2895 .3176 his .0055 .0627 .0367 .1643 .2719 it .0636 .1138 .0813 .0786 .1299 that .0649 .0555 .0335 .0489 .0484 this .0380 .0450 .0545 .0534 .0376 be .0840 .0178 .0175 .0540 .0347 them .0179 .0572 .0686 .0365 .0259 one .0678 .0197 .0399 .0280 .0425 all .0338 .0665 .0444 .0252 .0265 her .0000 .0254 .0181 .0318 .1113 their .0148 .0415 .0450 .0301 .0382 – p. 6/1

Principle Component Analysis Simply taking the dot product of the � h vectors doesn’t reveal much about authorship – the dot products all are ≈ . 95 . So it’s time to do PCA. Each � h has thousands of entries – it’s too big Cut all � h s down to the ≈ 30 words with the highest average values The sum of the top 2 or 3 eigenvalues is often about half of the total, so 2 or 3 dimensions should provide an ok representation – p. 7/1

Twain vs. Dickens Red dots are from Innocents Abroad. Blue dots are from A Tale of Two Cities. Works chosen because they are quite different – if this method works, it will work here 8 chunks of 4000-6000 words from each book 0.20 0.15 0.1 0.2 0.0 0.10 � 0.1 0.05 0.1 � 0.2 � 0.1 � 0.2 � 0.1 0.1 0.2 0.0 0.0 � 0.05 0.1 � 0.1 � 0.10 0.2 – p. 8/1

Eliot vs. Gaskell Red dots are from Middlemarch. Blue dots are from North and South. Works chosen because they are similar (I am told) – both written by women in Victorian England and have some themes in common Chunks are again 4000-6000 words 0.2 0.1 0.1 0.0 0.1 � 0.1 � 0.2 � 0.1 0.1 0.0 � 0.2 � 0.1 � 0.1 � 0.1 0.0 0.1 � 0.2 � 0.2 – p. 9/1

Darwin vs. Spencer Red dots are from The Descent of Man. Blue dots are from Essays on Education and Kindred Subjects. Wanted to test some non-fiction works. Chunks are somewhat larger 0.1 0.2 0.1 0.0 0.1 � 0.2 � 0.1 0.1 0.2 � 0.1 0.0 � 0.2 � 0.1 � 0.1 � 0.1 0.0 0.1 � 0.2 � 0.2 0.2 – p. 10/1

Closing Thoughts The method shows some promise. What might improve it? To some extent, bigger chunks are better. I could do whole books at once if I had more RAM. Program can probably be tweaked for some more speed. Is it a good thing that a few words have very high scores? If not, we could re-weight the edges non-linearly. a and � Make use of both � h Remove the squint step and do clustering in higher dimensions instead. – p. 11/1

References George Bebis, Principal Components Analysis, http://www.cse.unr.edu/ bebis/MathMethods/PCA/lecture.pdf Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, http://www.cs.cornell.edu/home/kleinber/auth.pdf M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, http://arxiv.org/abs/physics/0605087 Andrew Y. Ng, Alice X. Zheng and Michael Jordan, Link analysis, eigenvectors, and stability, http://ai.stanford.edu/ ang/papers/ijcai01- linkanalysis.pdf – p. 12/1

Stylometry Using Adjacent Word Graphs Leon Maurer Math 76 p. 1/1 - PowerPoint PPT Presentation

Stylometry Using Adjacent Word Graphs Leon Maurer Math 76 p. 1/1 The plan 1. Take works, chop them up, and make graphs out of them 2. Perform HITS on graphs and find Hub vectors 3. Do Principle Component Analysis on the vectors 4.

Using Stylometry to Model Transmission of Arabic in Medieval Europe: the Case of the Bocados de

PA153: Stylometric analysis of texts using machine learning techniques Jan Rygl rygl@fi.muni.cz

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Grouping of Adjacent Media in SDP draft-jennings-mmusic-adjacent-grouping-02 IETF 79 November

How Downtown Integrates into Adjacent Neighborhoods How Downtown Integrates into Adjacent

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Downtown Zoning Evaluation Options Focus Group Goal and Agenda 1. Welcome and brief introduction

Freight Rail Dependent Uses Advisory Committee March 28, 2018 Jose Alvarez, Community

Lecture 24: Loop Invariants [Online Reading] CS 1110 Introduction to Computing Using Python

Investor Presentation May, 2015 Polaris Industries Inc. A Global Corporation ~8,100

Poster Presentation Guidelines ANNUAL MEETING REGISTRATION All presenters must register and pay

Community Center Conditional Use Request Our Request: Pursuant to 14-16-2-6 R-1

STATEN ISLAND/BRONX SPECIAL DISTRICTS UPDATE Draft Proposal for Staten Island November 2018

States Prison Hollow Road and Monkton Ridge Road Intersection Study Town of Monkton Public

Sambuz

Useful Links

Newsletter

Mail Us