grieve 2007 quantitative authorship attribution an
play

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary - PowerPoint PPT Presentation

Grieve 2007: Quantitative Authorship Attribution: An Evaluation of Techniques Zarah Wei Introduction Textual Measurements Length Measures Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of


  1. Grieve 2007: Quantitative Authorship Attribution: An Evaluation of Techniques Zarah Weiß Introduction Textual Measurements Length Measures Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of Techniques Frequency Measures The Algorithm The Corpus Zarah Weiß Experiment & Results Experiment Results Combination of Techniques Conclusion November 18th, 2015 References Discussion

  2. Grieve 2007: Quantitative Authorship Introduction Attribution: An Evaluation of Techniques Textual Measurements Zarah Weiß Length Measures Introduction Vocabulary Richness Measures Textual Measurements Frequency Measures Length Measures Vocabulary Richness Measures The Algorithm Frequency Measures The Algorithm The Corpus The Corpus Experiment & Results Experiment Experiment & Results Results Combination of Experiment Techniques Results Conclusion Combination of Techniques References Discussion Conclusion References Discussion

  3. Grieve 2007: Introduction Quantitative Authorship Quantitative Authorship Attribution Attribution: An Evaluation of Techniques Zarah Weiß Introduction Textual Measurements Length Measures Vocabulary Richness Measures Frequency Measures ◮ Determine author from set of possible authors The Algorithm ◮ Based on corpus of author set The Corpus ◮ Based on textual measures (features) Experiment & Results ◮ Attribution algorithm compares anonymous text with known author data Experiment Results ◮ Mendenhall (1887) on Shakespeare plays Combination of Techniques Conclusion References Discussion

  4. Grieve 2007: Introduction Quantitative Authorship Grieve 2007 Attribution: An Evaluation of Techniques Zarah Weiß Introduction Textual Measurements Length Measures Vocabulary Richness Measures Frequency Measures ◮ Overview over 39 most common features for authorship attribution The Algorithm ◮ First comprehensive feature set evaluation The Corpus ◮ Uses identical data set Experiment & Results ◮ Uses identical attribution algorithm Experiment Results ◮ Proposes more accurate approach combining promising features Combination of Techniques Conclusion References Discussion

  5. Grieve 2007: Textual Measurements Quantitative Authorship Length Measures Attribution: An Evaluation of Techniques Zarah Weiß Word-Length Sentence-Length Introduction # digits + # graphemes (# ” words ” | # characters ! ) Average length ! Textual Measurements # ” words ” # sentences Length Measures Vocabulary Richness # ” words ” of length n # sentences of length n Measures Distribution rel. freq. # ” words ” # sentences Frequency Measures The Algorithm The Corpus Table: Length measures evaluated in Grieve 2007. Experiment & Results Experiment Results ◮ For n = 1 , . . . , N (for varying N ) Combination of Techniques ◮ For sentence frequency distribution in characters n as range, e.g. 1 to 10 Conclusion characters References ◮ With sentence length being measured in Discussion 1. # ”words” 2. # characters

  6. Grieve 2007: Textual Measurements Quantitative Authorship Length Measures Attribution: An Evaluation of Techniques Zarah Weiß Word-Length Sentence-Length Introduction # digits + # graphemes (# ” words ” | # characters ! ) Average length ! Textual Measurements # ” words ” # sentences Length Measures Vocabulary Richness # ” words ” of length n # sentences of length n Measures Distribution rel. freq. # ” words ” # sentences Frequency Measures The Algorithm The Corpus Table: Length measures evaluated in Grieve 2007. Experiment & Results Experiment Results ◮ For n = 1 , . . . , N (for varying N ) Combination of Techniques ◮ For sentence frequency distribution in characters n as range, e.g. 1 to 10 Conclusion characters References ◮ With sentence length being measured in Discussion 1. # ”words” 2. # characters ◮ length(”Chris drank an espresso .”) = ?

  7. Grieve 2007: Textual Measurements Quantitative Authorship Length Measures Attribution: An Evaluation of Techniques Zarah Weiß Word-Length Sentence-Length Introduction # digits + # graphemes (# ” words ” | # characters ! ) Average length ! Textual Measurements # ” words ” # sentences Length Measures Vocabulary Richness # ” words ” of length n # sentences of length n Measures Distribution rel. freq. # ” words ” # sentences Frequency Measures The Algorithm The Corpus Table: Length measures evaluated in Grieve 2007. Experiment & Results Experiment Results ◮ For n = 1 , . . . , N (for varying N ) Combination of Techniques ◮ For sentence frequency distribution in characters n as range, e.g. 1 to 10 Conclusion characters References ◮ With sentence length being measured in Discussion 1. # ”words” 2. # characters ◮ length(”Chris drank an espresso .”) = ? 1. 4 (dot is neither grapheme nor digit)

  8. Grieve 2007: Textual Measurements Quantitative Authorship Length Measures Attribution: An Evaluation of Techniques Zarah Weiß Word-Length Sentence-Length Introduction # digits + # graphemes (# ” words ” | # characters ! ) Average length ! Textual Measurements # ” words ” # sentences Length Measures Vocabulary Richness # ” words ” of length n # sentences of length n Measures Distribution rel. freq. # ” words ” # sentences Frequency Measures The Algorithm The Corpus Table: Length measures evaluated in Grieve 2007. Experiment & Results Experiment Results ◮ For n = 1 , . . . , N (for varying N ) Combination of Techniques ◮ For sentence frequency distribution in characters n as range, e.g. 1 to 10 Conclusion characters References ◮ With sentence length being measured in Discussion 1. # ”words” 2. # characters ◮ length(”Chris drank an espresso .”) = ? 1. 4 (dot is neither grapheme nor digit) 2. 25 (again, no dot)

  9. Grieve 2007: Textual Measurements Quantitative Authorship Vocabulary Richness Measures Attribution: An Evaluation of Techniques Zarah Weiß # types Introduction Unrestricted type-”word” ratio: # ” words ” Textual Measurements ◮ Issue? Length Measures Vocabulary Richness Measures Frequency Measures The Algorithm The Corpus Experiment & Results Experiment Results Combination of Techniques Conclusion References Discussion

  10. Grieve 2007: Textual Measurements Quantitative Authorship Vocabulary Richness Measures Attribution: An Evaluation of Techniques Zarah Weiß # types Introduction Unrestricted type-”word” ratio: # ” words ” Textual Measurements ◮ Issue? Sensitive to text length! Length Measures Vocabulary Richness Measures Frequency Measures Type Token Ratio variations: The Algorithm # types ◮ Guiraud’s R: The Corpus √ # ” words ” Experiment & Results log (# types ) ◮ Herdan’s C: Experiment log (# ” words ”) Results Combination of log (# types ) ◮ Dugat’s k: Techniques log ( log (# ” words ”)) Conclusion 1 − (# types ) 2 ◮ Tuldava’s LN: (# types ) 2 × log (# ” words ”)) References # first n types Discussion ◮ Restricted type-”word” ratio: # first n ” words ” , with n being # ” words ” in shortest writing sample

  11. Grieve 2007: Textual Measurements Quantitative Authorship Vocabulary Richness Measures Attribution: An Evaluation of Techniques Zarah Weiß Introduction Type Token Ratio variations: ea’s M: # types occurring 2 times Textual Measurements ◮ Sichel’s S and Mich´ # tokens Length Measures Vocabulary Richness 100 × log (# ” words ”) ◮ Honor´ e’s H: Measures (1 − # types occurring 1 time ) / # types Frequency Measures � i 2 × # types occurring i times − # ” words ” ◮ Yule’s K and Simpson’s D: 10 4 × The Algorithm (# ” words ”) 2 The Corpus Experiment & Results Experiment Other lexical diversity measures: Results Combination of ◮ Entropy: − 100 × � v p v × log ( p v ), Techniques with p v = relative frequency of v th most frequent type Conclusion ◮ W: (# ” words ”) # types − a , with some constant a References Discussion For evaluation of LD measures, see McCarthy & Jarvis (2007, 2010)!

  12. Grieve 2007: Textual Measurements Quantitative Authorship Grapheme Frequency Attribution: An Evaluation of Techniques Zarah Weiß Introduction Textual Measurements Length Measures Simple grapheme profile 1 : # instances of grapheme i Vocabulary Richness # graphemes Measures ◮ For each i ∈ set(alphabet) Frequency Measures The Algorithm The Corpus # instances of grapheme i in position p Single-position grapheme profile: Experiment & Results # ” words ” containing position p Experiment ◮ For each i ∈ set(alphabet) Results Combination of ◮ For varying positions p within a ”word” (first, second, . . . , last grapheme) Techniques Conclusion References Discussion 1 All profiles are frequency distributions! I.e. one profile per text!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend