Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt - PowerPoint PPT Presentation

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa Julian Rupp Robert M. Nickel ∗ PAN@CLEF2020 *

Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

Text preprocessing strategies: Preparing train/dev sets • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: 90% small: 10% large: 5 % large: 95% 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

Text preprocessing strategies: Preparing train/dev sets • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

Text preprocessing strategies: Topic Masking • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

Text preprocessing strategies: Data augmentation • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs Train set 1 small: ~41,700 pairs Epoch 1: large: ~233,450 pairs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt - PowerPoint PPT Presentation

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa Julian Rupp Robert M. Nickel PAN@CLEF2020 * Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task:

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Vulnerability Analysis (IV): Program Verifjcation Slide credit: Vijay Dawn Song DSilva

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Intro to Perception, part II Jonathan Pillow Sensation & Perception (PSY 345 / NEU 325)

The Crossroads Versus the Seesaw: Getting a Fix on Recent International Tax Policy

Philosophy of Mind Philipp Koehn 7 February 2017 Philipp Koehn Artificial Intelligence:

What kind of virtual machine is capable of human consciousness? Aaron Sloman

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

ProtoDUNE-ND: Containment Studies Near Detector Workshop Fermilab Mai 25th, 2019 Patrick

Want to chat with everyone? Please send chats to All Participants Keep this number handy Contact

Jim Bray Northwestern University Expert Finder Systems Forum March 1 st , 2019 ~1500

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt - PowerPoint PPT Presentation

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa Julian Rupp Robert M. Nickel PAN@CLEF2020 * Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task:

Authorship &amp; Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Vulnerability Analysis (IV): Program Verifjcation Slide credit: Vijay Dawn Song DSilva

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Intro to Perception, part II Jonathan Pillow Sensation &amp; Perception (PSY 345 / NEU 325)

The Crossroads Versus the Seesaw: Getting a Fix on Recent International Tax Policy

Philosophy of Mind Philipp Koehn 7 February 2017 Philipp Koehn Artificial Intelligence:

What kind of virtual machine is capable of human consciousness? Aaron Sloman

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

ProtoDUNE-ND: Containment Studies Near Detector Workshop Fermilab Mai 25th, 2019 Patrick

Want to chat with everyone? Please send chats to All Participants Keep this number handy Contact

Jim Bray Northwestern University Expert Finder Systems Forum March 1 st , 2019 ~1500

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Intro to Perception, part II Jonathan Pillow Sensation & Perception (PSY 345 / NEU 325)