deep bayes factor scoring for authorship verifjcation
play

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt - PowerPoint PPT Presentation

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa Julian Rupp Robert M. Nickel PAN@CLEF2020 * Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task:


  1. Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa Julian Rupp Robert M. Nickel ∗ PAN@CLEF2020 *

  2. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  3. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  4. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  5. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  6. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  7. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  8. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  9. Authorship verifjcation (AV) tasks at PAN 2020 to 2022 1 (Kestemont, Manjavacas, et al. 2020) Task: Given two documents, determine if they were written by the same person • PAN 2020: Closed-set / cross-fandom verifjcation • A large training dataset is provided by the PAN organizers (Bischoff, Deckers, et al. 2020) • Test set represents a subset of the authors/fandoms found in the training data • PAN 2021: Open-set verifjcation • Test set now only contains “unseen” authors/fandoms • Training datset is identical to year one • PAN 2022: Role of judges at court 1 https://pan.webis.de/clef20/pan20-web/author-identification.html 1 / 11

  10. Text preprocessing strategies: Preparing train/dev sets • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: 90% small: 10% large: 5 % large: 95% 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

  11. Text preprocessing strategies: Preparing train/dev sets • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

  12. Text preprocessing strategies: Topic Masking • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

  13. Text preprocessing strategies: Topic Masking • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

  14. Text preprocessing strategies: Data augmentation • Splitting the dataset into a train and a dev set 2 • Removing all documents in the train set which also appear in the dev set • Tokenizing (train/dev sets) 3 and counting words/characters (train set) • Reducing the vocabulary sizes 4 : Mapping all rare token/character types to a special unknown symbol • Re-sampling the pairs for train set in every epoch (Boenninghoff, Hessler, et al. 2019) • Keeping all dev set pairs fjxed! Train set Dev set Test set small: ~83,400 docs small: ~5,200 pairs 14,311 pairs large: ~13,671 pairs large: ~466,900 docs Train set 1 small: ~41,700 pairs Epoch 1: large: ~233,450 pairs 2 Dataset available at https://zenodo.org/record/3724096#.X2itQ3UzbQ8 3 Spacy tokenizer: https://spacy.io/ 4 Similar to text distortion algorithm 1 proposed in (Stamatatos 2017) 2 / 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend