Uncovering Plagiarism, Authorship, and Social Software Misuse PAN - - PowerPoint PPT Presentation

uncovering plagiarism authorship and social software
SMART_READER_LITE
LIVE PREVIEW

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN - - PowerPoint PPT Presentation

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 [pan.webis.de] The PAN Team Teresa Holfeld Andreas Eiselt Martin Potthast Alberto Barrn-Cedeo Efstathios Stamatatos Moshe Koppel Patrick Juola Shlomo Argamon


slide-1
SLIDE 1

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011

[pan.webis.de]

slide-2
SLIDE 2

The PAN Team

Teresa Holfeld Andreas Eiselt Martin Potthast Alberto Barrón-Cedeño Efstathios Stamatatos Moshe Koppel Patrick Juola Shlomo Argamon Paolo Rosso Benno Stein

Bauhaus-Universität Weimar Martin Potthast, Benno Stein, Andreas Eiselt, Teresa Holfeld Universidad Politécnica de Valencia Alberto Barrón-Cedeño, Paolo Rosso University of the Aegean Efstathios Stamatatos Bar-Ilan University Moshe Koppel Illinois Institute of Technology Shlomo Argamon Duquesne University Patrick Juola

slide-3
SLIDE 3

PAN Overview

PAN Plagiarism Detection Authorship Identification Vandalism Detection External Detection Intrinsic Detection Authorship Verification Authorship Attribution

3 c www.webis.de

slide-4
SLIDE 4

PAN Overview

Mission & Tasks

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

4 c www.webis.de

slide-5
SLIDE 5

PAN Overview

Mission & Tasks

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

5 c www.webis.de

slide-6
SLIDE 6

PAN Overview

Mission & Tasks

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

❑ Social Software Misuse

– serial sharing, lobbyism, spam – trolling, stalking, Wikipedia vandalism – social trust, anonymity and de-anonymization

6 c www.webis.de

slide-7
SLIDE 7

PAN Overview

Mission & Tasks

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

❑ Social Software Misuse

– serial sharing, lobbyism, spam – trolling, stalking, Wikipedia vandalism – social trust, anonymity and de-anonymization

7 c www.webis.de

slide-8
SLIDE 8

Plagiarism Detection

8 c www.webis.de

slide-9
SLIDE 9

Plagiarism Detection

Plagiarism is the practice of claiming, or implying, original authorship

  • f someone else’s written or creative work, in whole or in part, into
  • ne’s own without adequate acknowledgment.

9 c www.webis.de

slide-10
SLIDE 10

Plagiarism Detection

Plagiarism is the practice of claiming, or implying, original authorship

  • f someone else’s written or creative work, in whole or in part, into
  • ne’s own without adequate acknowledgment.

[Wikipedia: Plagiarism, 2009]

10 c www.webis.de

slide-11
SLIDE 11

. . . better technology nowadays ;–)

+

11 c www.webis.de

slide-12
SLIDE 12

. . . better technology nowadays ;–)

+

12 c www.webis.de

slide-13
SLIDE 13

Authorship Identification

13 c www.webis.de

slide-14
SLIDE 14

Authorship Identification

Sub-task: Authorship Attribution Given a text of uncertain authorship and texts from a set of candidate authors, the task is to map the uncertain text onto the true author among the candidates.

14 c www.webis.de

slide-15
SLIDE 15

Authorship Identification

Sub-task: Authorship Attribution Given a text of uncertain authorship and texts from a set of candidate authors, the task is to map the uncertain text onto the true author among the candidates.

A

1

A2 A

3

A4 A5 A6 A7 A8 A

9

A10 A

11

A12 A? ... ...

15 c www.webis.de

slide-16
SLIDE 16

Authorship Identification

Sub-task: Authorship Verification Given a text of uncertain authorship and text from a specific author, the task is to determine whether the given text has been written by that author.

A? A

3

≠ =

16 c www.webis.de

slide-17
SLIDE 17

Authorship Identification

Sub-task: Authorship Verification Given a text of uncertain authorship and text from a specific author, the task is to determine whether the given text has been written by that author.

A? A

3

≠ =

The problem can be considered as a one-class classification problem.

17 c www.webis.de

slide-18
SLIDE 18

Vandalism Detection

18 c www.webis.de

slide-19
SLIDE 19

Vandalism Detection in Amsterdam

19 c www.webis.de

slide-20
SLIDE 20

Vandalism Detection in Amsterdam

20 c www.webis.de

slide-21
SLIDE 21

Vandalism Detection in Wikipedia

21 c www.webis.de

slide-22
SLIDE 22

Vandalism Detection in Wikipedia

Example: special chars, spacing

22 c www.webis.de

slide-23
SLIDE 23

Vandalism Detection in Wikipedia

Example: special chars, spacing

23 c www.webis.de

slide-24
SLIDE 24

Vandalism Detection in Wikipedia

Example: misguided helping

24 c www.webis.de

slide-25
SLIDE 25

Vandalism Detection in Wikipedia

Example: misguided helping

25 c www.webis.de

slide-26
SLIDE 26

Vandalism Detection in Wikipedia

Example: wrong facts, opinionated, nonsense

26 c www.webis.de

slide-27
SLIDE 27

Vandalism Detection in Wikipedia

Example: wrong facts, opinionated, nonsense

27 c www.webis.de

slide-28
SLIDE 28
slide-29
SLIDE 29

More about PAN

slide-30
SLIDE 30

More about PAN

History [pan.webis.de]

2007 2008 2009 2010 2011

30 c www.webis.de

slide-31
SLIDE 31

More about PAN

Key Figures 2011

2009 2010 2011 Task(s) plagiarism plagiarism vandalism plagiarism authorship vandalism Corpus size 5GB 3.4GB 8.2GB 4.6GB 3MB 8.4GB Corpus size (cases) 94 000 68 000 32 000 61 000 4 100 64 000 Languages 3 3 1 3 1 3 Registrations 21 38 15 30 31 18 Countries 17 24 11 21 23 14 Run submissions 14 18 9 11 13 3 Notebook submissions 11 17 5 11 8 3 Followers (mailing list) 78 151 181

Sponsorship by Research. Media coverage on German and Spanish television, among others.

31 c www.webis.de

slide-32
SLIDE 32

More about PAN

Key Figures 2011

2009 2010 2011 Task(s) plagiarism plagiarism vandalism plagiarism authorship vandalism Corpus size 5GB 3.4GB 8.2GB 4.6GB 3MB 8.4GB Corpus size (cases) 94 000 68 000 32 000 61 000 4 100 64 000 Languages 3 3 1 3 1 3 Registrations 21 38 15 30 31 18 Countries 17 24 11 21 23 14 Run submissions 14 18 9 11 13 3 Notebook submissions 11 17 5 11 8 3 Followers (mailing list) 78 151 181

Sponsorship by Research. Media coverage on German and Spanish television, among others.

32 c www.webis.de

slide-33
SLIDE 33

More about PAN

Program 2011 Today 16:30 Poster Session Wednesday 10:30 Vandalism Detection 11:00 Authorship Identification 14:30 Keynote: Linguists’ Achievements and Analysis Challenges

María Teresa Turell and Malcolm Coulthard

15:10 Panel Discussion Thursday 9:00 Plagiarism Detection 11:30 Reports from the Labs

33 c www.webis.de

slide-34
SLIDE 34

Quo Vadis PAN?

slide-35
SLIDE 35

Quo Vadis PAN?

Ideas for Future Editions

❑ Hide plagiarism cases in a really large corpus such as ClueWeb. ❑ Provide a unified experimentation platform for all participants.

➜ Simplify participation. ➜ Equalize implementation / hardware issues.

❑ Add “semantic” challenges.

➜ Distinguish improper text reuse from correct citations. ➜ Find “excuse” citations.

❑ Scale up evaluation corpora for authorship identification.

➜ Different genres, languages, and time periods. ➜ Focus on specific task variants.

❑ Compile significantly more training data for vandalism detection.

35 c www.webis.de

slide-36
SLIDE 36

Thank you!

Visit us at pan.webis.de. Mail us at pan@webis.de.