PAN 2010 Uncovering Plagiarism, Authorship, and Social Software - - PowerPoint PPT Presentation

pan 2010
SMART_READER_LITE
LIVE PREVIEW

PAN 2010 Uncovering Plagiarism, Authorship, and Social Software - - PowerPoint PPT Presentation

PAN 2010 Uncovering Plagiarism, Authorship, and Social Software Misuse Webis @ Bauhaus-Universitt Weimar NLEL @ Universidad Politcnica de Valencia University of the Aegean Bar-Ilan University http://pan.webis.de Who we are... Benno Stein


slide-1
SLIDE 1

PAN 2010

Uncovering Plagiarism, Authorship, and Social Software Misuse

Webis @ Bauhaus-Universität Weimar NLEL @ Universidad Politécnica de Valencia University of the Aegean Bar-Ilan University http://pan.webis.de

slide-2
SLIDE 2

Who we are...

Benno Stein Paolo Rosso Efstathios Stamatatos Moshe Koppel Martin Potthast Alberto Barrón-Cedeño Andreas Eiselt Teresa Holfeld

slide-3
SLIDE 3

PAN Overview

3 c www.webis.de

slide-4
SLIDE 4

PAN Overview

Mission

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

4 c www.webis.de

slide-5
SLIDE 5

PAN Overview

Mission

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

5 c www.webis.de

slide-6
SLIDE 6

PAN Overview

Mission

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

❑ Social Software Misuse

– serial sharing, lobbyism, spam – trolling, stalking, Wikipedia vandalism – social trust, anonymity and de-anonymization

6 c www.webis.de

slide-7
SLIDE 7

PAN Overview

Mission & Tasks

❑ Plagiarism Detection

– text plagiarism within and across languages, multimedia plagiarism – text reuse, paraphrasing, information flow, meme tracking – near-duplicates, high-similarity search, fingerprinting, hash-based search

❑ Authorship Identification

– models for authorship verification, authorship attribution, and writing style – models to capture personal traits and sentiment – text forensics, ghostwriting, intrinsic plagiarism detection

❑ Social Software Misuse

– serial sharing, lobbyism, spam – trolling, stalking, Wikipedia vandalism – social trust, anonymity and de-anonymization

7 c www.webis.de

slide-8
SLIDE 8

Plagiarism is the practice of claiming, or implying, original authorship

  • f someone else’s written or creative work, in whole or in part, into
  • ne’s own without adequate acknowledgment.

8 c www.webis.de

slide-9
SLIDE 9

Plagiarism is the practice of claiming, or implying, original authorship

  • f someone else’s written or creative work, in whole or in part, into
  • ne’s own without adequate acknowledgment.

9 c www.webis.de

slide-10
SLIDE 10

Plagiarism is the practice of claiming, or implying, original authorship

  • f someone else’s written or creative work, in whole or in part, into
  • ne’s own without adequate acknowledgment.

[Wikipedia: Plagiarism, 2009]

10 c www.webis.de

slide-11
SLIDE 11

... better technology nowadays ;–)

+

11 c www.webis.de

slide-12
SLIDE 12

... better technology nowadays ;–)

+ ? ❀

12 c www.webis.de

slide-13
SLIDE 13

Research Questions

❑ Is plagiarism a problem with respect to education? ❑ Is there a misunderstanding wrt. an evolving cultural technique? ❑ Can plagiarism be detected by humans? ❑ Can plagiarism be detected by machines? ❑ Should automatic plagiarism detection algorithms become standard?

13 c www.webis.de

slide-14
SLIDE 14

Plagiarism Detection

Research Questions

❑ Is plagiarism a problem with respect to education? ❑ Is there a misunderstanding wrt. an evolving cultural technique? ❑ Can plagiarism be detected by humans? ❑ Can plagiarism be detected by machines? ❑ Should automatic plagiarism detection algorithms become standard?

14 c www.webis.de

slide-15
SLIDE 15

Plagiarism Detection

Research Questions

❑ Is plagiarism a problem with respect to education? ❑ Is there a misunderstanding wrt. an evolving cultural technique? ❑ Can plagiarism be detected by humans? ❑ Can plagiarism be detected by machines? ❑ Should automatic plagiarism detection algorithms become standard?

For several reasons we should say “text reuse” rather than “plagiarism”.

15 c www.webis.de

slide-16
SLIDE 16

Vandalism Detection

16 c www.webis.de

slide-17
SLIDE 17

Vandalism Detection in Padua

17 c www.webis.de

slide-18
SLIDE 18

Vandalism Detection in Wikipedia

18 c www.webis.de

slide-19
SLIDE 19

Vandalism Detection in Wikipedia

19 c www.webis.de

slide-20
SLIDE 20

Vandalism Detection in Wikipedia

20 c www.webis.de

slide-21
SLIDE 21

Vandalism Detection in Wikipedia

21 c www.webis.de

slide-22
SLIDE 22

Vandalism Detection in Wikipedia

22 c www.webis.de

slide-23
SLIDE 23

Vandalism Detection in Wikipedia

23 c www.webis.de

slide-24
SLIDE 24

Vandalism Detection in Wikipedia

24 c www.webis.de

slide-25
SLIDE 25

Vandalism Detection in Wikipedia

25 c www.webis.de

slide-26
SLIDE 26

Vandalism Detection in Wikipedia

26 c www.webis.de

slide-27
SLIDE 27

Vandalism Detection in Wikipedia

27 c www.webis.de

slide-28
SLIDE 28

Vandalism Detection in Wikipedia

28 c www.webis.de

slide-29
SLIDE 29

Vandalism Detection in Wikipedia

29 c www.webis.de

slide-30
SLIDE 30

Vandalism Detection in Wikipedia

The Machine Learning Perspective Every edit on Wikipedia has to be double-checked for integrity— even if it affects just one char. The task is to discriminate between regular edits and vandalism edits. The achievements of ML enfold their full power in discrimination situations. ➜ Feature engineering plays an outstanding role.

30 c www.webis.de

slide-31
SLIDE 31

PAN Overview Cont’d

Facts and Stats

❑ Previous workshops at SIGIR’07 and ECAI’08;

previous PAN plagiarism detection competition at SEPLN’09.

❑ Sponsorship by

Research (2009, 2010).

❑ Media coverage on

(2009, 2010), among others.

2009 2010 plagiarism plagiarism vandalism Corpus size (GB) 5 GB 3.4 GB 8.2 GB Corpus size (cases) 94 000 68 000 32 000 Registrations 21 38 15 Countries 17 24 11 Run submissions 14 18 9 Notebook submissions 11 17 5 Followers (mailing list) 78 151

31 c www.webis.de

slide-32
SLIDE 32

PAN Overview Cont’d

Program Sessions

❑ Wednesday, 9:00.

PAN Task 1 - Plagiarism Detection

❑ Wednesday, 11:00.

PAN Task 2 - Wikipedia Vandalism Detection

❑ Wednesday, 18:00.

Poster Session

❑ Thursday, 9:00.

Reports from the Labs Web

❑ http://pan.webis.de ❑ pan@webis.de

32 c www.webis.de