05.07.2018 Pipeline for TR extraction Milad Alshomary
A Pipeline for Scalable Text Reuse Analysis
Milad Alshomary 05.07.2018
Bauhaus Universität
1
A Pipeline for Scalable Text Reuse Analysis Milad Alshomary - - PowerPoint PPT Presentation
A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universitt 05.07.2018 Milad Alshomary Pipeline for TR extraction 05.07.2018 1 Overview Motivation A Pipeline for Scalable Text Reuse Extraction Application on
05.07.2018 Pipeline for TR extraction Milad Alshomary
Bauhaus Universität
1
05.07.2018 Pipeline for TR extraction Milad Alshomary
2
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation 3
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation 4
METER project (Measuring Text Reuse) Plagiarism detection
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation
Plagiarism detection
5
METER projet (Measuring Text Reuse)
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 6
Plagiarism detection METER projet (Measuring Text Reuse)
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation
7
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation
8
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation
9
05.07.2018 Pipeline for TR extraction Milad Alshomary
Motivation
10
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation
Quality Flaws
11
Scientific community
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation
12
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation
13
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 14
05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 15
05.07.2018 Pipeline for TR extraction Milad Alshomary
16
05.07.2018 Pipeline for TR extraction Milad Alshomary
TR Pipeline D1 D2
➔ Input: Two datasets ➔ Output: Text reuse cases
17 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
TR Pipeline D1 D2
➔ Input: Two datasets ➔ Output: Text reuse cases
18 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Text Preprocessing Candidate Elimination Text Alignment
19
➔ Content extraction ➔ Chunking ➔ Feature extraction
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Text Preprocessing Candidate Elimination Text Alignment
20
➔ Content extraction ➔ Chunking ➔ Feature extraction ➔ Pairwise scan ➔ Text Reuse heuristics
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Text Preprocessing Candidate Elimination Text Alignment
➔ Content extraction ➔ Chunking ➔ Feature extraction ➔ Pairwise scan ➔ Text Reuse heuristics ➔ Detailed scan of text reuse ➔ Picapica framework
21 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Text Preprocessing Candidate Elimination Text Alignment
Keys for scaling-up: ➔ Cluster computing ➔ Heuristics based candidate elimination algorithms
22 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Text Preprocessing Candidate Elimination Text Alignment
Keys for scaling-up: ➔ Cluster computing ➔ Heuristics based candidate elimination algorithms
23 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
For a candidacy function we proposed the following methods:
Paragraph embedding (semantic + structure)
24
d2n D1 D2 candidacy(d11, d21) → [0, 1] d11 d21 d22 d12 d1n
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
Wikipedia
TR cases
25 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
Wikipedia
TR cases
26 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
Wikipedia
TR cases
27
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
TR sample
Evaluation of “candidacy” function:
according to the proposed “candidacy” .
Thresholds of [1, 101,..,100k]
documents that have TR.
28
T1 T2 T3
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
TR sample
Evaluation of “candidacy” function:
29
T1 T2 T3
r1 r2 p1 p2
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Semantic hashing function:
hashes.
exact binary hash.
30
011001 011001
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Semantic hashing function:
documents that intersect in one hash at least.
001001 011001 001000
Inverted index
011001 011001
D1 D2
31 A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
001001 011001 001000
Inverted index
011001 011001
D1 D2
32
Semantic hashing function:
documents that intersect in one hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
001001 011001 001000
Inverted index
011001 011001
D1 D2
33
Semantic hashing function:
documents that intersect in one hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
001001 011001 001000
Inverted index
011001 011001
D1 D2
34
Semantic hashing function:
documents that intersect in one hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Proposed semantic hashing methods:
independent)
Hashing (data dependent)
35
di dj
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Proposed semantic hashing methods:
independent)
Hashing (data dependent)
36
di
001 100
dj
001
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Proposed semantic hashing methods:
independent)
Hashing (data dependent)
37
Learning
VDSH
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Transform
011001
38
Learning
VDSH VDSH
Proposed semantic hashing methods:
independent)
Hashing (data dependent)
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
39
Hashing methods evaluation:
evaluation.
proposed hashing function.
TR sample
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
40
Hashing methods evaluation:
evaluation.
proposed hashing function.
TR sample
101 001 111 101 101 101 110 000 100
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
41
Hashing methods evaluation:
evaluation.
proposed hashing function.
TR sample
101 001 111 101 101 101 110 000 100
Precision = 2/3 Recall = 1.0
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
Random projection bits precision recall …. 8 3.1 x 10-4 0.8741 …. 16 9.9 x 10-4 0.324 VDSH bits precision recall …. 8 2.8 x 10-4 0.88 …. 16 4.5 x 10-3 0.73
42
Hashing methods evaluation
evaluation.
proposed hashing function.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
VDSH bits precision recall 8 2.8 x 10-4 0.88 16 4.5 x 10-3 0.73
○ Reduces the computations needed by 3
43
Hashing methods evaluation
evaluation.
proposed hashing function.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018 Pipeline for TR extraction Milad Alshomary
44
05.07.2018 Pipeline for TR extraction Milad Alshomary
Application on Wikipedia
45
05.07.2018 Pipeline for TR extraction Milad Alshomary
Application on Wikipedia
100 million text reuse TR Pipeline Wikipedia Wikipedia
46
Wikipedia Articles
360k Wikipedia Article
05.07.2018 Pipeline for TR extraction Milad Alshomary
What kinds of text reuse occur in Wikipedia?
(1) Two texts describe the same topic. (2) Two texts describe two different topics, that share similar characteristics
47
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary
What kinds of text reuse occur in Wikipedia?
(1) Two texts describe the same topic.
Text Reuse Structure Text Reuse Content Text Reuse
48
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary
What kinds of text reuse occur in Wikipedia?
topic.
Text Reuse Structure Text Reuse Content Text Reuse
49
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary
What kinds of text reuse occur in Wikipedia?
(2) Two texts describe two different topics, that share similar characteristics
Text Reuse Structure Text Reuse Content Text Reuse
50
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary
What kinds of text reuse occur in Wikipedia?
(2) Two texts describe two different topics, that share similar characteristics
Text Reuse Structure Text Reuse Content Text Reuse
51
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary 52
Application on Wikipedia
Vertical relation Horizontal relation
05.07.2018 Pipeline for TR extraction Milad Alshomary 53
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary 54
Application on Wikipedia
05.07.2018 Pipeline for TR extraction Milad Alshomary
55
05.07.2018 Pipeline for TR extraction Milad Alshomary
Application on Wikipedia and Common Crawl 56
05.07.2018 Pipeline for TR extraction Milad Alshomary
WWW
Extracted web content
pages Web Sample 10% random sample
57 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
WWW
Extracted web content
pages Web Sample 10% random sample
contains less than 10 web pages
Number of web pages Number of websites
58 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
TR Pipeline Web Sample Wikipedia
59 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Monthly revenue estimation:
manually checked the existence of Advertisements.
60 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million
61 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million
62 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million
63 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million
64 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million
The rough estimate of monthly revenue of Wikipedia content
65 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Revenue estimation:
66 Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Extracted from Wikipedia API
67
Revenue estimation:
Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k
Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Estimated from marketing reports
68
Revenue estimation:
Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k
Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
69
Revenue estimation:
Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k
Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k
70
Revenue estimation:
Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
71
Monthly revenue:
Application on Wikipedia and Common Crawl
Per Web sample Number of reusing web pages Revenue(per webpage) 59 million 15k $900k 590 million 150k $9 million
05.07.2018 Pipeline for TR extraction Milad Alshomary
Per Web sample Number of reusing web pages Revenue(per webpage) 59 million 15k $900k 590 million 150k $9 million
72
Monthly revenue:
Application on Wikipedia and Common Crawl
05.07.2018 Pipeline for TR extraction Milad Alshomary
73
05.07.2018 Pipeline for TR extraction Milad Alshomary
conclusion 74
Text Preprocessing Candidate Elimination Text Alignment
Text Reuse Structure Text Reuse Content Text Reuse
Per website (all websites) Per website (highly reuse) Per Webpage $1.2 million $15k $900k
05.07.2018 Pipeline for TR extraction Milad Alshomary
conclusion
TR Pipeline Wikipedia
?
75
TR between Wikipedia and the scientific community.
subtask.
Reuse cases.
monthly revenue generated by Wikipedia content.
05.07.2018 Pipeline for TR extraction Milad Alshomary
Future Work
TR Pipeline Wikipedia
?
TR between Wikipedia and the scientific community.
subtask.
Reuse cases.
monthly revenue generated by Wikipedia content.
Text Preprocessing Candidate Elimination Text Alignment
76
05.07.2018 Pipeline for TR extraction Milad Alshomary
Future Work
TR Pipeline Wikipedia
?
Text Preprocessing Candidate Elimination Text Alignment
77
TR between Wikipedia and the scientific community.
subtask.
Reuse cases.
monthly revenue generated by Wikipedia content.
05.07.2018 Pipeline for TR extraction Milad Alshomary
Future Work
TR Pipeline Wikipedia
?
Text Preprocessing Candidate Elimination Text Alignment
78
TR between Wikipedia and the scientific community.
subtask.
Reuse cases.
monthly revenue generated by Wikipedia content.
05.07.2018 Pipeline for TR extraction Milad Alshomary
79
05.07.2018 Pipeline for TR extraction Milad Alshomary 80
05.07.2018 Pipeline for TR extraction Milad Alshomary 81
Wiki paragraphs stopwords stopword ngrams
Extract stop words generate n-grams
filtered stopword ngrams
Top 50 frequent stopwords: the, of, and, a, in, to,is, was, it, for, with, he, be, on, i, that, by, at, you, 's, are, not,his, this, from, but, had, which, she, they, or, an, were, we, their, been, has, have, will, would, her, there, can, all,as, if, who, what, said filter n-grams
increases false positive.
stopwords from C
belonging to C is less than n-2
binary count vector
which a specific n-gram happened in a paragraph.
count vector
05.07.2018 Pipeline for TR extraction Milad Alshomary 82
VDSH USAGE
05.07.2018 Pipeline for TR extraction Milad Alshomary 83
Documents from sample who have number of aligned docs <= 10 Documents from sample who have number of aligned docs > 10
Thresholds between (1 to 1000 and step of 5)
RECALL RECALL Precision Precision
05.07.2018 Pipeline for TR extraction Milad Alshomary 84
Documents from sample who have number of aligned docs <= 10 Documents from sample who have number of aligned docs > 10
Thresholds between (1 to 1000 and step of 5)
RECALL RECALL Precision Precision
05.07.2018 Pipeline for TR extraction Milad Alshomary 85
05.07.2018 Pipeline for TR extraction Milad Alshomary 86
05.07.2018 Pipeline for TR extraction Milad Alshomary 87
05.07.2018 Pipeline for TR extraction Milad Alshomary 88
t_percent_reused < 0.5) => content reuse otherwise structure reuse
Structure reuse Content reuse Sample1 100% 58% Sample2 (Text1 or Text2 > 200) 100% 73%