A Pipeline for Scalable Text Reuse Analysis Milad Alshomary - - PowerPoint PPT Presentation

a pipeline for scalable text reuse analysis
SMART_READER_LITE
LIVE PREVIEW

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary - - PowerPoint PPT Presentation

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universitt 05.07.2018 Milad Alshomary Pipeline for TR extraction 05.07.2018 1 Overview Motivation A Pipeline for Scalable Text Reuse Extraction Application on


slide-1
SLIDE 1

05.07.2018 Pipeline for TR extraction Milad Alshomary

A Pipeline for Scalable Text Reuse Analysis

Milad Alshomary 05.07.2018

Bauhaus Universität

1

slide-2
SLIDE 2

05.07.2018 Pipeline for TR extraction Milad Alshomary

Overview

2

  • Motivation
  • A Pipeline for Scalable Text Reuse Extraction
  • Application on Wikipedia
  • Application on Wikipedia and Common Crawl
  • Conclusion
slide-3
SLIDE 3

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse (TR)

Motivation 3

  • Quoting
  • Verbatim
  • Paraphrasing
  • Translation
  • Summarization
slide-4
SLIDE 4

05.07.2018 Pipeline for TR extraction Milad Alshomary

TR Detection Applications

Motivation 4

METER project (Measuring Text Reuse) Plagiarism detection

slide-5
SLIDE 5

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation

Plagiarism detection

5

METER projet (Measuring Text Reuse)

TR Detection Applications

slide-6
SLIDE 6

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 6

Plagiarism detection METER projet (Measuring Text Reuse)

TR Detection Applications

slide-7
SLIDE 7

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The World

Motivation

  • Digital Encyclopedia
  • Collaborative environment
  • Giant public source of

information

  • Free to use

7

slide-8
SLIDE 8

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The World

Motivation

  • Digital Encyclopedia
  • Collaborative environment
  • Giant public source of

information

  • Free to use

8

slide-9
SLIDE 9

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The World

Motivation

  • Digital Encyclopedia
  • Collaborative environment
  • Giant public source of

information

  • Free to use

9

slide-10
SLIDE 10

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The World

Motivation

  • Digital Encyclopedia
  • Collaborative environment
  • Giant public source of

information

  • Free to use

10

slide-11
SLIDE 11

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation

Wikipedia vs The World

Quality Flaws

11

Scientific community

slide-12
SLIDE 12

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation

Wikipedia vs The World

  • Web pages = Wikipedia text + advertisements

12

slide-13
SLIDE 13

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation

Research Questions

➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate?

13

slide-14
SLIDE 14

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 14

➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate?

Research Questions

slide-15
SLIDE 15

05.07.2018 Pipeline for TR extraction Milad Alshomary Motivation 15

➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate?

Research Questions

slide-16
SLIDE 16

05.07.2018 Pipeline for TR extraction Milad Alshomary

A Pipeline for Scalable Text Reuse Extraction

16

slide-17
SLIDE 17

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse Pipeline

TR Pipeline D1 D2

➔ Input: Two datasets ➔ Output: Text reuse cases

17 A Pipeline for Scalable Text Reuse Extraction

slide-18
SLIDE 18

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse Pipeline

TR Pipeline D1 D2

➔ Input: Two datasets ➔ Output: Text reuse cases

18 A Pipeline for Scalable Text Reuse Extraction

slide-19
SLIDE 19

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse Pipeline

Text Preprocessing Candidate Elimination Text Alignment

19

➔ Content extraction ➔ Chunking ➔ Feature extraction

A Pipeline for Scalable Text Reuse Extraction

slide-20
SLIDE 20

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse Pipeline

Text Preprocessing Candidate Elimination Text Alignment

20

➔ Content extraction ➔ Chunking ➔ Feature extraction ➔ Pairwise scan ➔ Text Reuse heuristics

A Pipeline for Scalable Text Reuse Extraction

slide-21
SLIDE 21

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse Pipeline

Text Preprocessing Candidate Elimination Text Alignment

➔ Content extraction ➔ Chunking ➔ Feature extraction ➔ Pairwise scan ➔ Text Reuse heuristics ➔ Detailed scan of text reuse ➔ Picapica framework

21 A Pipeline for Scalable Text Reuse Extraction

slide-22
SLIDE 22

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Text Preprocessing Candidate Elimination Text Alignment

Keys for scaling-up: ➔ Cluster computing ➔ Heuristics based candidate elimination algorithms

22 A Pipeline for Scalable Text Reuse Extraction

slide-23
SLIDE 23

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Text Preprocessing Candidate Elimination Text Alignment

Keys for scaling-up: ➔ Cluster computing ➔ Heuristics based candidate elimination algorithms

23 A Pipeline for Scalable Text Reuse Extraction

slide-24
SLIDE 24

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

For a candidacy function we proposed the following methods:

  • Cosine similarity of TF-IDF (semantic)
  • Paragraph embedding (semantic)
  • Stopwords N-grams (structure)
  • Weighted average of Stopwords Ngrams and

Paragraph embedding (semantic + structure)

24

d2n D1 D2 candidacy(d11, d21) → [0, 1] d11 d21 d22 d12 d1n

A Pipeline for Scalable Text Reuse Extraction

slide-25
SLIDE 25

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

  • Sample 1k documents from

Wikipedia

  • Using Picapica framework to find

TR cases

25 A Pipeline for Scalable Text Reuse Extraction

slide-26
SLIDE 26

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

  • Sample 1k documents from

Wikipedia

  • Using Picapica framework to find

TR cases

26 A Pipeline for Scalable Text Reuse Extraction

slide-27
SLIDE 27

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

  • Sample 1k documents from

Wikipedia

  • Using Picapica framework to find

TR cases

27

  • 232 documents
  • ~ 90% have < 10 alignements (TR case)

A Pipeline for Scalable Text Reuse Extraction

slide-28
SLIDE 28

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

TR sample

Evaluation of “candidacy” function:

  • For each document in TR sample:
  • Sort all Wikipedia articles

according to the proposed “candidacy” .

  • Precision/Recall on

Thresholds of [1, 101,..,100k]

  • A True Positive (TP) is a pair of

documents that have TR.

28

T1 T2 T3

A Pipeline for Scalable Text Reuse Extraction

slide-29
SLIDE 29

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

TR sample

Evaluation of “candidacy” function:

29

T1 T2 T3

r1 r2 p1 p2

A Pipeline for Scalable Text Reuse Extraction

slide-30
SLIDE 30

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Semantic hashing function:

  • Hashes documents into binary

hashes.

  • Similar documents get similar or

exact binary hash.

30

011001 011001

A Pipeline for Scalable Text Reuse Extraction

slide-31
SLIDE 31

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Semantic hashing function:

  • Hashing all documents.
  • Inverted index.
  • Hash document’s chunks.
  • Apply candidacy function only on

documents that intersect in one hash at least.

001001 011001 001000

Inverted index

011001 011001

D1 D2

31 A Pipeline for Scalable Text Reuse Extraction

slide-32
SLIDE 32

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

001001 011001 001000

Inverted index

011001 011001

D1 D2

32

Semantic hashing function:

  • Hashing all documents.
  • Inverted index.
  • Hash document’s chunks.
  • Apply candidacy function only on

documents that intersect in one hash at least.

A Pipeline for Scalable Text Reuse Extraction

slide-33
SLIDE 33

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

001001 011001 001000

Inverted index

011001 011001

D1 D2

33

Semantic hashing function:

  • Hashing all documents.
  • Inverted index.
  • Hash document’s chunks.
  • Apply candidacy function only on

documents that intersect in one hash at least.

A Pipeline for Scalable Text Reuse Extraction

slide-34
SLIDE 34

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

001001 011001 001000

Inverted index

011001 011001

D1 D2

34

Semantic hashing function:

  • Hashing all documents.
  • Inverted index.
  • Hash document’s chunks.
  • Apply candidacy function only on

documents that intersect in one hash at least.

A Pipeline for Scalable Text Reuse Extraction

slide-35
SLIDE 35

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

  • Random Projection (data

independent)

  • Variational Deep Semantic

Hashing (data dependent)

35

di dj

A Pipeline for Scalable Text Reuse Extraction

slide-36
SLIDE 36

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

  • Random Projection (data

independent)

  • Variational Deep Semantic

Hashing (data dependent)

36

di

001 100

dj

001

A Pipeline for Scalable Text Reuse Extraction

slide-37
SLIDE 37

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

  • Random Projection (data

independent)

  • Variational Deep Semantic

Hashing (data dependent)

37

Learning

VDSH

A Pipeline for Scalable Text Reuse Extraction

slide-38
SLIDE 38

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Transform

011001

38

Learning

VDSH VDSH

Proposed semantic hashing methods:

  • Random Projection (data

independent)

  • Variational Deep Semantic

Hashing (data dependent)

A Pipeline for Scalable Text Reuse Extraction

slide-39
SLIDE 39

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

39

Hashing methods evaluation:

  • Using same TR sample for

evaluation.

  • Hashing all documents using the

proposed hashing function.

  • Compute precision and recall.

TR sample

A Pipeline for Scalable Text Reuse Extraction

slide-40
SLIDE 40

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

40

Hashing methods evaluation:

  • Using same TR sample for

evaluation.

  • Hashing all documents using the

proposed hashing function.

  • Compute precision and recall.

TR sample

101 001 111 101 101 101 110 000 100

A Pipeline for Scalable Text Reuse Extraction

slide-41
SLIDE 41

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

41

Hashing methods evaluation:

  • Using same TR sample for

evaluation.

  • Hashing all documents using the

proposed hashing function.

  • Compute precision and recall.

TR sample

101 001 111 101 101 101 110 000 100

Precision = 2/3 Recall = 1.0

A Pipeline for Scalable Text Reuse Extraction

slide-42
SLIDE 42

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

Random projection bits precision recall …. 8 3.1 x 10-4 0.8741 …. 16 9.9 x 10-4 0.324 VDSH bits precision recall …. 8 2.8 x 10-4 0.88 …. 16 4.5 x 10-3 0.73

42

Hashing methods evaluation

  • Using same TR sample for

evaluation.

  • Hashing all documents using the

proposed hashing function.

  • Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

slide-43
SLIDE 43

05.07.2018 Pipeline for TR extraction Milad Alshomary

Candidate Elimination

VDSH bits precision recall 8 2.8 x 10-4 0.88 16 4.5 x 10-3 0.73

  • Retains 73% of the recall
  • By experiment:

○ Reduces the computations needed by 3

  • rder of magnitude

43

Hashing methods evaluation

  • Using same TR sample for

evaluation.

  • Hashing all documents using the

proposed hashing function.

  • Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

slide-44
SLIDE 44

05.07.2018 Pipeline for TR extraction Milad Alshomary

Application on Wikipedia

44

slide-45
SLIDE 45

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse In Wikipedia

Application on Wikipedia

➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate?

45

slide-46
SLIDE 46

05.07.2018 Pipeline for TR extraction Milad Alshomary

Text Reuse In Wikipedia

Application on Wikipedia

100 million text reuse TR Pipeline Wikipedia Wikipedia

46

Wikipedia Articles

360k Wikipedia Article

slide-47
SLIDE 47

05.07.2018 Pipeline for TR extraction Milad Alshomary

What kinds of text reuse occur in Wikipedia?

  • Reasons behind text reuse:

(1) Two texts describe the same topic. (2) Two texts describe two different topics, that share similar characteristics

47

Text Reuse In Wikipedia

Application on Wikipedia

slide-48
SLIDE 48

05.07.2018 Pipeline for TR extraction Milad Alshomary

What kinds of text reuse occur in Wikipedia?

  • Reasons behind text reuse:

(1) Two texts describe the same topic.

Text Reuse Structure Text Reuse Content Text Reuse

48

Text Reuse In Wikipedia

Application on Wikipedia

slide-49
SLIDE 49

05.07.2018 Pipeline for TR extraction Milad Alshomary

What kinds of text reuse occur in Wikipedia?

  • Reasons behind text reuse:
  • Tow texts describing same

topic.

Text Reuse Structure Text Reuse Content Text Reuse

49

Text Reuse In Wikipedia

Application on Wikipedia

slide-50
SLIDE 50

05.07.2018 Pipeline for TR extraction Milad Alshomary

What kinds of text reuse occur in Wikipedia?

  • Reasons behind text reuse:

(2) Two texts describe two different topics, that share similar characteristics

Text Reuse Structure Text Reuse Content Text Reuse

50

Text Reuse In Wikipedia

Application on Wikipedia

slide-51
SLIDE 51

05.07.2018 Pipeline for TR extraction Milad Alshomary

What kinds of text reuse occur in Wikipedia?

  • Reasons behind text reuse:

(2) Two texts describe two different topics, that share similar characteristics

Text Reuse Structure Text Reuse Content Text Reuse

51

Text Reuse In Wikipedia

Application on Wikipedia

slide-52
SLIDE 52

05.07.2018 Pipeline for TR extraction Milad Alshomary 52

Text Reuse In Wikipedia

Application on Wikipedia

  • Vertical alignment → Content TR
  • Horizontal alignment → Structure TR

Vertical relation Horizontal relation

slide-53
SLIDE 53

05.07.2018 Pipeline for TR extraction Milad Alshomary 53

Text Reuse In Wikipedia

Application on Wikipedia

  • Vertical alignment → Content TR
  • Horizontal alignment → Structure TR
slide-54
SLIDE 54

05.07.2018 Pipeline for TR extraction Milad Alshomary 54

Text Reuse In Wikipedia

Application on Wikipedia

  • Vertical alignment → Content TR
  • Horizontal alignment → Structure TR
slide-55
SLIDE 55

05.07.2018 Pipeline for TR extraction Milad Alshomary

Application on Wikipedia and Common Crawl

55

slide-56
SLIDE 56

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Application on Wikipedia and Common Crawl 56

➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate?

slide-57
SLIDE 57

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

WWW

  • Crawling

Extracted web content

  • Content extraction
  • Keeping only english

pages Web Sample 10% random sample

57 Application on Wikipedia and Common Crawl

slide-58
SLIDE 58

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

WWW

  • Crawling

Extracted web content

  • Content extraction
  • Keeping only english

pages Web Sample 10% random sample

  • 59 million web pages.
  • 1.4 million websites.
  • 70% of these websites

contains less than 10 web pages

Number of web pages Number of websites

58 Application on Wikipedia and Common Crawl

slide-59
SLIDE 59

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

TR Pipeline Web Sample Wikipedia

  • 1.6 million text reuse cases.
  • 15k pages reuse Wikipedia text.
  • 4.8k websites reuse Wikipedia text.

59 Application on Wikipedia and Common Crawl

slide-60
SLIDE 60

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Monthly revenue estimation:

  • Rough estimate of Ads revenue
  • Based on CPM (Cost Per Millie)
  • Sampled 100 webpages and

manually checked the existence of Advertisements.

60 Application on Wikipedia and Common Crawl

slide-61
SLIDE 61

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million

61 Application on Wikipedia and Common Crawl

slide-62
SLIDE 62

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million

62 Application on Wikipedia and Common Crawl

slide-63
SLIDE 63

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million

63 Application on Wikipedia and Common Crawl

slide-64
SLIDE 64

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million

64 Application on Wikipedia and Common Crawl

slide-65
SLIDE 65

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

website Monthly revenue Percentage of reuse Monthly Wikipedia value pdxretro.com $195 0.012 $2.5 seqrchquarry.com $8,850 0.096 $850 asiatees.com $36,000 0.017 $613 …. ….. ….. …. Total $1.2 million

The rough estimate of monthly revenue of Wikipedia content

65 Application on Wikipedia and Common Crawl

slide-66
SLIDE 66

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page
  • Percentage of pages reusing Wikipedia >= 0.5
  • 87 websites.
  • Estimated monthly revenue: $15k

66 Application on Wikipedia and Common Crawl

slide-67
SLIDE 67

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Extracted from Wikipedia API

67

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k

Application on Wikipedia and Common Crawl

slide-68
SLIDE 68

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Estimated from marketing reports

68

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k

Application on Wikipedia and Common Crawl

slide-69
SLIDE 69

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

69

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k

Application on Wikipedia and Common Crawl

slide-70
SLIDE 70

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Reused Wikipedia page Average page views Average CPM Average monthly revenue Nuclear renaissance 645 $2.8 $1.806 Second Chechen War 34655 $2.8 $97 Enumerated powers 12858 $2.8 $36 …. ….. ….. …. Total $900k

70

Revenue estimation:

  • Per website (all websites)
  • Per website (highly reusing)
  • Per Wikipedia web page

Application on Wikipedia and Common Crawl

slide-71
SLIDE 71

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

71

Monthly revenue:

Application on Wikipedia and Common Crawl

Per Web sample Number of reusing web pages Revenue(per webpage) 59 million 15k $900k 590 million 150k $9 million

slide-72
SLIDE 72

05.07.2018 Pipeline for TR extraction Milad Alshomary

Wikipedia vs The Web

Per Web sample Number of reusing web pages Revenue(per webpage) 59 million 15k $900k 590 million 150k $9 million

72

Monthly revenue:

Application on Wikipedia and Common Crawl

slide-73
SLIDE 73

05.07.2018 Pipeline for TR extraction Milad Alshomary

Conclusion

73

slide-74
SLIDE 74

05.07.2018 Pipeline for TR extraction Milad Alshomary

Summary

conclusion 74

  • Pipeline for TR Extraction
  • Text Reuse in Wikipedia
  • Text Reuse between

Wikipedia and the Web

Text Preprocessing Candidate Elimination Text Alignment

Text Reuse Structure Text Reuse Content Text Reuse

Per website (all websites) Per website (highly reuse) Per Webpage $1.2 million $15k $900k

slide-75
SLIDE 75

05.07.2018 Pipeline for TR extraction Milad Alshomary

Future Work

conclusion

TR Pipeline Wikipedia

?

75

  • Using the pipeline to extract and analyze

TR between Wikipedia and the scientific community.

  • Experiments on the Text Alignment

subtask.

  • Further analysis of the extracted Text

Reuse cases.

  • More accurate estimation on the

monthly revenue generated by Wikipedia content.

slide-76
SLIDE 76

05.07.2018 Pipeline for TR extraction Milad Alshomary

Conclusion

Future Work

TR Pipeline Wikipedia

?

  • Using the pipeline to extract and analyze

TR between Wikipedia and the scientific community.

  • Experiments on the Text Alignment

subtask.

  • Further analysis of the extracted Text

Reuse cases.

  • More accurate estimation on the

monthly revenue generated by Wikipedia content.

Text Preprocessing Candidate Elimination Text Alignment

76

slide-77
SLIDE 77

05.07.2018 Pipeline for TR extraction Milad Alshomary

Conclusion

Future Work

TR Pipeline Wikipedia

?

Text Preprocessing Candidate Elimination Text Alignment

77

  • Using the pipeline to extract and analyze

TR between Wikipedia and the scientific community.

  • Experiments on the Text Alignment

subtask.

  • Further analysis of the extracted Text

Reuse cases.

  • More accurate estimation on the

monthly revenue generated by Wikipedia content.

slide-78
SLIDE 78

05.07.2018 Pipeline for TR extraction Milad Alshomary

Conclusion

Future Work

TR Pipeline Wikipedia

?

Text Preprocessing Candidate Elimination Text Alignment

78

  • Using the pipeline to extract and analyze

TR between Wikipedia and the scientific community.

  • Experiments on the Text Alignment

subtask.

  • Further analysis of the extracted Text

Reuse cases.

  • More accurate estimation on the

monthly revenue generated by Wikipedia content.

slide-79
SLIDE 79

05.07.2018 Pipeline for TR extraction Milad Alshomary

Backup Slides

79

slide-80
SLIDE 80

05.07.2018 Pipeline for TR extraction Milad Alshomary 80

  • Candidate Elimination functions:
slide-81
SLIDE 81

05.07.2018 Pipeline for TR extraction Milad Alshomary 81

  • Stopwords N-grams procedure:

Wiki paragraphs stopwords stopword ngrams

Extract stop words generate n-grams

filtered stopword ngrams

Top 50 frequent stopwords: the, of, and, a, in, to,is, was, it, for, with, he, be, on, i, that, by, at, you, 's, are, not,his, this, from, but, had, which, she, they, or, an, were, we, their, been, has, have, will, would, her, there, can, all,as, if, who, what, said filter n-grams

  • Let C = {the, of, and, a, in, to, ’s} stopwords that

increases false positive.

  • X is accepted n-gram if:
  • It doesn’t contain more than n-1

stopwords from C

  • The maximal sequence of stopwords

belonging to C is less than n-2

binary count vector

  • Binary count vector ignores the frequency in

which a specific n-gram happened in a paragraph.

  • We apply the scoring function on the binary

count vector

slide-82
SLIDE 82

05.07.2018 Pipeline for TR extraction Milad Alshomary 82

  • VDSH explained:

VDSH USAGE

slide-83
SLIDE 83

05.07.2018 Pipeline for TR extraction Milad Alshomary 83

  • Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10 Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALL RECALL Precision Precision

slide-84
SLIDE 84

05.07.2018 Pipeline for TR extraction Milad Alshomary 84

  • Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10 Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALL RECALL Precision Precision

slide-85
SLIDE 85

05.07.2018 Pipeline for TR extraction Milad Alshomary 85

  • Candidate Elimination procedure over the cluster:
slide-86
SLIDE 86

05.07.2018 Pipeline for TR extraction Milad Alshomary 86

  • Hash based Candidate Elimination procedure over the cluster:
slide-87
SLIDE 87

05.07.2018 Pipeline for TR extraction Milad Alshomary 87

  • Hash based Candidate Elimination procedure over the cluster:
slide-88
SLIDE 88

05.07.2018 Pipeline for TR extraction Milad Alshomary 88

  • Heuristics:
  • H1: ne_sim ∈ (0.5, 1.0] AND 10grams_sim > 0.5 AND (s_percent_reused < 0.5 or

t_percent_reused < 0.5) => content reuse otherwise structure reuse

  • 6700 content reuse cases only
  • Validation on two random samples of size 100:

Structure reuse Content reuse Sample1 100% 58% Sample2 (Text1 or Text2 > 200) 100% 73%