New Issues in Near-duplicate Detection Martin Potthast and Benno - - PowerPoint PPT Presentation

new issues in near duplicate detection
SMART_READER_LITE
LIVE PREVIEW

New Issues in Near-duplicate Detection Martin Potthast and Benno - - PowerPoint PPT Presentation

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl07 Mar. 7th, 2007


slide-1
SLIDE 1

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

New Issues in Near-duplicate Detection

Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-2
SLIDE 2

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Motivation

About 30% of the Web is redundant.

[Fetterly 03, Broder 06]

Content redundancy occurs in various forms:

❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different

advertisement, available through multiple URLs.

❑ Versions created for different delivery mechanisms (HTML, PDF

, etc.)

❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links ❑ Syndicated news articles delivered in different venues ❑ Revisions and versions ❑ Reuse and republication of text (legitimate and otherwise)

[Zobel 06]

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-3
SLIDE 3

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Motivation

About 30% of the Web is redundant.

[Fetterly 03, Broder 06]

Content redundancy occurs in various forms:

❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different

advertisement, available through multiple URLs.

❑ Versions created for different delivery mechanisms (HTML, PDF

, etc.)

❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links ❑ Syndicated news articles delivered in different venues ❑ Revisions and versions ❑ Reuse and republication of text (legitimate and otherwise)

[Zobel 06]

Nearly exact copies and modified copies with high similarity. ➜ Near-duplicate documents.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-4
SLIDE 4

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Motivation

Contributions of near-duplicate detection to real-world tasks:

❑ Index size reduction ❑ Search result cleaning ❑ Web crawl prioritization ❑ Plagiarism analysis

Our contributions to near-duplicate detection:

❑ Classification of near-duplicate detection algorithms ❑ Presentation of a new tailored corpus for evaluation ❑ Comparison of current algorithms

(including so far unconsidered hashing technologies)

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-5
SLIDE 5

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Formalization

Consider a set of documents D. Given a document dq: Find all documents Dq ⊂ D with a high similarity to dq. ➜ Naive approach: Compare dq with each d ∈ D. In detail: Construct document models for D and dq, obtaining D and dq. Employ a similarity function ϕ : D × D → [0, 1].

❑ Near-duplicate detection algorithms rely on purposefully

constructed document models, called fingerprints.

❑ A fingerprints is a set of k natural numbers, which are

computed on the basis document extracts.

❑ Two documents are considered as duplicates if their

fingerprints share at least kd, kd < k, numbers.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-6
SLIDE 6

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Taxonomy of Fingerprinting Algorithms

Fingerprinting methods Finger- printing Chunking- based Similarity Hashing

Chunking: k chunks are selected from a document d.

Chunks c1, c2 d

➜ ➜

125497 Hashcodes

  • y
  • y

351427 Fingerprint

{351427, 125497}

  • yy

σ

p1 = h(c1), p2 = h(c2) Fd = { p1, p2 }

  • yy

Chunks are also called n-grams or shingles.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-7
SLIDE 7

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Taxonomy of Fingerprinting Algorithms

Collection-specific (Pseudo-) Random Synchronized Local Cascading All Chunks n-gram model super-, megashingling random, n-th chunk shingling, prefix anchors, hashed breakpoints, winnowing rare chunks SPEX, I-Match Algorithms Fingerprinting methods Finger- printing Chunking- based Similarity Hashing

Chunking: k chunks are selected from a document d. Selection heuristics:

❑ all ❑ based on knowledge about D ❑ intelligent random choices

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-8
SLIDE 8

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Taxonomy of Fingerprinting Algorithms

Collection-specific (Pseudo-) Random Synchronized Local Cascading All Chunks n-gram model super-, megashingling random, n-th chunk shingling, prefix anchors, hashed breakpoints, winnowing rare chunks SPEX, I-Match Algorithms Fingerprinting methods Finger- printing Chunking- based Similarity Hashing

Similarity Hashing: k particular hash functions hϕ : D → U, U ⊂ N, with the property hϕ(d) = hϕ(dq) ⇒ ϕ(d, dq) ≥ 1 − ε with d ∈ D, 0 < ε ≪ 1 are used to generate k hashcodes for a document d.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-9
SLIDE 9

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Taxonomy of Fingerprinting Algorithms

Knowledge-based Randomized fuzzy-fingerprinting locality-sensitive hashing Collection-specific (Pseudo-) Random Synchronized Local Cascading All Chunks n-gram model super-, megashingling random, n-th chunk shingling, prefix anchors, hashed breakpoints, winnowing rare chunks SPEX, I-Match Algorithms Fingerprinting methods Finger- printing Chunking- based Similarity Hashing

Similarity Hashing: k particular hash functions hϕ : D → U, U ⊂ N, with the property hϕ(d) = hϕ(dq) ⇒ ϕ(d, dq) ≥ 1 − ε with d ∈ D, 0 < ε ≪ 1 are used to generate k hashcodes. Hash function construction: domain knowledge vs. randomization.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-10
SLIDE 10

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Taxonomy of Fingerprinting Algorithms

Knowledge-based Randomized fuzzy-fingerprinting locality-sensitive hashing Collection-specific (Pseudo-) Random Synchronized Local Cascading All Chunks n-gram model super-, megashingling random, n-th chunk shingling, prefix anchors, hashed breakpoints, winnowing rare chunks SPEX, I-Match Algorithms Fingerprinting methods Finger- printing Chunking- based Similarity Hashing

For algorithms in the upper box fingerprints have to share more than one number, kd > 1, to be recognized as duplicates For algorithms in the lower box fingerprints need to share only one number, kd = 1, to be recognized as duplicates.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-11
SLIDE 11

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

(Cascaded) Chunking

(Super-)Shingling (SSh)

[Broder 97]

n-gram vector space model ➜ hash value computation Fingerprint (random choice) w1 w2 w3 wm-2 wm-1 wm w3 w4 w5 w2 w3 w4 ...

( )

12354 43586 59634 15695 ...

( )

➜ {12354, 15695, ..., 55476} ➜

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-12
SLIDE 12

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

(Cascaded) Chunking

(Super-)Shingling (SSh)

[Broder 97]

Shingling Super-Shingling

Cascade String representation ➜ "12354 15695 ... 55476" n-gram vector space model ➜ hash value computation Fingerprint (random choice) w1 w2 w3 wm-2 wm-1 wm w3 w4 w5 w2 w3 w4 ...

( )

12354 43586 59634 15695 ...

( )

➜ {12354, 15695, ..., 55476} ➜

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-13
SLIDE 13

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Similarity Hashing

Fuzzy-Fingerprinting (FF)

[Stein 05]

A priori probabilities of prefix classes in BNC Distribution of prefix classes in sample ➜ ➜ Normalization and difference computation Fuzzification

{213235632, 157234594}

Fingerprint ➜ ➜

All words having the same prefix belong to the same prefix class.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-14
SLIDE 14

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Similarity Hashing

Locality-Sensitive Hashing (LSH)

[Indyk and Motwani 98, Datar et. al. 04]

Vector space with sample document and random vectors Real number line

d a1 ar a2

➜ ai . d

T

Dot product computation ➜ ➜

{213235632}

Fingerprint

The results of the r dot products are summed.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-15
SLIDE 15

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Evaluation Corpus

Wikipedia Snapshot including all Revisions Existing standard corpora (TREC, Reuters) are not suited for large-scale near-duplicate detection algorithm evaluations. Wikipedia is a rich resource of versioned and revisioned documents. Benchmark data:

❑ approx. 6 million pages (documents) ❑ approx. 80 million revisions ❑ XML file of approx. 1 TB

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-16
SLIDE 16

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Evaluation Corpus

Wikipedia Snapshot including all Revisions Experiments:

❑ first revision of each Wikipedia page is in the role of dq ❑ dq was compared with each of it’s revisions ❑ dq was compared with it’s immediate succeeding page

Reference: Vector space model with t f and cos-similarity.

  • y
y
  • y
y
  • yy
yy yy
  • y
y
  • y
y
  • y
y
  • y
  • y
  • yyy
  • yy
  • yy

Wikipedia Reuters

  • yy
  • 0.2

0.4 0.6 0.8

  • 1

Similarity Intervals

0.0001 0.001 0.01 0.1

  • 1

Percentage of Similarities

Precision and recall were recorded for similarity thresholds ranging from 0 to 1.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-17
SLIDE 17

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Evaluation Results

  • 0.2

0.4 0.6 0.8

  • 1
  • 0.2

0.4 0.6 0.8

  • 1

Recall Similarity Wikipedia Revision Corpus FF LSH SSh Sh HBC

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-18
SLIDE 18

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Evaluation Results

  • 0.2

0.4 0.6 0.8

  • 1
  • 0.2

0.4 0.6 0.8

  • 1

Precision Similarity Wikipedia Revision Corpus FF LSH SSh Sh HBC

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-19
SLIDE 19

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Summary

Near-duplicate detection accuracy:

❑ FF outperforms other algorithms in terms of recall. ❑ No algorithm outperforms another in terms of precision. ❑ LSH performs poor in both cases.

Wikipedia Revision Collection:

❑ May be a new standard for high similarity evaluations. ❑ Allows for evaluations at the Web scale.

Conclusions: ➜ Similarity hashing is a promising technology for near-duplicate detection. ➜ There is still room for improvement. ➜ Chunking strategies are susceptible to versioned documents.

GfKl’07 Mar. 7th, 2007 Stein/Potthast

slide-20
SLIDE 20

Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary

Thank you!

GfKl’07 Mar. 7th, 2007 Stein/Potthast