Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary
New Issues in Near-duplicate Detection
Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems
GfKl’07 Mar. 7th, 2007 Stein/Potthast
New Issues in Near-duplicate Detection Martin Potthast and Benno - - PowerPoint PPT Presentation
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl07 Mar. 7th, 2007
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
Chunks c1, c2 d
➜ ➜
125497 Hashcodes
351427 Fingerprint
➜
{351427, 125497}
σ
p1 = h(c1), p2 = h(c2) Fd = { p1, p2 }
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
{213235632, 157234594}
GfKl’07 Mar. 7th, 2007 Stein/Potthast
T
{213235632}
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
Wikipedia Reuters
0.4 0.6 0.8
Similarity Intervals
0.0001 0.001 0.01 0.1
Percentage of Similarities
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast
GfKl’07 Mar. 7th, 2007 Stein/Potthast