1
play

1 Near-Duplicate News Articles Near-Duplicate Detection More - PDF document

Table of Content Motivation Shingling for duplicate comparison Topic: Duplicate Detection and Minhashing Similarity Computing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman


  1. Table of Content • Motivation • Shingling for duplicate comparison Topic: Duplicate Detection and • Minhashing Similarity Computing • LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman Applications of Duplicate Detection and Duplicate Detection Similarity Computing • Duplicate and near-duplicate documents occur in • Exact duplicate detection is relatively easy many situations  Content fingerprints  Copies, versions, plagiarism, spam, mirror sites  MD5, cyclic redundancy check (CRC)  Over 30% of the web pages in a large crawl are • Checksum techniques exact or near duplicates of pages in the other 70%  A checksum is a value that is computed based on the • Duplicates consume significant resources during content of the document crawling, indexing, and search – e.g., sum of the bytes in the document file  Little value to most users • Similar query suggestions • Advertisement: coalition and spam detection  Possible for files with different text to have same checksum 1

  2. Near-Duplicate News Articles Near-Duplicate Detection • More challenging task  Are web pages with same text context but different advertising or format near-duplicates? • Near-Duplication : Approximate match  Compute syntactic similarity with an edit- distance measure  Use similarity threshold to detect near- duplicates – E.g., Similarity > 80% => Documents are “near duplicates” – Not transitive though sometimes used transitively SpotSigs: Robust & Efficient Near Duplicate Detection in 5 Large Web Collections Near-Duplicate Detection Two Techniques for Computing Similarity 1. Shingling : convert documents, emails, etc., to • Search : fingerprint sets.  find near-duplicates of a document D 2. Minhashing : convert large sets to short signatures, while preserving similarity.  O(N) comparisons required • Discovery :  find all pairs of near-duplicate documents in the All-pair Docu- collection ment comparison  O(N 2 ) comparisons • IR techniques are effective for search scenario The set Signatures : of strings short integer • For discovery, other techniques used to generate of length k vectors that compact representations that appear represent the in the doc- sets, and ument reflect their 8 similarity 2

  3. Fingerprint Generation Process for Web Computing Similarity with Shingles Documents • Shingles (Word k -Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is • Similarity Measure between two docs (= sets of shingles)  Size_of_Intersection / Size_of_Union Jaccard measure Example: Jaccard Similarity Fingerprint Example for Web Documents • The Jaccard similarity of two sets is the size of their intersection divided by the size of their union.  Sim (C 1 , C 2 ) = |C 1  C 2 |/|C 1  C 2 |. 3 in intersection. 8 in union. Jaccard similarity = 3/8 11 3

  4. Approximated Representation with Computing Sketch[i] for Doc1 Sketching • Computing exact set intersection of shingles between Document 1 all pairs of documents is expensive  Approximate using a subset of shingles (called sketch vectors) 2 64 Start with 64 bit shingles  Create a sketch vector using minhashing. 2 64 – For doc d , sketch d [i] is computed as follows: Permute on the number line – Let f map all shingles in the universe to 0..2 m with p i 2 64 – Let p i be a specific random permutation on 0..2 m – Pick MIN p i ( f(s) ) over all shingles s in this document d 2 64 Pick the min value  Documents which share more than t (say 80%) in sketch vector’s elements are similar Test if Doc1.Sketch[i] = Shingling with minhashing Doc2.Sketch[i] • Given two documents d1, d2. Document 2 Document 1 • Let S1 and S2 be their shingle sets • Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|. 2 64 2 64 • Let Alpha = min ( p (S1)) 2 64 2 64 • Let Beta = min ( p (S2)) 2 64 2 64  Probability (Alpha = Beta) = Resemblance A B  Computing this by sampling (e.g. 200 times). 2 64 2 64 Are these equal? Test for 200 random permutations: p 1 , p 2 ,… p 200 4

  5. Proof with Boolean Matrices Key Observation • Rows = elements of the universal set. • For columns C i , C j , four types of rows • Columns = sets. C i C j • 1 in row e and column S if and only if e is a A 1 1 member of S . • Column similarity is the Jaccard similarity of the B 1 0 sets of their rows with 1. C 0 1 • Typical matrix is sparse. D 0 0  C C  i j sim (C , C ) • Overload notation: A = # of rows of type A C 1 C 2 J i j  C C 0 1 * i j • Claim 1 0 * * * 1 1 Sim (C 1 , C 2 ) = A  sim (C , C ) 0 0 2/5 = 0.4 J   i j A B C * * 1 1 * 0 1 17 Minhashing Property • Imagine the rows permuted randomly. • The probability (over all permutations of the • “hash” function h ( C ) = the index of the first (in rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1 , the permuted order) row with 1 in column C . C 2 ).       P h(C ) h(C ) sim C , C • Use several (e.g., 100) independent hash J i j i j functions to create a signature. • Both are A /( A + B + C )! • The similarity of signatures is the fraction of • Why? the hash functions in which they agree.  Look down the permuted columns C 1 and C 2 until we see a 1.  If it’s a type - a row, then h (C 1 ) = h (C 2 ). If a type- b or type- c row, then not. 19 20 5

  6. All-pair comparison is expensive • We want to compare objects, finding those pairs that are sufficiently similar. • comparing the signatures of all pairs of objects is quadratic in the number of objects • Example: 10 6 objects implies 5*10 11 comparisons. Locality-Sensitive Hashing  At 1 microsecond/comparison: 6 days. 21 22 The Big Picture Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair Candidate of elements whose similarity must be evaluated. pairs : • Map a document to many buckets Locality- Docu- those pairs sensitive of signatures ment Hashing that we need d1 d2 to test for similarity. The set Signatures : of strings short integer of length k vectors that that appear represent the in the doc- sets, and • Make elements of the same bucket candidate pairs. ument reflect their similarity 23 24 6

  7. Another view of LSH: Produce signature Signature agreement of each pair at each with bands band Agreement? Mapped into the same bucket? r rows r rows per band per band b bands b bands One short signature Signature 25 26 Docs 2 and 6 Buckets Signature generation and bucket comparison are probably identical. Docs 6 and 7 are surely different. Matrix M • Create b bands for each document  Signature of doc X and Y in the same band agrees  a candidate pair  Use r minhash values ( r rows) for each band b bands r rows • Tune b and r to catch most similar pairs, but few nonsimilar pairs. 27 28 7

  8. Example Analysis of LSH • Suppose C 1 , C 2 are 80% Similar • Probability the minhash signatures of C 1 , C 2 agree in • Choose 20 bands of 5 integers/band. one row: s  Threshold of two similar documents • Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328. • Probability C 1 , C 2 identical in one band: s r • Probability C 1 , C 2 are not similar in any of the 20 • Probability C 1 , C 2 do not agree at least one row of a bands: (1-0.328) 20 = .00035 . band: 1-s r  i.e., about 1/3000th of the 80%-similar column pairs • Probability C 1 , C 2 do not agree in all bands: (1-s r ) b are false negatives.  False negative probability • Probability C 1 , C 2 agree one of these bands: 1- (1-s r ) b C1 C2  Probability that we find such a pair. 29 30 What One Band Gives You Analysis of LSH – What We Want Probability = 1 if s > t Remember: Probability No chance Probability probability of of sharing if s < t of sharing equal hash-values a bucket a bucket = similarity t t Similarity s of two docs Similarity s of two docs 31 32 8

  9. Example: b = 20; r = 5 What b Bands of r Rows Gives You Probability of a similar pair to share a bucket At least s 1-(1-s r ) b No bands one band identical identical .2 .006 .3 .047 s r 1 - ( 1 - ) b t ~ (1/b) 1/r .4 .186 Probability of sharing .5 .470 a bucket .6 .802 All rows Some row .7 .975 of a band of a band are equal unequal t .8 .9996 Similarity s of two docs 33 34 LSH Summary • Get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures.  Check that candidate pairs really do have similar signatures. • LSH involves tradeoff  Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.  Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. 35 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend