How similar are these? 1 Whats the Problem? Finding similar items - PowerPoint PPT Presentation

How similar are these? 1

What’s the Problem? • Finding similar items with respect to some distance metric • Two variants of the problem: – Offline: extract all similar pairs of objects from a large collection – Online: is this object similar to something I’ve seen before? 2

Application: Plagiarism Detection 3

Application: Content ‐ based Search 4

Other Applications • Near ‐ duplicate detection of webpages – Mirror pages – Similar news articles • Recommender systems – Find users with similar tastes in movies, etc – Find products with similar customer sets • Sequence/tree alignment – Find similar DNA or protein sequences • … 5

Finding similar items ‐ Three Components • How to quantify similarity? – Distance measures • Euclidean distance – based on locations of points in space, e.g., L r ‐ norm • Non ‐ Euclidean distance, ‐ based on properties of points, e.g., Jaccard, cosine, edit • Compute representation – Shingling, tf.idf, etc. • Space and Time Efficient algorithms – Transformation: Minhash – All ‐ pair comparison: Locality sensitive hashing 6

(Axioms of) Distance Metrics d is a distance measure between points x and y 1. Non ‐ negativity: 2. Identity: 3. Symmetry: 4. Triangle Inequality 7

Problem: Finding Similar Documents • Given N text documents, find pairs that are “near duplicates” Doc 1 to N – Find similarity between a pair – For large N, it can be very compute intensive degree of similarity • Can we avoid all ‐ pair comparisons? between doc i and j 8

Comparing two documents … • Naïve methods – Feature: Treat each document as a set/bag of words – Distance: Jaccard distance (or cosine distance or hamming distance) 9

Distance: Jaccard • Given two sets A, B • Jaccard similarity : • Jaccard distance : • E.g., A = {I, like, CS5344}; B = {CS5344, is, not, for, me}, d(A, B) = 6/7 10

Comparing two documents … • Naïve methods – Feature: Treat each document as a set/bag of words – Distance: Jaccard distance or cosine distance or hamming distance – Textual rather than semantics • Documents with many common words are more similar, even if the text appears in different order – What good is this? • Fast filtering before the slower refinement 11

Shingling: Account for ordering of words … • Instead of treating each word independently, we can consider a sequence of k words – More effective in terms of accuracy • A k ‐ shingle (or k ‐ gram) for a document D is a sequence of k tokens that appear in D – Tokens can be characters or words or some feature/object depending on the application • E.g., k = 3 characters; D = {This is a test} will give rise to the following set of 3 ‐ shingles: S(D) = {Thi, his, is_, s_i, _is, is_, s_a, _a_, a_t, _te, tes, est} • E.g., k = 3 words; we have S(D) = {{This is a}, {is a test}} NOTE: Assume “characters” in our discussion 12

Shingles and Similarity • Documents that are similar should have many shingles in common • What if two documents differ by a word? – Affects only k ‐ shingles within distance k from the word • What if we reorder paragraphs? – Affects only the 2k shingles that cross paragraph boundaries • Example: k=3 – The dog which chased the cat vs The dog that chased the cat – Only 3 ‐ shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c

Shingles • We can represent document D as a set of its k ‐ shingles – Distance metrics: Jaccard distance • What’s the effect of the value of k (in terms of characters)? – Recommended values of k • 5 for small documents • 10 for large documents 14

Shingles • How about space overhead? – Each character can be represented as a byte (integer) – k ‐ shingle requires k bytes (integers) • Can compress by hashing a k ‐ shingle to say 4 bytes – D is now a set of 4 ‐ byte hash values of its k ‐ shingles – False positive may occur in matching • What’s the advantage? – Tradeoffs between ability to differentiate vs space • It is better to hash 10 ‐ shingles to say 4 bytes than to use 4 ‐ shingles! 15

So far … • Represent a document as a set of k ‐ shingles or its hash values • Use Jaccard distance to compare two documents • Can we do better? – Parallelism vs Sampling • This scheme works but … – What if the set of hash values (or k ‐ shingles) is too large to fit in the memory? – Or the number of documents are too large? Idea: Find a way to hash a document to a single (small size) value! and similar documents to the same value! 16

Minhash • Seminal algorithm for near ‐ duplicate detection of webpages – Used by AltaVista – Documents (HTML pages) represented by shingles ( n ‐ grams) – Jaccard similarity: dups are pairs with high similarity 17

MinHash – Key Idea • Hash the set of document shingles (big in terms of space requirement) into a signature (relatively small size) • Instead of comparing shingles, we compare signatures – ESSENTIAL: Similarities of signatures and similarities of shingles MUST BE related !! – Not every hashing function is applicable! – Need one that satisfies the following: • if Sim (D 1 ,D 2 ) is high, then with high prob. h (D 1 ) = h (D 2 ) • if Sim (D 1 ,D 2 ) is low, then with high prob. h (D 1 ) ≠ h (D 2 ) – It is possible to have false positives, and false negatives! • minhashing turns out to be one such function for Jaccard similarity 18

Preliminaries: Representation & Jacaad Measure • Sets: – A = { e 1 , e 3 , e 7 } Let: – B = { e 3 , e 5 , e 7 } M 00 = # rows where both elements are 0 • Can be equivalently M 11 = # rows where both elements are 1 expressed as matrices: M 01 = # rows where A=0, B=1 M 10 = # rows where A=1, B=0 19

Computing Minhash • Start with the matrix representation of the set • Randomly permute the rows of the matrix • minhash (which is the signature) is the first row (in the permuted order) with a “1” • Example: Input matrix Permuted matrix Row 1 2 h(A) = 4 3 4 h(B) = 3 5 6 7 20

Minhash and Jaccard M 00 M 00 M 01 M 11 M 11 M 00 M 10 21

MinHash – False positive/negative • Instead of comparing sets, we now compare only 1 hash value! • False positive? – False positive can be easily dealt with by doing an additional layer of checking (treat minhash as a filtering mechanism) • False negative? • High error rate! Can we do better? 22

Using multiple minhash signatures Comparison between two sets (original matrix) becomes comparison between two columns of minhash values (signature matrix) The similarity between signatures of two columns is given by the fraction of hash functions in which they agree Input Permutations Minhash signatures Similarities 23

Implementation of MinHash Computation • Permutations are expensive – Incur space and random access (if data cannot fit into memory) • Interpret the hash value as the permutation • Only need to keep track of the minimum hash values 24

Implementation of minhash (By example) 2 4 3 1 h(x) = x mod 5 + 1 4 3 g(x) = (2x+1) mod 5 + 1 5 5 1 2 2 1 3 1 25

Implementation of minhash (By example) Sig1 Sig2 h(1) = 2 2  g(1) = 4 4  h(2) = 3 2 3 g(2) = 1 4 1 h(3) = 4 2 3 h(x) = x mod 5 + 1 g(3) = 3 3 1 g(x) = (2x+1) mod 5 + 1 h(4) = 5 2 3 g(4) = 5 3 1 Initialization: set signatures to  Apply all hash functions on each row h(5) = 1 2 1 • If column value (of the source matrix) is 1, g(5) = 2 3 1 keep the minimum value 26 • Otherwise, do nothing

So far … • Represent a document as a set of hash values (of its k ‐ shingles) • Transform set of k ‐ shingles to a set of minhash signatures • Use Jaccard to compare two documents by comparing their signatures • Is this method (i.e., transforming sets to signature) necessarily “better”?? 27

The BIG Picture (All ‐ pair comparison) Locality Sensitive Shingling Hashing Signatures falling Minhashing This course is into the same course is about bucket are is about big “similar” about big data big data analytics Set of strings Signature for Shingling of length k the set of strings (can capture similarity) Another document 28

Find all near ‐ duplicates among N documents • Naïve solution – For each document, compare with the other N ‐ 1 documents • Takes N ‐ 1 comparisons • Can optimize using filter ‐ and ‐ refine mechanisms – Requires N*(N ‐ 1)/2 comparisons – For large N, still takes ages … • E.g., N = 10 7 , we have ~10 14 comparisons; if each comparison takes 1 μ s, we need 10 8 sec ( ~ 3 years!) 29

Locality Sensitive Hashing (LSH) • Suppose we have N documents • For each document, we can derive say k minhash signatures D 1 D 2 D 3 D 4 …. D N minhash 1 3 3 2 2 3 2 7 7 5 5 7 k ‐ 1 2 2 2 2 2 k 1 1 1 1 2 30

Idea of hashing D 1 D 2 D 3 D 4 …. D N minhash 1 3 3 2 2 3 2 7 7 5 5 7 k ‐ 1 2 2 2 2 2 k 1 1 1 1 1 Hash function F(column) 31 Buckets

How similar are these? 1 Whats the Problem? Finding similar items - PowerPoint PPT Presentation

How similar are these? 1 Whats the Problem? Finding similar items with respect to some distance metric Two variants of the problem: Offline: extract all similar pairs of objects from a large collection Online: is this object

Similarity is crucial to cognition General (often implicit) hypothesis: similar stimulus in

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Trigonometric functions Step one: similar triangles Two similar triangles have the same set of

Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk

Quantifying sequence similarity Analogy Homology Similar function Similar ancestry

Similar code fragment A code fragment that has similar part to it in source code

Lesson 8 Vocabulary & Anti synonym Different words with synonym similar meanings

Shape Matching Carola Wenk 4/28/15 CMPS 3130/6130 Computational Geometry 1 When are two shapes

If we control these we can monitor & influence these If we control these

Breeding Self Pollinated Crops 1 Cultivars Cultivar Is a group of genetically similar

A linear-time algorithm for comparing similar ordered trees H el` ene Touzet LIFL

HUDSA Partly similar issues Partly different solutions EHPM Meeting Budapest 05.07.2018 Dr.

Conducting investigations on similar incidents that occurred in different countries A complex

VIDEO: https://vimeo.com/323361180 Variable rate rendering Always grouping similar work

Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799

PRELIMINARY RESULTS OF LES SIMULATIONS OF SELF-SIMILAR VARIABLE ACCELERATION RT MIXING FLOWS D

Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

SMR in Linux Systems Seagate's Contribution to Legacy File Systems Adrian Palmer, Drive

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer

Util ilit ity P y Prop oper ertie ies Guide Guidelin ines Chap apter er 9 9 1 Guide

The Congregation regational al Church h of Plai ainvill ville Capital Campaign Sanctuary

How similar are these? 1 Whats the Problem? Finding similar items - PowerPoint PPT Presentation

How similar are these? 1 Whats the Problem? Finding similar items with respect to some distance metric Two variants of the problem: Offline: extract all similar pairs of objects from a large collection Online: is this object

Similarity is crucial to cognition General (often implicit) hypothesis: similar stimulus in

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Trigonometric functions Step one: similar triangles Two similar triangles have the same set of

Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk

Quantifying sequence similarity Analogy Homology Similar function Similar ancestry

Similar code fragment A code fragment that has similar part to it in source code

Lesson 8 Vocabulary &amp; Anti synonym Different words with synonym similar meanings

Shape Matching Carola Wenk 4/28/15 CMPS 3130/6130 Computational Geometry 1 When are two shapes

If we control these we can monitor &amp; influence these If we control these

Breeding Self Pollinated Crops 1 Cultivars Cultivar Is a group of genetically similar

A linear-time algorithm for comparing similar ordered trees H el` ene Touzet LIFL

HUDSA Partly similar issues Partly different solutions EHPM Meeting Budapest 05.07.2018 Dr.

Conducting investigations on similar incidents that occurred in different countries A complex

VIDEO: https://vimeo.com/323361180 Variable rate rendering Always grouping similar work

Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799

PRELIMINARY RESULTS OF LES SIMULATIONS OF SELF-SIMILAR VARIABLE ACCELERATION RT MIXING FLOWS D

Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

SMR in Linux Systems Seagate's Contribution to Legacy File Systems Adrian Palmer, Drive

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer

Util ilit ity P y Prop oper ertie ies Guide Guidelin ines Chap apter er 9 9 1 Guide

The Congregation regational al Church h of Plai ainvill ville Capital Campaign Sanctuary

Lesson 8 Vocabulary & Anti synonym Different words with synonym similar meanings

If we control these we can monitor & influence these If we control these