Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive - PowerPoint PPT Presentation

Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari Family AND-OR Family anil@scs.carleton.ca Fingerprints School of Computer Science References Carleton University Canada

Outline Locality-Sensitive Hashing Anil Maheshwari Introduction Introduction 1 Similarity of Documents Similarity of Documents 2 LSH Metric Spaces LSH 3 Sensitive Function Family AND-OR Family Metric Spaces 4 Fingerprints References Sensitive Function Family 5 AND-OR Family 6 Fingerprints 7 References 8

Objectives Locality-Sensitive Hashing Anil Maheshwari How to find efficiently Introduction Similarity of Similar documents among a collection of documents 1 Documents LSH Similar web-pages among web-pages 2 Metric Spaces Similar fingerprints among a database of fingerprints 3 Sensitive Function Family Similar sets among a collection of sets 4 AND-OR Family Similar images from a database of images 5 Fingerprints Similar vectors in higher dimensions. References 6

Similarity of Documents Locality-Sensitive Hashing Anil Maheshwari Problem Definition Introduction Similarity of Input: A collection of web-pages. Documents Output: Report near duplicate web-pages. LSH Metric Spaces k-shingles Sensitive Function Family Any substring of k words that appears in the document. AND-OR Family Fingerprints References Text Document = “What is the likely date that the regular classes may resume in Ontario” 2 − shingles: What is, is the, the likely, . . . , in Ontario 3 − shingles: What is the, is the likely, . . . , resume in Ontario In practice: 9 − shingles for English Text and 5 − shingles for e-mails

Similarity between sets Locality-Sensitive Hashing Anil Maheshwari Text Document D → Set S Introduction Similarity of Form all the k -shingles of D 1 Documents LSH S is the collection of all k -shingles of D 2 Metric Spaces Sensitive Function Family Jaccard Similarity AND-OR Family For a pair of sets S and T , the Jaccard Similarity is Fingerprints defined as SIM ( S, T ) = | S ∩ T | References | S ∪ T |

Problem: Find Similar Sets Locality-Sensitive Hashing Anil Maheshwari New Problem Introduction Similarity of Given a constant 0 ≤ s ≤ 1 and a collection of sets S , find Documents the pairs of sets in S with Jaccard similarity ≥ s LSH Metric Spaces Sensitive Function Family AND-OR Family Fingerprints References

Characteristic Matrix Representation of Sets Locality-Sensitive Hashing Anil Maheshwari U = { Cruise, Ski, Resorts, Safari, Stay@Home } Introduction S = { S 1 , S 2 , S 3 , S 4 } , where each S i ⊆ U Similarity of Documents e.g. S 1 = { Cruise, Safari } and S 2 = { Resorts } LSH Metric Spaces Characteristic matrix for S : Sensitive Function S 1 S 2 S 3 S 4 Family Cruise 1 0 0 1 AND-OR Family Fingerprints Ski 0 0 1 0 References Resorts 0 1 0 1 Safari 1 0 1 1 Stay@Home 0 0 1 0

MinHash Signatures Locality-Sensitive Hashing Anil Maheshwari S 1 S 2 S 3 S 4 Introduction 0 Cruise 1 0 0 1 Similarity of 1 Ski 0 0 1 0 Documents 2 Resorts 0 1 0 1 LSH 3 Safari 1 0 1 1 Metric Spaces 4 Stay@Home 0 0 1 0 Sensitive Function Family Permute Rows π : 01234 → 40312 AND-OR Family S 1 S 2 S 3 S 4 Fingerprints 0 Ski 0 0 1 0 References 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Minhash Signatures for π : h ( S 1 ) = 1 , h ( S 2 ) = 3 , h ( S 3 ) = 0 , and h ( S 4 ) = 1

Key Observation Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of For any two sets S i and S j in a collection of sets S where Documents the elements are drawn from the universe U , the LSH probability that the minhash value h ( S i ) equals h ( S j ) is Metric Spaces equal to the Jaccard similarity of S i and S j , i.e., Sensitive Function Family Pr [ h ( S i ) = h ( S j )] = SIM ( S i , S j ) = | S i ∩ S j | | S i ∪ S j | . AND-OR Family Fingerprints S 1 S 2 S 3 S 4 References 0 Ski 0 0 1 0 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Pr [ h ( S 1 ) = h ( S 4 )] = SIM ( S 1 , S 4 ) = | S 1 ∩ S 4 | | S 1 ∪ S 4 | = 2 3

MinHashSignature Matrix Locality-Sensitive Hashing Anil Maheshwari MinHash Signature matrix for |S| = 11 sets with 12 hash Introduction functions Similarity of Documents S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 LSH 2 2 1 0 0 1 3 2 5 0 3 Metric Spaces 1 3 2 0 2 2 1 4 2 1 2 Sensitive Function Family 3 0 3 0 4 3 2 0 0 4 2 AND-OR Family 0 4 3 1 5 3 3 2 3 5 4 Fingerprints 2 1 1 0 4 1 2 1 4 2 5 References 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4

LSH for MinHash Locality-Sensitive Hashing Anil Maheshwari Partitioning of a signature matrix into b = 4 bands of r = 3 Introduction rows each. Similarity of Documents Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 LSH 2 2 1 0 0 1 3 2 5 0 3 Metric Spaces I 1 3 2 0 2 2 1 4 2 1 2 3 0 3 0 4 3 2 0 0 4 2 Sensitive Function Family 0 4 3 1 5 3 3 2 3 5 4 AND-OR Family II 2 1 1 0 4 1 2 1 4 2 5 Fingerprints 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 References III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4 Band 3: { S 3 , S 6 , S 11 } are hashed into the same bucket, and so are { S 8 , S 9 }

Probability of finding similar sets Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of Let s > 0 be the Jaccard similarity of two sets. The Documents probability that the minHash signature matrix agrees in all LSH the rows of at least one of the bands for these two sets is Metric Spaces f ( s ) = 1 − (1 − s r ) b . Sensitive Function Family AND-OR Family Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 Fingerprints 2 2 1 0 0 1 3 2 5 0 3 I 1 3 2 0 2 2 1 4 2 1 2 References 3 0 3 0 4 3 2 0 0 4 2 0 4 3 1 5 3 3 2 3 5 4 II 2 1 1 0 4 1 2 1 4 2 5 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4

Proof Locality-Sensitive Hashing Anil Maheshwari Claim: Pr(signatures agree in all rows of ≥ 1 bands for S i Introduction and S j with Jaccard Similarity s )= f ( s ) = 1 − (1 − s r ) b . Similarity of Documents LSH Metric Spaces Sensitive Function Family AND-OR Family Fingerprints References

Understanding f ( s ) Locality-Sensitive Hashing Anil Maheshwari f ( s ) = 1 − (1 − s r ) b for different values of s, b, and r : Introduction Similarity of Documents ( b, r ) (4 , 3) (16 , 4) (20 , 5) (25 , 5) (100 , 10) f ( s ) = 1 − (1 − s r ) b ց LSH Metric Spaces s = 0 . 2 0.0316 0.0252 0.0063 0.0079 0.0000 Sensitive Function Family s = 0 . 4 0.2324 0.3396 0.1860 0.2268 0.0104 AND-OR Family s = 0 . 5 0.4138 0.6439 0.4700 0.5478 0.0930 Fingerprints References s = 0 . 6 0.6221 0.8914 0.8019 0.8678 0.4547 s = 0 . 8 0.9432 0.9997 0.9996 0.9999 0.9999 s = 1 . 0 1.0 1.0 1.0 1.0 1.0 b ) ( 1 Threshold t = ( 1 r ) 0 . 6299 0 . 5 0 . 5492 0 . 5253 0 . 6309

S -curve Locality-Sensitive Hashing Anil Maheshwari 1 Introduction Similarity of r = 3 , b = 4 Documents r = 4 , b = 16 0 . 8 LSH f ( s ) = 1 − (1 − s r ) b r = 5 , b = 20 Metric Spaces r = 5 , b = 25 Sensitive Function 0 . 6 Family r = 10 , b = 100 AND-OR Family Fingerprints 0 . 4 References 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 s

Comments on S -Curve Locality-Sensitive Hashing Anil Maheshwari For what values of s , f ′′ ( s ) = 0 ? 1 Introduction 1 s = ( r − 1 br − 1 ) Similarity of r Documents 1 For values of br >> 1 , s ≈ ( 1 b ) 2 LSH r Metric Spaces Steepest slope occurs at s ≈ (1 /b ) (1 /r ) 3 Sensitive Function Family If the Jaccard similarity s of the two sets is above the 4 1 AND-OR Family threshold t = ( 1 r , the probability that they will be b ) Fingerprints found potentially similar is very high. References Consider the entries in the row corresponding to 5 s = 0 . 8 in the table and observe that most of the values for f ( s = 0 . 8) → 1 as s > t .

Computational Summary Locality-Sensitive Hashing Anil Maheshwari Input: Collection of m text documents of size D Introduction k -shingles: Size = k D Similarity of Documents Characteristic matrix of size | U | × m , where U is the LSH universe of all possible k -shingles Metric Spaces Sensitive Function Signature matrix of size n × m using n -permutations Family ⌈ n r ⌉ bands each consisting of r rows AND-OR Family Fingerprints Hash maps from bands to buckets References Output: All pairs of documents that are in the same bucket corresponding to a band Check whether the pairs correspond to similar documents! With the right choice of threshold Pr(the pair is similar) → 1

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive - PowerPoint PPT Presentation

Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari Family AND-OR Family anil@scs.carleton.ca Fingerprints School of Computer

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Information near-duplicates Minimum hashing; Locality Sensitive Hashing Web Search Information

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Near Neighbor Search in High Dimensional Data (2) Locality-Sensitive Hashing (continued) LS

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

High Dimensional Search Min-Hashing Locality Sensi6ve Hashing

r tr s s rtsst

Linear coordinates for perfect codes and Steiner triple systems F.I. Soloveva, I.Yu. Mogilnykh

Critical Problem for Matroids and Codes Keisuke Shiromoto Department of Mathematics and

Interlacing methods in Extremal Combinatorics Hao Huang Emory University Nov 14, 2020 Hao Huang

Some aspects of codes over rings Peter J. Cameron p.j.cameron@qmul.ac.uk Galway, July 2009 This

Similarity vs. distance Algoritmi per la Bioinformatica Two ways of measuring the same thing:

Neighbour-transitive codes in Johnson graphs Mark Ioppolo Centre for Mathematics of Symmetry and

Locality Sensitive Hashing Lecture 14 October 13, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1