 
               We know: Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  Now generalize to multiple hash functions  The similarity of two signatures is the fraction of the hash functions in which they agree  Thus, the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.  And the longer the signatures, the smaller will be the expected error. 3/2/2020 28
Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 0 1 0 1 1 6 6 Similarities: 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 29
 Permuting rows even once is prohibitive  Row hashing!  Pick K = 100 hash functions h i  Ordering under h i gives a random permutation of rows!  One-pass implementation  For each column c and hash-func. h i keep a “slot” M ( i, c ) for the min-hash value  Initialize all M ( i, c ) =  How to pick a random  Scan rows looking for 1s hash function h(x)? Universal hashing:  Suppose row j has 1 in column c h a,b (x)=((a·x+b) mod p) mod N where:  Then for each h i : a,b … random integers  If h i (j) < M ( i, c ), then M ( i, c )  h i (j) p … prime number (p > N) 3/2/2020 30
for each row r do begin for each hash function h i do compute h i ( r ); Important: so you hash r only for each column c once per hash function, not once per 1 in row r. if c has 1 in row r for each hash function h i do if h i ( r ) < M ( i, c ) then M ( i, c ) := h i ( r ); end; 3/2/2020 31
M(i, C 1 ) M(i, C 2 ) ∞ h (1) = 1 1 ∞ g (1) = 3 3 Row C 1 C 2 h (2) = 2 1 2 1 1 0 g (2) = 0 3 0 2 0 1 3 1 1 h (3) = 3 1 2 4 1 0 g (3) = 2 2 0 5 0 1 h (4) = 4 1 2 g (4) = 4 2 0 h ( x ) = x mod 5 h (5) = 0 1 0 g ( x ) = (2 x +1) mod 5 g (5) = 1 2 0 Signature matrix M 3/2/2020 32
Candidate pairs: Locality- those pairs Docu- Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
2 3 7 2 2 4 1 4 7 3 7 3  Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s =0.8)  LSH – General idea: Use a hash function that tells whether x and y is a candidate pair : a pair of elements whose similarity must be evaluated  For Min-Hash matrices:  Hash columns of signature matrix M to many buckets  Each pair of documents that hashes into the same bucket is a candidate pair 3/2/2020 34
2 3 7 2 2 4 1 4 7 3 7 3  Pick a similarity threshold s (0 < s < 1)  Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M ( i, x ) = M ( i, y ) for at least frac. s values of i  We expect documents x and y to have the same (Jaccard) similarity as their signatures 3/2/2020 35
2 3 7 2 2 4 1 4 7 3 7 3  Big idea: Hash columns of signature matrix M several times  Arrange that (only) similar columns are likely to hash to the same bucket , with high probability  Candidate pairs are those that hash to the same bucket 3/2/2020 36
2 3 7 2 2 4 1 4 7 3 7 3 r rows per band b bands One signature Signature matrix M 3/2/2020 37
 Divide matrix M into b bands of r rows  For each band, hash its portion of each column to a hash table with k buckets  Make k as large as possible  Candidate column pairs are those that hash to the same bucket for ≥ 1 band  Tune b and r to catch most similar pairs, but few non-similar pairs 3/2/2020 38
Columns 2 and 6 Buckets are probably identical ( candidate pair ) Columns 6 and 7 are surely different. Matrix M b bands r rows 3/2/2020 39
 There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band  Hereafter, we assume that “ same bucket ” means “ identical in that band ”  Assumption needed only to simplify analysis, not for correctness of algorithm 3/2/2020 41
2 3 7 2 2 4 1 4 7 3 7 3 Assume the following case:  Suppose 100,000 columns of M (100k docs)  Signatures of 100 integers (rows)  Therefore, signatures take 40MB  Goal: Find pairs of documents that are at least s = 0.8 similar  Choose b = 20 bands of r = 5 integers/band 3/2/2020 42
2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.8  Since sim(C 1 , C 2 )  s , we want C 1 , C 2 to be a candidate pair : We want them to hash to at least 1 common bucket (at least one band is identical)  Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328  Probability C 1 , C 2 are not similar in all of the 20 bands: (1-0.328) 20 = 0.00035  i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them)  We would find 99.965% pairs of truly similar documents 3/2/2020 43
2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.3  Since sim(C 1 , C 2 ) < s we want C 1 , C 2 to hash to NO common buckets (all bands should be different)  Probability C 1 , C 2 identical in one particular band: (0.3) 5 = 0.00243  Probability C 1 , C 2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243) 20 = 0.0474  In other words, approximately 4.74% pairs of docs with similarity 0.3 end up becoming candidate pairs  They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s 3/2/2020 44
2 3 7 2 2 4 1 4 7 3 7 3  Pick:  The number of Min-Hashes (rows of M )  The number of bands b , and  The number of rows r per band to balance false positives/negatives  Example: If we had only 10 bands of 10 rows, the number of false positives would go down, but the number of false negatives would go up 3/2/2020 45
Probability = 1 Similarity threshold s if t > s Probability No chance Say “yes” if you of sharing if t < s are below the line. a bucket Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 46
Probability Remember: of sharing Probability of a bucket equal hash-values = similarity Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 47
False negatives Probability Say “yes” if you of sharing are below the line. a bucket False positives s Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 48
 Say columns C 1 and C 2 have similarity t  Pick any band ( r rows)  Prob. that all rows in band equal = t r  Prob. that some row in band unequal = 1 - t r  Prob. that no band identical = (1 - t r ) b  Prob. that at least 1 band identical = 1 - (1 - t r ) b 3/2/2020 49
At least No bands one band identical identical 1 - ( ) b 1 - t r Probability of sharing a bucket All rows Some row of a band of a band are equal unequal Similarity t=sim(C 1 , C 2 ) of two sets 3/2/2020 50
 Similarity threshold s  Prob. that at least 1 band is identical: 1-(1-s r ) b s 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996 3/2/2020 51
 Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Blue area: False Negative rate 0.1 Green area: False Positive rate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity 3/2/2020 52
 Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures  Check in main memory that candidate pairs really do have similar signatures  Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents 3/2/2020 53
 Shingling: Convert documents to set representation  We used hashing to assign each shingle an ID  Min-Hashing: Convert large sets to short signatures, while preserving similarity  We used similarity preserving hashing to generate signatures with property Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  We used hashing to get around generating random permutations  Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents  We used hashing to find candidate pairs of similarity  s 3/2/2020 54
 Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates”  Problem:  Too many documents to compare all pairs  Solution: Hash documents so that similar documents hash into the same bucket  Documents in the same bucket are then candidate pairs whose similarity is then evaluated 3/2/2020 56
Candidate pairs: Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 3/2/2020 57
 A k -shingle (or k -gram) is a sequence of k tokens that appears in the document  Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca }  Represent a doc by a set of hash values of its k -shingles  A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1  C 2 |/|C 1  C 2 |  Similarity of two documents is the Jaccard similarity of their shingles 3/2/2020 59
 Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 60
 Hash columns of the signature matrix M: Similar columns likely hash to same bucket  Divide matrix M into b bands of r rows (M=b·r)  Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 3/2/2020 61
Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity signatures that Points sensitive we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 3/2/2020 62
 The S- curve is where the “magic” happens Probability of sharing Remember: Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket if t>s = similarity No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h  (C 1 ) = h  (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 3/2/2020 63
 Remember: b bands, r rows/band  Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal?  Pick some band ( r rows)  Prob. that elements in a single row of columns C 1 and C 2 are equal = s  Prob. that all rows in a band are equal = s r  Prob. that some row in a band is not equal = 1 - s r  Prob. that all bands are not equal = (1 - s r ) b  Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 3/2/2020 64
 Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 3/2/2020 65
1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 r = 10, b = 1..50 1 that the Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 3/2/2020 66
Candidate pairs: Locality- those pairs sensitive of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity
 We have used LSH to find similar documents  More generally, we found similar columns in large sparse matrices with high Jaccard similarity  Can we use LSH for other distance measures?  e.g., Euclidean distances, Cosine distance  Let’s generalize what we’ve learned! 3/2/2020 68
 d() is a distance measure if it is a function from pairs of points x,y to real numbers such that:  𝑒 𝑦, 𝑧 ≥ 0  𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧  𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦)  𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality)  Jaccard distance for sets = 1 minus Jaccard similarity  Cosine distance for vectors = angle between the vectors  Euclidean distances:  L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension  The most common notion of “distance”  L 1 norm : sum of the differences in each dimension  Manhattan distance = distance if you travel along coordinates only 3/2/2020 69
 d(x,y) > 0 because |x  y| < |x  y|  Thus, similarity < 1 and distance = 1 – similarity > 0  d(x,x) = 0 because x  x = x  x.  And if x  y, then |x  y| is strictly less than |x  y|, so sim(x,y) < 1; thus d(x,y) > 0  d(x,y) = d(y,x) because union and intersection are symmetric  d(x,y) < d(x,z) + d(z,y) trickier: 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y| 3/2/2020 70
d(x,z) d(z,y) d(x,y) 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y|  Remember: |a  b|/|a  b| = probability that minhash(a) = minhash(b).  Thus, 1 - |a  b|/|a  b| = probability that minhash(a)  minhash(b).  Need to show: prob[minhash(x)  minhash(y)] < prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)] 71
 Whenever minhash(x)  minhash(y), at least one of minhash(x)  minhash(z) and minhash(z)  minhash(y) must be true: minhash(x)  minhash(y minhash(x)  minhash(z) minhash(z)  minhash(y) 72
 For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows  A “hash function” is any function that allows us to say whether two elements are “equal”  Shorthand: h(x) = h(y) means “ h says x and y are equal ”  A family of hash functions is any set of hash functions from which we can pick one at random efficiently  Example: The set of Min-Hash functions generated from permutations of rows 3/2/2020 73
Suppose we have a space S of points with  a distance measure d(x,y) Critical assumption A family H of hash functions is said to be  ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h  H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h  H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 3/2/2020 74
Distance Small distance, threshold t high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 3/2/2020 75
 Let:  S = space of all sets,  d = Jaccard distance,  H is family of Min-Hash functions for all permutations of rows  Then for any hash function h  H : Pr[h(x) = h(y)] = 1 - d(x, y)  Simply restates theorem about Min-Hashing in terms of distances rather than similarities 3/2/2020 76
 Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3  For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 3/2/2020 77
Prob. of sharing  Can we reproduce the a bucket “S - curve” effect we saw before for any LS family? Similarity t  The “ bands ” technique we learned for signature matrices carries over to this more general setting  Can do LSH with any ( d 1 , d 2 , p 1 , p 2 )- sensitive family!  Two constructions:  AND construction like “rows in a band”  OR construction like “many bands” 3/2/2020 78
 Given family H , construct family H’ consisting of r functions from H  For h = [ h 1 ,…, h r ] in H’ , we say h(x) = h(y) if and only if h i (x) = h i (y) for all i 1  i  r  Note this corresponds to creating a band of size r  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 ) -sensitive, then H’ is ( d 1 ,d 2 , (p 1 ) r , (p 2 ) r ) -sensitive  Proof: Use the fact that h i ’s are independent Lowers probability for Also lowers probability large distances (Good) for small distances (Bad) 3/2/2020 80
 Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes”  But two particular hash functions could be highly correlated  For example, in Min-Hash if their permutations agree in the first one million entries  However , the probabilities in definition of a LSH-family are over all possible members of H , H’ (i.e., average case and not the worst case) 3/2/2020 81
 Given family H , construct family H’ consisting of b functions from H  For h = [ h 1 ,…, h b ] in H’ , h(x) = h(y) if and only if h i (x) = h i (y) for at least 1 i  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 )- sensitive, then H’ is ( d 1 , d 2 , 1-(1- p 1 ) b , 1-(1- p 2 ) b ) -sensitive  Proof: Use the fact that h i ’s are independent Raises probability for Raises probability for large distances (Bad) small distances (Good) 3/2/2020 82
 AND makes all probs. shrink , but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not  OR makes all probs. grow , but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not Prob. sharing a bucket 1 1 Prob. sharing a bucket AND 0.9 0.9 r=1..10, b=1 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 OR 0.3 0.3 0.2 0.2 r=1, b=1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Similarity of a pair of items 3/2/2020 83
 By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches 1  As for the signature matrix, we can use the AND construction followed by the OR construction  Or vice-versa  Or any sequence of AND’s and OR’s alternating 3/2/2020 84
 r -way AND followed by b -way OR construction  Exactly what we did with Min-Hashing  AND: If bands match in all r values hash to same bucket  OR: Cols that have  1 common bucket  Candidate  Take points x and y s.t. Pr[h(x) = h(y)] = s  H will make (x,y) a candidate pair with prob. s  Construction makes (x,y) a candidate pair with probability 1-(1-s r ) b The S-Curve!  Example: Take H and construct H’ by the AND construction with r = 4 . Then, from H’ , construct H’’ by the OR construction with b = 4 3/2/2020 85
p=1-(1-s 4 ) 4 s .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a .9 .9860 (.2,.8,.8785,.0064)-sensitive family. 3/2/2020 86
 Picking r and b to get desired performance  50 hash-functions ( r = 5, b = 10 ) 1 Blue area X : False Negative rate Threshold s 0.9 These are pairs with sim > s but the X Prob(Candidate pair) 0.8 fraction won’t share a band and then 0.7 will never become candidates. This 0.6 means we will never consider these 0.5 pairs for (slow/exact) similarity 0.4 calculation! 0.3 Green area Y: False Positive rate 0.2 These are pairs with sim < s but 0.1 we will consider them as candidates. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 This is not too bad, we will consider Similarity s them for (slow/exact) similarity computation and discard them. 3/2/2020 88
 Picking r and b to get desired performance  50 hash-functions ( r * b = 50 ) 1 Threshold s r=2, b=25 0.9 Prob(Candidate pair) r=5, b=10 0.8 r=10, b=5 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity s 3/2/2020 89
 Apply a b -way OR construction followed by an r -way AND construction  Transforms similarity s (probability p) into (1-(1-s) b ) r  The same S-curve, mirrored horizontally and vertically  Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’ , construct H’’ by the AND construction with r = 4 3/2/2020 90
1 p=(1-(1-s) 4 ) 4 s 0.9 Prob(Candidate pair) 0.8 .1 .0140 0.7 0.6 .2 .1215 0.5 .3 .3334 0.4 0.3 .4 .5740 0.2 0.1 .5 .7725 0 0 0.2 0.4 0.6 0.8 1 Similarity s .6 .9015 The example transforms a .7 .9680 (.2,.8,.8,.2)-sensitive family into a .8 .9936 (.2,.8,.9936,.1215)-sensitive family 3/2/2020 91
 Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction  Transforms a (.2, .8, .8, .2)-sensitive family into a (.2, .8, .9999996, .0008715)-sensitive family  Note this family uses 256 (=4*4*4*4) of the original hash functions 3/2/2020 92
 For each AND-OR S-curve 1-(1-s r ) b , there is a threshold t , for which 1-(1-t r ) b = t  Above t , high probabilities are increased; below t , low probabilities are decreased  You improve the sensitivity as long as the low probability is less than t , and the high probability is greater than t  Iterate as you like.  Similar observation for the OR-AND type of S- curve: (1-(1-s) b ) r 3/2/2020 93
Probability Is raised Prob(Candidate pair) Threshold t Probability Is lowered s t 3/2/2020 94
 Pick any two distances d 1 < d 2  Start with a ( d 1 , d 2 , (1- d 1 ), (1- d 2 ) ) - sensitive family  Apply constructions to amplify (d 1 , d 2 , p 1 , p 2 ) -sensitive family, where p 1 is almost 1 and p 2 is almost 0  The closer to 0 and 1 we get, the more hash functions must be used! 3/2/2020 95
 LSH methods for other distance metrics:  Cosine distance: Random hyperplanes  Euclidean distance: Project on lines Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Points sensitive we need to test Hashing for similarity Design a (d 1 , d 2 , p 1 , p 2 )-sensitive Amplify the family family of hash functions (for that using AND and OR particular distance metric) constructions Depends on the distance function used 3/2/2020 97
Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Data sensitive we need to test Hashing for similarity 0 1 0 0 Documents 1 1 1 0 1 5 1 5 MinHash “Bands” technique 0 0 0 1 Candidate pairs 2 3 1 3 0 1 0 1 6 4 6 4 0 0 1 0 1 0 0 1 0 1 0 0 Data points Random Hyperplanes -1 +1 -1 -1 1 1 1 0 “Bands” technique Candidate pairs +1 +1 +1 -1 0 0 0 1 -1 -1 -1 -1 0 1 0 1 0 0 1 0 3/2/2020 98 1 0 0 1
A  Cosine distance = angle between vectors from the origin to the points in question d(A, B) =  = arccos(A  B / ǁ A ǁ · ǁ B ǁ ) B A  B  Has range 𝟏 … 𝝆 (equivalently 0...180 ° ) ‖B‖  Can divide  by 𝝆 to have distance in range 0…1  Cosine similarity = 1-d(A,B) 𝐵⋅𝐶  But often defined as cosine sim: cos(𝜄) = 𝐵 𝐶 - Has range - 1…1 for general vectors - Range 0..1 for non-negative vectors (angles up to 90 ° ) 3/2/2020 99
 For cosine distance , there is a technique called Random Hyperplanes  Technique similar to Min-Hashing  Random Hyperplanes method is a ( d 1 , d 2 , (1-d 1 / 𝝆 ), (1-d 2 / 𝝆 ) ) - sensitive family for any d 1 and d 2  Reminder: ( d 1 , d 2 , p 1 , p 2 ) - sensitive 1. If d(x,y) < d 1 , then prob. that h(x) = h(y) is at least p 1 2. If d(x,y) > d 2 , then prob. that h(x) = h(y) is at most p 2 3/2/2020 100
 Each vector v determines a hash function h v with two buckets  h v (x) = +1 if v  x  0 ; = -1 if v  x < 0  LS-family H = set of all functions derived from any vector  Claim: For points x and y , Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆 3/2/2020 101
v’ Look in the plane of x x v and y . Hyperplane θ normal to v’. Here h(x) ≠ h(y) Hyperplane y normal to v . Here h(x) = h(y) Note: what is important is that hyperplane is outside the angle, not that the vector is inside. 3/2/2020 102
So: Prob[Red case ] = θ / 𝝆 So: P [ h(x)=h(y) ] = 1- θ/ 𝜌 = 1-d(x,y)/ 𝜌 3/2/2020 103
Recommend
More recommend