Apps data data data learning Locality Filtering PageRank, - PowerPoint PPT Presentation

 We know: Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  Now generalize to multiple hash functions  The similarity of two signatures is the fraction of the hash functions in which they agree  Thus, the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.  And the longer the signatures, the smaller will be the expected error. 3/2/2020 28

Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 0 1 0 1 1 6 6 Similarities: 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 29

 Permuting rows even once is prohibitive  Row hashing!  Pick K = 100 hash functions h i  Ordering under h i gives a random permutation of rows!  One-pass implementation  For each column c and hash-func. h i keep a “slot” M ( i, c ) for the min-hash value  Initialize all M ( i, c ) =  How to pick a random  Scan rows looking for 1s hash function h(x)? Universal hashing:  Suppose row j has 1 in column c h a,b (x)=((a·x+b) mod p) mod N where:  Then for each h i : a,b … random integers  If h i (j) < M ( i, c ), then M ( i, c )  h i (j) p … prime number (p > N) 3/2/2020 30

for each row r do begin for each hash function h i do compute h i ( r ); Important: so you hash r only for each column c once per hash function, not once per 1 in row r. if c has 1 in row r for each hash function h i do if h i ( r ) < M ( i, c ) then M ( i, c ) := h i ( r ); end; 3/2/2020 31

M(i, C 1 ) M(i, C 2 ) ∞ h (1) = 1 1 ∞ g (1) = 3 3 Row C 1 C 2 h (2) = 2 1 2 1 1 0 g (2) = 0 3 0 2 0 1 3 1 1 h (3) = 3 1 2 4 1 0 g (3) = 2 2 0 5 0 1 h (4) = 4 1 2 g (4) = 4 2 0 h ( x ) = x mod 5 h (5) = 0 1 0 g ( x ) = (2 x +1) mod 5 g (5) = 1 2 0 Signature matrix M 3/2/2020 32

Candidate pairs: Locality- those pairs Docu- Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

2 3 7 2 2 4 1 4 7 3 7 3  Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s =0.8)  LSH – General idea: Use a hash function that tells whether x and y is a candidate pair : a pair of elements whose similarity must be evaluated  For Min-Hash matrices:  Hash columns of signature matrix M to many buckets  Each pair of documents that hashes into the same bucket is a candidate pair 3/2/2020 34

2 3 7 2 2 4 1 4 7 3 7 3  Pick a similarity threshold s (0 < s < 1)  Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M ( i, x ) = M ( i, y ) for at least frac. s values of i  We expect documents x and y to have the same (Jaccard) similarity as their signatures 3/2/2020 35

2 3 7 2 2 4 1 4 7 3 7 3  Big idea: Hash columns of signature matrix M several times  Arrange that (only) similar columns are likely to hash to the same bucket , with high probability  Candidate pairs are those that hash to the same bucket 3/2/2020 36

2 3 7 2 2 4 1 4 7 3 7 3 r rows per band b bands One signature Signature matrix M 3/2/2020 37

 Divide matrix M into b bands of r rows  For each band, hash its portion of each column to a hash table with k buckets  Make k as large as possible  Candidate column pairs are those that hash to the same bucket for ≥ 1 band  Tune b and r to catch most similar pairs, but few non-similar pairs 3/2/2020 38

Columns 2 and 6 Buckets are probably identical ( candidate pair ) Columns 6 and 7 are surely different. Matrix M b bands r rows 3/2/2020 39

 There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band  Hereafter, we assume that “ same bucket ” means “ identical in that band ”  Assumption needed only to simplify analysis, not for correctness of algorithm 3/2/2020 41

2 3 7 2 2 4 1 4 7 3 7 3 Assume the following case:  Suppose 100,000 columns of M (100k docs)  Signatures of 100 integers (rows)  Therefore, signatures take 40MB  Goal: Find pairs of documents that are at least s = 0.8 similar  Choose b = 20 bands of r = 5 integers/band 3/2/2020 42

2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.8  Since sim(C 1 , C 2 )  s , we want C 1 , C 2 to be a candidate pair : We want them to hash to at least 1 common bucket (at least one band is identical)  Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328  Probability C 1 , C 2 are not similar in all of the 20 bands: (1-0.328) 20 = 0.00035  i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them)  We would find 99.965% pairs of truly similar documents 3/2/2020 43

2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.3  Since sim(C 1 , C 2 ) < s we want C 1 , C 2 to hash to NO common buckets (all bands should be different)  Probability C 1 , C 2 identical in one particular band: (0.3) 5 = 0.00243  Probability C 1 , C 2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243) 20 = 0.0474  In other words, approximately 4.74% pairs of docs with similarity 0.3 end up becoming candidate pairs  They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s 3/2/2020 44

2 3 7 2 2 4 1 4 7 3 7 3  Pick:  The number of Min-Hashes (rows of M )  The number of bands b , and  The number of rows r per band to balance false positives/negatives  Example: If we had only 10 bands of 10 rows, the number of false positives would go down, but the number of false negatives would go up 3/2/2020 45

Probability = 1 Similarity threshold s if t > s Probability No chance Say “yes” if you of sharing if t < s are below the line. a bucket Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 46

Probability Remember: of sharing Probability of a bucket equal hash-values = similarity Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 47

False negatives Probability Say “yes” if you of sharing are below the line. a bucket False positives s Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 48

 Say columns C 1 and C 2 have similarity t  Pick any band ( r rows)  Prob. that all rows in band equal = t r  Prob. that some row in band unequal = 1 - t r  Prob. that no band identical = (1 - t r ) b  Prob. that at least 1 band identical = 1 - (1 - t r ) b 3/2/2020 49

At least No bands one band identical identical 1 - ( ) b 1 - t r Probability of sharing a bucket All rows Some row of a band of a band are equal unequal Similarity t=sim(C 1 , C 2 ) of two sets 3/2/2020 50

 Similarity threshold s  Prob. that at least 1 band is identical: 1-(1-s r ) b s 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996 3/2/2020 51

 Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Blue area: False Negative rate 0.1 Green area: False Positive rate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity 3/2/2020 52

 Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures  Check in main memory that candidate pairs really do have similar signatures  Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents 3/2/2020 53

 Shingling: Convert documents to set representation  We used hashing to assign each shingle an ID  Min-Hashing: Convert large sets to short signatures, while preserving similarity  We used similarity preserving hashing to generate signatures with property Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  We used hashing to get around generating random permutations  Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents  We used hashing to find candidate pairs of similarity  s 3/2/2020 54

 Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates”  Problem:  Too many documents to compare all pairs  Solution: Hash documents so that similar documents hash into the same bucket  Documents in the same bucket are then candidate pairs whose similarity is then evaluated 3/2/2020 56

Candidate pairs: Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 3/2/2020 57

 A k -shingle (or k -gram) is a sequence of k tokens that appears in the document  Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca }  Represent a doc by a set of hash values of its k -shingles  A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1  C 2 |/|C 1  C 2 |  Similarity of two documents is the Jaccard similarity of their shingles 3/2/2020 59

 Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 60

 Hash columns of the signature matrix M: Similar columns likely hash to same bucket  Divide matrix M into b bands of r rows (M=b·r)  Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 3/2/2020 61

Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity signatures that Points sensitive we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 3/2/2020 62

 The S- curve is where the “magic” happens Probability of sharing Remember: Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket if t>s = similarity No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h  (C 1 ) = h  (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 3/2/2020 63

 Remember: b bands, r rows/band  Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal?  Pick some band ( r rows)  Prob. that elements in a single row of columns C 1 and C 2 are equal = s  Prob. that all rows in a band are equal = s r  Prob. that some row in a band is not equal = 1 - s r  Prob. that all bands are not equal = (1 - s r ) b  Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 3/2/2020 64

 Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 3/2/2020 65

1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 r = 10, b = 1..50 1 that the Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 3/2/2020 66

Candidate pairs: Locality- those pairs sensitive of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity

 We have used LSH to find similar documents  More generally, we found similar columns in large sparse matrices with high Jaccard similarity  Can we use LSH for other distance measures?  e.g., Euclidean distances, Cosine distance  Let’s generalize what we’ve learned! 3/2/2020 68

 d() is a distance measure if it is a function from pairs of points x,y to real numbers such that:  𝑒 𝑦, 𝑧 ≥ 0  𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧  𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦)  𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality)  Jaccard distance for sets = 1 minus Jaccard similarity  Cosine distance for vectors = angle between the vectors  Euclidean distances:  L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension  The most common notion of “distance”  L 1 norm : sum of the differences in each dimension  Manhattan distance = distance if you travel along coordinates only 3/2/2020 69

 d(x,y) > 0 because |x  y| < |x  y|  Thus, similarity < 1 and distance = 1 – similarity > 0  d(x,x) = 0 because x  x = x  x.  And if x  y, then |x  y| is strictly less than |x  y|, so sim(x,y) < 1; thus d(x,y) > 0  d(x,y) = d(y,x) because union and intersection are symmetric  d(x,y) < d(x,z) + d(z,y) trickier: 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y| 3/2/2020 70

d(x,z) d(z,y) d(x,y) 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y|  Remember: |a  b|/|a  b| = probability that minhash(a) = minhash(b).  Thus, 1 - |a  b|/|a  b| = probability that minhash(a)  minhash(b).  Need to show: prob[minhash(x)  minhash(y)] < prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)] 71

 Whenever minhash(x)  minhash(y), at least one of minhash(x)  minhash(z) and minhash(z)  minhash(y) must be true: minhash(x)  minhash(y minhash(x)  minhash(z) minhash(z)  minhash(y) 72

 For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows  A “hash function” is any function that allows us to say whether two elements are “equal”  Shorthand: h(x) = h(y) means “ h says x and y are equal ”  A family of hash functions is any set of hash functions from which we can pick one at random efficiently  Example: The set of Min-Hash functions generated from permutations of rows 3/2/2020 73

Suppose we have a space S of points with  a distance measure d(x,y) Critical assumption A family H of hash functions is said to be  ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h  H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h  H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 3/2/2020 74

Distance Small distance, threshold t high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 3/2/2020 75

 Let:  S = space of all sets,  d = Jaccard distance,  H is family of Min-Hash functions for all permutations of rows  Then for any hash function h  H : Pr[h(x) = h(y)] = 1 - d(x, y)  Simply restates theorem about Min-Hashing in terms of distances rather than similarities 3/2/2020 76

 Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3  For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 3/2/2020 77

Prob. of sharing  Can we reproduce the a bucket “S - curve” effect we saw before for any LS family? Similarity t  The “ bands ” technique we learned for signature matrices carries over to this more general setting  Can do LSH with any ( d 1 , d 2 , p 1 , p 2 )- sensitive family!  Two constructions:  AND construction like “rows in a band”  OR construction like “many bands” 3/2/2020 78

 Given family H , construct family H’ consisting of r functions from H  For h = [ h 1 ,…, h r ] in H’ , we say h(x) = h(y) if and only if h i (x) = h i (y) for all i 1  i  r  Note this corresponds to creating a band of size r  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 ) -sensitive, then H’ is ( d 1 ,d 2 , (p 1 ) r , (p 2 ) r ) -sensitive  Proof: Use the fact that h i ’s are independent Lowers probability for Also lowers probability large distances (Good) for small distances (Bad) 3/2/2020 80

 Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes”  But two particular hash functions could be highly correlated  For example, in Min-Hash if their permutations agree in the first one million entries  However , the probabilities in definition of a LSH-family are over all possible members of H , H’ (i.e., average case and not the worst case) 3/2/2020 81

 Given family H , construct family H’ consisting of b functions from H  For h = [ h 1 ,…, h b ] in H’ , h(x) = h(y) if and only if h i (x) = h i (y) for at least 1 i  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 )- sensitive, then H’ is ( d 1 , d 2 , 1-(1- p 1 ) b , 1-(1- p 2 ) b ) -sensitive  Proof: Use the fact that h i ’s are independent Raises probability for Raises probability for large distances (Bad) small distances (Good) 3/2/2020 82

 AND makes all probs. shrink , but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not  OR makes all probs. grow , but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not Prob. sharing a bucket 1 1 Prob. sharing a bucket AND 0.9 0.9 r=1..10, b=1 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 OR 0.3 0.3 0.2 0.2 r=1, b=1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Similarity of a pair of items 3/2/2020 83

 By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches 1  As for the signature matrix, we can use the AND construction followed by the OR construction  Or vice-versa  Or any sequence of AND’s and OR’s alternating 3/2/2020 84

 r -way AND followed by b -way OR construction  Exactly what we did with Min-Hashing  AND: If bands match in all r values hash to same bucket  OR: Cols that have  1 common bucket  Candidate  Take points x and y s.t. Pr[h(x) = h(y)] = s  H will make (x,y) a candidate pair with prob. s  Construction makes (x,y) a candidate pair with probability 1-(1-s r ) b The S-Curve!  Example: Take H and construct H’ by the AND construction with r = 4 . Then, from H’ , construct H’’ by the OR construction with b = 4 3/2/2020 85

p=1-(1-s 4 ) 4 s .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a .9 .9860 (.2,.8,.8785,.0064)-sensitive family. 3/2/2020 86

 Picking r and b to get desired performance  50 hash-functions ( r = 5, b = 10 ) 1 Blue area X : False Negative rate Threshold s 0.9 These are pairs with sim > s but the X Prob(Candidate pair) 0.8 fraction won’t share a band and then 0.7 will never become candidates. This 0.6 means we will never consider these 0.5 pairs for (slow/exact) similarity 0.4 calculation! 0.3 Green area Y: False Positive rate 0.2 These are pairs with sim < s but 0.1 we will consider them as candidates. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 This is not too bad, we will consider Similarity s them for (slow/exact) similarity computation and discard them. 3/2/2020 88

 Picking r and b to get desired performance  50 hash-functions ( r * b = 50 ) 1 Threshold s r=2, b=25 0.9 Prob(Candidate pair) r=5, b=10 0.8 r=10, b=5 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity s 3/2/2020 89

 Apply a b -way OR construction followed by an r -way AND construction  Transforms similarity s (probability p) into (1-(1-s) b ) r  The same S-curve, mirrored horizontally and vertically  Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’ , construct H’’ by the AND construction with r = 4 3/2/2020 90

1 p=(1-(1-s) 4 ) 4 s 0.9 Prob(Candidate pair) 0.8 .1 .0140 0.7 0.6 .2 .1215 0.5 .3 .3334 0.4 0.3 .4 .5740 0.2 0.1 .5 .7725 0 0 0.2 0.4 0.6 0.8 1 Similarity s .6 .9015 The example transforms a .7 .9680 (.2,.8,.8,.2)-sensitive family into a .8 .9936 (.2,.8,.9936,.1215)-sensitive family 3/2/2020 91

 Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction  Transforms a (.2, .8, .8, .2)-sensitive family into a (.2, .8, .9999996, .0008715)-sensitive family  Note this family uses 256 (=4*4*4*4) of the original hash functions 3/2/2020 92

 For each AND-OR S-curve 1-(1-s r ) b , there is a threshold t , for which 1-(1-t r ) b = t  Above t , high probabilities are increased; below t , low probabilities are decreased  You improve the sensitivity as long as the low probability is less than t , and the high probability is greater than t  Iterate as you like.  Similar observation for the OR-AND type of S- curve: (1-(1-s) b ) r 3/2/2020 93

Probability Is raised Prob(Candidate pair) Threshold t Probability Is lowered s t 3/2/2020 94

 Pick any two distances d 1 < d 2  Start with a ( d 1 , d 2 , (1- d 1 ), (1- d 2 ) ) - sensitive family  Apply constructions to amplify (d 1 , d 2 , p 1 , p 2 ) -sensitive family, where p 1 is almost 1 and p 2 is almost 0  The closer to 0 and 1 we get, the more hash functions must be used! 3/2/2020 95

 LSH methods for other distance metrics:  Cosine distance: Random hyperplanes  Euclidean distance: Project on lines Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Points sensitive we need to test Hashing for similarity Design a (d 1 , d 2 , p 1 , p 2 )-sensitive Amplify the family family of hash functions (for that using AND and OR particular distance metric) constructions Depends on the distance function used 3/2/2020 97

Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Data sensitive we need to test Hashing for similarity 0 1 0 0 Documents 1 1 1 0 1 5 1 5 MinHash “Bands” technique 0 0 0 1 Candidate pairs 2 3 1 3 0 1 0 1 6 4 6 4 0 0 1 0 1 0 0 1 0 1 0 0 Data points Random Hyperplanes -1 +1 -1 -1 1 1 1 0 “Bands” technique Candidate pairs +1 +1 +1 -1 0 0 0 1 -1 -1 -1 -1 0 1 0 1 0 0 1 0 3/2/2020 98 1 0 0 1

A  Cosine distance = angle between vectors from the origin to the points in question d(A, B) =  = arccos(A  B / ǁ A ǁ · ǁ B ǁ ) B A  B  Has range 𝟏 … 𝝆 (equivalently 0...180 ° ) ‖B‖  Can divide  by 𝝆 to have distance in range 0…1  Cosine similarity = 1-d(A,B) 𝐵⋅𝐶  But often defined as cosine sim: cos(𝜄) = 𝐵 𝐶 - Has range - 1…1 for general vectors - Range 0..1 for non-negative vectors (angles up to 90 ° ) 3/2/2020 99

 For cosine distance , there is a technique called Random Hyperplanes  Technique similar to Min-Hashing  Random Hyperplanes method is a ( d 1 , d 2 , (1-d 1 / 𝝆 ), (1-d 2 / 𝝆 ) ) - sensitive family for any d 1 and d 2  Reminder: ( d 1 , d 2 , p 1 , p 2 ) - sensitive 1. If d(x,y) < d 1 , then prob. that h(x) = h(y) is at least p 1 2. If d(x,y) > d 2 , then prob. that h(x) = h(y) is at most p 2 3/2/2020 100

 Each vector v determines a hash function h v with two buckets  h v (x) = +1 if v  x  0 ; = -1 if v  x < 0  LS-family H = set of all functions derived from any vector  Claim: For points x and y , Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆 3/2/2020 101

v’ Look in the plane of x x v and y . Hyperplane θ normal to v’. Here h(x) ≠ h(y) Hyperplane y normal to v . Here h(x) = h(y) Note: what is important is that hyperplane is outside the angle, not that the vector is inside. 3/2/2020 102

So: Prob[Red case ] = θ / 𝝆 So: P [ h(x)=h(y) ] = 1- θ/ 𝜌 = 1-d(x,y)/ 𝜌 3/2/2020 103

Apps data data data learning Locality Filtering PageRank, - PowerPoint PPT Presentation

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising

Boxing them in Buggy apps can crash other apps The Kernel App 1 App 2 App 3 Buggy apps can

Small Business Apps WHAT ARE MOBI LE APPS ? W h a t are mob i le apps ? A little bit of

Adaptive Progressive Web Apps PWA Progressive Web Apps are just great websites that can behave

The Kernel wants to be your friend Boxing them in Buggy apps can crash other apps App 1 App 2

Build beautiful native apps in record time with flutter Eduardo Telaya - CTO / Software

Linux Kung-Fu James Droste UBNetDef Fall 2016 $ init 1 GO TO https://apps.ubnetdef.org

F ROM WEBSITES TO APPS , AND NOW FROM APPS TO CHATBOTS ? A NTNIO B RANCO FROM APPS TO CHATBOTS

in Android apps Shin Hwei Zhen Xiang Abhik Tan Dong Gao Roychoudhury 2 Prevalence of

By: Taylor Thomas & Ashley Fischer There are apps for just about EVERYTHING! There are

Mobile Apps in the Marketplace Mobile Technology Association of Michigan 1 Mobile Apps in the

Mobile Apps INFM 603 Week 10 Agenda Questions Mobile Apps HCI Project

in native apps allen pike steamclock software build delightful apps. embed a javascript

Educational Mobile Apps Dr. Agnieszka (Aga) Palalas October 2013 1 Educational Mobile Apps

DATA AT SWEDEN'S TELEVISION Ismail Elouafiq A wide spectrum of Apps A wide spectrum of Apps

TRUSD Apps Portal: The Gateway for Distance Learning and Online Resources FACE Distance

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

OFFICER TRAINING: A DIGITAL APPROACH Major Thomas Silier - German Air Force Officer School

Information Visualization Marks & Channels Tamara Munzner Department of Computer Science

The Tenet Real-Time Protocol Suite Bruce A. Mah bmah@tenet.berkeley.edu The Tenet Group

Irradiance Gradients in the Presence of Media & Occlusions Wojciech Jarosz in collaboration

Leveraging Supply Chain Finance to Optimize Value Brad Peterson +1 312 701 8568

Tony Comper Tony Comper Chairman & CEO Chairman & CEO SLIDE 1 Who We Are Who We Are

Offer in Compromise Workshop From Initial Call to Offer Acceptance Presented by: Eric L.

Today Simplest.. Load balance: m balls in n bins. k n k k n n ne For

Apps data data data learning Locality Filtering PageRank, - PowerPoint PPT Presentation

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising

Boxing them in Buggy apps can crash other apps The Kernel App 1 App 2 App 3 Buggy apps can

Small Business Apps WHAT ARE MOBI LE APPS ? W h a t are mob i le apps ? A little bit of

Adaptive Progressive Web Apps PWA Progressive Web Apps are just great websites that can behave

The Kernel wants to be your friend Boxing them in Buggy apps can crash other apps App 1 App 2

Build beautiful native apps in record time with flutter Eduardo Telaya - CTO / Software

Linux Kung-Fu James Droste UBNetDef Fall 2016 $ init 1 GO TO https://apps.ubnetdef.org

F ROM WEBSITES TO APPS , AND NOW FROM APPS TO CHATBOTS ? A NTNIO B RANCO FROM APPS TO CHATBOTS

in Android apps Shin Hwei Zhen Xiang Abhik Tan Dong Gao Roychoudhury 2 Prevalence of

By: Taylor Thomas &amp; Ashley Fischer There are apps for just about EVERYTHING! There are

Mobile Apps in the Marketplace Mobile Technology Association of Michigan 1 Mobile Apps in the

Mobile Apps INFM 603 Week 10 Agenda Questions Mobile Apps HCI Project

in native apps allen pike steamclock software build delightful apps. embed a javascript

Educational Mobile Apps Dr. Agnieszka (Aga) Palalas October 2013 1 Educational Mobile Apps

DATA AT SWEDEN'S TELEVISION Ismail Elouafiq A wide spectrum of Apps A wide spectrum of Apps

TRUSD Apps Portal: The Gateway for Distance Learning and Online Resources FACE Distance

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

OFFICER TRAINING: A DIGITAL APPROACH Major Thomas Silier - German Air Force Officer School

Information Visualization Marks &amp; Channels Tamara Munzner Department of Computer Science

The Tenet Real-Time Protocol Suite Bruce A. Mah bmah@tenet.berkeley.edu The Tenet Group

Irradiance Gradients in the Presence of Media &amp; Occlusions Wojciech Jarosz in collaboration

Leveraging Supply Chain Finance to Optimize Value Brad Peterson +1 312 701 8568

Tony Comper Tony Comper Chairman &amp; CEO Chairman &amp; CEO SLIDE 1 Who We Are Who We Are

Offer in Compromise Workshop From Initial Call to Offer Acceptance Presented by: Eric L.

Today Simplest.. Load balance: m balls in n bins. k n k k n n ne For

By: Taylor Thomas & Ashley Fischer There are apps for just about EVERYTHING! There are

Information Visualization Marks & Channels Tamara Munzner Department of Computer Science

Irradiance Gradients in the Presence of Media & Occlusions Wojciech Jarosz in collaboration

Tony Comper Tony Comper Chairman & CEO Chairman & CEO SLIDE 1 Who We Are Who We Are