Apps data data data learning Locality Filtering PageRank, - - PowerPoint PPT Presentation
Apps data data data learning Locality Filtering PageRank, - - PowerPoint PPT Presentation
High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising
High dim. data
Locality sensitive hashing Clustering Dimensional ity reduction
Graph data
PageRank, SimRank Network Analysis Spam Detection
Infinite data
Filtering data streams Web advertising Queries on streams
Machine learning
SVM Decision Trees Perceptron, kNN
Apps
Recommen der systems Association Rules Duplicate document detection
3/2/2020 2
Given a query image patch, find similar images
3/2/2020 3
Collect billions of images Determine feature vector for each image (4k dim) Given a query Q find, nearest neighbors FAST
Hamming Distance
Image B Feature Vector Image Q Feature Vector
Similarity (Q,B)
1 1 1 1 1 1 1
…
1 1 1 1 1 1 1
… … …
3/2/2020 4
3/2/2020 5
Many problems can be expressed as
finding “similar” sets:
- Find near-neighbors in high-dimensional space
Examples:
- Pages with similar words
- For duplicate detection, classification by topic
- Customers who purchased similar products
- Products with similar customer sets
- Images with similar features
- Image completion
- Recommendations and search
3/2/2020 6
Given: High dimensional data points 𝒚𝟐, 𝒚𝟑, …
- For example: Image is a long vector of pixel colors
And some distance function 𝒆(𝒚𝟐, 𝒚𝟑)
- which quantifies the “distance” between 𝒚𝟐 and 𝒚𝟑
Goal: Find all pairs of data points (𝒚𝒋, 𝒚𝒌) that
are within distance threshold 𝒆 𝒚𝒋, 𝒚𝒌 ≤ 𝒕
Note: Naïve solution would take 𝑷 𝑶𝟑
where 𝑶 is the number of data points MAGIC: This can be done in 𝑷 𝑶 !! How??
3/2/2020 7
LSH is really a family of related techniques In general, one throws items into buckets using
several different “hash functions”
You examine only those pairs of items that share
a bucket for at least one of these hashings
Upside: Designed correctly, only a small fraction
- f pairs are ever examined
Downside: There are false negatives – pairs of
similar items that never even get considered
8 3/2/2020
Suppose we need to find near-duplicate
documents among 𝑶 = 𝟐 million documents
- Naïvely, we would have to compute pairwise
similarities for every pair of docs
- 𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*1011 comparisons
- At 105 secs/day and 106 comparisons/sec,
it would take 5 days
- For 𝑶 = 𝟐𝟏 million, it takes more than a year…
Similarly, you have a dataset of 10m images,
quickly find the most similar to query image Q
3/2/2020 10
- 1. Shingling: Converts a document into a set
representation (Boolean vector)
- 2. Min-Hashing: Convert large sets to short
signatures, while preserving similarity
3.
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
- Candidate pairs!
3/2/2020 11
12
Docu- ment The set
- f strings
- f length k
that appear in the docu- ment Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
- f signatures
that we need to test for similarity
3/2/2020
Step 1: Shingling: Convert a document into a set
Docu- ment The set
- f strings
- f length k
that appear in the docu- ment
Step 1: Shingling: Converts a document into a set
A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
- Tokens can be characters, words or something else,
depending on the application
- Assume tokens = characters for examples
To compress long shingles, we can hash them to
(say) 4 bytes
Represent a document by the set of hash
values of its k-shingles
3/2/2020 14
Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca} Hash the shingles: h(D1) = {1, 5, 7}
k = 8, 9, or 10 is often used in practice
Benefits of shingles:
- Documents that are intuitively similar will have
many shingles in common
- Changing a word only affects k-shingles within
distance k-1 from the word
3/2/2020 15
Document D1 is a set of its k-shingles C1=S(D1) A natural similarity measure is the
Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|
Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3/2/2020 16
3 in intersection. 8 in union. Jaccard similarity = 3/8
Encode sets using 0/1 (bit, Boolean) vectors
Rows = elements (shingles) Columns = sets (documents)
- 1 in row e and column s if and
- nly if e is a member of s
- Column similarity is the Jaccard
similarity of the corresponding sets (rows with value 1)
- Typical matrix is sparse!
Each document is a column:
- Example: sim(C1 ,C2) = ?
- Size of intersection = 3; size of union = 6,
Jaccard similarity (not distance) = 3/6
- d(C1,C2) = 1 – (Jaccard similarity) = 3/6
17 3/2/2020
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Documents Shingles
We don’t really construct the matrix; just imagine it exists
So far:
- Documents Sets of shingles
- Represent sets as boolean vectors in a matrix
Next goal: Find similar columns while
computing small signatures
- Similarity of columns == similarity of signatures
Warnings:
- Comparing all pairs takes too much time: Job for LSH
- These methods can produce false negatives, and even false
positives (if the optional check is not made)
3/2/2020 18
Step 2: Min-Hashing: Convert large sets to short signatures, while preserving similarity
Docu- ment The set
- f strings
- f length k
that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity
Key idea: “hash” each column C to a small
signature h(C), such that:
- sim(C1, C2) is the same as the “similarity” of
signatures h(C1) and h(C2)
Goal: Find a hash function h(·) such that:
- If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
- If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
Idea: Hash docs into buckets. Expect that
“most” pairs of near duplicate docs hash into the same bucket!
3/2/2020 20
Goal: Find a hash function h(·) such that:
- if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
- if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
Clearly, the hash function depends on
the similarity metric:
- Not all similarity metrics have a suitable
hash function
There is a suitable hash function for
the Jaccard similarity: It is called Min-Hashing
3/2/2020 21
Permute the rows of the Boolean matrix
- Thought experiment – not real
Define minhash function for this permutation,
h(C) = the number of the first (in the permuted
- rder) row in which column C has 1:
- h (C) = min (C)
Apply, to all columns, several randomly chosen
permutations to create a signature for each column
Result is a signature matrix: columns = sets, rows
= minhash values, in order for that column
22 3/2/2020
24
3 4 7 2 6 1 5
Signature matrix M
2 7 3 2 5 7 6 3 1 2 4 4 1 4 2 4 5 1 6 7 3 2 3 7 3 7
3/2/2020
2nd element of the permutation (row 1) is the first to map to a 1 h2(3)=1 (permutation 2, column 3) 4th element of the permutation (row 1) is the first to map to a 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Input matrix (Shingles x Documents) Permutation
Students sometimes ask whether the minhash
value should be the original number of the row, or the number in the permuted order (as we did in our example).
Answer: it doesn’t matter
- You only need to be consistent, and assure that
two columns get the same value if and only if their first 1’s in the permuted order are in the same row
25 3/2/2020
Choose a random permutation Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) Why?
- Let X be a doc (set of shingles), z X is a shingle
- Then: Pr[(z) = min((X))] = 1/|X|
- It is equally likely that any z X is mapped to the min element
- Let y be s.t. (y) = min((C1C2))
- Then either:
(y) = min((C1)) if y C1 , or (y) = min((C2)) if y C2
- So the prob. that both are true is the prob. y C1 C2
- Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)
3/2/2020 26
1 1 1 1
One of the two cols had to have 1 at position y
Given cols C1 and C2, rows are classified as:
C1 C2 A 1 1 B 1 C 1 D
- Define: a = # rows of type A, etc.
Note: sim(C1, C2) = a/(a +b +c) Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
- Look down the permuted cols C1 and C2 until we see a 1
- If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
27 3/2/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets
1 1 1 1
28
We know: Pr[h(C1) = h(C2)] = sim(C1, C2) Now generalize to multiple hash functions The similarity of two signatures is the
fraction of the hash functions in which they agree
Thus, the expected similarity of two
signatures equals the Jaccard similarity of the columns or sets that the signatures represent.
- And the longer the signatures, the smaller will be
the expected error.
3/2/2020
29 3/2/2020
Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0
Signature matrix M
5 7 6 3 1 2 4 4 5 1 6 7 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Input matrix (Shingles x Documents)
3 4 7 2 6 1 5
Permutation
2 7 3 2 4 1 4 2 3 7 3 7
Permuting rows even once is prohibitive Row hashing!
- Pick K = 100 hash functions hi
- Ordering under hi gives a random permutation of rows!
One-pass implementation
- For each column c and hash-func. hi keep a “slot” M(i, c)
for the min-hash value
- Initialize all M(i, c) =
- Scan rows looking for 1s
- Suppose row j has 1 in column c
- Then for each hi :
- If hi(j) < M(i, c), then M(i, c) hi(j)
3/2/2020 30
How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)
for each row r do begin for each hash function hi do compute hi (r); for each column c if c has 1 in row r for each hash function hi do
if hi (r) < M(i, c) then M(i, c) := hi (r);
end;
31
Important: so you hash r only
- nce per hash function, not
- nce per 1 in row r.
3/2/2020
32
Row C1 C2 1 1 2 1 3 1 1 4 1 5 1 h(x) = x mod 5 g(x) = (2x+1) mod 5 h(1) = 1 1 ∞ g(1) = 3 3 ∞ h(2) = 2 1 2 g(2) = 0 3 h(3) = 3 1 2 g(3) = 2 2 h(4) = 4 1 2 g(4) = 4 2 h(5) = 0 1 g(5) = 1 2 M(i, C1) M(i, C2) Signature matrix M
3/2/2020
Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
Docu- ment The set
- f strings
- f length k
that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
- f signatures
that we need to test for similarity
Goal: Find documents with Jaccard similarity at
least s (for some similarity threshold, e.g., s=0.8)
LSH – General idea: Use a hash function that
tells whether x and y is a candidate pair: a pair
- f elements whose similarity must be evaluated
For Min-Hash matrices:
- Hash columns of signature matrix M to many buckets
- Each pair of documents that hashes into the
same bucket is a candidate pair
3/2/2020 34
2 7 3 2 4 1 4 2 3 7 3 7
Pick a similarity threshold s (0 < s < 1) Columns x and y of M are a candidate pair if
their signatures agree on at least fraction s of their rows: M (i, x) = M (i, y) for at least frac. s values of i
- We expect documents x and y to have the same
(Jaccard) similarity as their signatures
3/2/2020 35
2 7 3 2 4 1 4 2 3 7 3 7
Big idea: Hash columns of
signature matrix M several times
Arrange that (only) similar columns are
likely to hash to the same bucket, with high probability
Candidate pairs are those that hash to the
same bucket
3/2/2020 36
2 7 3 2 4 1 4 2 3 7 3 7
3/2/2020 37
Signature matrix M r rows per band b bands One signature
2 7 3 2 4 1 4 2 3 7 3 7
Divide matrix M into b bands of r rows For each band, hash its portion of each
column to a hash table with k buckets
- Make k as large as possible
Candidate column pairs are those that hash
to the same bucket for ≥ 1 band
Tune b and r to catch most similar pairs,
but few non-similar pairs
38 3/2/2020
Matrix M r rows b bands
Buckets
3/2/2020 39
Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different.
There are enough buckets that columns are
unlikely to hash to the same bucket unless they are identical in a particular band
Hereafter, we assume that “same bucket”
means “identical in that band”
Assumption needed only to simplify analysis,
not for correctness of algorithm
3/2/2020 41
Assume the following case:
Suppose 100,000 columns of M (100k docs) Signatures of 100 integers (rows) Therefore, signatures take 40MB Goal: Find pairs of documents that
are at least s = 0.8 similar
Choose b = 20 bands of r = 5 integers/band
3/2/2020 42
2 7 3 2 4 1 4 2 3 7 3 7
Find pairs of s=0.8 similarity, set b=20, r=5 Assume: sim(C1, C2) = 0.8
- Since sim(C1, C2) s, we want C1, C2 to be a candidate
pair: We want them to hash to at least 1 common bucket (at least one band is identical)
Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
Probability C1, C2 are not similar in all of the 20
bands: (1-0.328)20 = 0.00035
- i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
- We would find 99.965% pairs of truly similar documents
3/2/2020 43
2 7 3 2 4 1 4 2 3 7 3 7
Find pairs of s=0.8 similarity, set b=20, r=5 Assume: sim(C1, C2) = 0.3
- Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
Probability C1, C2 identical in one particular
band: (0.3)5 = 0.00243
Probability C1, C2 identical in at least 1 of 20
bands: 1 - (1 - 0.00243)20 = 0.0474
- In other words, approximately 4.74% pairs of docs
with similarity 0.3 end up becoming candidate pairs
- They are false positives since we will have to examine them
(they are candidate pairs) but then it will turn out their similarity is below threshold s
3/2/2020 44
2 7 3 2 4 1 4 2 3 7 3 7
Pick:
- The number of Min-Hashes (rows of M)
- The number of bands b, and
- The number of rows r per band
to balance false positives/negatives
Example: If we had only 10 bands of 10
rows, the number of false positives would go down, but the number of false negatives would go up
3/2/2020 45
2 7 3 2 4 1 4 2 3 7 3 7
Similarity t =sim(C1, C2) of two sets Probability
- f sharing
a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s
3/2/2020 46
Say “yes” if you are below the line.
3/2/2020 47
Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability
- f sharing
a bucket
3/2/2020 48
Similarity t =sim(C1, C2) of two sets Probability
- f sharing
a bucket
False positives False negatives
s Say “yes” if you are below the line.
Say columns C1 and C2 have similarity t Pick any band (r rows)
- Prob. that all rows in band equal = tr
- Prob. that some row in band unequal = 1 - tr
Prob. that no band identical = (1 - tr)b Prob. that at least 1 band identical =
1 - (1 - tr)b
3/2/2020 49
t r
All rows
- f a band
are equal
1 -
Some row
- f a band
unequal
( )b
No bands identical
1 -
At least
- ne band
identical
3/2/2020 50
Similarity t=sim(C1, C2) of two sets Probability
- f sharing
a bucket
Similarity threshold s Prob. that at least 1 band is identical:
3/2/2020 51
s 1-(1-sr)b 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996
Picking r and b to get the best S-curve
- 50 hash-functions (r=5, b=10)
3/2/2020 52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Blue area: False Negative rate Green area: False Positive rate Similarity
- Prob. sharing a bucket
Tune M, b, r to get almost all pairs with
similar signatures, but eliminate most pairs that do not have similar signatures
Check in main memory that candidate pairs
really do have similar signatures
Optional: In another pass through data,
check that the remaining candidate pairs really represent similar documents
3/2/2020 53
Shingling: Convert documents to set representation
- We used hashing to assign each shingle an ID
Min-Hashing: Convert large sets to short signatures,
while preserving similarity
- We used similarity preserving hashing to generate
signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
- We used hashing to get around generating random
permutations
Locality-Sensitive Hashing: Focus on pairs of
signatures likely to be from similar documents
- We used hashing to find candidate pairs of similarity s
3/2/2020 54
Task: Given a large number (N in the millions or
billions) of documents, find “near duplicates”
Problem:
- Too many documents to compare all pairs
Solution: Hash documents so that similar
documents hash into the same bucket
- Documents in the same bucket are then
candidate pairs whose similarity is then evaluated
3/2/2020 56
57
Docu- ment The set
- f strings
- f length k
that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs
- f signatures
that we need to test for similarity
3/2/2020
A k-shingle (or k-gram) is a sequence of k
tokens that appears in the document
- Example: k=2; D1 = abcab
Set of 2-shingles: C1 = S(D1) = {ab, bc, ca}
Represent a doc by a set of hash values of its
k-shingles
A natural similarity measure is then the
Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|
- Similarity of two documents is the Jaccard similarity of
their shingles
3/2/2020 59
Min-Hashing: Convert large sets into short signatures,
while preserving similarity: Pr[h(C1) = h(C2)] = sim(D1, D2)
3/2/2020 60
Similarities of columns and signatures (approx.) match! 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0
Signature matrix M
5 7 6 3 1 2 4 4 5 1 6 7 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Input matrix (Shingles x Documents)
3 4 7 2 6 1 5
Permutation
2 7 3 2 4 1 4 2 3 7 3 7
Hash columns of the signature matrix M:
Similar columns likely hash to same bucket
- Divide matrix M into b bands of r rows (M=b·r)
- Candidate column pairs are those that hash
to the same bucket for ≥ 1 band
3/2/2020 61
r rows
b bands
Buckets Matrix M Similarity
- Prob. of sharing
≥ 1 bucket Threshold s
3/2/2020 62
Points Signatures: short integer signatures that reflect point similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity
Design a locality sensitive hash function (for a given distance metric)
Apply the “Bands” technique
The S-curve is where the “magic” happens
3/2/2020 63
Similarity t of two sets Probability of sharing ≥ 1 bucket
Remember: Probability of equal hash-values = similarity
This is what 1 hash-code gives you Pr[h(C1) = h(C2)] = sim(D1, D2) No chance if t<s Probability=1 if t>s
This is what we want! How to get a step-function? By choosing r and b!
Threshold s Similarity t of two sets
Remember: b bands, r rows/band Let sim(C1 , C2) = s
What’s the prob. that at least 1 band is equal?
Pick some band (r rows)
- Prob. that elements in a single row of
columns C1 and C2 are equal = s
- Prob. that all rows in a band are equal = sr
- Prob. that some row in a band is not equal = 1 - sr
Prob. that all bands are not equal = (1 - sr)b Prob. that at least 1 band is equal = 1 - (1 - sr)b
3/2/2020 64
P(C1, C2 is a candidate pair) = 1 - (1 - sr)b
Picking r and b to get the best S-curve
- 50 hash-functions (r=5, b=10)
3/2/2020 65
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity, s
- Prob. sharing a bucket
3/2/2020 66
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity r = 1..10, b = 1 Prob(Candidate pair)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Prob(Candidate pair) r = 1, b = 1..10 r = 5, b = 1..50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r = 10, b = 1..50
Similarity
prob = 1 - (1 - t r)b
Given a fixed threshold s. We want choose r and b such that the P(Candidate pair) has a “step” right around s.
Signatures: short vectors that represent the sets, and reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs
- f signatures
that we need to test for similarity
We have used LSH to find similar documents
- More generally, we found similar columns in large
sparse matrices with high Jaccard similarity
Can we use LSH for other distance measures?
- e.g., Euclidean distances, Cosine distance
- Let’s generalize what we’ve learned!
3/2/2020 68
d() is a distance measure if it is a function from pairs of
points x,y to real numbers such that:
- 𝑒 𝑦, 𝑧 ≥ 0
- 𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧
- 𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦)
- 𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality)
Jaccard distance for sets = 1 minus Jaccard similarity Cosine distance for vectors = angle between the vectors Euclidean distances:
- L2 norm: d(x,y) = square root of the sum of the squares of the
differences between x and y in each dimension
- The most common notion of “distance”
- L1 norm: sum of the differences in each dimension
- Manhattan distance = distance if you travel along coordinates only
3/2/2020 69
d(x,y) > 0 because |xy| < |xy|
- Thus, similarity < 1 and distance = 1 – similarity > 0
d(x,x) = 0 because xx = xx. And if x y, then |xy| is strictly less than
|xy|, so sim(x,y) < 1; thus d(x,y) > 0
d(x,y) = d(y,x) because union and intersection
are symmetric
d(x,y) < d(x,z) + d(z,y) trickier:
1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y|
3/2/2020 70
71
1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y|
Remember: |a b|/|a b| = probability that
minhash(a) = minhash(b).
Thus, 1 - |a b|/|a b| = probability that
minhash(a) minhash(b).
Need to show: prob[minhash(x) minhash(y)]
< prob[minhash(x) minhash(z)] + prob[minhash(z) minhash(y)]
d(x,z) d(x,y) d(z,y)
72
Whenever minhash(x) minhash(y), at least one
- f minhash(x) minhash(z) and minhash(z)
minhash(y) must be true:
minhash(x) minhash(z) minhash(z) minhash(y) minhash(x) minhash(y
For Min-Hashing signatures, we got a Min-Hash
function for each permutation of rows
A “hash function” is any function that allows us
to say whether two elements are “equal”
- Shorthand: h(x) = h(y) means “h says x and y are equal”
A family of hash functions is any set of hash
functions from which we can pick one at random efficiently
- Example: The set of Min-Hash functions generated
from permutations of rows
3/2/2020 73
Suppose we have a space S of points with a distance measure d(x,y)
A family H of hash functions is said to be (d1, d2, p1, p2)-sensitive if for any x and y in S:
- 1. If d(x, y) < d1, then the probability over all h H,
that h(x) = h(y) is at least p1
- 2. If d(x, y) > d2, then the probability over all h H,
that h(x) = h(y) is at most p2
3/2/2020 74
With a LS Family we can do LSH!
Critical assumption
Pr[h(x) = h(y)] Distance d(x,y)
d1 d2 p2 p1
Small distance, high probability Large distance, low probability
- f hashing to
the same value
3/2/2020 75
Distance threshold t
Let:
- S = space of all sets,
- d = Jaccard distance,
- H is family of Min-Hash functions for all
permutations of rows
Then for any hash function h H:
Pr[h(x) = h(y)] = 1 - d(x, y)
- Simply restates theorem about Min-Hashing
in terms of distances rather than similarities
3/2/2020 76
Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)-
sensitive family for S and d.
For Jaccard similarity, Min-Hashing gives a
(d1,d2,(1-d1),(1-d2))-sensitive family for any d1<d2
If distance < 1/3 (so similarity ≥ 2/3) Then probability that Min-Hash values agree is > 2/3
3/2/2020 77
Can we reproduce the
“S-curve” effect we saw before for any LS family?
The “bands” technique we learned for signature
matrices carries over to this more general setting
Can do LSH with any (d1, d2, p1, p2)-sensitive
family!
Two constructions:
- AND construction like “rows in a band”
- OR construction like “many bands”
3/2/2020 78
Similarity t
- Prob. of sharing
a bucket
3/2/2020 80
1 i r Lowers probability for large distances (Good) Also lowers probability for small distances (Bad)
Given family H, construct family H’ consisting
- f r functions from H
For h = [h1,…,hr] in H’, we say
h(x) = h(y) if and only if hi(x) = hi(y) for all i
- Note this corresponds to creating a band of size r
Theorem: If H is (d1, d2, p1, p2)-sensitive,
then H’ is (d1,d2, (p1)r, (p2)r)-sensitive
Proof: Use the fact that hi ’s are independent
Independence of hash functions (HFs) really
means that the prob. of two HFs saying “yes” is the product of each saying “yes”
- But two particular hash functions could be highly
correlated
- For example, in Min-Hash if their permutations agree in
the first one million entries
- However, the probabilities in definition of a
LSH-family are over all possible members of H, H’ (i.e., average case and not the worst case)
3/2/2020 81
3/2/2020 82
Raises probability for small distances (Good) Raises probability for large distances (Bad)
Given family H, construct family H’ consisting
- f b functions from H
For h = [h1,…,hb] in H’,
h(x) = h(y) if and only if hi(x) = hi(y) for at least 1 i
Theorem: If H is (d1, d2, p1, p2)-sensitive,
then H’ is (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive
Proof: Use the fact that hi’s are independent
AND makes all probs. shrink, but by choosing r
correctly, we can make the lower prob. approach 0 while the higher does not
OR makes all probs. grow, but by choosing b correctly,
we can make the upper prob. approach 1 while the lower does not
3/2/2020 83
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
AND r=1..10, b=1
- Prob. sharing a bucket
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
- Prob. sharing a bucket
OR r=1, b=1..10 Similarity of a pair of items Similarity of a pair of items
By choosing b and r correctly, we can make
the lower probability approach 0 while the higher approaches 1
As for the signature matrix, we can use the
AND construction followed by the OR construction
- Or vice-versa
- Or any sequence of AND’s and OR’s alternating
84 3/2/2020
r-way AND followed by b-way OR construction
- Exactly what we did with Min-Hashing
- AND: If bands match in all r values hash to same bucket
- OR: Cols that have 1 common bucket Candidate
Take points x and y s.t. Pr[h(x) = h(y)] = s
- H will make (x,y) a candidate pair with prob. s
Construction makes (x,y) a candidate pair with
probability 1-(1-sr)b The S-Curve!
- Example: Take H and construct H’ by the AND
construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4
3/2/2020 85
s p=1-(1-s4)4 .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 .9 .9860
r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)-sensitive family.
3/2/2020 86
Picking r and b to get desired performance
- 50 hash-functions (r = 5, b = 10)
3/2/2020 88
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Blue area X: False Negative rate These are pairs with sim > s but the X fraction won’t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them.
Similarity s Prob(Candidate pair) Threshold s
Picking r and b to get desired performance
- 50 hash-functions (r * b = 50)
3/2/2020 89
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=2, b=25 r=5, b=10 r=10, b=5
Threshold s Similarity s Prob(Candidate pair)
Apply a b-way OR construction followed by
an r-way AND construction
Transforms similarity s (probability p)
into (1-(1-s)b)r
- The same S-curve, mirrored horizontally and
vertically
Example: Take H and construct H’ by the OR
construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4
3/2/2020 90
3/2/2020 91
s p=(1-(1-s)4)4 .1 .0140 .2 .1215 .3 .3334 .4 .5740 .5 .7725 .6 .9015 .7 .9680 .8 .9936
The example transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)-sensitive family
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1
Similarity s Prob(Candidate pair)
Example: Apply the (4,4) OR-AND construction
followed by the (4,4) AND-OR construction
Transforms a (.2, .8, .8, .2)-sensitive family into
a (.2, .8, .9999996, .0008715)-sensitive family
- Note this family uses 256 (=4*4*4*4) of the
- riginal hash functions
3/2/2020 92
93
For each AND-OR S-curve 1-(1-sr)b, there is a
threshold t, for which 1-(1-tr)b = t
Above t, high probabilities are increased; below
t, low probabilities are decreased
You improve the sensitivity as long as the low
probability is less than t, and the high probability is greater than t
- Iterate as you like.
Similar observation for the OR-AND type of S-
curve: (1-(1-s)b)r
3/2/2020
Threshold t t
94
Probability Is lowered Probability Is raised s
3/2/2020
Prob(Candidate pair)
Pick any two distances d1 < d2 Start with a (d1, d2, (1- d1), (1- d2))-sensitive
family
Apply constructions to amplify
(d1, d2, p1, p2)-sensitive family, where p1 is almost 1 and p2 is almost 0
The closer to 0 and 1 we get, the more
hash functions must be used!
3/2/2020 95
LSH methods for other distance metrics:
- Cosine distance: Random hyperplanes
- Euclidean distance: Project on lines
3/2/2020 97
Points Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) Amplify the family using AND and OR constructions
Depends on the distance function used
3/2/2020 98
Data Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity MinHash 1 5 1 5 2 3 1 3 6 4 6 4 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Random Hyperplanes -1 +1 -1
- 1
+1 +1 +1 -1
- 1
- 1
- 1
- 1
0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Documents Data points Candidate pairs Candidate pairs
Cosine distance = angle between vectors
from the origin to the points in question d(A, B) = = arccos(AB / ǁAǁ·ǁBǁ)
- Has range 𝟏 … 𝝆 (equivalently 0...180°)
- Can divide by 𝝆 to have distance in range 0…1
Cosine similarity = 1-d(A,B)
- But often defined as cosine sim: cos(𝜄) =
𝐵⋅𝐶 𝐵 𝐶
3/2/2020 99
A B
AB ‖B‖
- Has range -1…1 for
general vectors
- Range 0..1 for
non-negative vectors (angles up to 90°)
For cosine distance, there is a technique
called Random Hyperplanes
- Technique similar to Min-Hashing
Random Hyperplanes method is a
(d1, d2, (1-d1/𝝆), (1-d2/𝝆))-sensitive family for
any d1 and d2
Reminder: (d1, d2, p1, p2)-sensitive
1. If d(x,y) < d1, then prob. that h(x) = h(y) is at least p1 2. If d(x,y) > d2, then prob. that h(x) = h(y) is at most p2
3/2/2020 100
Each vector v determines a hash function hv
with two buckets
hv(x) = +1 if vx 0; = -1 if vx < 0 LS-family H = set of all functions derived
from any vector
Claim: For points x and y,
Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆
3/2/2020 101
3/2/2020 102
x y
Look in the plane of x and y.
θ Hyperplane normal to v’. Here h(x) ≠ h(y)
v’
Hyperplane normal to v. Here h(x) = h(y)
v Note: what is important is that hyperplane is outside the angle, not that the vector is inside.
3/2/2020 103
So: Prob[Red case] = θ / 𝝆
So: P[h(x)=h(y)] = 1- θ/𝜌 = 1-d(x,y)/𝜌
Pick some number of random vectors, and
hash your data for each vector
The result is a signature (sketch) of
+1’s and –1’s for each data point
Can be used for LSH like we used the
Min-Hash signatures for Jaccard distance
Amplify using AND/OR constructions
3/2/2020 104
Expensive to pick a random vector in M
dimensions for large M
- Would have to generate M random numbers
A more efficient approach
- It suffices to consider only vectors v
consisting of +1 and –1 components
- Why? Assuming data is random, then vectors of +/-1 cover
the entire space evenly (and does not bias in any way)
3/2/2020 105
Idea: Hash functions correspond to lines Partition the line into buckets of size a Hash each point to the bucket containing its
projection onto the line
- An element of the “Signature” is a bucket id for
that given projection line
Nearby points are always close;
distant points are rarely in same bucket
3/2/2020 106
“Lucky” case:
- Points that are close
hash in the same bucket
- Distant points end up in
different buckets
Two “unlucky” cases:
- Top: unlucky
quantization
- Bottom: unlucky
projection
3/2/2020 107
v v Line Buckets of size a v v v v v v v v v v
3/2/2020 108
v v v v v v v v
Bucket width a Randomly chosen line Points at distance d If d << a, then the chance the points are in the same bucket is at least 1 – d/a.
3/2/2020 109
Bucket width a Points at distance d θ d cos θ If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket.
3/2/2020 110
Randomly chosen line
If points are distance d < a/2, prob.
they are in same bucket ≥ 1- d/a = ½
If points are distance d > 2a apart, then they
can be in the same bucket only if d cos θ ≤ a
- cos θ ≤ ½
- 60 < θ < 90, i.e., at most 1/3 probability
Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of
hash functions for any a
Amplify using AND-OR cascades
3/2/2020 111
3/2/2020 114
Data Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) Amplify the family using AND and OR constructions MinHash 1 5 1 5 2 3 1 3 6 4 6 4 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Random Hyperplanes -1 +1 -1
- 1
+1 +1 +1 -1
- 1
- 1
- 1
- 1
0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Documents Data points Candidate pairs Candidate pairs
Property P(h(C1)=h(C2))=sim(C1,C2) of
hash function h is the essential part of LSH, without it we can’t do anything
LS-hash functions transform data to
signatures so that the bands technique (AND, OR constructions) can then be applied
3/2/2020 115