Apps data data data learning Locality Filtering PageRank, - - PowerPoint PPT Presentation

apps data data data learning
SMART_READER_LITE
LIVE PREVIEW

Apps data data data learning Locality Filtering PageRank, - - PowerPoint PPT Presentation

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising


slide-1
SLIDE 1
slide-2
SLIDE 2

High dim. data

Locality sensitive hashing Clustering Dimensional ity reduction

Graph data

PageRank, SimRank Network Analysis Spam Detection

Infinite data

Filtering data streams Web advertising Queries on streams

Machine learning

SVM Decision Trees Perceptron, kNN

Apps

Recommen der systems Association Rules Duplicate document detection

3/2/2020 2

slide-3
SLIDE 3

Given a query image patch, find similar images

3/2/2020 3

slide-4
SLIDE 4

 Collect billions of images  Determine feature vector for each image (4k dim)  Given a query Q find, nearest neighbors FAST

Hamming Distance

Image B Feature Vector Image Q Feature Vector

Similarity (Q,B)

1 1 1 1 1 1 1

1 1 1 1 1 1 1

… … …

3/2/2020 4

slide-5
SLIDE 5

3/2/2020 5

slide-6
SLIDE 6

 Many problems can be expressed as

finding “similar” sets:

  • Find near-neighbors in high-dimensional space

 Examples:

  • Pages with similar words
  • For duplicate detection, classification by topic
  • Customers who purchased similar products
  • Products with similar customer sets
  • Images with similar features
  • Image completion
  • Recommendations and search

3/2/2020 6

slide-7
SLIDE 7

 Given: High dimensional data points 𝒚𝟐, 𝒚𝟑, …

  • For example: Image is a long vector of pixel colors

 And some distance function 𝒆(𝒚𝟐, 𝒚𝟑)

  • which quantifies the “distance” between 𝒚𝟐 and 𝒚𝟑

 Goal: Find all pairs of data points (𝒚𝒋, 𝒚𝒌) that

are within distance threshold 𝒆 𝒚𝒋, 𝒚𝒌 ≤ 𝒕

 Note: Naïve solution would take 𝑷 𝑶𝟑

where 𝑶 is the number of data points  MAGIC: This can be done in 𝑷 𝑶 !! How??

3/2/2020 7

slide-8
SLIDE 8

 LSH is really a family of related techniques  In general, one throws items into buckets using

several different “hash functions”

 You examine only those pairs of items that share

a bucket for at least one of these hashings

 Upside: Designed correctly, only a small fraction

  • f pairs are ever examined

 Downside: There are false negatives – pairs of

similar items that never even get considered

8 3/2/2020

slide-9
SLIDE 9
slide-10
SLIDE 10

 Suppose we need to find near-duplicate

documents among 𝑶 = 𝟐 million documents

  • Naïvely, we would have to compute pairwise

similarities for every pair of docs

  • 𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*1011 comparisons
  • At 105 secs/day and 106 comparisons/sec,

it would take 5 days

  • For 𝑶 = 𝟐𝟏 million, it takes more than a year…

 Similarly, you have a dataset of 10m images,

quickly find the most similar to query image Q

3/2/2020 10

slide-11
SLIDE 11
  • 1. Shingling: Converts a document into a set

representation (Boolean vector)

  • 2. Min-Hashing: Convert large sets to short

signatures, while preserving similarity

3.

Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

  • Candidate pairs!

3/2/2020 11

slide-12
SLIDE 12

12

Docu- ment The set

  • f strings
  • f length k

that appear in the docu- ment Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs

  • f signatures

that we need to test for similarity

3/2/2020

slide-13
SLIDE 13

Step 1: Shingling: Convert a document into a set

Docu- ment The set

  • f strings
  • f length k

that appear in the docu- ment

slide-14
SLIDE 14

Step 1: Shingling: Converts a document into a set

 A k-shingle (or k-gram) for a document is a

sequence of k tokens that appears in the doc

  • Tokens can be characters, words or something else,

depending on the application

  • Assume tokens = characters for examples

 To compress long shingles, we can hash them to

(say) 4 bytes

 Represent a document by the set of hash

values of its k-shingles

3/2/2020 14

slide-15
SLIDE 15

 Example: k=2; document D1= abcab

Set of 2-shingles: S(D1) = {ab, bc, ca} Hash the shingles: h(D1) = {1, 5, 7}

 k = 8, 9, or 10 is often used in practice

 Benefits of shingles:

  • Documents that are intuitively similar will have

many shingles in common

  • Changing a word only affects k-shingles within

distance k-1 from the word

3/2/2020 15

slide-16
SLIDE 16

 Document D1 is a set of its k-shingles C1=S(D1)  A natural similarity measure is the

Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|

Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|

3/2/2020 16

3 in intersection. 8 in union. Jaccard similarity = 3/8

slide-17
SLIDE 17

Encode sets using 0/1 (bit, Boolean) vectors

 Rows = elements (shingles)  Columns = sets (documents)

  • 1 in row e and column s if and
  • nly if e is a member of s
  • Column similarity is the Jaccard

similarity of the corresponding sets (rows with value 1)

  • Typical matrix is sparse!

 Each document is a column:

  • Example: sim(C1 ,C2) = ?
  • Size of intersection = 3; size of union = 6,

Jaccard similarity (not distance) = 3/6

  • d(C1,C2) = 1 – (Jaccard similarity) = 3/6

17 3/2/2020

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Documents Shingles

We don’t really construct the matrix; just imagine it exists

slide-18
SLIDE 18

 So far:

  • Documents  Sets of shingles
  • Represent sets as boolean vectors in a matrix

 Next goal: Find similar columns while

computing small signatures

  • Similarity of columns == similarity of signatures

 Warnings:

  • Comparing all pairs takes too much time: Job for LSH
  • These methods can produce false negatives, and even false

positives (if the optional check is not made)

3/2/2020 18

slide-19
SLIDE 19

Step 2: Min-Hashing: Convert large sets to short signatures, while preserving similarity

Docu- ment The set

  • f strings
  • f length k

that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity

slide-20
SLIDE 20

 Key idea: “hash” each column C to a small

signature h(C), such that:

  • sim(C1, C2) is the same as the “similarity” of

signatures h(C1) and h(C2)

 Goal: Find a hash function h(·) such that:

  • If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Idea: Hash docs into buckets. Expect that

“most” pairs of near duplicate docs hash into the same bucket!

3/2/2020 20

slide-21
SLIDE 21

 Goal: Find a hash function h(·) such that:

  • if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Clearly, the hash function depends on

the similarity metric:

  • Not all similarity metrics have a suitable

hash function

 There is a suitable hash function for

the Jaccard similarity: It is called Min-Hashing

3/2/2020 21

slide-22
SLIDE 22

 Permute the rows of the Boolean matrix

  • Thought experiment – not real

 Define minhash function for this permutation,

h(C) = the number of the first (in the permuted

  • rder) row in which column C has 1:
  • h (C) = min (C)

 Apply, to all columns, several randomly chosen

permutations to create a signature for each column

 Result is a signature matrix: columns = sets, rows

= minhash values, in order for that column

22 3/2/2020

slide-23
SLIDE 23

24

3 4 7 2 6 1 5

Signature matrix M

2 7 3 2 5 7 6 3 1 2 4 4 1 4 2 4 5 1 6 7 3 2 3 7 3 7

3/2/2020

2nd element of the permutation (row 1) is the first to map to a 1 h2(3)=1 (permutation 2, column 3) 4th element of the permutation (row 1) is the first to map to a 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents) Permutation 

slide-24
SLIDE 24

 Students sometimes ask whether the minhash

value should be the original number of the row, or the number in the permuted order (as we did in our example).

 Answer: it doesn’t matter

  • You only need to be consistent, and assure that

two columns get the same value if and only if their first 1’s in the permuted order are in the same row

25 3/2/2020

slide-25
SLIDE 25

 Choose a random permutation   Claim: Pr[h(C1) = h(C2)] = sim(C1, C2)  Why?

  • Let X be a doc (set of shingles), z X is a shingle
  • Then: Pr[(z) = min((X))] = 1/|X|
  • It is equally likely that any z X is mapped to the min element
  • Let y be s.t. (y) = min((C1C2))
  • Then either:

(y) = min((C1)) if y  C1 , or (y) = min((C2)) if y  C2

  • So the prob. that both are true is the prob. y  C1  C2
  • Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)

3/2/2020 26

1 1 1 1

One of the two cols had to have 1 at position y

slide-26
SLIDE 26

 Given cols C1 and C2, rows are classified as:

C1 C2 A 1 1 B 1 C 1 D

  • Define: a = # rows of type A, etc.

 Note: sim(C1, C2) = a/(a +b +c)  Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)

  • Look down the permuted cols C1 and C2 until we see a 1
  • If it’s a type-A row, then h(C1) = h(C2)

If a type-B or type-C row, then not

27 3/2/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets

1 1 1 1

slide-27
SLIDE 27

28

 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)  Now generalize to multiple hash functions  The similarity of two signatures is the

fraction of the hash functions in which they agree

 Thus, the expected similarity of two

signatures equals the Jaccard similarity of the columns or sets that the signatures represent.

  • And the longer the signatures, the smaller will be

the expected error.

3/2/2020

slide-28
SLIDE 28

29 3/2/2020

Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0

Signature matrix M

5 7 6 3 1 2 4 4 5 1 6 7 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

3 4 7 2 6 1 5

Permutation 

2 7 3 2 4 1 4 2 3 7 3 7

slide-29
SLIDE 29

 Permuting rows even once is prohibitive  Row hashing!

  • Pick K = 100 hash functions hi
  • Ordering under hi gives a random permutation of rows!

 One-pass implementation

  • For each column c and hash-func. hi keep a “slot” M(i, c)

for the min-hash value

  • Initialize all M(i, c) = 
  • Scan rows looking for 1s
  • Suppose row j has 1 in column c
  • Then for each hi :
  • If hi(j) < M(i, c), then M(i, c)  hi(j)

3/2/2020 30

How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)

slide-30
SLIDE 30

for each row r do begin for each hash function hi do compute hi (r); for each column c if c has 1 in row r for each hash function hi do

if hi (r) < M(i, c) then M(i, c) := hi (r);

end;

31

Important: so you hash r only

  • nce per hash function, not
  • nce per 1 in row r.

3/2/2020

slide-31
SLIDE 31

32

Row C1 C2 1 1 2 1 3 1 1 4 1 5 1 h(x) = x mod 5 g(x) = (2x+1) mod 5 h(1) = 1 1 ∞ g(1) = 3 3 ∞ h(2) = 2 1 2 g(2) = 0 3 h(3) = 3 1 2 g(3) = 2 2 h(4) = 4 1 2 g(4) = 4 2 h(5) = 0 1 g(5) = 1 2 M(i, C1) M(i, C2) Signature matrix M

3/2/2020

slide-32
SLIDE 32

Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

Docu- ment The set

  • f strings
  • f length k

that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs

  • f signatures

that we need to test for similarity

slide-33
SLIDE 33

 Goal: Find documents with Jaccard similarity at

least s (for some similarity threshold, e.g., s=0.8)

 LSH – General idea: Use a hash function that

tells whether x and y is a candidate pair: a pair

  • f elements whose similarity must be evaluated

 For Min-Hash matrices:

  • Hash columns of signature matrix M to many buckets
  • Each pair of documents that hashes into the

same bucket is a candidate pair

3/2/2020 34

2 7 3 2 4 1 4 2 3 7 3 7

slide-34
SLIDE 34

 Pick a similarity threshold s (0 < s < 1)  Columns x and y of M are a candidate pair if

their signatures agree on at least fraction s of their rows: M (i, x) = M (i, y) for at least frac. s values of i

  • We expect documents x and y to have the same

(Jaccard) similarity as their signatures

3/2/2020 35

2 7 3 2 4 1 4 2 3 7 3 7

slide-35
SLIDE 35

 Big idea: Hash columns of

signature matrix M several times

 Arrange that (only) similar columns are

likely to hash to the same bucket, with high probability

 Candidate pairs are those that hash to the

same bucket

3/2/2020 36

2 7 3 2 4 1 4 2 3 7 3 7

slide-36
SLIDE 36

3/2/2020 37

Signature matrix M r rows per band b bands One signature

2 7 3 2 4 1 4 2 3 7 3 7

slide-37
SLIDE 37

 Divide matrix M into b bands of r rows  For each band, hash its portion of each

column to a hash table with k buckets

  • Make k as large as possible

 Candidate column pairs are those that hash

to the same bucket for ≥ 1 band

 Tune b and r to catch most similar pairs,

but few non-similar pairs

38 3/2/2020

slide-38
SLIDE 38

Matrix M r rows b bands

Buckets

3/2/2020 39

Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different.

slide-39
SLIDE 39

 There are enough buckets that columns are

unlikely to hash to the same bucket unless they are identical in a particular band

 Hereafter, we assume that “same bucket”

means “identical in that band”

 Assumption needed only to simplify analysis,

not for correctness of algorithm

3/2/2020 41

slide-40
SLIDE 40

Assume the following case:

 Suppose 100,000 columns of M (100k docs)  Signatures of 100 integers (rows)  Therefore, signatures take 40MB  Goal: Find pairs of documents that

are at least s = 0.8 similar

 Choose b = 20 bands of r = 5 integers/band

3/2/2020 42

2 7 3 2 4 1 4 2 3 7 3 7

slide-41
SLIDE 41

 Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C1, C2) = 0.8

  • Since sim(C1, C2)  s, we want C1, C2 to be a candidate

pair: We want them to hash to at least 1 common bucket (at least one band is identical)

 Probability C1, C2 identical in one particular

band: (0.8)5 = 0.328

 Probability C1, C2 are not similar in all of the 20

bands: (1-0.328)20 = 0.00035

  • i.e., about 1/3000th of the 80%-similar column pairs

are false negatives (we miss them)

  • We would find 99.965% pairs of truly similar documents

3/2/2020 43

2 7 3 2 4 1 4 2 3 7 3 7

slide-42
SLIDE 42

 Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C1, C2) = 0.3

  • Since sim(C1, C2) < s we want C1, C2 to hash to NO

common buckets (all bands should be different)

 Probability C1, C2 identical in one particular

band: (0.3)5 = 0.00243

 Probability C1, C2 identical in at least 1 of 20

bands: 1 - (1 - 0.00243)20 = 0.0474

  • In other words, approximately 4.74% pairs of docs

with similarity 0.3 end up becoming candidate pairs

  • They are false positives since we will have to examine them

(they are candidate pairs) but then it will turn out their similarity is below threshold s

3/2/2020 44

2 7 3 2 4 1 4 2 3 7 3 7

slide-43
SLIDE 43

 Pick:

  • The number of Min-Hashes (rows of M)
  • The number of bands b, and
  • The number of rows r per band

to balance false positives/negatives

 Example: If we had only 10 bands of 10

rows, the number of false positives would go down, but the number of false negatives would go up

3/2/2020 45

2 7 3 2 4 1 4 2 3 7 3 7

slide-44
SLIDE 44

Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s

3/2/2020 46

Say “yes” if you are below the line.

slide-45
SLIDE 45

3/2/2020 47

Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-46
SLIDE 46

3/2/2020 48

Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket

False positives False negatives

s Say “yes” if you are below the line.

slide-47
SLIDE 47

 Say columns C1 and C2 have similarity t  Pick any band (r rows)

  • Prob. that all rows in band equal = tr
  • Prob. that some row in band unequal = 1 - tr

 Prob. that no band identical = (1 - tr)b  Prob. that at least 1 band identical =

1 - (1 - tr)b

3/2/2020 49

slide-48
SLIDE 48

t r

All rows

  • f a band

are equal

1 -

Some row

  • f a band

unequal

( )b

No bands identical

1 -

At least

  • ne band

identical

3/2/2020 50

Similarity t=sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-49
SLIDE 49

 Similarity threshold s  Prob. that at least 1 band is identical:

3/2/2020 51

s 1-(1-sr)b 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996

slide-50
SLIDE 50

 Picking r and b to get the best S-curve

  • 50 hash-functions (r=5, b=10)

3/2/2020 52

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Blue area: False Negative rate Green area: False Positive rate Similarity

  • Prob. sharing a bucket
slide-51
SLIDE 51

 Tune M, b, r to get almost all pairs with

similar signatures, but eliminate most pairs that do not have similar signatures

 Check in main memory that candidate pairs

really do have similar signatures

 Optional: In another pass through data,

check that the remaining candidate pairs really represent similar documents

3/2/2020 53

slide-52
SLIDE 52

 Shingling: Convert documents to set representation

  • We used hashing to assign each shingle an ID

 Min-Hashing: Convert large sets to short signatures,

while preserving similarity

  • We used similarity preserving hashing to generate

signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)

  • We used hashing to get around generating random

permutations

 Locality-Sensitive Hashing: Focus on pairs of

signatures likely to be from similar documents

  • We used hashing to find candidate pairs of similarity  s

3/2/2020 54

slide-53
SLIDE 53
slide-54
SLIDE 54

 Task: Given a large number (N in the millions or

billions) of documents, find “near duplicates”

 Problem:

  • Too many documents to compare all pairs

 Solution: Hash documents so that similar

documents hash into the same bucket

  • Documents in the same bucket are then

candidate pairs whose similarity is then evaluated

3/2/2020 56

slide-55
SLIDE 55

57

Docu- ment The set

  • f strings
  • f length k

that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs

  • f signatures

that we need to test for similarity

3/2/2020

slide-56
SLIDE 56

 A k-shingle (or k-gram) is a sequence of k

tokens that appears in the document

  • Example: k=2; D1 = abcab

Set of 2-shingles: C1 = S(D1) = {ab, bc, ca}

 Represent a doc by a set of hash values of its

k-shingles

 A natural similarity measure is then the

Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|

  • Similarity of two documents is the Jaccard similarity of

their shingles

3/2/2020 59

slide-57
SLIDE 57

 Min-Hashing: Convert large sets into short signatures,

while preserving similarity: Pr[h(C1) = h(C2)] = sim(D1, D2)

3/2/2020 60

Similarities of columns and signatures (approx.) match! 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0

Signature matrix M

5 7 6 3 1 2 4 4 5 1 6 7 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

3 4 7 2 6 1 5

Permutation 

2 7 3 2 4 1 4 2 3 7 3 7

slide-58
SLIDE 58

 Hash columns of the signature matrix M:

Similar columns likely hash to same bucket

  • Divide matrix M into b bands of r rows (M=b·r)
  • Candidate column pairs are those that hash

to the same bucket for ≥ 1 band

3/2/2020 61

r rows

b bands

Buckets Matrix M Similarity

  • Prob. of sharing

≥ 1 bucket Threshold s

slide-59
SLIDE 59

3/2/2020 62

Points Signatures: short integer signatures that reflect point similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

Design a locality sensitive hash function (for a given distance metric)

Apply the “Bands” technique

slide-60
SLIDE 60

 The S-curve is where the “magic” happens

3/2/2020 63

Similarity t of two sets Probability of sharing ≥ 1 bucket

Remember: Probability of equal hash-values = similarity

This is what 1 hash-code gives you Pr[h(C1) = h(C2)] = sim(D1, D2) No chance if t<s Probability=1 if t>s

This is what we want! How to get a step-function? By choosing r and b!

Threshold s Similarity t of two sets

slide-61
SLIDE 61

 Remember: b bands, r rows/band  Let sim(C1 , C2) = s

What’s the prob. that at least 1 band is equal?

 Pick some band (r rows)

  • Prob. that elements in a single row of

columns C1 and C2 are equal = s

  • Prob. that all rows in a band are equal = sr
  • Prob. that some row in a band is not equal = 1 - sr

 Prob. that all bands are not equal = (1 - sr)b  Prob. that at least 1 band is equal = 1 - (1 - sr)b

3/2/2020 64

P(C1, C2 is a candidate pair) = 1 - (1 - sr)b

slide-62
SLIDE 62

 Picking r and b to get the best S-curve

  • 50 hash-functions (r=5, b=10)

3/2/2020 65

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity, s

  • Prob. sharing a bucket
slide-63
SLIDE 63

3/2/2020 66

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity r = 1..10, b = 1 Prob(Candidate pair)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Prob(Candidate pair) r = 1, b = 1..10 r = 5, b = 1..50

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r = 10, b = 1..50

Similarity

prob = 1 - (1 - t r)b

Given a fixed threshold s. We want choose r and b such that the P(Candidate pair) has a “step” right around s.

slide-64
SLIDE 64

Signatures: short vectors that represent the sets, and reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs

  • f signatures

that we need to test for similarity

slide-65
SLIDE 65

 We have used LSH to find similar documents

  • More generally, we found similar columns in large

sparse matrices with high Jaccard similarity

 Can we use LSH for other distance measures?

  • e.g., Euclidean distances, Cosine distance
  • Let’s generalize what we’ve learned!

3/2/2020 68

slide-66
SLIDE 66

 d() is a distance measure if it is a function from pairs of

points x,y to real numbers such that:

  • 𝑒 𝑦, 𝑧 ≥ 0
  • 𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧
  • 𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦)
  • 𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality)

 Jaccard distance for sets = 1 minus Jaccard similarity  Cosine distance for vectors = angle between the vectors  Euclidean distances:

  • L2 norm: d(x,y) = square root of the sum of the squares of the

differences between x and y in each dimension

  • The most common notion of “distance”
  • L1 norm: sum of the differences in each dimension
  • Manhattan distance = distance if you travel along coordinates only

3/2/2020 69

slide-67
SLIDE 67

 d(x,y) > 0 because |xy| < |xy|

  • Thus, similarity < 1 and distance = 1 – similarity > 0

 d(x,x) = 0 because xx = xx.  And if x  y, then |xy| is strictly less than

|xy|, so sim(x,y) < 1; thus d(x,y) > 0

 d(x,y) = d(y,x) because union and intersection

are symmetric

 d(x,y) < d(x,z) + d(z,y) trickier:

1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y|

3/2/2020 70

slide-68
SLIDE 68

71

1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y|

 Remember: |a b|/|a b| = probability that

minhash(a) = minhash(b).

 Thus, 1 - |a b|/|a b| = probability that

minhash(a)  minhash(b).

 Need to show: prob[minhash(x)  minhash(y)]

< prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)]

d(x,z) d(x,y) d(z,y)

slide-69
SLIDE 69

72

 Whenever minhash(x)  minhash(y), at least one

  • f minhash(x)  minhash(z) and minhash(z) 

minhash(y) must be true:

minhash(x)  minhash(z) minhash(z)  minhash(y) minhash(x)  minhash(y

slide-70
SLIDE 70

 For Min-Hashing signatures, we got a Min-Hash

function for each permutation of rows

 A “hash function” is any function that allows us

to say whether two elements are “equal”

  • Shorthand: h(x) = h(y) means “h says x and y are equal”

 A family of hash functions is any set of hash

functions from which we can pick one at random efficiently

  • Example: The set of Min-Hash functions generated

from permutations of rows

3/2/2020 73

slide-71
SLIDE 71

Suppose we have a space S of points with a distance measure d(x,y)

A family H of hash functions is said to be (d1, d2, p1, p2)-sensitive if for any x and y in S:

  • 1. If d(x, y) < d1, then the probability over all h H,

that h(x) = h(y) is at least p1

  • 2. If d(x, y) > d2, then the probability over all h H,

that h(x) = h(y) is at most p2

3/2/2020 74

With a LS Family we can do LSH!

Critical assumption

slide-72
SLIDE 72

Pr[h(x) = h(y)] Distance d(x,y)

d1 d2 p2 p1

Small distance, high probability Large distance, low probability

  • f hashing to

the same value

3/2/2020 75

Distance threshold t

slide-73
SLIDE 73

 Let:

  • S = space of all sets,
  • d = Jaccard distance,
  • H is family of Min-Hash functions for all

permutations of rows

 Then for any hash function h H:

Pr[h(x) = h(y)] = 1 - d(x, y)

  • Simply restates theorem about Min-Hashing

in terms of distances rather than similarities

3/2/2020 76

slide-74
SLIDE 74

 Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)-

sensitive family for S and d.

 For Jaccard similarity, Min-Hashing gives a

(d1,d2,(1-d1),(1-d2))-sensitive family for any d1<d2

If distance < 1/3 (so similarity ≥ 2/3) Then probability that Min-Hash values agree is > 2/3

3/2/2020 77

slide-75
SLIDE 75

 Can we reproduce the

“S-curve” effect we saw before for any LS family?

 The “bands” technique we learned for signature

matrices carries over to this more general setting

 Can do LSH with any (d1, d2, p1, p2)-sensitive

family!

 Two constructions:

  • AND construction like “rows in a band”
  • OR construction like “many bands”

3/2/2020 78

Similarity t

  • Prob. of sharing

a bucket

slide-76
SLIDE 76
slide-77
SLIDE 77

3/2/2020 80

1  i  r Lowers probability for large distances (Good) Also lowers probability for small distances (Bad)

 Given family H, construct family H’ consisting

  • f r functions from H

 For h = [h1,…,hr] in H’, we say

h(x) = h(y) if and only if hi(x) = hi(y) for all i

  • Note this corresponds to creating a band of size r

 Theorem: If H is (d1, d2, p1, p2)-sensitive,

then H’ is (d1,d2, (p1)r, (p2)r)-sensitive

 Proof: Use the fact that hi ’s are independent

slide-78
SLIDE 78

 Independence of hash functions (HFs) really

means that the prob. of two HFs saying “yes” is the product of each saying “yes”

  • But two particular hash functions could be highly

correlated

  • For example, in Min-Hash if their permutations agree in

the first one million entries

  • However, the probabilities in definition of a

LSH-family are over all possible members of H, H’ (i.e., average case and not the worst case)

3/2/2020 81

slide-79
SLIDE 79

3/2/2020 82

Raises probability for small distances (Good) Raises probability for large distances (Bad)

 Given family H, construct family H’ consisting

  • f b functions from H

 For h = [h1,…,hb] in H’,

h(x) = h(y) if and only if hi(x) = hi(y) for at least 1 i

 Theorem: If H is (d1, d2, p1, p2)-sensitive,

then H’ is (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive

 Proof: Use the fact that hi’s are independent

slide-80
SLIDE 80

 AND makes all probs. shrink, but by choosing r

correctly, we can make the lower prob. approach 0 while the higher does not

 OR makes all probs. grow, but by choosing b correctly,

we can make the upper prob. approach 1 while the lower does not

3/2/2020 83

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

AND r=1..10, b=1

  • Prob. sharing a bucket

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Prob. sharing a bucket

OR r=1, b=1..10 Similarity of a pair of items Similarity of a pair of items

slide-81
SLIDE 81

 By choosing b and r correctly, we can make

the lower probability approach 0 while the higher approaches 1

 As for the signature matrix, we can use the

AND construction followed by the OR construction

  • Or vice-versa
  • Or any sequence of AND’s and OR’s alternating

84 3/2/2020

slide-82
SLIDE 82

 r-way AND followed by b-way OR construction

  • Exactly what we did with Min-Hashing
  • AND: If bands match in all r values hash to same bucket
  • OR: Cols that have  1 common bucket  Candidate

 Take points x and y s.t. Pr[h(x) = h(y)] = s

  • H will make (x,y) a candidate pair with prob. s

 Construction makes (x,y) a candidate pair with

probability 1-(1-sr)b The S-Curve!

  • Example: Take H and construct H’ by the AND

construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4

3/2/2020 85

slide-83
SLIDE 83

s p=1-(1-s4)4 .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 .9 .9860

r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)-sensitive family.

3/2/2020 86

slide-84
SLIDE 84
slide-85
SLIDE 85

 Picking r and b to get desired performance

  • 50 hash-functions (r = 5, b = 10)

3/2/2020 88

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Blue area X: False Negative rate These are pairs with sim > s but the X fraction won’t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them.

Similarity s Prob(Candidate pair) Threshold s

slide-86
SLIDE 86

 Picking r and b to get desired performance

  • 50 hash-functions (r * b = 50)

3/2/2020 89

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

r=2, b=25 r=5, b=10 r=10, b=5

Threshold s Similarity s Prob(Candidate pair)

slide-87
SLIDE 87

 Apply a b-way OR construction followed by

an r-way AND construction

 Transforms similarity s (probability p)

into (1-(1-s)b)r

  • The same S-curve, mirrored horizontally and

vertically

 Example: Take H and construct H’ by the OR

construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4

3/2/2020 90

slide-88
SLIDE 88

3/2/2020 91

s p=(1-(1-s)4)4 .1 .0140 .2 .1215 .3 .3334 .4 .5740 .5 .7725 .6 .9015 .7 .9680 .8 .9936

The example transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)-sensitive family

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

Similarity s Prob(Candidate pair)

slide-89
SLIDE 89

 Example: Apply the (4,4) OR-AND construction

followed by the (4,4) AND-OR construction

 Transforms a (.2, .8, .8, .2)-sensitive family into

a (.2, .8, .9999996, .0008715)-sensitive family

  • Note this family uses 256 (=4*4*4*4) of the
  • riginal hash functions

3/2/2020 92

slide-90
SLIDE 90

93

 For each AND-OR S-curve 1-(1-sr)b, there is a

threshold t, for which 1-(1-tr)b = t

 Above t, high probabilities are increased; below

t, low probabilities are decreased

 You improve the sensitivity as long as the low

probability is less than t, and the high probability is greater than t

  • Iterate as you like.

 Similar observation for the OR-AND type of S-

curve: (1-(1-s)b)r

3/2/2020

slide-91
SLIDE 91

Threshold t t

94

Probability Is lowered Probability Is raised s

3/2/2020

Prob(Candidate pair)

slide-92
SLIDE 92

 Pick any two distances d1 < d2  Start with a (d1, d2, (1- d1), (1- d2))-sensitive

family

 Apply constructions to amplify

(d1, d2, p1, p2)-sensitive family, where p1 is almost 1 and p2 is almost 0

 The closer to 0 and 1 we get, the more

hash functions must be used!

3/2/2020 95

slide-93
SLIDE 93
slide-94
SLIDE 94

 LSH methods for other distance metrics:

  • Cosine distance: Random hyperplanes
  • Euclidean distance: Project on lines

3/2/2020 97

Points Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) Amplify the family using AND and OR constructions

Depends on the distance function used

slide-95
SLIDE 95

3/2/2020 98

Data Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity MinHash 1 5 1 5 2 3 1 3 6 4 6 4 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Random Hyperplanes -1 +1 -1

  • 1

+1 +1 +1 -1

  • 1
  • 1
  • 1
  • 1

0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Documents Data points Candidate pairs Candidate pairs

slide-96
SLIDE 96

 Cosine distance = angle between vectors

from the origin to the points in question d(A, B) =  = arccos(AB / ǁAǁ·ǁBǁ)

  • Has range 𝟏 … 𝝆 (equivalently 0...180°)
  • Can divide  by 𝝆 to have distance in range 0…1

 Cosine similarity = 1-d(A,B)

  • But often defined as cosine sim: cos(𝜄) =

𝐵⋅𝐶 𝐵 𝐶

3/2/2020 99

A B

AB ‖B‖

  • Has range -1…1 for

general vectors

  • Range 0..1 for

non-negative vectors (angles up to 90°)

slide-97
SLIDE 97

 For cosine distance, there is a technique

called Random Hyperplanes

  • Technique similar to Min-Hashing

 Random Hyperplanes method is a

(d1, d2, (1-d1/𝝆), (1-d2/𝝆))-sensitive family for

any d1 and d2

 Reminder: (d1, d2, p1, p2)-sensitive

1. If d(x,y) < d1, then prob. that h(x) = h(y) is at least p1 2. If d(x,y) > d2, then prob. that h(x) = h(y) is at most p2

3/2/2020 100

slide-98
SLIDE 98

 Each vector v determines a hash function hv

with two buckets

 hv(x) = +1 if vx  0; = -1 if vx < 0  LS-family H = set of all functions derived

from any vector

 Claim: For points x and y,

Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆

3/2/2020 101

slide-99
SLIDE 99

3/2/2020 102

x y

Look in the plane of x and y.

θ Hyperplane normal to v’. Here h(x) ≠ h(y)

v’

Hyperplane normal to v. Here h(x) = h(y)

v Note: what is important is that hyperplane is outside the angle, not that the vector is inside.

slide-100
SLIDE 100

3/2/2020 103

So: Prob[Red case] = θ / 𝝆

So: P[h(x)=h(y)] = 1- θ/𝜌 = 1-d(x,y)/𝜌

slide-101
SLIDE 101

 Pick some number of random vectors, and

hash your data for each vector

 The result is a signature (sketch) of

+1’s and –1’s for each data point

 Can be used for LSH like we used the

Min-Hash signatures for Jaccard distance

 Amplify using AND/OR constructions

3/2/2020 104

slide-102
SLIDE 102

 Expensive to pick a random vector in M

dimensions for large M

  • Would have to generate M random numbers

 A more efficient approach

  • It suffices to consider only vectors v

consisting of +1 and –1 components

  • Why? Assuming data is random, then vectors of +/-1 cover

the entire space evenly (and does not bias in any way)

3/2/2020 105

slide-103
SLIDE 103

 Idea: Hash functions correspond to lines  Partition the line into buckets of size a  Hash each point to the bucket containing its

projection onto the line

  • An element of the “Signature” is a bucket id for

that given projection line

 Nearby points are always close;

distant points are rarely in same bucket

3/2/2020 106

slide-104
SLIDE 104

 “Lucky” case:

  • Points that are close

hash in the same bucket

  • Distant points end up in

different buckets

 Two “unlucky” cases:

  • Top: unlucky

quantization

  • Bottom: unlucky

projection

3/2/2020 107

v v Line Buckets of size a v v v v v v v v v v

slide-105
SLIDE 105

3/2/2020 108

v v v v v v v v

slide-106
SLIDE 106

Bucket width a Randomly chosen line Points at distance d If d << a, then the chance the points are in the same bucket is at least 1 – d/a.

3/2/2020 109

slide-107
SLIDE 107

Bucket width a Points at distance d θ d cos θ If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket.

3/2/2020 110

Randomly chosen line

slide-108
SLIDE 108

 If points are distance d < a/2, prob.

they are in same bucket ≥ 1- d/a = ½

 If points are distance d > 2a apart, then they

can be in the same bucket only if d cos θ ≤ a

  • cos θ ≤ ½
  • 60 < θ < 90, i.e., at most 1/3 probability

 Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of

hash functions for any a

 Amplify using AND-OR cascades

3/2/2020 111

slide-109
SLIDE 109

3/2/2020 114

Data Signatures: short integer signatures that reflect their similarity Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) Amplify the family using AND and OR constructions MinHash 1 5 1 5 2 3 1 3 6 4 6 4 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Random Hyperplanes -1 +1 -1

  • 1

+1 +1 +1 -1

  • 1
  • 1
  • 1
  • 1

0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 “Bands” technique Documents Data points Candidate pairs Candidate pairs

slide-110
SLIDE 110

 Property P(h(C1)=h(C2))=sim(C1,C2) of

hash function h is the essential part of LSH, without it we can’t do anything

 LS-hash functions transform data to

signatures so that the bands technique (AND, OR constructions) can then be applied

3/2/2020 115