How similar are these? 1 Whats the Problem? Finding similar items - - PowerPoint PPT Presentation

how similar are these
SMART_READER_LITE
LIVE PREVIEW

How similar are these? 1 Whats the Problem? Finding similar items - - PowerPoint PPT Presentation

How similar are these? 1 Whats the Problem? Finding similar items with respect to some distance metric Two variants of the problem: Offline: extract all similar pairs of objects from a large collection Online: is this object


slide-1
SLIDE 1

How similar are these?

1

slide-2
SLIDE 2

What’s the Problem?

  • Finding similar items with respect to some

distance metric

  • Two variants of the problem:

– Offline: extract all similar pairs of objects from a large collection – Online: is this object similar to something I’ve seen before?

2

slide-3
SLIDE 3

Application: Plagiarism Detection

3

slide-4
SLIDE 4

Application: Content‐based Search

4

slide-5
SLIDE 5

Other Applications

  • Near‐duplicate detection of webpages

– Mirror pages – Similar news articles

  • Recommender systems

– Find users with similar tastes in movies, etc – Find products with similar customer sets

  • Sequence/tree alignment

– Find similar DNA or protein sequences

5

slide-6
SLIDE 6

Finding similar items ‐ Three Components

  • How to quantify similarity?

– Distance measures

  • Euclidean distance – based on locations of points in space,

e.g., Lr‐norm

  • Non‐Euclidean distance, ‐ based on properties of points, e.g.,

Jaccard, cosine, edit

  • Compute representation

– Shingling, tf.idf, etc.

  • Space and Time Efficient algorithms

– Transformation: Minhash – All‐pair comparison: Locality sensitive hashing

6

slide-7
SLIDE 7

(Axioms of) Distance Metrics

d is a distance measure between points x and y

  • 1. Non‐negativity:
  • 2. Identity:
  • 3. Symmetry:
  • 4. Triangle Inequality

7

slide-8
SLIDE 8

Problem: Finding Similar Documents

  • Given N text documents, find pairs that are

“near duplicates”

– Find similarity between a pair – For large N, it can be very compute intensive

  • Can we avoid all‐pair comparisons?

8

degree of similarity between doc i and j

Doc 1 to N

slide-9
SLIDE 9

Comparing two documents …

  • Naïve methods

– Feature: Treat each document as a set/bag of words – Distance: Jaccard distance (or cosine distance or hamming distance)

9

slide-10
SLIDE 10

Distance: Jaccard

  • Given two sets A, B
  • Jaccard similarity:
  • Jaccard distance:
  • E.g., A = {I, like, CS5344}; B = {CS5344, is, not,

for, me}, d(A, B) = 6/7

10

slide-11
SLIDE 11

Comparing two documents …

  • Naïve methods

– Feature: Treat each document as a set/bag of words – Distance: Jaccard distance or cosine distance or hamming distance – Textual rather than semantics

  • Documents with many common words are more similar,

even if the text appears in different order

– What good is this?

  • Fast filtering before the slower refinement

11

slide-12
SLIDE 12

Shingling: Account for ordering of words …

  • Instead of treating each word independently,

we can consider a sequence of k words

– More effective in terms of accuracy

  • A k‐shingle (or k‐gram) for a document D is a

sequence of k tokens that appear in D

– Tokens can be characters or words or some feature/object depending on the application

  • E.g., k = 3 characters; D = {This is a test} will give rise to

the following set of 3‐shingles: S(D) = {Thi, his, is_, s_i, _is, is_, s_a, _a_, a_t, _te, tes, est}

  • E.g., k = 3 words; we have S(D) = {{This is a}, {is a test}}

12

NOTE: Assume “characters” in our discussion

slide-13
SLIDE 13

Shingles and Similarity

  • Documents that are similar should have many

shingles in common

  • What if two documents differ by a word?

– Affects only k‐shingles within distance k from the word

  • What if we reorder paragraphs?

– Affects only the 2k shingles that cross paragraph boundaries

  • Example: k=3

– The dog which chased the cat vs The dog that chased the cat – Only 3‐shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c

slide-14
SLIDE 14

Shingles

  • We can represent document D as a set of its k‐

shingles

– Distance metrics: Jaccard distance

  • What’s the effect of the value of k (in terms of

characters)?

– Recommended values of k

  • 5 for small documents
  • 10 for large documents

14

slide-15
SLIDE 15

Shingles

  • How about space overhead?

– Each character can be represented as a byte (integer) – k‐shingle requires k bytes (integers)

  • Can compress by hashing a k‐shingle to say 4 bytes

– D is now a set of 4‐byte hash values of its k‐shingles – False positive may occur in matching

  • What’s the advantage?

– Tradeoffs between ability to differentiate vs space

  • It is better to hash 10‐shingles to say 4 bytes than to use 4‐

shingles!

15

slide-16
SLIDE 16

So far …

  • Represent a document as a set of k‐shingles or its

hash values

  • Use Jaccard distance to compare two documents
  • Can we do better?

– Parallelism vs Sampling

  • This scheme works but …

– What if the set of hash values (or k‐shingles) is too large to fit in the memory? – Or the number of documents are too large?

16

Idea: Find a way to hash a document to a single (small size) value! and similar documents to the same value!

slide-17
SLIDE 17

Minhash

  • Seminal algorithm for near‐duplicate

detection of webpages

– Used by AltaVista – Documents (HTML pages) represented by shingles (n‐grams) – Jaccard similarity: dups are pairs with high similarity

17

slide-18
SLIDE 18

MinHash – Key Idea

  • Hash the set of document shingles (big in terms of

space requirement) into a signature (relatively small size)

  • Instead of comparing shingles, we compare signatures

– ESSENTIAL: Similarities of signatures and similarities of shingles MUST BE related!! – Not every hashing function is applicable! – Need one that satisfies the following:

  • if Sim(D1,D2) is high, then with high prob. h(D1) = h(D2)
  • if Sim(D1,D2) is low, then with high prob. h(D1) ≠ h(D2)

– It is possible to have false positives, and false negatives!

  • minhashing turns out to be one such function for

Jaccard similarity

18

slide-19
SLIDE 19

Preliminaries: Representation & Jacaad Measure

  • Sets:

– A = {e1, e3, e7} – B = {e3, e5, e7}

  • Can be equivalently

expressed as matrices:

19

M00 = # rows where both elements are 0 Let: M11 = # rows where both elements are 1 M01 = # rows where A=0, B=1 M10 = # rows where A=1, B=0

slide-20
SLIDE 20

Computing Minhash

  • Start with the matrix representation of the set
  • Randomly permute the rows of the matrix
  • minhash (which is the signature) is the first row (in the

permuted order) with a “1”

  • Example:

20

h(A) = 4 h(B) = 3

Input matrix Permuted matrix 1 2 3 4 5 6 7 Row

slide-21
SLIDE 21

Minhash and Jaccard

M00 M00 M01 M11 M11 M00 M10

21

slide-22
SLIDE 22

MinHash – False positive/negative

  • Instead of comparing sets, we now compare
  • nly 1 hash value!
  • False positive?

– False positive can be easily dealt with by doing an additional layer of checking (treat minhash as a filtering mechanism)

  • False negative?
  • High error rate! Can we do better?

22

slide-23
SLIDE 23

Using multiple minhash signatures

23

Comparison between two sets (original matrix) becomes comparison between two columns of minhash values (signature matrix)

Input Permutations Minhash signatures Similarities

The similarity between signatures of two columns is given by the fraction of hash functions in which they agree

slide-24
SLIDE 24

Implementation of MinHash Computation

  • Permutations are expensive

– Incur space and random access (if data cannot fit into memory)

  • Interpret the hash value as the permutation
  • Only need to keep track of the minimum hash

values

24

slide-25
SLIDE 25

Implementation of minhash (By example)

25

h(x) = x mod 5 + 1 g(x) = (2x+1) mod 5 + 1 2 1 2 3 4 5 1 4 1 3 5 2 3 1

slide-26
SLIDE 26

Implementation of minhash (By example)

26

Initialization: set signatures to  Apply all hash functions on each row

  • If column value (of the source matrix) is 1,

keep the minimum value

  • Otherwise, do nothing

h(x) = x mod 5 + 1 g(x) = (2x+1) mod 5 + 1 Sig1 Sig2 h(1) = 2 2  g(1) = 4 4  h(2) = 3 2 3 g(2) = 1 4 1 h(3) = 4 2 3 g(3) = 3 3 1 h(4) = 5 2 3 g(4) = 5 3 1 h(5) = 1 2 1 g(5) = 2 3 1

slide-27
SLIDE 27

So far …

  • Represent a document as a set of hash values

(of its k‐shingles)

  • Transform set of k‐shingles to a set of minhash

signatures

  • Use Jaccard to compare two documents by

comparing their signatures

  • Is this method (i.e., transforming sets to

signature) necessarily “better”??

27

slide-28
SLIDE 28

The BIG Picture (All‐pair comparison)

28

Minhashing

Signature for the set of strings (can capture similarity)

Locality Sensitive Hashing

Signatures falling into the same bucket are “similar” Set of strings

  • f length k

Shingling

is about big This course is course is about about big data big data analytics

Shingling

Another document

slide-29
SLIDE 29

Find all near‐duplicates among N documents

  • Naïve solution

– For each document, compare with the other N‐1 documents

  • Takes N‐1 comparisons
  • Can optimize using filter‐and‐refine mechanisms

– Requires N*(N‐1)/2 comparisons – For large N, still takes ages …

  • E.g., N = 107, we have ~1014 comparisons; if each

comparison takes 1 μs, we need 108 sec ( ~ 3 years!)

29

slide-30
SLIDE 30

Locality Sensitive Hashing (LSH)

  • Suppose we have N documents
  • For each document, we can derive say k

minhash signatures

30 minhash

D1 D2 D3 D4 …. DN 1 3 3 2 2 3 2 7 7 5 5 7 k‐1 2 2 2 2 2 k 1 1 1 1 2

slide-31
SLIDE 31

Idea of hashing

31 minhash

D1 D2 D3 D4 …. DN 1 3 3 2 2 3 2 7 7 5 5 7 k‐1 2 2 2 2 2 k 1 1 1 1 1 Hash function F(column) Buckets

slide-32
SLIDE 32

LSH

  • Documents that fall into the same buckets are

likely to be similar

– More false positive (as a result of LSH) possible

  • Why? Because of collision (subject to hash function and

number of buckets)

  • How to deal with this? Refinement step

– More false negative? Of course!

  • Finding all pairs within a bucket is

computationally cheaper

– Declare all pairs within a bucket to be “matching” OR – Perform pair‐wise comparisons for those documents that fall into the same bucket

  • Much smaller than pair‐wise over all documents

32

slide-33
SLIDE 33

LSH

  • With only 1 hash function on one entire

column of signature, likely to have many false negatives

  • Key idea: Apply the hash function on the

column multiple times, each on a partition of the column

– Similar columns will be hashed to the same bucket (with high probability) – Candidate pairs are those that hash at least once to the same bucket

33

slide-34
SLIDE 34

LSH ‐ Intuition

34

3 6 9 4 2 8 4 8 1 1 4 2 12 minhash functions – likely to be similar documents C1 3 5 9 4 2 8 4 7 1 1 4 2 C2

Hash on all 12 minhash values – not similar Partition the 12 minhash values into two ‐ Still not similar Partition the 12 minhash values into four ‐ Potentially similar (>= 50% similar)

slide-35
SLIDE 35

n different k Minhash Signatures

  • For each document,

compute n sets of k minhash values

  • For each set,

concatenate k minhash values together

35

  • Within each set:

– Hash on the concatenated values (this has the “same” effect as ensuring all columns have the same values) – All documents with the same hash value will be bucketed together – Output all pairs within each bucket

  • Candidate pairs are those that have the same bucket for ≥ 1 set
  • De‐dup pairs
slide-36
SLIDE 36

n sets of k minhash signatures

36

Documents 1 and 3 potentially similar Documents 5 and 7 potentially dis‐similar Documents 1 and 3 potentially similar

  • 1. Duplicates
  • 2. D1 and D3

are similar over at least 2k signatures Number of buckets is set to be as large as possible to minimize collision Candidate column pairs are those that hash to the same bucket for more than 1 set

slide-37
SLIDE 37

Example

  • Given 100,000 columns (documents)
  • Let there be 100 signatures
  • Let n = 20 and k = 5
  • Goal: Find pairs that are at least 80% similar

NOTE: If each signature is represented as a 4 byte integer value, we need only 100*4*100,000 = 40 MB of memory!

37

slide-38
SLIDE 38

Example

  • Suppose C1 and C2 are 80% similar
  • Probability that C1 and C2 are identical in one

set = (0.8)5 = 0.328

  • Probability that C1 and C2 are not similar in

any of the 20 sets = (1‐0.328)20= 0.00035

– About 0.00035 of the 80%‐similar column pairs are false negatives – We would find 99.965% pairs of truly similar documents

38

slide-39
SLIDE 39

Example

  • Suppose C1 and C2 are 30% similar
  • Probability that C1 and C2 are identical in one

set = (0.3)5 = 0.00243

  • Probability that C1 and C2 are identical in at

least 1 of the 20 sets ≤ 20*0.00243= 0.0486

– About 4.86% pairs with similarity 30% end up becoming candidate pairs – false positives

39

slide-40
SLIDE 40

Need to tune LSH

  • Choose n and k to balance between false

positives and false negatives

  • What if we use only 15 sets instead of 20?

– Lower number of false positives – Higher number of false negatives

40

slide-41
SLIDE 41

Ideally, ….

41

Probability

  • f hashing

to the same bucket Similarity s of two sets

When n = 1, k = 1

Similarity s of two sets False negatives False positives

slide-42
SLIDE 42

n sets, k signatures/set

  • C1 and C2 have similarity s
  • For any set (k rows)

– Prob that all rows are equal = sk – Prof that some rows are not the same = 1 ‐ sk

  • Prob that all sets are not identical = (1 – sk)n
  • Prob that at least 1 set is identical = 1 ‐ (1 – sk)n
  • Tune n and k to minimize false negatives and false

positives

42

slide-43
SLIDE 43

The S‐curve for arbitrary n and k (> 1)

43

Probability

  • f hashing

to the same bucket Similarity s of two sets t ~ (1/n)1/k t= 1/2 We have n = 16, k = 4

slide-44
SLIDE 44

Example: n = 20, k = 5

44

s 1 ‐ (1 – sk)n 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996

slide-45
SLIDE 45

So far, …

  • Tune to minimize false positives and false

negatives

  • Need to check that candidate pairs have

similar signatures

  • Need further refinement to find really similar

documents

  • Jaccaad distance is useful also for other

applications, e.g., customer/item purchase histories

45

slide-46
SLIDE 46

Generalizing LSH: Multiple Hash Functions

  • So far, we have assumed only one hash

function

– h(x) = h(y) implies “h says x and y are equal”

  • We could have used a family of hash functions

and use any of them

46

slide-47
SLIDE 47

Locality‐Sensitive (LS) Families

  • Consider a space S of points with a distance measure d

47

  • A family H of hash functions is said to be (d1, d2, p1, p2)‐

sensitive if for any x and y in S:

– If d(x, y) ≤ d1, then prob over all h in H that h(x) = h(y) is at least p1 – If d(x, y) ≥ d2, then prob over all h in H that h(x) = h(y) is at most p2

slide-48
SLIDE 48

Example

48

  • Let S = sets, d = Jaccard distance
  • Minhashing gives a (d1, d2, p1, p2)‐sensitive family

for any d1 < d2

– E.g., H is a (1/3, 2/3, 2/3, 1/3)‐sensitive family for S and d – If distance ≤ 1/3 (i.e., similarity ≥ 2/3), then prob that minhash values agree is ≥ 2/3

  • Recall: Pr(h(x)=h(y)) = 1‐d(x,y)
  • No guarantees about fraction of false positives

– the theory leaves unknown what happens to pairs that are at distance between d1 and d2

slide-49
SLIDE 49

Amplifying a LS‐family: AND Construction of Hash Functions

49

  • Given family H, construct family H’consisting of r

functions from H

  • For h= [h1,…,hr] in H’, h(x)=h(y) if and only if hi(x)=hi(y)

for all i

  • This has the same effect as n sets of “k signatures”

– x and y are considered a candidate pair if every one of the r rows say that x and y are equal

  • Theorem: If H is (d1, d2, p1, p2)‐sensitive, then H’ is (d1,

d2, p1

r, p2 r)‐sensitive.

– That is, for any p, if p is the probability that a member of H will declare (x, y) to be a candidate pair, then the probability that a member of H′ will so declare is pr

slide-50
SLIDE 50

Amplifying a LS‐family: OR Construction of Hash Functions

50

  • Given family H, construct family H’consisting of b functions from H
  • For h= [h1,…,hb] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for at least
  • ne i
  • Mirrors the effect of combining several sets: x and y become a

candidate pair if any set makes them a candidate pair

  • Theorem: If H is (d1, d2, p1, p2)‐sensitive, then H’ is (d1, d2, 1‐(1‐p1)b,

1‐(1‐p2)b)‐sensitive.

– That is, for any p, if p is the probability that a member of H will declare (x, y) to be a candidate pair, then (1‐p) is the probability that it will not declare so. – (1‐p)b is the probability that the probability that none of the family h1, hb will declare x and y a candidate pair – 1 − (1 − p)b is the probability that at least one hi will declare (x, y) a candidate pair, and therefore that H’ will declare (x, y) to be a candidate pair.

slide-51
SLIDE 51

Effect of AND and OR Constructions

  • AND makes all probs. shrink, but by choosing r correctly, we

can make the lower prob. approach 0 while the higher does not

  • OR makes all probs. grow, but by choosing b correctly, we can

make the upper prob. approach 1 while the lower does not

51

(p1)r (p2)r 1‐(1‐p1)b 1‐(1‐p2)b

slide-52
SLIDE 52

Composing Constructions: AND‐OR Composition

  • r‐way AND construction followed by b‐way OR

construction

– Exactly what we did with minhashing

  • Take points x and y s.t. Pr[h(x) = h(y)] = p
  • H will make (x,y) a candidate pair with prob. P
  • Construction makes (x,y) a candidate pair with

probability 1‐(1‐pr)b

– The S‐Curve!

52

slide-53
SLIDE 53

Example

  • Take H and construct H’

by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4

  • E.g., transform a (0.2,

0.8, 0.8, 0.2)‐sensitive family into a (0.2, 0.8, 0.8785, 0.0064)‐ sensitive family

53

slide-54
SLIDE 54

Example

  • Take H and construct H’

by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4

  • E.g., transform a (0.2,

0.8, 0.8, 0.2)‐sensitive family into a (0.2, 0.8, 0.8785, 0.0064)‐ sensitive family

54

slide-55
SLIDE 55

Composing Constructions: OR‐AND Composition

  • b‐way OR construction followed by r‐way AND

construction

  • Transforms probability p into (1‐(1‐p)b)r

55

slide-56
SLIDE 56

Example

  • Take H and construct H’

by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4

  • E.g., transform a (0.2,

0.8, 0.8, 0.2)‐sensitive family into a (0.2, 0.8, 0.9936, 0.1215)‐ sensitive family

56

slide-57
SLIDE 57

p p8 .2 0.000 .3 0.000 .4 0.000 .5 0.003 .6 0.016 .7 0.057 .8 0.167 .9 0.387

57

p 1‐(1‐p)8 .2 0.832 .3 0.942 .4 0.983 .5 0.996 .6 0.999 .7 0.999 .8 0.999 .9 0.999 ALL must match At least one match All 4 in each group must match. Enough to have one such group. At least one

  • f the 4 in

each group match. All 4 groups must match

slide-58
SLIDE 58

Cascading Constructions

  • Apply the (4,4) OR‐AND construction followed

by the (4,4) AND‐OR construction.

  • Transforms a (.2,.8,.8,.2)‐sensitive family into a

(.2,.8,.9999996,.0008715)‐sensitive family.

  • Note this family uses 256 (= 4*4*4*4) of the
  • riginal hash functions.

58

slide-59
SLIDE 59

So far …

  • Pick any two distances x< y
  • Start with a (x, y, (1‐x), (1‐y))‐sensitive family
  • Apply constructions to produce (x, y, p, q)‐

sensitive family, where p is almost 1 and q is almost 0.

  • The closer to 0 and 1 we get, the more hash

functions must be used.

  • What about other distances? Euclidean

distance? Cosine distance?

59

slide-60
SLIDE 60

Conclusion

  • Many applications require finding similar items
  • 3 key components

– Feature representation, distance measure, efficient algorithms

  • Focus on finding similar documents at various

levels

– Shingles, minhashing and LSH

  • LSH for other distance measures (cosine and

euclidean distance)

60