http://www.mmds.org Many problems can be expressed as finding - - PowerPoint PPT Presentation

http mmds org
SMART_READER_LITE
LIVE PREVIEW

http://www.mmds.org Many problems can be expressed as finding - - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


slide-1
SLIDE 1

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-2
SLIDE 2

 Many problems can be expressed as

finding “similar” objects:

  • Find near(est)-neighbors

 Example Applications:

  • Pages with similar words
  • For duplicate detection, clustering by topic
  • Customers who purchased similar products
  • kNN classification, collaborative filtering
  • Images with similar features
  • Image recommendation
  • Record linkage (deduplication)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

2

slide-3
SLIDE 3

10 nearest neighbors from a collection of 2 million images

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

[Hays and Efros, SIGGRAPH 2007]

slide-4
SLIDE 4

 Given: (High dimensional) data points 𝒚𝟐, 𝒚𝟑, …

  • For example: Image is a vector of pixel colors

1 2 1 2 1 1 → [1 2 1 0 2 1 0 1 0]

 And some distance function 𝒆(𝒚𝟐, 𝒚𝟑)

  • Which quantifies the “distance” between 𝒚𝟐 and 𝒚𝟑

 Goal: Find all pairs of data points (𝒚𝒋, 𝒚𝒌) that are

within some distance threshold 𝒆 𝒚𝒋, 𝒚𝒌 ≤ 𝒕

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

slide-5
SLIDE 5

 Given: (High dimensional) data points 𝒚𝟐, 𝒚𝟑, …

  • For example: Image is a vector of pixel colors

1 2 1 2 1 1 → [1 2 1 0 2 1 0 1 0]

 And some distance function 𝒆(𝒚𝟐, 𝒚𝟑)

  • Which quantifies the “distance” between 𝒚𝟐 and 𝒚𝟑

 Goal: Find all pairs of data points (𝒚𝒋, 𝒚𝒌) that are

within some distance threshold 𝒆 𝒚𝒋, 𝒚𝒌 ≤ 𝒕

 Naïve solution would take 𝑷 𝑶𝟑 

where 𝑶 is the number of data points

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

slide-6
SLIDE 6

 Hash objects to buckets such that objects that

are similar hash to the same bucket

 Only compare candidates in each bucket  Benefits: Instead of O(N2) comparisons, we

need O(N) to find similar documents

 Hash functions depend on similarity functions

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

slide-7
SLIDE 7

 Goal: Given a large number (𝑂 in the millions or

billions) of documents, find “near duplicate” pairs

 Applications:

  • Mirror websites, or approximate mirrors
  • Similar news articles at many news sites

 Problems:

  • Too many documents to compare all pairs
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7

slide-8
SLIDE 8

 Shingling: Convert documents to sets  Simple approaches:

  • Document = set of words appearing in document
  • Document = set of “important” words
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

slide-9
SLIDE 9

 Need to account for ordering of words!

  • Document = set of Shingles

 A set of k-shingles (or k-grams) is a set of k-

sequence tokens that appears in the doc

  • Tokens can be characters, words

 Example:

  • k=2;
  • D1 = abcab
  • Set of 2-shingles: S(D1) = {ab, bc, ca}
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

9

slide-10
SLIDE 10

 A natural similarity measure is the

Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

10

slide-11
SLIDE 11

 Encode sets using 0/1 (bit, boolean) vectors

  • One dimension per element in the universal set

 Interpret set intersection as bitwise AND, and

set union as bitwise OR

 Example: C1 = 10111; C2 = 10011

  • Size of intersection = 3; size of union = 4,
  • Jaccard similarity = 3/4
  • Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11

slide-12
SLIDE 12

 Rows = elements (shingles)  Columns = sets (documents)

  • 1 in row e and column s if and only

if e is a member of s

  • Column similarity is the Jaccard

similarity of the corresponding sets

  • Typical matrix is sparse!

 Example: sim(C1 ,C2) = ?

12

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Documents Shingles

slide-13
SLIDE 13

 Rows = elements (shingles)  Columns = sets (documents)

  • 1 in row e and column s if and only

if e is a member of s

  • Column similarity is the Jaccard

similarity of the corresponding sets

  • Typical matrix is sparse!

 Example: sim(C1 ,C2) = ?

  • Size of intersection = 3; size of union = 6,

Jaccard similarity = 3/6

  • d(C1,C2) = 1 – (Jaccard similarity) = 3/6

13

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Documents Shingles

slide-14
SLIDE 14

 Suppose we need to find near-duplicate

documents among 𝑶 = 𝟐 million documents

 Naïvely, we would have to compute pairwise

Jaccard similarities for every pair of docs

  • 𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*1011 comparisons
  • At 105 secs/day and 106 comparisons/sec,

it would take 5 days

 For 𝑶 = 𝟐𝟏 million, it takes more than a year…

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15

slide-15
SLIDE 15

 Key Idea: “hash” each column C to a small

signature h(C):

  • (1) h(C) is small enough that the signature fits in RAM
  • (2) sim(C1, C2) is the same as the “similarity” of

signatures h(C1) and h(C2)

 Locality sensitive hashing:

  • If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Expect that “most” pairs of near duplicate docs

hash into the same bucket!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

16

slide-16
SLIDE 16

17

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

slide-17
SLIDE 17

18

3 4 7 2 6 1 5

Signature matrix M

1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 1 6 7 3 2 2 1 2 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents) Permutation 

slide-18
SLIDE 18

19

 Imagine the rows of the boolean matrix

permuted under random permutation 

 Define a “hash” function h(C) = the index of

the first (in the permuted order ) row in which column C has value 1: h (C) = min (C)

 Use several (e.g., 100) independent hash

functions (that is, permutations) to create a signature of a column

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-19
SLIDE 19

 Permuting rows even once is prohibitive  Row hashing!

  • Pick K hash functions ki
  • Ordering under ki gives a random row permutation!
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

How to pick a random hash function h(x)? Universal hashing: ha,b(x)= (a·x+b) mod N where: a,b … random integers

slide-20
SLIDE 20

 One-pass implementation

  • For each column C
  • Initialize all hash values for each permutation i: sig(C)[i] = 
  • For each row
  • If there is a 1 in column C
  • Update hash value of column C if the row number in the current

permutation is smaller than current value

  • If ki < sig(C)[i], then sig(C)[i]  ki
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

slide-21
SLIDE 21
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-22
SLIDE 22

24

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5 7 6 3 1 2 4 4 5 1 6 7 3 2

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

3 4 7 2 6 1 5

Permutation 

slide-23
SLIDE 23

 One bit matching (given a )

Pr[h(C1) = h(C2)] =

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

25

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-24
SLIDE 24

 One bit matching (given )

Pr[h(C1) = h(C2)] = Sim(C1, C2)

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-25
SLIDE 25

 Given cols C1 and C2, rows may be classified as:

C1 C2 A 1 1 B 1 C 1 D

  • a = # rows of type A, etc.

 Note: sim(C1, C2) = a/(a +b +c)  Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)

  • Look down the cols C1 and C2 until we see a 1
  • If it’s a type-A row, then h(C1) = h(C2)

If a type-B or type-C row, then not

28

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-26
SLIDE 26

29

 One given bit matching

Pr[h(C1) = h(C2)] = sim(C1, C2)

 The expected similarity of the

signatures (defined as fractions of matching values) sim[h (C1) = h (C2)] =

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-27
SLIDE 27

30

 One given bit matching:

Pr[h(C1) = h(C2)] = sim(C1, C2)

 The expected similarity of the

signatures (defined as fractions of matching values) sim[h (C1) = h (C2)] = sim(C1, C2)

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-28
SLIDE 28

31

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

5 7 6 3 1 2 4 4 5 1 6 7 3 2

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

3 4 7 2 6 1 5

Permutation 

slide-29
SLIDE 29

 Key Idea: “hash” each column C to a small

signature h(C):

  • (1) h(C) is small enough that the signature fits in RAM
  • (2) sim(C1, C2) is the same as the “similarity” of

signatures h(C1) and h(C2)

 Locality sensitive hashing:

  • If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Expect that “most” pairs of near duplicate docs

hash into the same bucket!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

32

slide-30
SLIDE 30

Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s

33

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-31
SLIDE 31
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

34

Similarity t =sim(C1, C2) of two sets Probability

  • f matching

1 bit

slide-32
SLIDE 32

35

 One given bit matching:

Pr[h(C1) = h(C2)] = sim(C1, C2)

 Similarity of signatures:

sim[h (C1) = h (C2)] = sim(C1, C2)

 All bits matching (AND):

Pr[h(C1) = h(C2)] =

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-33
SLIDE 33

36

 One given bit matching:

Pr[h(C1) = h(C2)] = sim(C1, C2)

 Similarity of signatures:

sim[h (C1) = h (C2)] = sim(C1, C2)

 All bits matching (AND):

Pr[h(C1) = h(C2)] = sim(C1, C2)K

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-34
SLIDE 34

37

 One given bit matching:

Pr[h(C1) = h(C2)] = sim(C1, C2)

 Similarity of signatures:

sim[h (C1) = h (C2)] = sim(C1, C2)

 All bits matching (AND):

Pr[h(C1) = h(C2)] = sim(C1, C2)K

 Any bit matching (OR):

Pr[any h(C1) = h(C2)] =

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-35
SLIDE 35

38

 One given bit matching:

Pr[h(C1) = h(C2)] = sim(C1, C2)

 Similarity of signatures:

sim[h (C1) = h (C2)] = sim(C1, C2)

 All bits matching (AND):

Pr[h(C1) = h(C2)] = sim(C1, C2)K

 Any bit matching (OR):

Pr[any h(C1) = h(C2)] = 1 - (1-sim(C1, C2))K

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-36
SLIDE 36
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

47

Signature matrix M r rows per band b bands One signature

1 2 1 2 1 4 1 2 2 1 2 1 1 2 1 2 1 1 1 1 2 1 2 1

slide-37
SLIDE 37

 Divide matrix M into b bands of r rows  Candidate column pairs are those that hash

to the same values for any band

 Tune b and r to catch most similar pairs,

but few non-similar pairs

48

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-38
SLIDE 38

 Columns C1 and C2 have similarity t  Prob. of one given band (r rows) matching =  Prob. of any band matching =

51

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-39
SLIDE 39

 Columns C1 and C2 have similarity t  Prob. of one given band (r rows) matching = tr  Prob. of any band matching = 1 - (1 - tr)b

52

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-40
SLIDE 40

t r 1 - 1 - ( )b s ~ (1/b)1/r

54

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarity t=sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-41
SLIDE 41

 Tradeoff of false positive and false negatives  Example: 50 hash-functions (r=5, b=10)

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

55 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Blue area: False Negative rate Green area: False Positive rate Similarity

  • Prob. sharing a bucket
slide-42
SLIDE 42
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

61

slide-43
SLIDE 43

 Given a (d1,d2,p1,p2)-sensitive family F  AND-construction

  • AND of r members of F
  • (d1,d2,p1

r,p2 r)-sensitive

  • Mirrors the effect of r rows in a single band

 OR-construction

  • OR of b members of F
  • (d1,d2,(1-p1)b,(1-p2)b )-sensitive
  • Mirrors the effect of combining multiple bands

 Select r and b to increase p1 and decrease p2

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

62

slide-44
SLIDE 44

 Hashing function: a vector of d dimensions is hashed

into the ith bit value

 (d1, d2, 1-d1/d, 1-d2/d)-sensitive  AND/OR constructions

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

63

slide-45
SLIDE 45

 Distance function: d = theta  A randomly chosen vector vf  Given two vectors x and y, f(x) = f(y) iff vfx and vfy

have the same sign

 (d1,d2, 1 -d1/180, 1-d2/180)-sensitive

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

64

slide-46
SLIDE 46

 A randomly chosen line with segments of length a  A point is hashed to the bucket in which its

projection onto the line lies

 (d1,d2, p1,p2)-sensitive

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

65