http://www.mmds.org Many problems can be expressed as finding - - PowerPoint PPT Presentation

http mmds org
SMART_READER_LITE
LIVE PREVIEW

http://www.mmds.org Many problems can be expressed as finding - - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


slide-1
SLIDE 1

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-2
SLIDE 2

 Many problems can be expressed as

finding “similar” objects:

  • Find near(est)-neighbors

 Examples:

  • Pages with similar words
  • For duplicate detection, classification by topic
  • Customers who purchased similar products
  • Products with similar customer sets
  • Images with similar features
  • Users who visited similar websites
  • Record linkage (deduplication)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

2

slide-3
SLIDE 3

10 nearest neighbors from a collection of 2 million images

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

[Hays and Efros, SIGGRAPH 2007]

slide-4
SLIDE 4

 Given: (High dimensional) data points 𝒚𝟐, 𝒚𝟑, …

  • For example: Image is a vector of pixel colors

1 2 1 2 1 1 → [1 2 1 0 2 1 0 1 0]

 And some distance function 𝒆(𝒚𝟐, 𝒚𝟑)

  • Which quantifies the “distance” between 𝒚𝟐 and 𝒚𝟑

 Goal: Find all pairs of data points (𝒚𝒋, 𝒚𝒌) that are

within some distance threshold 𝒆 𝒚𝒋, 𝒚𝒌 ≤ 𝒕

 Note: Naïve solution would take 𝑷 𝑶𝟑 

where 𝑶 is the number of data points

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4

slide-5
SLIDE 5

 Hash objects to buckets such that objects that

are similar hash to the same bucket

 Only compare candidates in each bucket  Benefits: Instead of O(N2) comparisons, we

need O(N) to find similar documents

 Hash functions depend on similarity functions

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

9

slide-6
SLIDE 6

 Goal: Given a large number (𝑂 in the millions or

billions) of documents, find “near duplicate” pairs

 Applications:

  • Mirror websites, or approximate mirrors
  • Similar news articles at many news sites

 Problems:

  • Documents are so large or so many that they cannot

fit in main memory

  • Too many documents to compare all pairs
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

10

slide-7
SLIDE 7

 Shingling: Convert documents to sets  Simple approaches:

  • Document = set of words appearing in document
  • Document = set of “important” words

 Need to account for ordering of words!

  • Document = set of Shingles
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14

slide-8
SLIDE 8

 A set of k-shingles (or k-grams) for a

document is a set of k-sequence tokens that appears in the doc

  • Tokens can be characters, words

 Example:

  • k=2;
  • D1 = abcab
  • Set of 2-shingles: S(D1) = {ab, bc, ca}
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15

slide-9
SLIDE 9

 Document Di is a set of its k-shingles Ci=S(Di)  A natural similarity measure is the

Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

17

slide-10
SLIDE 10

 Rows = elements (shingles)  Columns = sets (documents)

  • 1 in row e and column s if and only

if e is a member of s

  • Column similarity is the Jaccard

similarity of the corresponding sets

  • Typical matrix is sparse!

 Example: sim(C1 ,C2) = ?

19

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Documents Shingles

slide-11
SLIDE 11

 Rows = elements (shingles)  Columns = sets (documents)

  • 1 in row e and column s if and only

if e is a member of s

  • Column similarity is the Jaccard

similarity of the corresponding sets

  • Typical matrix is sparse!

 Example: sim(C1 ,C2) = ?

  • Size of intersection = 3; size of union = 6,

Jaccard similarity = 3/6

  • d(C1,C2) = 1 – (Jaccard similarity) = 3/6

20

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Documents Shingles

slide-12
SLIDE 12

 Suppose we need to find near-duplicate

documents among 𝑶 = 𝟐 million documents

 Naïvely, we would have to compute pairwise

Jaccard similarities for every pair of docs

  • 𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*1011 comparisons
  • At 105 secs/day and 106 comparisons/sec,

it would take 5 days

 For 𝑶 = 𝟐𝟏 million, it takes more than a year…

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22

slide-13
SLIDE 13

 Key Idea: “hash” each column C to a small

signature h(C):

  • (1) h(C) is small enough that the signature fits in RAM
  • (2) sim(C1, C2) is the same as the “similarity” of

signatures h(C1) and h(C2)

 Locality sensitive hashing:

  • If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Expect that “most” pairs of near duplicate docs

hash into the same bucket!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

slide-14
SLIDE 14

27

3 4 7 2 6 1 5

Signature matrix M

1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 1 6 7 3 2 2 1 2 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents) Permutation 

slide-15
SLIDE 15

28

 Imagine the rows of the boolean matrix

permuted under random permutation 

 Define a “hash” function h(C) = the index of

the first (in the permuted order ) row in which column C has value 1: h (C) = min (C)

 Use several (e.g., 100) independent hash

functions (that is, permutations) to create a signature of a column

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-16
SLIDE 16

 Permuting rows even once is prohibitive  Row hashing!

  • Pick K hash functions ki
  • Ordering under ki gives a random row permutation!
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

29

How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)

slide-17
SLIDE 17

 One-pass implementation

  • For each column C and hash-func. ki keep a “slot” for the

min-hash value

  • Initialize all sig(C)[i] = 
  • For each row
  • If there is a 1 in column C
  • Update sig(C)[i] if ki(j) is smaller than current value
  • If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30

slide-18
SLIDE 18
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-19
SLIDE 19

 Given a random permutation   What is Pr[h(C1) = h(C2)]

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

33

1 1 1 1

slide-20
SLIDE 20

 Given a random permutation   Claim: Pr[h(C1) = h(C2)] = sim(C1, C2)  Why?

  • The first non 0 row
  • If both are 1, then h(C1) = h(C2)
  • Pr[h(C1) = h(C2)]=|C1C2|/|C1C2|= sim(C1, C2)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

34

1 1 1 1

slide-21
SLIDE 21

36

 The similarity of two signatures is the

fraction of the values that agree

 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)  Because of the Min-Hash property, the

expected similarity of two signatures = similarity of the columns

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1

slide-22
SLIDE 22

37

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

5 7 6 3 1 2 4 4 5 1 6 7 3 2

Signature matrix M

1 2 1 2 1 4 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Input matrix (Shingles x Documents)

3 4 7 2 6 1 5

Permutation 

slide-23
SLIDE 23

Step 3: Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

Docu- ment The set

  • f strings
  • f length k

that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs

  • f signatures

that we need to test for similarity

slide-24
SLIDE 24

40

 The probability of two columns hash into the same

bit given one permutation is their similarity

  • Pr[h(C1) = h(C2)] = sim(C1, C2)

 The similarity of two signatures is the fraction of the

hash functions in which they agree

  • Sim[h (C1), h (C2)] = sim(C1, C2)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-25
SLIDE 25

 Key Idea: “hash” each column C to a small

signature h(C):

  • (1) h(C) is small enough that the signature fits in RAM
  • (2) sim(C1, C2) is the same as the “similarity” of

signatures h(C1) and h(C2)

 Locality sensitive hashing:

  • If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
  • If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Expect that “most” pairs of near duplicate docs

hash into the same bucket!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

41

slide-26
SLIDE 26

42

 The probability of two columns hash into the

same bit given one permutation is their similarity

  • Pr[h(C1) = h(C2)] = sim(C1, C2)

 The similarity of two signatures is the fraction of

the hash functions in which they agree

  • Sim[h(C1), h(C2)] = sim(C1, C2)

 The probability of two columns hash into the

same signatures (all bits are the same) is

 Pr[h(C1) = h(C2)] = sim(C1, C2)^K

 The probability of two columns hash into at least

  • ne same bit is

 Pr[any h(C1) = h(C2)] = 1 - (1-sim(C1, C2))^K

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-27
SLIDE 27

43

 MinHash signatures serve as a good

compression

 Still need to compare all pairs of signatures

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-28
SLIDE 28

Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s

44

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-29
SLIDE 29
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

45

Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-30
SLIDE 30
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

46

Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-31
SLIDE 31
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

47

Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-32
SLIDE 32
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

51

Signature matrix M r rows per band b bands One signature

1 2 1 2 1 4 1 2 2 1 2 1

slide-33
SLIDE 33

 Divide matrix M into b bands of r rows  For each band, hash its portion of each

column to a hash table with k buckets

 Candidate column pairs are those that hash

to the same bucket for ≥ 1 band

 Tune b and k to catch most similar pairs,

but few non-similar pairs

52

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-34
SLIDE 34

 Columns C1 and C2 have similarity t  Pick any band (r rows)

  • Prob. that all rows in band equal =
  • Prob. that some row in band unequal =

 Prob. that no band identical =  Prob. that at least 1 band identical =

55

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-35
SLIDE 35

 Columns C1 and C2 have similarity t  Pick any band (r rows)

  • Prob. that all rows in band equal = tr
  • Prob. that some row in band unequal = 1 - tr

 Prob. that no band identical = (1 - tr)b  Prob. that at least 1 band identical =

1 - (1 - tr)b

56

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-36
SLIDE 36

t r

All rows

  • f a band

are equal

1 -

Some row

  • f a band

unequal

( )b

No bands identical

1 -

At least

  • ne band

identical

s ~ (1/b)1/r

57

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarity t=sim(C1, C2) of two sets Probability

  • f sharing

a bucket

slide-37
SLIDE 37

 Picking r and b to get the best S-curve

  • 50 hash-functions (r=5, b=10)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

58 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Blue area: False Negative rate Green area: False Positive rate Similarity

  • Prob. sharing a bucket
slide-38
SLIDE 38

 Pick:

  • The number of Min-Hashes (rows of M)
  • The number of bands b, and
  • The number of rows r per band

to balance false positives/negatives

59

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-39
SLIDE 39

Assume the following case:

 Suppose 100,000 columns (100k docs)  Signatures of 100 integers (rows)  Choose b = 20 bands of r = 5 integers/band  Goal: Find pairs of documents that

are at least s = 0.8 similar

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

60

1 2 1 2 1 4 1 2 2 1 2 1

slide-40
SLIDE 40

 Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C1, C2) = 0.8  Probability C1, C2 identical in one particular

band: (0.8)5 = 0.328

 Probability C1, C2 are not similar in all of the 20

bands: (1-0.328)20 = 0.00035

  • i.e., about 1/3000th of the 80%-similar pairs

are false negatives

61

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1 2 1 2 1 4 1 2 2 1 2 1

slide-41
SLIDE 41

 Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C1, C2) = 0.3  Probability C1, C2 identical in one particular

band: (0.3)5 = 0.00243

 Probability C1, C2 identical in at least 1 of 20

bands: 1 - (1 - 0.00243)20 = 0.0474

  • i.e. about 4.74% of the 0.3-similarity pairs are false

positives

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

62

1 2 1 2 1 4 1 2 2 1 2 1

slide-42
SLIDE 42
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

66

slide-43
SLIDE 43

 Given a (d1,d2,p1,p2)-sensitive family F  AND-construction

  • AND of r members of F
  • (d1,d2,p1^r, p2^r)-sensitive
  • Mirrors the effect of r rows in a single band

 OR-construction

  • OR of b members of F
  • (d1,d2, 1-(1-p1)^b, 1-(1-p2)^b)-sensitive
  • Mirrors the effect of combining multiple bands

 Select r and b to increase p1 and decrease p2

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

67

slide-44
SLIDE 44

 Hashing function: a vector of d dimensions is hashed

into the ith bit value

 (d1, d2, 1-d1/d, 1-d2/d)-sensitive  AND/OR constructions

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

68

slide-45
SLIDE 45

 d = theta  A randomly chosen vector vf  Given two vectors x and y, f(x) = f(y) iff vfx and vfy

have the same sign

 (d1,d2, 1 -d1/180, 1-d2/180)-sensitive

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

69

slide-46
SLIDE 46

 A randomly chosen line with segments of length a  A point is hashed to the bucket in which its

projection onto the line lies

 (d1,d2, p1,p2)-sensitive

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

70

slide-47
SLIDE 47

 Record linkage: determine if pairs of data

records describe the same entity

  • I.e., find record pairs that are co-referent
  • Entities: usually people (or organizations or…)
  • Data records: names, addresses, job titles, birth

dates, …

 Main applications:

  • Joining two heterogeneous relations
  • Removing duplicates from a single relation
slide-48
SLIDE 48

 Deterministic linkage

  • Test equality of normalized version of record
  • Hand-coded rules for an “acceptable match”
  • e.g. “same SSNs, or same zipcode, birthdate, and

Soundex code for last name”

  • difficult to tune, can be expensive to test

 Probabilistic linkage

  • Fellegi and Sunter “A Theory for Record

Linkage”

slide-49
SLIDE 49

 Two sets to link: A and B  A x B = {(a,b) : a in A, b in B}

M = matched pairs, U= unmatched pairs

 Comparison vector, g(a,b), contains

“comparison features” (e.g., “last names are same”, “birthdates are same year”, …)

  • g(a,b)=h g1(a(a),b(b)),…, gK(a(a),b(b))
slide-50
SLIDE 50

 Three actions on (a,b):

  • A1: treat (a,b) as a match
  • A2: treat (a,b) as uncertain
  • A3: treat (a,b) as a non-match

 Assume a distribution D over A x B:

  • m(g) = PrD( g(a,b) | (a,b) in M )
  • u(g) = PrD( g(a,b) | (a,b) in U )

 Learn attributes weights, thresholds to

minimize false negatives and false positives

slide-51
SLIDE 51

 Efficiency issues:

  • How do we avoid looking at |A| * |B| pairs?

 Blocking

  • Or-blocking
  • Only select records that match on at least one variable
  • Sorting
  • Clustering
  • Locality sensitive hashing
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

75

slide-52
SLIDE 52

 Modeling and training:

  • How do we estimate m(g), u(g) ?

 Making decisions with the model:

  • How do we set the thresholds m and l?

 Feature engineering:

  • What should the comparison space G be?
  • Distance metrics for text fields
  • Normalizing/parsing text fields

 Efficiency issues:

  • How do we avoid looking at |A| * |B| pairs?