Information near-duplicates
Minimum hashing; Locality Sensitive Hashing
Information near-duplicates Minimum hashing; Locality Sensitive - - PowerPoint PPT Presentation
Information near-duplicates Minimum hashing; Locality Sensitive Hashing Web Search Information near-duplicates Corpus duplicates Usually, a corpus has many different topics discussed across different documents. Organizing a corpus
Minimum hashing; Locality Sensitive Hashing
2
3
4
D > 30.000
2
1 D 1
Dimensionality
N
Documents MinHash LSH
Duplicate detection, min-hash, sim-hash
5
6
7
8
9
πΎπππππ π π·π, π·
π = π·π β© π· π
π·π βͺ π·
π
πΎπππππ π π·1, π·2 = 3 6
A B C D πΎπππππ π π·π, π·
π =
π΅ π΅ + πΆ + π·
10
Doc A
Shingle set A
Sketch A Doc B
Shingle set B
Sketch B
Jaccard
11
12
Start with 64-bit f(shingles) Permute on the number line Pick the min value
13
A B
14
a,b β¦ random integers p β¦ prime number (p > N)
15
16
Signature matrix M
Input matrix
Permutations p
Shingles Documents Signatures Documents
Jaccard: Original: Signatures:
17
minimum shingle of a given hash permutation.
18
DocX shingles hashA() hashB() hashC() hashD() β¦ a rose is a 103 19032 09743 98432 rose is a rose 1098 3456 89032 98743 4539 6578 89327 21309 243 2435 93285 29873 8876 7746 9832 98321 2486 9823 30984 30282 β¦
19
20
1 30000 1
Dimensionality
N
Documents
21
22
23
R cR
24
R cR
25
26
27
MinHash has this property.
28
p1 p2 π ππ
MinHash satisfy these conditions.
29
0,2 0,4 0,6 0,8 1 1,2 0,2 0,4 0,6 0,8 1
||a-b|| Probability of finding correct neighbours Ideal curve. Real curves. p1=1 and p2=0
30
1 1 1
101
1 1 1
100
L projections
31
32 1 L β¦ β¦
True nearest neighbours:
33
Original vector
k bits hash code
L hash tables
β¦
β¦
β¦
2k buckets 2k buckets 2k buckets N/2k instances per bucket
34
π 2π
π 2π
35
36
π β π = β π = π‘ππ π, π
37
π‘ππ π, π 1 - (1 - sk)L
38
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity k = 1..10, L = 1 Prob(Candidate pair)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Prob(Candidate pair) k = 1, L = 1..10 k = 5, L = 1..50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k = 10, L = 1..50
Similarity Given a fixed threshold s. We want choose r and b such that the P(Candidate pair) has a βstepβ right around s.
39
40
1 1 1
101
41
001 111 100
42
43
44 MinHash LSH
Massive Datasetsβ, Cambridge University Press, 2011.
approximate nearest neighbor in high dimensionsβ. Communications
45