1-1
Efficient Algorithms for Streaming Datasets with Near-Duplicates
Theory and Applications of Hashing May 4, 2017
Efficient Algorithms for Streaming Datasets with Near-Duplicates - - PowerPoint PPT Presentation
Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing
1-1
Theory and Applications of Hashing May 4, 2017
2-1
2-2
3-1
1 7 9 1 7 3 2
E.g., what is the number
4-1
linear mapping (sometimes embeds a hash function) sketching vector
recover
4-2
linear mapping (sometimes embeds a hash function) sketching vector
recover
5-1
RAM
1 7 9 1 7 3 2
6-1
6-2
7-1
8-1
9-1
9-2
10-1
Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT.
10-2
Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT.
11-1
(Useful in: traffic monitoring, query optimization, . . .)
12-1
12-2
12-3
13-1
14-1
O(1/ǫ2) non-empty cells C
w(C) = 1/w(GC), where GC is the (only) group intersecting C, and w(GC) is #cells GC intersects
η · C∈C w(C), where z is the #non-empty cells in G
15-1
15-2
15-3
15-4
15-5
16-1
store one point of each non-empty neighboring cell; used to compute the weight of the sampled cell.
17-1
17-2
17-3
17-4
18-1
– we say a group G is adjacent to a hash bucket B if ∃p, q ∈ S s.t. p ∈ G, h(q) = B and d(p, q) ≤ α.
18-2
– we say a group G is adjacent to a hash bucket B if ∃p, q ∈ S s.t. p ∈ G, h(q) = B and d(p, q) ≤ α.
19-1
19-2
20-1
2).
20-2
2).
π , 1 − β π)-sensitive and O(1)-concentrated;
20-3
2).
π , 1 − β π)-sensitive and O(1)-concentrated;
21-1
Dataset: 4,000,000 images from ImageNet Experiments on a desktop PC with 8GB of RAM and a 4-core 3.40GHz Intel i7 CPU I500k100x5d means the dataset consists of – 500k images, – each has 100 near-duplicates on avarage, – mapped into points in 5-dim Euclidean space (feature space)
22-1
23-1
Baseline (greedy algo.) Θ(n) space Sketch (our algo.) ˜ O(1/ǫ2) space CellCount: (streaming
comparison) ˜ O(1/ǫ2) space
24-1
25-1
25-2
26-1
27-1
27-2
27-3
27-4
28-1
29-1
29-2
30-1
Parameterized with a random string r ∈ {0, 1}6n, and maps f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:
1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 i j 2j − 1 + s[i]
s s’ r
30-2
Parameterized with a random string r ∈ {0, 1}6n, and maps f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:
1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 i j 2j − 1 + s[i]
s s’ r
If ed(s, t) = k, then k/2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99
31-1
1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 p j 2j − 1 + s[p]
s s’
1 1 1 1 1 1 1
t t’
2j − 1 + t[q] CGK CGK q
r
31-2
1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 p j 2j − 1 + s[p]
s s’
1 1 1 1 1 1 1
t t’
2j − 1 + t[q] CGK CGK q
1
r
32-1
1 0 0 p
s t
q
s t
0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
33-1
2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99
33-2
2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99
33-3
2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99
34-1
UniProt project
titles and abstracts from 270 medical journals
individuals obtained from 1000 genomes project
35-1
36-1
37-1
UNIREF, vary K UNIREF, vary n GEN50kS, vary K GEN50kS, vary n
38-1
UNIREF, vary K GEN50kS, vary K GEN50kS, vary n UNIREF, vary n
39-1
GEN20kS, vary K/n GEN20kL, vary K/n GEN320kS, vary K/n
40-1
40-2
41-1
App: distributed similarity join
RAM CPU
42-1
43-1
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1
43-2
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1 Note: size of sk(OPT) is only O(k log n).
43-3
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1
j∈[ρ] Aj
Note: size of sk(OPT) is only O(k log n).
44-1
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
44-2
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
j∈[ρ] Aj
45-1
46-1
j∈[ρ] Aj
CGK embedding, we say a pair (u, v) is an anchor if s[u] = t[v], and (u, v) ∈ I . Claim: W.pr. 1 − 1/n2, there is an optimal alignment going through all anchors. Proof idea: We focus on a “greedy” optimal matching O. Suppose on the contrary that O does not pass an anchor (u, v), then we can find a matching M in the left neighborhood of (u, v) which may “mislead” a random walk, that is, with a non-trivial probability the random walk will “follow” M and consequently miss (u, v).
47-1
47-2
47-3
47-4
47-5
47-6
48-1