L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds
David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014
1 / 27
L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds - - PowerPoint PPT Presentation
L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014 1 / 27 All-Pairs Similarity Search (APSS) Goal For each object in a set,
1 / 27
2 / 27
3 / 27
4 / 27
5 / 27
6 / 27
6 / 27
6 / 27
6 / 27
6 / 27
6 / 27
6 / 27
6 / 27
6 / 27
7 / 27
8 / 27
4 =
4 = 0.12,
9 / 27
j+1, y) ≥ t, ∀y > x then
10 / 27
11 / 27
12 / 27
j , y) ≥ t then
j , y′ j ) < t
13 / 27
j , y) ≥ t then
j , y′ j ) < t
14 / 27
15 / 27
j , y) ≥ t then
j , y′ j ) < t
16 / 27
j , y′ j ) < t
17 / 27
j , y′ j ) < t
18 / 27
j , y′ j ) < t
19 / 27
nnz n nnz m
20 / 27
21 / 27
10 100 1e+3 1e+4 1e+5
10 30 100 300 1e+3 3e+3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
22 / 27
10 100 1e+3 1e+4 1e+5
10 100 1e+3 1e+4 1e+5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
23 / 27
6e+06 1.2e+07 1.8e+07 2.4e+07 3e+07 3.6e+07 4.2e+07 4.8e+07
1e+07 3e+07 5e+07 7e+07 9e+07 1.1e+08 1.3e+08 1.5e+08 1.7e+08 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
24 / 27
1e+10 4e+10 7e+10 1e+11 1.3e+11 1.6e+11
5e+08 1.5e+09 2.5e+09 3.5e+09 4.5e+09 5.5e+09 6.5e+09 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
25 / 27
◮ L2AP achieved significant speedups over exact baselines. ◮ BayesLSH-Lite approximate pruning cannot significantly
◮ Improved index, residual, and positional filtering via ℓ2-norm
◮ Introduced pscore filtering, which is able to prune many
◮ Strengthened other bounds, e.g. dpscore, detailed in the
26 / 27
27 / 27
1e+06 1e+07 1e+08 1e+09 1e+10
1000 2000 3000 4000 5000 6000 7000 8000 1e+06 1e+07 1e+08 1e+09 1e+10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600 700 800
27 / 27
6e+06 1.2e+07 1.8e+07 2.4e+07 3e+07 3.6e+07 4.2e+07 4.8e+07 5.4e+07 6e+07 RCV1 t index size
AllPairs MMJoin L2AP L2APb3
5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07 WikiLinks t index size 1e+07 4e+07 7e+07 1e+08 1.3e+08 1.6e+08 1.9e+08 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WikiWords500k t index size 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1e+07 4e+07 7e+07 1e+08 1.3e+08 1.6e+08 1.9e+08 2.2e+08 OrkutLinks t index size
27 / 27
200 400 600 800 1000 1200 1400 1600 1800
2000 4000 6000 8000 10000 12000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
27 / 27
1e+07 1e+08 1e+09 1e+10
RCV1
# dot-products, log-scaled t no dp dp1 dp2 dp3 dp4 dp5 dp6 dp7 dp8
1e+06 1e+07 1e+08 1e+09
WikiLinks
# dot-products, log-scaled t
1000 2000 3000 4000 5000 6000 7000 8000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RCV1
total time (s) t
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250
WikiLinks
total time (s) t 27 / 27
10 100 1e+3 1e+4 1e+5 RCV1
LSH+BayesLSH-Lite AllPairs+BayesLSH-Lite LSH+BayesLSH AllPairs+BayesLSH AllPairs L2AP* L2AP
WikiLinks 10 100 1e+3 1e+4 1e+5 WikiWords500k TwitterLinks 10 100 1e+3 1e+4 1e+5 0.5 0.6 0.7 0.8 0.9 1 WikiWords100k 0.5 0.6 0.7 0.8 0.9 1 OrkutLinks t total time (s), log-scaled
27 / 27
10 100 1e+3 1e+4 1e+5 RCV1
AllPairs AP
WikiLinks 10 100 1e+3 1e+4 1e+5 WikiWords500k TwitterLinks 10 100 1e+3 1e+4 1e+5 0.5 0.6 0.7 0.8 0.9 1 WikiWords100k 0.5 0.6 0.7 0.8 0.9 1 OrkutLinks t total time (s), log-scaled
27 / 27
◮ L2AP+BayesLSH-Lite - replace candidate verification with
◮ L2AP-approx - replace only l2cv bound pruning with
27 / 27