L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds - - PowerPoint PPT Presentation

l2ap fast cosine similarity search with prefix l 2 norm
SMART_READER_LITE
LIVE PREVIEW

L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds - - PowerPoint PPT Presentation

L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014 1 / 27 All-Pairs Similarity Search (APSS) Goal For each object in a set,


slide-1
SLIDE 1

L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds

David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014

1 / 27

slide-2
SLIDE 2

All-Pairs Similarity Search (APSS)

Goal

◮ For each object in a set, find all other set objects with a

similarity value of at least t (its neighbors) Applications

◮ Near-duplicate Document Detection ◮ Clustering ◮ Query Refinement ◮ Collaborative Filtering ◮ Semi-supervised Learning ◮ Information Retrieval

2 / 27

slide-3
SLIDE 3

Outline

  • 1. Problem Description
  • 2. Solution framework
  • 3. Index construction
  • 4. Candidate generation
  • 5. Candidate verification
  • 6. Experimental Evaluation

6.1 Efficiency testing 6.2 Effectiveness testing

  • 7. Conclusion

3 / 27

slide-4
SLIDE 4

Problem Description

◮ D, sparse matrix of size n × m ◮ x, row vector for row x in D ◮ rows unit-length normalized,

x =

x ||x|| ⇒ ||x|| = 1 ◮ sim(x, y) = cos(x, y) = xyT ||x||×||y|| =

xyT = m

j=1 xj × yj

4 / 27

slide-5
SLIDE 5

Problem Description

◮ Na¨

ıve solution: compute similarity of each object with all

  • thers, keep results ≥ t.

◮ Equivalent to sparse matrix-matrix multiplication, followed by a

filter operation: APSS ∼ DDT. ≥ t

for each row x = 1, . . . , n do for each row y = 1, . . . , n do if x = y & sim(x, y) > t then Add {x, y, sim(x, y)} to result

× = D DT DDT

5 / 27

slide-6
SLIDE 6

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.0000

A[c] = 0.0000 t = 0.5

6 / 27

slide-7
SLIDE 7

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.0000

A[c] = 0.1152 t = 0.5

6 / 27

slide-8
SLIDE 8

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.0000

A[c] = 0.1152 t = 0.5

6 / 27

slide-9
SLIDE 9

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.2405

A[c] = 0.1189 t = 0.5

6 / 27

slide-10
SLIDE 10

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.2405

A[c] = 0.1189 t = 0.5

6 / 27

slide-11
SLIDE 11

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.2405

A[c] = 0.1189 t = 0.5

6 / 27

slide-12
SLIDE 12

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.4050

A[c] = 0.1189 t = 0.5

6 / 27

slide-13
SLIDE 13

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.7425

A[c] = 0.1189 t = 0.5

6 / 27

slide-14
SLIDE 14

Problem Description

◮ Main idea: Use the similarity threshold t and theoretical

bounds to prune the search space

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 b = , 0.50, 0.65, , 0.05, 0.35, 0.45,

  • c = 0.96, 0.28, 0.01,

, , , ,

  • A[b] = 0.7425

A[c] = 0.1189 t = 0.5

6 / 27

slide-15
SLIDE 15

Extensions to the na¨ ıve approach

◮ Leverage sparsity in D. Build an inverted index. ◮ Leverage commutativity of cos(x, y). (Sarawagi and Kirpal, 2004). ◮ Build a partial index. (Chaudhuri et al., 2006)

7 / 27

slide-16
SLIDE 16

AllPairs Framework

AllPairs: for each row x = 1, . . . , n do Find similarity candidates for x using current inverted index (candidate generation) Complete similarity computation and prune unpromising candidates (candidate verification) Index enough of x to ensure all valid similar- ity pairs are discovered (index construction)

8 / 27

slide-17
SLIDE 17

What we index

◮ We index x′′, the suffix of x.

x′′

p = 0, . . . , 0, xp, . . . , xm is the suffix of x starting at feature p. ◮ x′ is the un-indexed prefix of x.

x′

p = x1, . . . , xp−1, 0, . . . , 0 is x’s prefix ending at p − 1. a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 a′′

4 =

, , , 0.22, , 0.47, 0.75, 0.13 a′

4 = 0.12,

, 0.37, , , , ,

  • ◮ x = x′ + x′′

xyT = p−1

j=1 xj × yj

+ m

j=p xj × yj

= x′

pyT

+ x′′

pyT ◮ cos(x, y) ≤ ||x|| × ||y|| (Cauchy–Schwarz inequality)

9 / 27

slide-18
SLIDE 18

Index Construction

◮ Add a minimum number of non-zero

features j of x to the inverted index lists Ij (index filtering).

for each column j = 1, . . . , m s.t. xj > 0 do if sim(x′

j+1, y) ≥ t, ∀y > x then

Ij ← Ij ∪ {(x, xj)}

10 / 27

slide-19
SLIDE 19

Index Construction

◮ By the Cauchy–Schwarz inequality,

cos(x′

j, y) ≤ ||x′ j|| × ||y|| =||x′ j||, since ||y|| = 1 (bound b3). ◮ We store ||x′ j|| along with xj in the index to use for later

pruning.

a = 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 ||aj|| = 0.12, 0.12, 0.39, 0.45, 0.45, 0.65, 0.99, 1.00

11 / 27

slide-20
SLIDE 20

Index Construction

◮ Let w = max z

z1, . . . , max

z

zm, the vector of max column values in D. We can estimate sim(x′

j, y) ≤ sim(x′ j, w). ◮ Leverage an order of D’s rows. Order rows in decreasing max

row value (||z||∞) order. Let ˆ w = min(x1, max

z

z1), . . . , min(xn, max

z

zm). Then sim(x′

j, y) ≤ sim(x′ j, ˆ

w), since the y’s we seek follow x in the row order (bound b1, Bayardo et al., 2007).

◮ We use the minimum of the two bounds, min(b1, b3). ◮ We store ps[x] ← min(sim(x′ j, ˆ

w), ||xj||) to use in later pruning.

12 / 27

slide-21
SLIDE 21

Candidate generation

◮ Traverse the inverted index lists Ij

corresponding to non-zero features j

  • f x and keep track of a partial dot

product (A[y]) for the candidates encountered.

for each column j = m, . . . , 1 s.t. xj > 0 do for each (y, yj) ∈ Ij do if A[y] > 0 or sim(x′

j , y) ≥ t then

A[y] ← A[y] + xj × yj A[y] ← 0 if A[y] + sim(x′

j , y′ j ) < t

◮ Note that we are accumulating the suffix dot product,

sim(x′′, y).

13 / 27

slide-22
SLIDE 22

Candidate generation

◮ Leverage t to prune potential

candidates (residual filtering).

for each column j = m, . . . , 1 s.t. xj > 0 do for each (y, yj) ∈ Ij do if A[y] > 0 or sim(x′

j , y) ≥ t then

A[y] ← A[y] + xj × yj A[y] ← 0 if A[y] + sim(x′

j , y′ j ) < t

◮ Accumulate only if A[y] > 0 or ||x′ j|| ≥ t, since

cos(x′

j, y) ≤ ||x′ j|| (bound rs4). ◮ Once the ℓ2 norm of x′ j falls below t, we ignore potential

candidates y if A[y] = 0.

14 / 27

slide-23
SLIDE 23

Candidate generation

◮ Given w defined as before, sim(x′, y) ≤ sim(x′, w) (bound rs1,

Bayardo et al., 2007). We pre-compute rs1 = xwT, and roll back the computation as we process each inverted index column j. We stop accumulating new candidates once rs1 < t.

◮ Candidates can only be those vectors with lower ids. We can

improve rs1 by using max column values of processed columns instead, w = max

z<x z1, . . . , max z<x zm, thus

sim(x′, y) ≤ sim(x′, w) (bound rs3).

◮ We use the best of both bounds, min(rs3, rs4), during residual

filtering.

15 / 27

slide-24
SLIDE 24

Candidate generation

◮ Leverage t at common features to

prune actual candidates (positional filtering).

for each column j = m, . . . , 1 s.t. xj > 0 do for each (y, yj) ∈ Ij do if A[y] > 0 or sim(x′

j , y) ≥ t then

A[y] ← A[y] + xj × yj A[y] ← 0 if A[y] + sim(x′

j , y′ j ) < t

◮ We estimate sim(x′ j, y′ j) ≤ ||x′ j|| × ||y′ j|| (||y′ j|| is stored in the

index), to prune some of the candidates (bound l2cg).

◮ We store ||x′ j|| for forward index features to use in future

pruning.

16 / 27

slide-25
SLIDE 25

Candidate verification

◮ We use the forward index to finish

computing the dot products for the encountered candidates, vectors y with A[y] > 0.

for each y s.t. A[y] > 0 do next y if A[y] + sim(x, y′) < t for each column j s.t. yj > 0 ∧ yj / ∈ Ij ∧ xj > 0 do A[y] ← A[y] + xj × yj next y if A[y] + sim(x′

j , y′ j ) < t

Add {x, y, A[y]} to result if A[y] > t

17 / 27

slide-26
SLIDE 26

Candidate verification

◮ Leverage t and the stored pscore

ps[y] obtained when indexing y, to prune candidates (pscore filtering).

◮ Note that, after candidate generation,

A[y] = sim(x, y′′). ps[y] is an estimate of sim(z, y′), ∀z > y, including x, i.e., sim(x, y′) ≤ ps[y].

◮ Prune if A[y] + ps[y] < t. for each y s.t. A[y] > 0 do next y if A[y] + sim(x, y′) < t for each column j s.t. yj > 0 ∧ yj / ∈ Ij ∧ xj > 0 do A[y] ← A[y] + xj × yj next y if A[y] + sim(x′

j , y′ j ) < t

Add {x, y, A[y]} to result if A[y] > t

18 / 27

slide-27
SLIDE 27

Candidate verification

◮ While computing the final dot

product, we estimate sim(x, y′

j) ≤ ||x′ j || × ||y′ j || and use it to

prune additional candidates. (bound l2cv).

◮ Additional pruning obtained via

dpscore and minsize filtering is described in the paper.

for each y s.t. A[y] > 0 do next y if A[y] + sim(x, y′) < t for each column j s.t. yj > 0 ∧ yj / ∈ Ij ∧ xj > 0 do A[y] ← A[y] + xj × yj next y if A[y] + sim(x′

j , y′ j ) < t

Add {x, y, A[y]} to result if A[y] > t

19 / 27

slide-28
SLIDE 28

Datasets

Dataset n m nnz

nnz n nnz m

RCV1 804,414 43,001 61e6 76 1417 WikiWords500k 494,244 343,622 197e6 399 574 WikiWords100k 100,528 339,944 79e6 787 233 TwitterLinks 146,170 143,469 200e6 1370 1395 WikiLinks 1,815,914 1,648,879 44e6 24 27 OrkutLinks 3,072,626 3,072,441 223e6 73 73

◮ RCV1: standard corpus of > 800, 000 newswire stories. ◮ WikiWords500k: Wikipedia articles, min length 200. ◮ WikiWords100k: Wikipedia articles, min length 500. ◮ TwitterLinks: follow relationships of Twitter users that follow min

1,000 other users.

◮ WikiLinks: directed graph of hyperlinks between Wikipedia articles. ◮ OrkutLinks: Orkut friendship network.

20 / 27

slide-29
SLIDE 29

Baseline approaches

◮ IdxJoin builds an inverted index and uses it to find

sim(x, y), ∀y < x, without pruning.

◮ AllPairs uses max vector w in similarity estimates (bounds

b1 and rs1)

◮ MMJoin (Lee et al., 2010) enhances AllPairs by adding

length filtering and a tighter minsize bound. Length filtering estimates sim(x′

j, y) ≤ 1 2||x′ j||2 + 1 2||y||2 = 1 2||x′ j||2 + 1 2, which is

not as tight as our ℓ2 norm estimate, sim(x′

j, y) ≤ ||x′ j||,

especially for low t values.

◮ AllPairs+BayesLSH-Lite and LSH+BayesLSH-Lite are

variants of BayesLSH that take as input the candidate set generated by AllPairs and LSH, respectively.

◮ Source code for all methods, including L2AP and

L2AP-approx, available at http://cs.umn.edu/˜dragos/l2ap.

21 / 27

slide-30
SLIDE 30

Comparison with exact baselines

10 100 1e+3 1e+4 1e+5

RCV1 IdxJoin AllPairs MMJoin L2AP WikiWords500k

10 30 100 300 1e+3 3e+3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TwitterLinks

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WikiLinks similarity threshold t total time (s), log-scaled

22 / 27

slide-31
SLIDE 31

Comparison with approximate baselines

10 100 1e+3 1e+4 1e+5

RCV1 LSH+BayesLSH-Lite AllPairs+BayesLSH-Lite L2AP+BayesLSH-Lite L2AP-approx L2AP WikiWords500k

10 100 1e+3 1e+4 1e+5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TwitterLinks

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WikiLinks similarity threshold t total time (s), log-scaled

23 / 27

slide-32
SLIDE 32

ℓ2-norm effectiveness for index reduction

6e+06 1.2e+07 1.8e+07 2.4e+07 3e+07 3.6e+07 4.2e+07 4.8e+07

RCV1 AllPairs MMJoin L2AP WikiLinks

1e+07 3e+07 5e+07 7e+07 9e+07 1.1e+08 1.3e+08 1.5e+08 1.7e+08 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WikiWords500k

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TwitterLinks similarity threshold t index size

24 / 27

slide-33
SLIDE 33

Residual filtering effectiveness

1e+10 4e+10 7e+10 1e+11 1.3e+11 1.6e+11

RCV1 AllPairs MMJoin min(rs3,rs4) rs3 rs4 WikiWords500k

5e+08 1.5e+09 2.5e+09 3.5e+09 4.5e+09 5.5e+09 6.5e+09 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TwitterLinks

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WikiLinks similarity threshold t # candidates

25 / 27

slide-34
SLIDE 34

Conclusion

Lessons learned from L2AP

◮ Filtering is efficient

◮ L2AP achieved significant speedups over exact baselines. ◮ BayesLSH-Lite approximate pruning cannot significantly

improve over L2AP.

◮ Filtering is effective

◮ Improved index, residual, and positional filtering via ℓ2-norm

bounds.

◮ Introduced pscore filtering, which is able to prune many

generated candidates.

◮ Strengthened other bounds, e.g. dpscore, detailed in the

paper.

26 / 27

slide-35
SLIDE 35

Thank You

◮ Questions? Acknowledgements: This work was supported in part by the NSF (IOS-0820730, IIS-0905220, OCI-1048018, CNS-1162405, and IIS-1247632), the Digital Technology Center at the University of Minnesota, and Minnesota Supercomputing Institute.

27 / 27

slide-36
SLIDE 36

pscore effectiveness for candidate pruning

1e+06 1e+07 1e+08 1e+09 1e+10

WikiWords100k AllPairs AllPairs+pscore L2AP L2AP+pscore

1000 2000 3000 4000 5000 6000 7000 8000 1e+06 1e+07 1e+08 1e+09 1e+10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WikiLinks

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600 700 800

similarity threshold t # non-pruned candidates, log-scaled total time (s)

27 / 27

slide-37
SLIDE 37

ℓ2-norm only effectiveness for the pscore bound

6e+06 1.2e+07 1.8e+07 2.4e+07 3e+07 3.6e+07 4.2e+07 4.8e+07 5.4e+07 6e+07 RCV1 t index size

AllPairs MMJoin L2AP L2APb3

5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07 WikiLinks t index size 1e+07 4e+07 7e+07 1e+08 1.3e+08 1.6e+08 1.9e+08 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WikiWords500k t index size 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1e+07 4e+07 7e+07 1e+08 1.3e+08 1.6e+08 1.9e+08 2.2e+08 OrkutLinks t index size

27 / 27

slide-38
SLIDE 38

Effectiveness of new ℓ2-norm filtering

200 400 600 800 1000 1200 1400 1600 1800

OrkutLinks no l2pruning l2cg l2cg+ l2cv W ikiLinks

2000 4000 6000 8000 10000 12000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

W ikiW

  • rds100k

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TwitterLinks total tim e (s) sim ilarity threshold t

27 / 27

slide-39
SLIDE 39

dpscore bounds effectiveness for positional filtering

1e+07 1e+08 1e+09 1e+10

RCV1

# dot-products, log-scaled t no dp dp1 dp2 dp3 dp4 dp5 dp6 dp7 dp8

1e+06 1e+07 1e+08 1e+09

WikiLinks

# dot-products, log-scaled t

1000 2000 3000 4000 5000 6000 7000 8000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RCV1

total time (s) t

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250

WikiLinks

total time (s) t 27 / 27

slide-40
SLIDE 40

Comparison with BayesLSH

10 100 1e+3 1e+4 1e+5 RCV1

LSH+BayesLSH-Lite AllPairs+BayesLSH-Lite LSH+BayesLSH AllPairs+BayesLSH AllPairs L2AP* L2AP

WikiLinks 10 100 1e+3 1e+4 1e+5 WikiWords500k TwitterLinks 10 100 1e+3 1e+4 1e+5 0.5 0.6 0.7 0.8 0.9 1 WikiWords100k 0.5 0.6 0.7 0.8 0.9 1 OrkutLinks t total time (s), log-scaled

27 / 27

slide-41
SLIDE 41

Comparison of AllPairs implementations

10 100 1e+3 1e+4 1e+5 RCV1

AllPairs AP

WikiLinks 10 100 1e+3 1e+4 1e+5 WikiWords500k TwitterLinks 10 100 1e+3 1e+4 1e+5 0.5 0.6 0.7 0.8 0.9 1 WikiWords100k 0.5 0.6 0.7 0.8 0.9 1 OrkutLinks t total time (s), log-scaled

27 / 27

slide-42
SLIDE 42

Approximate extensions

◮ BayesLSH-Lite (Satuluri and Parthasarathy, 2012) finds the

probability that sim(x, y) > t, conditional on observed LSH hash matches, after checking h hashes.

◮ We created two approximate APSS methods by combining

BayesLSH-Lite with L2AP:

◮ L2AP+BayesLSH-Lite - replace candidate verification with

BayesLSH-Lite

◮ L2AP-approx - replace only l2cv bound pruning with

BayesLSH-Lite

27 / 27