l2ap fast cosine similarity search with prefix l 2 norm
play

L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds - PowerPoint PPT Presentation

L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014 1 / 27 All-Pairs Similarity Search (APSS) Goal For each object in a set,


  1. L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds David C. Anastasiu and George Karypis University of Minnesota, Minneapolis, MN, USA April 3, 2014 1 / 27

  2. All-Pairs Similarity Search (APSS) Goal ◮ For each object in a set, find all other set objects with a similarity value of at least t (its neighbors) Applications ◮ Near-duplicate Document Detection ◮ Clustering ◮ Query Refinement ◮ Collaborative Filtering ◮ Semi-supervised Learning ◮ Information Retrieval 2 / 27

  3. Outline 1. Problem Description 2. Solution framework 3. Index construction 4. Candidate generation 5. Candidate verification 6. Experimental Evaluation 6.1 Efficiency testing 6.2 Effectiveness testing 7. Conclusion 3 / 27

  4. Problem Description ◮ D , sparse matrix of size n × m ◮ x , row vector for row x in D ◮ rows unit-length normalized, x x = || x || ⇒ || x || = 1 xy T ◮ sim ( x , y ) = cos ( x , y ) = || x ||×|| y || = xy T = � m j = 1 x j × y j 4 / 27

  5. Problem Description ◮ Na¨ ıve solution: compute similarity of each object with all others, keep results ≥ t . ◮ Equivalent to sparse matrix-matrix multiplication, followed by a filter operation: APSS ∼ DD T . ≥ t for each row x = 1 , . . . , n do for each row y = 1 , . . . , n do if x � = y & sim ( x , y ) > t then Add { x , y , sim ( x , y ) } to result × = DD T D T D 5 / 27

  6. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 0000 A [ c ] = 0 . 0000 t = 0 . 5 6 / 27

  7. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 0000 A [ c ] = 0 . 1152 t = 0 . 5 6 / 27

  8. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 0000 A [ c ] = 0 . 1152 t = 0 . 5 6 / 27

  9. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 2405 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  10. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 2405 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  11. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 2405 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  12. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 4050 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  13. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 7425 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  14. Problem Description ◮ Main idea : Use the similarity threshold t and theoretical bounds to prune the search space a = � 0.12, , 0.37, 0.22, , 0.47, 0.75, 0.13 � b = � � , 0.50, 0.65, , 0.05, 0.35, 0.45, c = � 0.96, 0.28, 0.01, , , , , � A [ b ] = 0 . 7425 A [ c ] = 0 . 1189 t = 0 . 5 6 / 27

  15. Extensions to the na¨ ıve approach ◮ Leverage sparsity in D . Build an inverted index . ◮ Leverage commutativity of cos ( x , y ) . (Sarawagi and Kirpal, 2004) . ◮ Build a partial index . (Chaudhuri et al., 2006) 7 / 27

  16. AllPairs Framework AllPairs : for each row x = 1 , . . . , n do Find similarity candidates for x using current inverted index (candidate generation) Complete similarity computation and prune unpromising candidates (candidate verification) Index enough of x to ensure all valid similar- ity pairs are discovered (index construction) 8 / 27

  17. What we index ◮ We index x ′′ , the suffix of x . x ′′ p = � 0 , . . . , 0 , x p , . . . , x m � is the suffix of x starting at feature p . ◮ x ′ is the un-indexed prefix of x . x ′ p = � x 1 , . . . , x p − 1 , 0 , . . . , 0 � is x ’s prefix ending at p − 1. a = � 0.12, , 0.47, 0.75, 0.13 � , 0.37, 0.22, a ′′ 4 = � , 0.47, 0.75, 0.13 � , , , 0.22, a ′ 4 = � 0.12, � , 0.37, , , , , ◮ x = x ′ + x ′′ � p − 1 � m xy T = j = 1 x j × y j + j = p x j × y j ◮ x ′ p y T x ′′ p y T = + ◮ cos ( x , y ) ≤ || x || × || y || (Cauchy–Schwarz inequality) 9 / 27

  18. Index Construction ◮ Add a minimum number of non-zero features j of x to the inverted index lists I j (index filtering). for each column j = 1 , . . . , m s.t. x j > 0 do if sim ( x ′ j + 1 , y ) ≥ t , ∀ y > x then I j ← I j ∪ { ( x , x j ) } 10 / 27

  19. Index Construction ◮ By the Cauchy–Schwarz inequality , cos ( x ′ j , y ) ≤ || x ′ j || × || y || = || x ′ j || , since || y || = 1 (bound b 3 ). ◮ We store || x ′ j || along with x j in the index to use for later pruning. a = � 0.12, , 0.47, 0.75, 0.13 � , 0.37, 0.22, || a j || = � 0.12, 0.12, 0.39, 0.45, 0.45, 0.65, 0.99, 1.00 � 11 / 27

  20. Index Construction ◮ Let w = � max z 1 , . . . , max z m � , the vector of max column z z values in D . We can estimate sim ( x ′ j , y ) ≤ sim ( x ′ j , w ) . ◮ Leverage an order of D ’s rows. Order rows in decreasing max row value ( || z || ∞ ) order. Let w = � min ( x 1 , max ˆ z 1 ) , . . . , min ( x n , max z m ) � . z z Then sim ( x ′ j , y ) ≤ sim ( x ′ j , ˆ w ) , since the y ’s we seek follow x in the row order (bound b 1 , Bayardo et al., 2007). ◮ We use the minimum of the two bounds, min ( b 1 , b 3 ) . ◮ We store ps [ x ] ← min ( sim ( x ′ j , ˆ w ) , || x j || ) to use in later pruning. 12 / 27

  21. Candidate generation ◮ Traverse the inverted index lists I j corresponding to non-zero features j of x and keep track of a partial dot product ( A [ y ] ) for the candidates encountered. for each column j = m , . . . , 1 s.t. x j > 0 do for each ( y , y j ) ∈ I j do if A [ y ] > 0 or sim ( x ′ j , y ) ≥ t then A [ y ] ← A [ y ] + x j × y j A [ y ] ← 0 if A [ y ] + sim ( x ′ j , y ′ j ) < t ◮ Note that we are accumulating the suffix dot product , sim ( x ′′ , y ) . 13 / 27

  22. Candidate generation ◮ Leverage t to prune potential candidates (residual filtering). for each column j = m , . . . , 1 s.t. x j > 0 do for each ( y , y j ) ∈ I j do if A [ y ] > 0 or sim ( x ′ j , y ) ≥ t then A [ y ] ← A [ y ] + x j × y j A [ y ] ← 0 if A [ y ] + sim ( x ′ j , y ′ j ) < t ◮ Accumulate only if A [ y ] > 0 or || x ′ j || ≥ t , since cos ( x ′ j , y ) ≤ || x ′ j || (bound rs 4 ). ◮ Once the ℓ 2 norm of x ′ j falls below t , we ignore potential candidates y if A [ y ] = 0. 14 / 27

  23. Candidate generation ◮ Given w defined as before, sim ( x ′ , y ) ≤ sim ( x ′ , w ) (bound rs 1 , Bayardo et al., 2007). We pre-compute rs 1 = xw T , and roll back the computation as we process each inverted index column j . We stop accumulating new candidates once rs 1 < t . ◮ Candidates can only be those vectors with lower ids. We can improve rs 1 by using max column values of processed columns instead, � w = � max z < x z 1 , . . . , max z < x z m � , thus sim ( x ′ , y ) ≤ sim ( x ′ , � w ) (bound rs 3 ). ◮ We use the best of both bounds, min ( rs 3 , rs 4 ) , during residual filtering. 15 / 27

  24. Candidate generation ◮ Leverage t at common features to prune actual candidates (positional filtering). for each column j = m , . . . , 1 s.t. x j > 0 do for each ( y , y j ) ∈ I j do if A [ y ] > 0 or sim ( x ′ j , y ) ≥ t then A [ y ] ← A [ y ] + x j × y j A [ y ] ← 0 if A [ y ] + sim ( x ′ j , y ′ j ) < t ◮ We estimate sim ( x ′ j , y ′ j ) ≤ || x ′ j || × || y ′ j || ( || y ′ j || is stored in the index), to prune some of the candidates (bound l 2 cg ). ◮ We store || x ′ j || for forward index features to use in future pruning. 16 / 27

  25. Candidate verification ◮ We use the forward index to finish computing the dot products for the encountered candidates, vectors y with A [ y ] > 0. for each y s.t. A [ y ] > 0 do next y if A [ y ] + sim ( x , y ′ ) < t for each column j s.t. y j > 0 ∧ y j / ∈ I j ∧ x j > 0 do A [ y ] ← A [ y ] + x j × y j next y if A [ y ] + sim ( x ′ j , y ′ j ) < t Add { x , y , A [ y ] } to result if A [ y ] > t 17 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend