problems
play

Problems Martin Aumller IT University of Copenhagen Roadmap 01 - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the


  1. Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumรผller IT University of Copenhagen

  2. Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 2

  3. 1. Similarity Search in High- Dimensions: Setup/Experimental Approach 3

  4. ๐‘™ -Nearest Neighbor Problem โ€ข Preprocessing : Build DS for set ๐‘‡ of ๐‘œ data points โ€ข Task : Given query point ๐‘Ÿ , return ๐‘™ closest points to ๐‘Ÿ in ๐‘‡ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ 4

  5. Nearest neighbor search on words โ€ข GloVe: learning algorithm to find vector representations for words โ€ข GloVe.twitter dataset: 1.2M words , vectors trained from 2B tweets , 100 dimensions โ€ข Semantically similar words: nearest neighbor search on vectors https://nlp.stanford.edu/projects/glove/ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 5

  6. โ€œ sicily โ€ โ€ข sardinia โ€ข tuscany โ€ข dubrovnik โ€ข liguria โ€ข naples โ€œalgorithmโ€ โ€ข algorithms GloVe Examples โ€ข optimization โ€ข approximation โ€ข iterative โ€ข computation โ€œengineeringโ€ โ€ข engineer โ€ข accounting โ€ข research โ€ข science โ€ข development $ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 - 1.1003 0.93183 0.13311 0.17207 โ€ฆ 6

  7. Basic Setup โ€ข Data is described by high-dimensional feature vectors โ€ข Exact similarity search is difficult in high dimensions โ€ข data structures and algorithms suffer โ€ข exponential dependence on dimensionality โ€ข in time, space, or both 7

  8. Why is Exact NN difficult? ๐‘’ , for large ๐‘’ โ€ข Choose ๐‘œ random points from ๐‘‚ 0, 1/๐‘’ โ€ข Choose a random query point 2 ยฑ 1/โˆš๐‘’ โ€ข nearest and furthest neighbor basically at same distance 8

  9. Performance on GloVe 9

  10. Difficulty measure for queries โ€ข Given query ๐‘Ÿ and distances ๐‘  1 , โ€ฆ , ๐‘  ๐‘™ to ๐‘™ nearest neighbors, define โˆ’1 ๐ธ ๐‘Ÿ = โˆ’ 1 ๐‘™ เท ln ๐‘  ๐‘— /๐‘  ๐‘Ÿ ๐‘™ Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015] 10

  11. LID Distribution 11

  12. Results (GloVe, 10-NN, 1.2M points) Easy Middle Difficult http://ann-benchmarks.com/sisap19/faiss-ivf.html 12

  13. 2. STATE-OF-THE-ART NEAREST NEIGHBOR SEARCH 13

  14. General Pipeline Index generates candidates Brute-force search on candidates 14

  15. Brute-force search GloVe: 1.2 M points, inner product as distance measure ๐‘ž ๐‘œ ๐‘ž 1 400 byte 400 byte โ€ข 100ms per scan โ€ข 4.2 GB/s throughput โ€ข CPU-bound Automatically SIMD vectorized with clang โ€“ O3: https://godbolt.org/z/TJX68s 15

  16. https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17 Manual vectorization (256 bit registers) โ€ฆ ๐‘ฆ Parallel multiply โ€ข 25 ms per query โ€ข 16 GB/s โ€ฆ โ€ข 16.5 GB/s single- ๐‘ง thread max on my laptop โ€ข Memory-bound Parallel add to result register 0 0 0 0 0 0 0 0 Horizontal sum and cast to float 16

  17. Brute-force on bit vectors โ€ข Another popular distance measure is Hamming distance โ€ข Number of positions in which two bit strings differ โ€ข Can be nicely packed into 64-bit words โ€ข Hamming distance of two words is just bitcount of the XOR โ€ข 1.3 ms per query (128 bits) โ€ข 6 GB/s throughput 17

  18. [Christiani, 2019] Sketching to avoid distance computations SimHash [Charikar, 2002] 1-BitMinHash [Kรถnig-Li, 2010] โ€ข Distance computations on bit Sketch representation vectors faster than Euclidean ๐‘Ÿ 1011100101 distance/inner product 0101110101 ๐‘ฆ โ€ข Their number can be reduced by Easy to analyze: Sum of Bernoulli trials of storing compact sketch Pr(๐‘Œ = 1) = ๐‘”(dist(๐‘Ÿ, ๐‘ฆ)) representations At least ๐œ collisions? Yes No Can distance computation be compute ๐‘Ÿ avoided? skip dist( ๐‘Ÿ, ๐‘ฆ) ๐‘ฆ Set ๐œ such that with probability at least 1 โˆ’ ๐œ we donโ€™t disregard point that could be among NN. 18

  19. General Pipeline Index generates candidates Brute-force search on candidates 19

  20. PUFFINN P ARAMETERLESS AND U NIVERSALLY F AST FI NDINGOF N EAREST N EIGHBORS [A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz 20

  21. How does it work? Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998] โ„Ž ๐‘ž = โ„Ž 1 ๐‘ž โˆ˜ โ„Ž 2 ๐‘ž โˆ˜ โ„Ž 3 ๐‘ž โˆˆ 0,1 3 = A family โ„‹ of hash functions is locality- sensitive , if the collision probability of two points is decreasing with their distance to each other. 21

  22. Solving ๐‘™ -NN using LSH (with failure prob. ๐œ€ ) Dataset ๐‘‡ โ„Ž 4 โ„Ž 5 โ„Ž 2 โ„Ž 3 โ„Ž ๐‘€ โ€ฆ Termination : If 1 โˆ’ ๐‘ž ๐‘˜ โ‰ค ๐œ€ , report current top- ๐‘™ . Not terminated? Decrease ๐ฟ ! probability of the current ๐‘™ -th nearest neighbor to collide. 22

  23. The Data Structure Theoretical Practical โ€ข LSH Forest: Each repetition is a โ€ข Store indices of data set points Trie build from LSH hash values sorted by hash code [Bawa et al., 2005] โ€ข โ€Traversing the Trie โ€ by binary search โ€ข use lookup table for first levels 1 0 0 1 0 1 0 1 0 1 0 1 0 1 โ€ฆ 23

  24. Overall System Design 24

  25. Running time (Glove 100d, 1.2M, 10-NN) 25

  26. A difficult (?) data set in โ„ 3๐‘’ ๐‘œ data points ๐‘› query points ๐‘ฆ 1 = 0 ๐‘’ , ๐‘ง 1 , ๐‘จ 1 ๐‘Ÿ 1 = ๐‘ค, 0 ๐‘’ , ๐‘  1 โ‹ฎ โ‹ฎ ๐‘Ÿ ๐‘› = (๐‘ค, 0 ๐‘’ , ๐‘  ๐‘› ) ๐‘ฆ ๐‘œโˆ’1 = 0 ๐‘’ , ๐‘ง ๐‘œโˆ’1 , ๐‘จ ๐‘œโˆ’1 ๐‘ฆ ๐‘œ = (๐‘ค, ๐‘ฅ, 0 ๐‘’ ) 0, 1 ๐‘ง ๐‘— , ๐‘จ ๐‘— , ๐‘ค, ๐‘ฅ, ๐‘  ๐‘— โˆผ ๐’ช ๐‘’ 2๐‘’ 26

  27. Running time (โ€œDifficultโ€, 1M, 10 -NN) 27

  28. Graph-based Similarity Search 28

  29. Building a Small World Graph 29

  30. Refining a Small World Graph Goal : Keep out- degree as small as possible (while maintaining โ€œlarge - enoughโ€ in -degree)! HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018] 30

  31. Running time (Glove 100d, 1.2M, 10-NN) 31

  32. Open Problems Nearest Neighbor Search โ€ข Data-dependent LSH with guarantees? โ€ข Theoretical sound Small-World Graphs? โ€ข Multi-core implementations โ€ข Good? [Malkov et al., 2020] โ€ข Alternative ways of sketching data? 32

  33. 3. Similarity Search on the GPU, in External Memory, and in Distributed Settings 33

  34. Nearest Neighbors on the GPU: FAISS [Johnson et al., 2017] https://github.com/facebookresearch/faiss โ€ข GPU setting โ€ข Data structure is held in GPU memory โ€ข Queries come in batches of say 10,000 queries per time โ€ข Results: โ€ข http://ann-benchmarks.com/sift-128-euclidean_10_euclidean-batch.html 34

  35. FAISS/2 โ€ข Data structure โ€ข Run k-means with large number of centroids โ€ข Each data point is associated with closest centroid โ€ข Query โ€ข Find ๐‘€ closest centroids โ€ข Return ๐‘™ closest points found in points associated with these centroids https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html 35

  36. Nearest Neighbors on the GPU: GGNN [Groh et al., 2019] 36

  37. Nearest Neighbors in External Memory [Subramanya et al., 2019] RAM SSD เทž ๐‘ฆ 1 โ€ฆ ๐‘ฆ ๐‘œ เทž โ€ฆ compressed vectors ๐‘ฆ 1 ๐‘ฆ ๐‘œ (32 byte per vector) Original vectors Product Quantization (~400 byte per vector) 37

  38. Distributed Setting: Similarity Join โ€ข Problem โ€ข given sets ๐‘† and ๐‘‡ of size ๐‘œ , ๐‘† โ€ข and similarity threshold ๐œ‡, compute ๐‘† โ‹ˆ ๐œ‡ ๐‘‡ = { ๐‘ฆ, ๐‘ง โˆˆ ๐‘† ร— ๐‘‡ โˆฃ ๐‘ก๐‘—๐‘› ๐‘ฆ, ๐‘ง โ‰ฅ ๐œ‡} โ€ข Similarity measures โ€ข Jaccard similarity โ€ข Cosine similarity ๐‘‡ โ€ข Naive: ๐‘ƒ ๐‘œ 2 distance computations 38

  39. Scalability! But at what COST? [ McSherry et al., 2015] Map-Reduce-based Similarity Join Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018] [Mann et al., 2016] 39

  40. Solved almost-optimally in the MPC model [Hu et al., 2019] Hash Join on using hash LSH ๐‘† Similarity (๐‘ฆ, โ„Ž ๐‘— (x)) at least Emit ๐œ‡ ? (๐‘ฆ, ๐‘ง ) (๐‘ฆ, ๐‘ง, โ„Ž ๐‘— (x)) (๐‘ง, โ„Ž ๐‘— (y)) ๐‘‡ ๐‘ƒ(๐‘œ 2 ) local work for distance computations! 40

  41. Another approach: DANNY [A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny LSH + Sketching, Cartesian Product candidate verification ๐‘† locally Emit/Collect ๐‘‡ Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow 41

  42. Results 42

  43. Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend