Algorithm Engineering for High- Dimensional Similarity Search Problems
Martin Aumรผller IT University of Copenhagen
Problems Martin Aumller IT University of Copenhagen Roadmap 01 - - PowerPoint PPT Presentation
Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the
Martin Aumรผller IT University of Copenhagen
Similarity Search in High-Dimensions: Setup/Experimental Approach
Survey of state-of- the-art Nearest Neighbor Search algorithms
Similarity Search on the GPU, in external memory, and in distributed settings
2
3
4
โ โ โ โ โ โ
100 dimensions
5
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/
$ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 -1.1003 0.93183 0.13311 0.17207 โฆ
โsicilyโ
โalgorithmโ
โengineeringโ
6
feature vectors
high dimensions
dimensionality
7
๐, for large ๐
basically at same distance
8
2 ยฑ 1/โ๐
9
1, โฆ , ๐ ๐ to
๐ nearest neighbors, define
๐ธ ๐ = โ 1 ๐ เท ln ๐
๐/๐ ๐ โ1
10
๐ Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015]
11
http://ann-benchmarks.com/sisap19/faiss-ivf.html Easy Difficult Middle
12
13
14
๐1 ๐๐ GloVe: 1.2 M points, inner product as distance measure 400 byte 400 byte Automatically SIMD vectorized with clang โO3: https://godbolt.org/z/TJX68s
15
๐ฆ ๐ง โฆ โฆ Parallel multiply Parallel add to result register Horizontal sum and cast to float
thread max on my laptop
https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17
16
17
vectors faster than Euclidean distance/inner product
storing compact sketch representations
18
๐ ๐ฆ ๐ฆ 1011100101 0101110101 Sketch representation SimHash [Charikar, 2002] 1-BitMinHash [Kรถnig-Li, 2010] Set ๐ such that with probability at least 1 โ ๐ we donโt disregard point that could be among NN. At least ๐ collisions? Yes No skip compute dist(๐, ๐ฆ) Easy to analyze: Sum of Bernoulli trials of Pr(๐ = 1) = ๐(dist(๐, ๐ฆ)) ๐ Can distance computation be avoided? [Christiani, 2019]
19
PUFFINN PARAMETERLESS ANDUNIVERSALLY FASTFINDINGOF NEAREST NEIGHBORS
20
[A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz
Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998]
21
= โ ๐ = โ1 ๐ โ โ2 ๐ โ โ3 ๐ โ 0,1 3
A family โ of hash functions is locality- sensitive, if the collision probability of two points is decreasing with their distance to each other.
22
โ3 โ2 โ4 โ5 โ๐
Dataset ๐
Termination: If 1 โ ๐ ๐ โค ๐, report current top-๐.
probability of the current ๐-th nearest neighbor to collide. Not terminated? Decrease ๐ฟ!
Theoretical
Trie build from LSH hash values [Bawa et al., 2005]
Practical
sorted by hash code
search
23
โฆ 1 1 1 1 1 1 1
24
25
๐ฆ1 = 0๐, ๐ง1, ๐จ1 โฎ ๐ฆ๐โ1 = 0๐, ๐ง๐โ1, ๐จ๐โ1 ๐ฆ๐ = (๐ค, ๐ฅ, 0๐) ๐ง๐, ๐จ๐, ๐ค, ๐ฅ, ๐ ๐ โผ ๐ช๐ 0, 1 2๐ ๐ data points ๐ query points ๐1 = ๐ค, 0๐, ๐
1
โฎ ๐๐ = (๐ค, 0๐, ๐
๐)
26
27
28
29
Goal: Keep out-degree as small as possible (while maintaining โlarge-enoughโ in-degree)!
30
HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018]
31
32
33
[Johnson et al., 2017] https://github.com/facebookresearch/faiss
34
centroids
closest centroid
points associated with these centroids
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
35
[Groh et al., 2019]
36
[Subramanya et al., 2019]
RAM SSD
เท ๐ฆ1 เท ๐ฆ๐ ๐ฆ1 ๐ฆ๐ โฆ โฆ compressed vectors (32 byte per vector) Product Quantization Original vectors (~400 byte per vector)
37
๐ โ๐ ๐ = { ๐ฆ, ๐ง โ ๐ ร ๐ โฃ ๐ก๐๐ ๐ฆ, ๐ง โฅ ๐}
๐ ๐
38
Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018]
Scalability! But at what COST? [McSherry et al., 2015]
[Mann et al., 2016]
39
[Hu et al., 2019]
๐ ๐ Hash using LSH (๐ฆ, โ๐(x)) (๐ง, โ๐(y)) Join on hash (๐ฆ, ๐ง, โ๐(x)) Similarity at least ๐? (๐ฆ, ๐ง) Emit ๐(๐2) local work for distance computations!
40
[A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny
๐ ๐ Cartesian Product LSH + Sketching, candidate verification locally Emit/Collect Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow
41
42
Similarity Search in High-Dimensions: Setup/Experimental Approach
Survey of state-of- the-art Nearest Neighbor Search algorithms
Similarity Search on the GPU, in external memory, and in distributed settings
43
https://arxiv.org/abs/1807.05614 for an open access version.
44
45
and [Christiani, 2019]
๐ ๐ฟ
Pick ๐ฟ hash functions in repetition ๐ using universal hash functions in each column. ๐ฟ โ ๐ independent hash functions from LSH family, ๐ โช ๐. Analysis using Cantelliโs inequality โ Requires different stopping criteria (factor 2 slowdown)
46