Problems Martin Aumller IT University of Copenhagen Roadmap 01 - - PowerPoint PPT Presentation

โ–ถ
problems
SMART_READER_LITE
LIVE PREVIEW

Problems Martin Aumller IT University of Copenhagen Roadmap 01 - - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the


slide-1
SLIDE 1

Algorithm Engineering for High- Dimensional Similarity Search Problems

Martin Aumรผller IT University of Copenhagen

slide-2
SLIDE 2

Roadmap

Similarity Search in High-Dimensions: Setup/Experimental Approach

01

Survey of state-of- the-art Nearest Neighbor Search algorithms

02

Similarity Search on the GPU, in external memory, and in distributed settings

03

2

slide-3
SLIDE 3
  • 1. Similarity Search in High-

Dimensions: Setup/Experimental Approach

3

slide-4
SLIDE 4

๐‘™-Nearest Neighbor Problem

  • Preprocessing: Build DS for set ๐‘‡ of ๐‘œ data points
  • Task: Given query point ๐‘Ÿ, return ๐‘™ closest points to ๐‘Ÿ in ๐‘‡

4

โœ“ โœ“ โœ“ โœ“ โœ“ โœ“

slide-5
SLIDE 5

Nearest neighbor search on words

  • GloVe: learning algorithm to find vector representations for words
  • GloVe.twitter dataset: 1.2M words, vectors trained from 2B tweets,

100 dimensions

  • Semantically similar words: nearest neighbor search on vectors

5

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/

slide-6
SLIDE 6

$ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 -1.1003 0.93183 0.13311 0.17207 โ€ฆ

GloVe Examples

  • sardinia
  • tuscany
  • dubrovnik
  • liguria
  • naples

โ€œsicilyโ€

  • algorithms
  • optimization
  • approximation
  • iterative
  • computation

โ€œalgorithmโ€

  • engineer
  • accounting
  • research
  • science
  • development

โ€œengineeringโ€

6

slide-7
SLIDE 7

Basic Setup

  • Data is described by high-dimensional

feature vectors

  • Exact similarity search is difficult in

high dimensions

  • data structures and algorithms suffer
  • exponential dependence on

dimensionality

  • in time, space, or both

7

slide-8
SLIDE 8

Why is Exact NN difficult?

  • Choose ๐‘œ random points from ๐‘‚ 0, 1/๐‘’

๐‘’, for large ๐‘’

  • Choose a random query point
  • nearest and furthest neighbor

basically at same distance

8

2 ยฑ 1/โˆš๐‘’

slide-9
SLIDE 9

Performance

  • n GloVe

9

slide-10
SLIDE 10

Difficulty measure for queries

  • Given query ๐‘Ÿ and distances ๐‘ 

1, โ€ฆ , ๐‘  ๐‘™ to

๐‘™ nearest neighbors, define

๐ธ ๐‘Ÿ = โˆ’ 1 ๐‘™ เท ln ๐‘ 

๐‘—/๐‘  ๐‘™ โˆ’1

10

๐‘Ÿ Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015]

slide-11
SLIDE 11

LID Distribution

11

slide-12
SLIDE 12

Results (GloVe, 10-NN, 1.2M points)

http://ann-benchmarks.com/sisap19/faiss-ivf.html Easy Difficult Middle

12

slide-13
SLIDE 13
  • 2. STATE-OF-THE-ART

NEAREST NEIGHBOR SEARCH

13

slide-14
SLIDE 14

General Pipeline

Index generates candidates Brute-force search on candidates

14

slide-15
SLIDE 15

Brute-force search

๐‘ž1 ๐‘ž๐‘œ GloVe: 1.2 M points, inner product as distance measure 400 byte 400 byte Automatically SIMD vectorized with clang โ€“O3: https://godbolt.org/z/TJX68s

  • 100ms per scan
  • 4.2 GB/s throughput
  • CPU-bound

15

slide-16
SLIDE 16

Manual vectorization (256 bit registers)

๐‘ฆ ๐‘ง โ€ฆ โ€ฆ Parallel multiply Parallel add to result register Horizontal sum and cast to float

  • 25 ms per query
  • 16 GB/s
  • 16.5 GB/s single-

thread max on my laptop

  • Memory-bound

https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17

16

slide-17
SLIDE 17

Brute-force on bit vectors

  • Another popular distance measure is Hamming distance
  • Number of positions in which two bit strings differ
  • Can be nicely packed into 64-bit words
  • Hamming distance of two words is just bitcount of the XOR
  • 1.3 ms per query (128 bits)
  • 6 GB/s throughput

17

slide-18
SLIDE 18

Sketching to avoid distance computations

  • Distance computations on bit

vectors faster than Euclidean distance/inner product

  • Their number can be reduced by

storing compact sketch representations

18

๐‘Ÿ ๐‘ฆ ๐‘ฆ 1011100101 0101110101 Sketch representation SimHash [Charikar, 2002] 1-BitMinHash [Kรถnig-Li, 2010] Set ๐œ such that with probability at least 1 โˆ’ ๐œ we donโ€™t disregard point that could be among NN. At least ๐œ collisions? Yes No skip compute dist(๐‘Ÿ, ๐‘ฆ) Easy to analyze: Sum of Bernoulli trials of Pr(๐‘Œ = 1) = ๐‘”(dist(๐‘Ÿ, ๐‘ฆ)) ๐‘Ÿ Can distance computation be avoided? [Christiani, 2019]

slide-19
SLIDE 19

General Pipeline

Index generates candidates Brute-force search on candidates

19

slide-20
SLIDE 20

PUFFINN PARAMETERLESS ANDUNIVERSALLY FASTFINDINGOF NEAREST NEIGHBORS

20

[A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz

slide-21
SLIDE 21

How does it work?

Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998]

21

= โ„Ž ๐‘ž = โ„Ž1 ๐‘ž โˆ˜ โ„Ž2 ๐‘ž โˆ˜ โ„Ž3 ๐‘ž โˆˆ 0,1 3

A family โ„‹ of hash functions is locality- sensitive, if the collision probability of two points is decreasing with their distance to each other.

slide-22
SLIDE 22

Solving ๐‘™-NN using LSH (with failure prob. ๐œ€)

22

โ„Ž3 โ„Ž2 โ„Ž4 โ„Ž5 โ„Ž๐‘€

โ€ฆ

Dataset ๐‘‡

Termination: If 1 โˆ’ ๐‘ž ๐‘˜ โ‰ค ๐œ€, report current top-๐‘™.

probability of the current ๐‘™-th nearest neighbor to collide. Not terminated? Decrease ๐ฟ!

slide-23
SLIDE 23

The Data Structure

Theoretical

  • LSH Forest: Each repetition is a

Trie build from LSH hash values [Bawa et al., 2005]

Practical

  • Store indices of data set points

sorted by hash code

  • โ€Traversing the Trieโ€ by binary

search

  • use lookup table for first levels

23

โ€ฆ 1 1 1 1 1 1 1

slide-24
SLIDE 24

Overall System Design

24

slide-25
SLIDE 25

Running time (Glove 100d, 1.2M, 10-NN)

25

slide-26
SLIDE 26

A difficult (?) data set in โ„3๐‘’

๐‘ฆ1 = 0๐‘’, ๐‘ง1, ๐‘จ1 โ‹ฎ ๐‘ฆ๐‘œโˆ’1 = 0๐‘’, ๐‘ง๐‘œโˆ’1, ๐‘จ๐‘œโˆ’1 ๐‘ฆ๐‘œ = (๐‘ค, ๐‘ฅ, 0๐‘’) ๐‘ง๐‘—, ๐‘จ๐‘—, ๐‘ค, ๐‘ฅ, ๐‘ ๐‘— โˆผ ๐’ช๐‘’ 0, 1 2๐‘’ ๐‘œ data points ๐‘› query points ๐‘Ÿ1 = ๐‘ค, 0๐‘’, ๐‘ 

1

โ‹ฎ ๐‘Ÿ๐‘› = (๐‘ค, 0๐‘’, ๐‘ 

๐‘›)

26

slide-27
SLIDE 27

27

Running time (โ€œDifficultโ€, 1M, 10-NN)

slide-28
SLIDE 28

Graph-based Similarity Search

28

slide-29
SLIDE 29

Building a Small World Graph

29

slide-30
SLIDE 30

Refining a Small World Graph

Goal: Keep out-degree as small as possible (while maintaining โ€œlarge-enoughโ€ in-degree)!

30

HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018]

slide-31
SLIDE 31

Running time (Glove 100d, 1.2M, 10-NN)

31

slide-32
SLIDE 32

Open Problems Nearest Neighbor Search

  • Data-dependent LSH with guarantees?
  • Theoretical sound Small-World Graphs?
  • Multi-core implementations
  • Good? [Malkov et al., 2020]
  • Alternative ways of sketching data?

32

slide-33
SLIDE 33
  • 3. Similarity Search on the

GPU, in External Memory, and in Distributed Settings

33

slide-34
SLIDE 34

Nearest Neighbors on the GPU: FAISS

[Johnson et al., 2017] https://github.com/facebookresearch/faiss

  • GPU setting
  • Data structure is held in GPU memory
  • Queries come in batches of say 10,000 queries per time
  • Results:
  • http://ann-benchmarks.com/sift-128-euclidean_10_euclidean-batch.html

34

slide-35
SLIDE 35

FAISS/2

  • Data structure
  • Run k-means with large number of

centroids

  • Each data point is associated with

closest centroid

  • Query
  • Find ๐‘€ closest centroids
  • Return ๐‘™ closest points found in

points associated with these centroids

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

35

slide-36
SLIDE 36

Nearest Neighbors on the GPU: GGNN

[Groh et al., 2019]

36

slide-37
SLIDE 37

Nearest Neighbors in External Memory

[Subramanya et al., 2019]

RAM SSD

เทž ๐‘ฆ1 เทž ๐‘ฆ๐‘œ ๐‘ฆ1 ๐‘ฆ๐‘œ โ€ฆ โ€ฆ compressed vectors (32 byte per vector) Product Quantization Original vectors (~400 byte per vector)

37

slide-38
SLIDE 38

Distributed Setting: Similarity Join

  • Problem
  • given sets ๐‘† and ๐‘‡ of size ๐‘œ,
  • and similarity threshold ๐œ‡, compute

๐‘† โ‹ˆ๐œ‡ ๐‘‡ = { ๐‘ฆ, ๐‘ง โˆˆ ๐‘† ร— ๐‘‡ โˆฃ ๐‘ก๐‘—๐‘› ๐‘ฆ, ๐‘ง โ‰ฅ ๐œ‡}

  • Similarity measures
  • Jaccard similarity
  • Cosine similarity
  • Naive: ๐‘ƒ ๐‘œ2 distance computations

๐‘† ๐‘‡

38

slide-39
SLIDE 39

Map-Reduce-based Similarity Join

Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018]

Scalability! But at what COST? [McSherry et al., 2015]

[Mann et al., 2016]

39

slide-40
SLIDE 40

Solved almost-optimally in the MPC model

[Hu et al., 2019]

๐‘† ๐‘‡ Hash using LSH (๐‘ฆ, โ„Ž๐‘—(x)) (๐‘ง, โ„Ž๐‘—(y)) Join on hash (๐‘ฆ, ๐‘ง, โ„Ž๐‘—(x)) Similarity at least ๐œ‡? (๐‘ฆ, ๐‘ง) Emit ๐‘ƒ(๐‘œ2) local work for distance computations!

40

slide-41
SLIDE 41

Another approach: DANNY

[A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny

๐‘† ๐‘‡ Cartesian Product LSH + Sketching, candidate verification locally Emit/Collect Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow

41

slide-42
SLIDE 42

Results

42

slide-43
SLIDE 43

Roadmap

Similarity Search in High-Dimensions: Setup/Experimental Approach

01

Survey of state-of- the-art Nearest Neighbor Search algorithms

02

Similarity Search on the GPU, in external memory, and in distributed settings

03

43

slide-44
SLIDE 44

References

  • [Amsaleg, Chelly, Furon, Girard, Houle, Kawarabayashi, Nett, 2015]: Estimating local intrinsic dimensionality. KDD 2015.
  • [A., Bernhardsson, Faithfull, 2020]: ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020), see

https://arxiv.org/abs/1807.05614 for an open access version.

  • [A., Ceccarello, 2019]: The role of local intrinsic dimensionality in benchmarking nearest neighbor search. In: SISAP 2019, see http://ann-benchmarks.com/sisap19/.
  • [A., Christiani, Pagh, Vesterli, 2019]: PUFFINN: parameterless and universally fast finding of nearest neighbors. In: ESA 2019, see https://github.com/puffinn/puffinn
  • [Christiani, 2019]: Fast locality-sensitive hashing frameworks for approximate near neighbor search
  • [Fier, Augsten, Bouros, Leser, Freytag, 2018]: Set similarity joins on MapReduce: An experimental survey. VLDB 2018.
  • [Groh, Ruppert, Wieschollek, Lensch, 2019]: GGNN: Graph-based GPU nearest neighbor search https://arxiv.org/abs/1912.01059
  • [Houle, 2013]: Dimensionality, discriminability, density and distance distributions. ICDMW 2013.
  • [Hu, Yi, Tao, 2019]: Output-optimal massively parallel algorithms for similarity joins. ACM Transactions on Database Systems, 2019.
  • [Iwasaki, Miyazaki, 2018]: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data, https://arxiv.org/abs/1810.07355
  • [Indyk, Motwani, 1998]: Approximate Nearest Neighbors: Towards removing the curse of dimensionality, STOC 1998.
  • [Johnson, Douze, Jegou, 2017]: Billion-scale similarity search with GPUs, https://arxiv.org/abs/1702.08734.
  • [Malkov, Yashunin, 2020]: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE TPAM 2020.
  • [Mann, Augsten, Bouros, 2016]: An empirical evaluation of set similarity join techniques. VLDB 2016.
  • [McSherry, Isard, Murray, 2015], Scalability! But at what COST? USENIX HotOS 2015.
  • [Subramanya, Devvrit, Kadekodi, Krishnaswamy, Simhadri, 2019]: DiskANN: Fast accurate billion-point nearest neighbor search on a single node. NeurIPS 2019

44

slide-45
SLIDE 45

Extra slides

45

slide-46
SLIDE 46

PUFFINN: Fast Hash Function Evaluation

  • Main Bottleneck: Computation of Hash Values
  • Adapt the โ€œpoolingโ€ technique of [Dahlgaard et al., 2017]

and [Christiani, 2019]

๐‘› ๐ฟ

Pick ๐ฟ hash functions in repetition ๐‘˜ using universal hash functions in each column. ๐ฟ โ‹… ๐‘› independent hash functions from LSH family, ๐‘› โ‰ช ๐‘€. Analysis using Cantelliโ€™s inequality โ†’ Requires different stopping criteria (factor 2 slowdown)

46