Problems Martin Aumller IT University of Copenhagen Roadmap 01 - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumüller IT University of Copenhagen

Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 2

1. Similarity Search in High- Dimensions: Setup/Experimental Approach 3

𝑙 -Nearest Neighbor Problem • Preprocessing : Build DS for set 𝑇 of 𝑜 data points • Task : Given query point 𝑟 , return 𝑙 closest points to 𝑟 in 𝑇 ✓ ✓ ✓ ✓ ✓ ✓ 4

Nearest neighbor search on words • GloVe: learning algorithm to find vector representations for words • GloVe.twitter dataset: 1.2M words , vectors trained from 2B tweets , 100 dimensions • Semantically similar words: nearest neighbor search on vectors https://nlp.stanford.edu/projects/glove/ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 5

“ sicily ” • sardinia • tuscany • dubrovnik • liguria • naples “algorithm” • algorithms GloVe Examples • optimization • approximation • iterative • computation “engineering” • engineer • accounting • research • science • development $ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 - 1.1003 0.93183 0.13311 0.17207 … 6

Basic Setup • Data is described by high-dimensional feature vectors • Exact similarity search is difficult in high dimensions • data structures and algorithms suffer • exponential dependence on dimensionality • in time, space, or both 7

Why is Exact NN difficult? 𝑒 , for large 𝑒 • Choose 𝑜 random points from 𝑂 0, 1/𝑒 • Choose a random query point 2 ± 1/√𝑒 • nearest and furthest neighbor basically at same distance 8

Performance on GloVe 9

Difficulty measure for queries • Given query 𝑟 and distances 𝑠 1 , … , 𝑠 𝑙 to 𝑙 nearest neighbors, define −1 𝐸 𝑟 = − 1 𝑙 ෍ ln 𝑠 𝑗 /𝑠 𝑟 𝑙 Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015] 10

LID Distribution 11

Results (GloVe, 10-NN, 1.2M points) Easy Middle Difficult http://ann-benchmarks.com/sisap19/faiss-ivf.html 12

2. STATE-OF-THE-ART NEAREST NEIGHBOR SEARCH 13

General Pipeline Index generates candidates Brute-force search on candidates 14

Brute-force search GloVe: 1.2 M points, inner product as distance measure 𝑞 𝑜 𝑞 1 400 byte 400 byte • 100ms per scan • 4.2 GB/s throughput • CPU-bound Automatically SIMD vectorized with clang – O3: https://godbolt.org/z/TJX68s 15

https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17 Manual vectorization (256 bit registers) … 𝑦 Parallel multiply • 25 ms per query • 16 GB/s … • 16.5 GB/s single- 𝑧 thread max on my laptop • Memory-bound Parallel add to result register 0 0 0 0 0 0 0 0 Horizontal sum and cast to float 16

Brute-force on bit vectors • Another popular distance measure is Hamming distance • Number of positions in which two bit strings differ • Can be nicely packed into 64-bit words • Hamming distance of two words is just bitcount of the XOR • 1.3 ms per query (128 bits) • 6 GB/s throughput 17

[Christiani, 2019] Sketching to avoid distance computations SimHash [Charikar, 2002] 1-BitMinHash [König-Li, 2010] • Distance computations on bit Sketch representation vectors faster than Euclidean 𝑟 1011100101 distance/inner product 0101110101 𝑦 • Their number can be reduced by Easy to analyze: Sum of Bernoulli trials of storing compact sketch Pr(𝑌 = 1) = 𝑔(dist(𝑟, 𝑦)) representations At least 𝜐 collisions? Yes No Can distance computation be compute 𝑟 avoided? skip dist( 𝑟, 𝑦) 𝑦 Set 𝜐 such that with probability at least 1 − 𝜁 we don’t disregard point that could be among NN. 18

General Pipeline Index generates candidates Brute-force search on candidates 19

PUFFINN P ARAMETERLESS AND U NIVERSALLY F AST FI NDINGOF N EAREST N EIGHBORS [A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz 20

How does it work? Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998] ℎ 𝑞 = ℎ 1 𝑞 ∘ ℎ 2 𝑞 ∘ ℎ 3 𝑞 ∈ 0,1 3 = A family ℋ of hash functions is locality- sensitive , if the collision probability of two points is decreasing with their distance to each other. 21

Solving 𝑙 -NN using LSH (with failure prob. 𝜀 ) Dataset 𝑇 ℎ 4 ℎ 5 ℎ 2 ℎ 3 ℎ 𝑀 … Termination : If 1 − 𝑞 𝑘 ≤ 𝜀 , report current top- 𝑙 . Not terminated? Decrease 𝐿 ! probability of the current 𝑙 -th nearest neighbor to collide. 22

The Data Structure Theoretical Practical • LSH Forest: Each repetition is a • Store indices of data set points Trie build from LSH hash values sorted by hash code [Bawa et al., 2005] • ”Traversing the Trie ” by binary search • use lookup table for first levels 1 0 0 1 0 1 0 1 0 1 0 1 0 1 … 23

Overall System Design 24

Running time (Glove 100d, 1.2M, 10-NN) 25

A difficult (?) data set in ℝ 3𝑒 𝑜 data points 𝑛 query points 𝑦 1 = 0 𝑒 , 𝑧 1 , 𝑨 1 𝑟 1 = 𝑤, 0 𝑒 , 𝑠 1 ⋮ ⋮ 𝑟 𝑛 = (𝑤, 0 𝑒 , 𝑠 𝑛 ) 𝑦 𝑜−1 = 0 𝑒 , 𝑧 𝑜−1 , 𝑨 𝑜−1 𝑦 𝑜 = (𝑤, 𝑥, 0 𝑒 ) 0, 1 𝑧 𝑗 , 𝑨 𝑗 , 𝑤, 𝑥, 𝑠 𝑗 ∼ 𝒪 𝑒 2𝑒 26

Running time (“Difficult”, 1M, 10 -NN) 27

Graph-based Similarity Search 28

Building a Small World Graph 29

Refining a Small World Graph Goal : Keep out- degree as small as possible (while maintaining “large - enough” in -degree)! HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018] 30

Running time (Glove 100d, 1.2M, 10-NN) 31

Open Problems Nearest Neighbor Search • Data-dependent LSH with guarantees? • Theoretical sound Small-World Graphs? • Multi-core implementations • Good? [Malkov et al., 2020] • Alternative ways of sketching data? 32

3. Similarity Search on the GPU, in External Memory, and in Distributed Settings 33

Nearest Neighbors on the GPU: FAISS [Johnson et al., 2017] https://github.com/facebookresearch/faiss • GPU setting • Data structure is held in GPU memory • Queries come in batches of say 10,000 queries per time • Results: • http://ann-benchmarks.com/sift-128-euclidean_10_euclidean-batch.html 34

FAISS/2 • Data structure • Run k-means with large number of centroids • Each data point is associated with closest centroid • Query • Find 𝑀 closest centroids • Return 𝑙 closest points found in points associated with these centroids https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html 35

Nearest Neighbors on the GPU: GGNN [Groh et al., 2019] 36

Nearest Neighbors in External Memory [Subramanya et al., 2019] RAM SSD ෞ 𝑦 1 … 𝑦 𝑜 ෞ … compressed vectors 𝑦 1 𝑦 𝑜 (32 byte per vector) Original vectors Product Quantization (~400 byte per vector) 37

Distributed Setting: Similarity Join • Problem • given sets 𝑆 and 𝑇 of size 𝑜 , 𝑆 • and similarity threshold 𝜇, compute 𝑆 ⋈ 𝜇 𝑇 = { 𝑦, 𝑧 ∈ 𝑆 × 𝑇 ∣ 𝑡𝑗𝑛 𝑦, 𝑧 ≥ 𝜇} • Similarity measures • Jaccard similarity • Cosine similarity 𝑇 • Naive: 𝑃 𝑜 2 distance computations 38

Scalability! But at what COST? [ McSherry et al., 2015] Map-Reduce-based Similarity Join Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018] [Mann et al., 2016] 39

Solved almost-optimally in the MPC model [Hu et al., 2019] Hash Join on using hash LSH 𝑆 Similarity (𝑦, ℎ 𝑗 (x)) at least Emit 𝜇 ? (𝑦, 𝑧 ) (𝑦, 𝑧, ℎ 𝑗 (x)) (𝑧, ℎ 𝑗 (y)) 𝑇 𝑃(𝑜 2 ) local work for distance computations! 40

Another approach: DANNY [A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny LSH + Sketching, Cartesian Product candidate verification 𝑆 locally Emit/Collect 𝑇 Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow 41

Results 42

Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 43

Problems Martin Aumller IT University of Copenhagen Roadmap 01 - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Wicked Problems & Leadership Keith Grint The Problem with Change Do d ifferent kinds of

PCP Lecture 26 And Hardness of Approximation 1 Promise Problems 2 Promise Problems Decision

5. Network flow problems Example: Sailco Minimum-cost flow problems Transportation

Sample Graph Problems Path problems. Graph Operations And Connectedness problems.

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Trapdoor Problems Basing the solution on the complexity of problems, which are easy to solve for

Trapdoor Problems Basing the solution on the complexity of problems, which are easy to solve for

Post-Quantum Cryptography a talk about problems problems problems Andreas Hlsing TU

Solving Word Problems The strategy for solving word problems, presented in written form, may be

Search Problems and Algorithms T79.4201 Search Problems and Algorithms (4 ECTS) T-79.4201

Chapter 11 Problems In Large Cities Create a list of at least 10 problems that exist in large

Chapter 2 Linear Ill-Posed Problems Observations from previous chapter Ill-Posed Problems in

Movement Problems Alejandro Flores & Saurabh Kumar Movement Problems Introduction to

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

A Study of Erlang ETS Table Implementations and Performance Or: Judy Arrays Are Amazing Data

Recap CS 525: Advanced Database We have discussed Organization

File indexing and searching

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Trim in Q1 for measurements R. Tom as, J. Coello, A. Garcia and M. Hofer for WP2 March

Filtering cases Gert Janssenswillen Creator of bupaR DataCamp Business Process Analytics in R

DRAT-trim: Efficient Checking and Trimming Using Expressive Clausal Proofs Nathan Wetzler

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

Problems Martin Aumller IT University of Copenhagen Roadmap 01 - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Wicked Problems &amp; Leadership Keith Grint The Problem with Change Do d ifferent kinds of

PCP Lecture 26 And Hardness of Approximation 1 Promise Problems 2 Promise Problems Decision

5. Network flow problems Example: Sailco Minimum-cost flow problems Transportation

Sample Graph Problems Path problems. Graph Operations And Connectedness problems.

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Trapdoor Problems Basing the solution on the complexity of problems, which are easy to solve for

Trapdoor Problems Basing the solution on the complexity of problems, which are easy to solve for

Post-Quantum Cryptography a talk about problems problems problems Andreas Hlsing TU

Solving Word Problems The strategy for solving word problems, presented in written form, may be

Search Problems and Algorithms T79.4201 Search Problems and Algorithms (4 ECTS) T-79.4201

Chapter 11 Problems In Large Cities Create a list of at least 10 problems that exist in large

Chapter 2 Linear Ill-Posed Problems Observations from previous chapter Ill-Posed Problems in

Movement Problems Alejandro Flores &amp; Saurabh Kumar Movement Problems Introduction to

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

A Study of Erlang ETS Table Implementations and Performance Or: Judy Arrays Are Amazing Data

Recap CS 525: Advanced Database We have discussed Organization

File indexing and searching

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Trim in Q1 for measurements R. Tom as, J. Coello, A. Garcia and M. Hofer for WP2 March

Filtering cases Gert Janssenswillen Creator of bupaR DataCamp Business Process Analytics in R

DRAT-trim: Efficient Checking and Trimming Using Expressive Clausal Proofs Nathan Wetzler

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

Wicked Problems & Leadership Keith Grint The Problem with Change Do d ifferent kinds of

Movement Problems Alejandro Flores & Saurabh Kumar Movement Problems Introduction to