Needle in a Haystack: Searching for Approximate k- Nearest - PowerPoint PPT Presentation

Needle in a Haystack: Searching for Approximate k- Nearest Neighbours in High-Dimensional Data Liang Wang* , Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Teemu Roos, and Jukka Corander University of Cambridge*, UK University of Helsinki, Finland

Not Only Tall, But Also Very Fat ● Data grow in both volume and dimensionality. ● Due to the technology advances and modelling techniques. ○ Advances in measuring and monitoring tools. ○ Advances in computation and storage technologies. ○ DNA, stock market, language models: inherently HD models. ● Why do high-dimensional data matter? ○ It is hard to tell what information matters in the beginning. ○ Save everything and leave this problem later or someone else.

Searching Needle(s) in a Haystack ● Searching is among the most important operations. ○ E.g., Computer vision, pattern recognition, natural language processing, online recommenders, and etc. ● Searching is difficult in high-dimensional data. Why? ○ “Under rather general conditions, given a query point, the distance between the nearest and farthest points does not increase as fast as dimensionality.” ○ k-NN quickly becomes unstable in high-dimensional spaces. K. Beyer, et al. "When is “nearest neighbor” meaningful?." Database Theory - ICDT’99 . Springer, 1999.

Key Technique - Approximation ● Approximate the original data set with another one of lower dimensionality by “tolerating some error”, i.e., Dimensionality reduction - e.g., SVD, Random Forest, and etc.

Key Technique - Approximation ● Approximate the exact search results with a “roughly” good ones, especially useful for time-constrained applications. D E + For example, B’s 3-nearest neighbours are A, C B C and D. Instead of returning the exact result, we + A + can return A, C, and E if our application can tolerate certain level of error. By so doing, we are usually able to gain a significant improvement on searching efficiency.

Random Projection ● Essentially, it is all about clustering - similar points should be grouped together, i.e., in a cluster.

Classic Random-Projection Tree In every step, the problem space will be divided into half, then solved separately. It is a typical divide and conquer technique. The split point can be mean value, median, or other more complicated statistics. The leaf node is a cluster of points which are close to each other.

Issues of Classic RP-Tree In general, the accuracy is not very high even for a data set of medium dimensionalities. The accuracy is impacted by two kinds of misclassifications: i.e., B and D; A and C. The process of Index building has only limited parallelism, so not very efficient in practice. Index size is big due to storing high-dimensional vectors in the intermediate nodes.

MRPT - Improve Accuracy ● Increase either leaf size or # of trees, but which is better? combine leaf clusters

MRPT - Improve Index Size ● We do not need to store the actual vector at each node. ● Instead, we can use a random seed to generate on the fly. In a leaf cluster, only the indices of vectors in the original data set are stored.

MRPT - Improve Efficiency ● Current algorithm can be parallelised to some extent, especially when moving towards leaves. ● Can we do better? By maximising the parallelism?

MRPT - Improve Efficiency Blue dotted lines are critical boundaries. The There is no critical boundary. All the projections computations in the child-branches cannot can be done in just one matrix multiplication. proceed without finishing the computation in the Therefore, the parallelism can be maximised. parent node.

Almost Done, Let’s Conclude ● High-dimensional data sets are quite common in practical applications. Efficient and accurate searching is difficult. ● MRPT is a compact data structure which provides approximate k-NN search for high-dimensional big data sets. ● MRPT optimises the index size, searching accuracy, searching efficiency, and parallelism of a building process.

Thank you. Questions?

MRPT - Improve Accuracy ● Increase either leaf size or # of trees, but which is better?

Finally, A Concrete Application of MRPT

Needle in a Haystack: Searching for Approximate k- Nearest - PowerPoint PPT Presentation

Needle in a Haystack: Searching for Approximate k- Nearest Neighbours in High-Dimensional Data Liang Wang* , Ville Hyvnen, Teemu Pitknen, Sotiris Tasoulis, Teemu Roos, and Jukka Corander University of Cambridge*, UK University of Helsinki,

SURGICAL NEEDLES Prepared by: Mana Basirat 0 ANATOMY OF A SURGICAL NEEDLE EYELESS NEEDLE

Finding the Needle in the Haystack Jonzy Data Security Analysis, Sr. Information Security

Ultra-High Angular Resolution VLBI Rusen Lu ( ) rslu@haystack.mit.edu MIT Haystack

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

FINDING A NEEDLE IN HAYSTACK, FACEBOOKS PHOTO STORAGE Based on: D. Beaver, S. Kumar, H. C. Li,

Picviz finding a needle in a haystack Sbastien Tricaud INL Usenix, San Diego 2008 Sbastien

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Finding a Needle in Haystack Presentation by: Neelim Haider Authors (of paper): Doug Beaver,

New directions in approximate nearest neighbors for the angular distance Thijs Laarhoven

M87 Avery E. Broderick Sheperd Doeleman (MIT Haystack) Avi Loeb (Harvard) Vincent Fish (MIT

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Haystack full of needles. Beaver, D., Kumar, S., Li, H.C., Sobel, J., and Vajgel, P.: Finding

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Feature Selection for Predictive Modelling A Needle in a Haystack Problem Munshi Imran Hossain

Early Detection of Aquatic Invasive Species finding the needle in the haystack Jim Grazio,

Finding the Needle in a Haystack: Materials discovery through

http://www.stanford.edu/yyye Joint work with Anthony So and Jiawei Zhang SDP Rank Reduction

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

Programming with Big Code: Lessons, Techniques, Applications Pavol Bielik , Veselin Raychev,

Selecting the Aspect Ratio of a Scatter Plot Based on Its Delaunay Triangulation Martin Fink

Contextuality, memory cost, and nonclassicality for sequential quantum measurements Costantino

CPSC 221: Data Structures Graph Theory Alan J. Hu (Many slides gratefully stolen from Steve

Introduction to Event Generators Lecture 1 of 4 Peter Skands Monash University (Melbourne,

Ultracold Atoms and Quantum Simulators Marc Cheneau Igor Ferrier-Barbut Laboratoire Charles