Y ou havent read it yet, but you can already tell this article is - PDF document

Technical Perspective: Finding a Good Neighbor , Near and Fast by Bernard Chazelle Y ou haven’t read it yet, but you can already tell this article is going to be one long jumble of words, numbers, and punctuation marks. Indeed, but look at it differently, as a text classifier would, and you will see a single point in high dimension, with word frequencies acting as coordinates. Or take the background on your flat panel display: a million colorful pixels teaming up to make quite a striking picture. Yes, but also one single point in 10 6 -dimensional space—that is, if you think of each pixel’s RGB intensity as a separate coordinate. In fact, you don’t need to look hard to find complex, heterogeneous data encoded as clouds of points in high dimension. They routinely surface in applications as diverse as medical imaging, bioinformatics, astrophysics, and finance. Why? One word: geometry. Ever since Euclid pondered what he learning), the data is imprecise to begin with, so erring by a small fac- could do with his compass, geometry has proven a treasure trove for tor of c > 1 does not cause much harm. And if it does, there is always countless computational problems. Unfortunately, high dimension the option (often useful in practice) to find the exact nearest neighbor comes at a price: the end of space partitioning as we know it. Chop up by enumerating all points in the vicinity of the query: something the a square with two bisecting slices and you get four congruent squares. methods discussed below will allow us to do. Now chop up a 100-dimensional cube in the same manner and you get The pleasant surprise is that one can tolerate an arbitrarily small 2 100 little cubes—some Lego set! High dimension provides too many error and still break the curse. Indeed, a zippy query time of O ( d log n ) can be achieved with an amount of storage roughly n O ( � -2 ) . No curse places to hide for searching to have any hope. Just as dimensionality can be a curse (in Richard Bellman’s words), there. Only one catch: a relative error of, say, 10% requires a prohibi- so it can be a blessing for all to enjoy. For one thing, a multitude of ran- tive amount of storage. So, while theoretically attractive, this solution dom variables cavorting together tend to produce sharply concentrated and its variants have left practitioners unimpressed. measures: for example, most of the action on a high-dimensional sphere Enter Alexandr Andoni and Piotr Indyk [1], with a new solution that occurs near the equator, and any function defined over it that does not should appeal to theoretical and applied types alike. It is fast and eco- vary too abruptly is in fact nearly constant. For another blessing of nomical, with software publicly available for slightly earlier incarnations dimensionality, consider Wigner’s celebrated semicircle law : the spectral of the method. The starting point is the classical idea of locality- distribution of a large random matrix (an otherwise perplexing object) sensitive hashing (LSH). The bane of classical hashing is collision: too is described by a single, lowly circle. Sharp measure concentrations and many keys hashing to the same spot can ruin a programmer’s day. LSH easy spectral predictions are the foodstuffs on which science feasts. turns this weakness into a strength by hashing high-dimensional points But what about the curse? It can be vanquished. Sometimes. into bins on a line in such a way that only nearby points collide. What Consider the problem of storing a set S of n points in R d (for very large better way to meet your neighbors than to bump into them? Andoni and d ) in a data structure, so that, given any point q , the nearest p � S (in Indyk modify LSH in critical ways to make neighbor searching more the Euclidean sense) can be found in a snap. Trying out all the points effective. For one thing, they hash down to spaces of logarithmic of S is a solution—a slow one. Another is to build the Voronoi diagram dimension, as opposed to single lines. They introduce a clever way of of S . This partitions R d into regions with the same answers, so that cutting up the hashing image space, all at a safe distance from the handling a query q means identifying its relevant region. Unfortunately, curse’s reach. They also add bells and whistles from coding theory to any solution with the word “partition” in it is likely to raise the specter make the algorithm more practical. of the dreaded curse, and indeed this one lives up to that expectation. Idealized data structures often undergo cosmetic surgery on their Unless your hard drive exceeds in bytes the number of particles in the way to industrial-strength implementations; such an evolution is likely universe, this “precompute and look up” method is doomed. in this latest form of LSH. But there is no need to wait for this. Should What if we instead lower our sights a little and settle for an approx- you need to find neighbors in very high dimension, one of the current imate solution, say a point p � S whose distance to q is at most c = LSH algorithms might be just the solution for you. 1 + � times the smallest one? Luckily, in many applications (for exam- Reference ple, data analysis, lossy compression, information retrieval, machine 1. Andoni, A. and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of Biography the 47th Annual IEEE Symposium on the Foundations of Computer Bernard Chazelle ( chazelle@cs.princeton.edu ) is a professor of com- Science ( FOCS’06 ). puter science at Princeton University, Princeton, NJ. 115 COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1

You’ve come a long way. Share what you’ve learned. ACM has partnered with MentorNet, the award-winning nonprofit e-mentoring network in engineering, science and mathematics. MentorNet’s award-winning One-on-One Mentoring Programs pair ACM student members with mentors from industry, government, higher education, and other sectors. • Communicate by email about career goals, course work, and many other topics. • Spend just 20 minutes a week - and make a huge di ff erence in a student’s life. • Take part in a lively online community of professionals and students all over the world. Make a di ff erence to a student in your field. Sign up today at: www.mentornet.net Find out more at: www.acm.org/mentornet MentorNet’s sponsors include 3M Foundation, ACM, Alcoa Foundation, Agilent Technologies, Amylin Pharmaceuticals, Bechtel Group Foundation, Cisco Systems, Hewlett-Packard Company, IBM Corporation, Intel Foundation, Lockheed Martin Space Systems, National Science Foundation, Naval Research Laboratory, NVIDIA, Sandia National Laboratories, Schlumberger, S.D. Bechtel, Jr. Foundation, Texas Instruments, and The Henry Luce Foundation.

Y ou havent read it yet, but you can already tell this article is - PDF document

Technical Perspective: Finding a Good Neighbor , Near and Fast by Bernard Chazelle Y ou havent read it yet, but you can already tell this article is going to be one long jumble of words, numbers, and punctuation marks. Indeed, but look at it

Tre a s u re r s Re p o r t a n d 2 0 1 9 B u d ge t P re s e ntat i o n Lutheran Week

DPF Execuve Commi0ee Meeng: January 25, 2014 Report of

Green Bond Treasurer Survey 24 th April 2020 Prepared by the Climate Bonds Initiative, with

HermitCore A Library Operating System for Cloud and High-Performance Computing Stefan Lankes

An Annotated Dataset of Stack Overflow Post Edits Sebastian Baltes

HIGH ENERGY MULTI-MESSENGER ASTRONOMY in collaboration with P. Padovani, P. Giommi, A. Turcati,

Finding Galactic- halo substructure in the Gaia data Amina Helmi Stellar halo: treasure trove of

Galactic di ff use molecular gas detected in absorption toward ALMA calibrator sources as

October 6, 2020 Financial Literacy and Education Commission (FLEC) National Strategy for

L ECTURE 11 Monetary Policy at the Zero Lower Bound: Quantitative Easing November 2, 2016 I . O

Second Quarter Review 26 / April / 2013 Forward-Looking Statements / Safe Harbor This

HCA update A NATIONAL and Land AGENCY Programme WORKING 17 LOCALLY Greater Brighton

Library of Congress Classification: 5.2 1 Library of Congress Classification: 5.2 In the

RECWOWE Seminar Understanding the Europeanisation of Domestic Welfare States Brussels, 2

John Knox O God. Give me Scotland or I die! 1 Slide 2 Born, Haddington, c1505-1514? 2

Motivation Consider any of the popular/periodic rankings of Javier Estrada mutual fund

PROGRAMMING FOR BUSINESS COMPUTING Applications in finance Hsin-Min Lu

Chapter 4 Trial Balance and Financial Statements 1 List of account balances and Trial balance $

GTAS and Closing Package Update Jaime M. Saling April 18, 2018 The Issue: A Disclaimer of Opinion

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied

PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning

The Long and Winding Road to My Dream Job David Pooley Trinity University dpooley@trinity.edu

The Role of QR Centers in The Role of QR Centers in Supporting Students and Faculty Supporting

How ar are MPAs Man anaged? MPA management Mean High Water Mean Low Water Intertidal Zone

Sambuz

Useful Links

Newsletter

Mail Us

Y ou havent read it yet, but you can already tell this article is - PDF document

Technical Perspective: Finding a Good Neighbor , Near and Fast by Bernard Chazelle Y ou havent read it yet, but you can already tell this article is going to be one long jumble of words, numbers, and punctuation marks. Indeed, but look at it

Tre a s u re r s Re p o r t a n d 2 0 1 9 B u d ge t P re s e ntat i o n Lutheran Week

DPF Execu*ve Commi0ee Mee*ng: January 25, 2014 Report of

Green Bond Treasurer Survey 24 th April 2020 Prepared by the Climate Bonds Initiative, with

HermitCore A Library Operating System for Cloud and High-Performance Computing Stefan Lankes

An Annotated Dataset of Stack Overflow Post Edits Sebastian Baltes

HIGH ENERGY MULTI-MESSENGER ASTRONOMY in collaboration with P. Padovani, P. Giommi, A. Turcati,

Finding Galactic- halo substructure in the Gaia data Amina Helmi Stellar halo: treasure trove of

Galactic di ff use molecular gas detected in absorption toward ALMA calibrator sources as

October 6, 2020 Financial Literacy and Education Commission (FLEC) National Strategy for

L ECTURE 11 Monetary Policy at the Zero Lower Bound: Quantitative Easing November 2, 2016 I . O

Second Quarter Review 26 / April / 2013 Forward-Looking Statements / Safe Harbor This

HCA update A NATIONAL and Land AGENCY Programme WORKING 17 LOCALLY Greater Brighton

Library of Congress Classification: 5.2 1 Library of Congress Classification: 5.2 In the

RECWOWE Seminar Understanding the Europeanisation of Domestic Welfare States Brussels, 2

John Knox O God. Give me Scotland or I die! 1 Slide 2 Born, Haddington, c1505-1514? 2

Motivation Consider any of the popular/periodic rankings of Javier Estrada mutual fund

PROGRAMMING FOR BUSINESS COMPUTING Applications in finance Hsin-Min Lu

Chapter 4 Trial Balance and Financial Statements 1 List of account balances and Trial balance $

GTAS and Closing Package Update Jaime M. Saling April 18, 2018 The Issue: A Disclaimer of Opinion

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied

PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning

The Long and Winding Road to My Dream Job David Pooley Trinity University dpooley@trinity.edu

The Role of QR Centers in The Role of QR Centers in Supporting Students and Faculty Supporting

How ar are MPAs Man anaged? MPA management Mean High Water Mean Low Water Intertidal Zone

Sambuz

Useful Links

Newsletter

Mail Us

DPF Execuve Commi0ee Meeng: January 25, 2014 Report of