Data-dependent Hashing for Nearest Neighbor Search Alex Andoni - - PowerPoint PPT Presentation

β–Ά
data dependent hashing for nearest neighbor search
SMART_READER_LITE
LIVE PREVIEW

Data-dependent Hashing for Nearest Neighbor Search Alex Andoni - - PowerPoint PPT Presentation

Data-dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Based on joint work with: Piotr Indyk, Huy Nguyen, Ilya Razenshteyn Nearest Neighbor Search (NNS) Preprocess: a set of points Query: given a


slide-1
SLIDE 1

Data-dependent Hashing for Nearest Neighbor Search

Alex Andoni

(Columbia University) Based on joint work with: Piotr Indyk, Huy Nguyen, Ilya Razenshteyn

slide-2
SLIDE 2

Nearest Neighbor Search (NNS)

 Preprocess: a set 𝑄 of points  Query: given a query point π‘Ÿ, report a

point π‘žβˆ— ∈ 𝑄 with the smallest distance to π‘Ÿ

π‘Ÿ π‘žβˆ—

2

slide-3
SLIDE 3

Motivation

 Generic setup:

 Points model objects (e.g. images)  Distance models (dis)similarity measure

 Application areas:

 machine learning: k-NN rule  speech/image/video/music recognition, vector quantization, bioinformatics, etc…

 Distances:

 Hamming, Euclidean, edit distance, earthmover distance, etc…

 Core primitive: closest pair, clustering, etc…

000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111

π‘Ÿ π‘žβˆ—

3

slide-4
SLIDE 4

Curse of Dimensionality

 All exact algorithms degrade rapidly with the

dimension 𝑒

4

Algorithm Query time Space Full indexing 𝑒 β‹… log π‘œ 𝑃 1 π‘œπ‘ƒ(𝑒) (Voronoi diagram size) No indexing – linear scan 𝑃(π‘œ β‹… 𝑒) 𝑃(π‘œ β‹… 𝑒)

slide-5
SLIDE 5

Approximate NNS

 𝑠-near neighbor: given a query point π‘Ÿ,

report a point π‘žβ€² ∈ 𝑄 s.t. π‘žβ€² βˆ’ π‘Ÿ ≀ 𝑠

 as long as there is some point within

distance 𝑠

 Practice: use for exact NNS

 Filtering: gives a set of candidates (hopefully

small)

𝑠 π‘Ÿ π‘žβˆ— π‘žβ€² 𝑑𝑠

𝑑𝑠

5

slide-6
SLIDE 6

NNS algorithms

6

Exponential dependence on dimension

 [Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-

We’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],…

Linear/poly dependence on dimension

 [Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’98, β€˜01],

[Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk- Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A.-Indyk’06], [A.-Indyk-Nguyen-Razenshteyn’14], [A.-Razenshteyn’15], [Pagh’16],[Laarhoven’16],…

slide-7
SLIDE 7

Locality-Sensitive Hashing

Random hash function β„Ž on 𝑆𝑒 satisfying:

 for close pair: when π‘Ÿ βˆ’ π‘ž

≀ 𝑠 Pr[β„Ž(π‘Ÿ) = β„Ž(π‘ž)] is β€œhigh”

 for far pair: when π‘Ÿ βˆ’ π‘žβ€²

> 𝑑𝑠 Pr[β„Ž(π‘Ÿ) = β„Ž(π‘žβ€²)] is β€œsmall”

Use several hash tables:

π‘Ÿ π‘ž π‘Ÿ βˆ’ π‘ž Pr[𝑕(π‘Ÿ) = 𝑕(π‘ž)] 𝑠 𝑑𝑠

1

𝑄1 𝑄2

π‘œπœ, where

[Indyk-Motwani’98]

π‘Ÿ

β€œnot-so-small”

𝑄1 = 𝑄2 =

𝜍 = log 1/𝑄

1

log 1/𝑄

2

7

π‘žβ€²

slide-8
SLIDE 8

LSH Algorithms

8

Space Time Exponent 𝒅 = πŸ‘ Reference π‘œ1+𝜍 π‘œπœ 𝜍 = 1/𝑑 𝜍 = 1/2

[IM’98]

π‘œ1+𝜍 π‘œπœ 𝜍 = 1/𝑑 𝜍 = 1/2

[IM’98, DIIM’04]

Hamming space Euclidean space

𝜍 β‰₯ 1/𝑑

[MNP’06, OWZ’11]

𝜍 β‰₯ 1/𝑑2

[MNP’06, OWZ’11]

𝜍 β‰ˆ 1/𝑑2 𝜍 = 1/4

[AI’06]

slide-9
SLIDE 9

LSH is tight… what’s next?

9

But are we really done with basic NNS algorithms?

Lower bounds (cell probe)

[A.-Indyk-Patrascu’06, Panigrahy-Talwar-Wieder’08,β€˜10, Kapralov-Panigrahy’12]

Space-time trade-offs

[Panigrahy’06, A.-Indyk’06]

Datasets with additional structure

[Clarkson’99, Karger-Ruhl’02, Krauthgamer-Lee’04, Beygelzimer-Kakade-Langford’06, Indyk-Naor’07, Dasgupta-Sinha’13, Abdullah-A.-Krauthgamer-Kannan’14,…]

slide-10
SLIDE 10

Beyond Locality Sensitive Hashing

Space Time Exponent 𝒅 = πŸ‘ Reference π‘œ1+𝜍 π‘œπœ 𝜍 = 1/𝑑 𝜍 = 1/2

[IM’98]

π‘œ1+𝜍 π‘œπœ 𝜍 β‰ˆ 1/𝑑2 𝜍 = 1/4

[AI’06]

Hamming space Euclidean space

𝜍 β‰₯ 1/𝑑

[MNP’06, OWZ’11]

𝜍 β‰₯ 1/𝑑2

[MNP’06, OWZ’11]

𝜍 β‰ˆ 1 2𝑑 βˆ’ 1 𝜍 = 1/3

[AR’15]

𝜍 β‰ˆ 1 2𝑑2 βˆ’ 1 𝜍 = 1/7

[AR’15]

LSH LSH

10

π‘œ1+𝜍 π‘œπœ complicated 𝜍 = 1/2 βˆ’ πœ—

[AINR’14]

π‘œ1+𝜍 π‘œπœ complicated 𝜍 = 1/4 βˆ’ πœ—

[AINR’14]

slide-11
SLIDE 11

New approach?

11

 A random hash function,

chosen after seeing the given dataset

 Efficiently computable

Data-dependent hashing

slide-12
SLIDE 12

Construction of hash function

12

 T

wo components:

 Nice geometric structure  Reduction to such structure

 Like a (weak) β€œregularity lemma” for a set of points

has better LSH data-dependent

slide-13
SLIDE 13

Nice geometric structure: average-case

13

 Think: random dataset on a sphere

 vectors perpendicular to each other  s.t. random points at distance β‰ˆ 𝑑𝑠

 Lemma: 𝜍 =

1 2𝑑2βˆ’1

 via Cap Carving

𝑑𝑠 𝑑𝑠/ 2

slide-14
SLIDE 14

Reduction to nice structure

14

 Idea:

iteratively decrease the radius of minimum enclosing ball

 Algorithm:

 find dense clusters

 with smaller radius  large fraction of points

 recurse on dense clusters  apply cap carving on the rest

 recurse on each β€œcap”  eg, dense clusters might reappear

radius = 99𝑑𝑠

*picture not to scale & dimension

radius = 100𝑑𝑠

Why ok? Why ok?

  • no dense clusters
  • like β€œrandom dataset”

with radius=100𝑑𝑠

  • even better!
slide-15
SLIDE 15

Hash function

15

 Described by a tree (like a hash table)

radius = 100𝑑𝑠 *picture not to scale&dimension

slide-16
SLIDE 16

Dense clusters

 Current dataset: radius 𝑆  A dense cluster:

 Contains π‘œ1βˆ’πœ€ points  Smaller radius: 1 βˆ’ Ξ© πœ—2

𝑆

 After we remove all clusters:

 For any point on the surface, there are at most π‘œ1βˆ’πœ€ points

within distance 2 βˆ’ πœ— 𝑆

 The other points are essentially orthogonal !

 When applying Cap Carving with parameters (𝑄

1, 𝑄 2):

 Empirical number of far pts colliding with query: π‘œπ‘„2 + π‘œ1βˆ’πœ€  As long as π‘œπ‘„2 ≫ π‘œ1βˆ’πœ€, the β€œimpurity” doesn’t matter! 2 βˆ’ πœ— 𝑆

πœ— trade-off πœ€ trade-off

?

slide-17
SLIDE 17

Tree recap

17

 During query:

 Recurse in all clusters  Just in one bucket in CapCarving

 Will look in >1 leaf!  How much branching?

 Claim: at most π‘œπœ€ + 1

𝑃(1/πœ—2)

 Each time we branch

 at most π‘œπœ€ clusters (+1)  a cluster reduces radius by Ξ©(πœ—2)  cluster-depth at most 100/Ξ© πœ—2

 Progress in 2 ways:

 Clusters reduce radius  CapCarving nodes reduce the # of far points (empirical 𝑄2)

 A tree succeeds with probability β‰₯ π‘œβˆ’

1 2𝑑2βˆ’1βˆ’π‘(1)

πœ€ trade-off

slide-18
SLIDE 18

Beyond β€œBeyond LSH”

18

 Practice: often optimize partition to your dataset

 PCA-tree, spectral hashing, etc [S91, McN01,VKD09, WTF08,…]  no guarantees (performance or correctness)

 Theory: assume special structure in the dataset

 low intrinsic dimension [KR’02, KL’04, BKL’06, IN’07, DS’13,…]  structure + noise [Abdullah-A.-Krauthgamer-Kannan’14]

Data-dependent hashing helps even when no a priori structure !

slide-19
SLIDE 19

Data-dependent hashing wrap-up

19

 Dynamicity?

 Dynamization techniques [Overmars-van Leeuwen’81]

 Better bounds?

 For dimension 𝑒 = 𝑃(log π‘œ), can get better 𝜍! [Laarhoven’16]  For 𝑒 > log1+πœ€ π‘œ: our 𝜍 is optimal even for data-dependent hashing! [A-

Razenshteyn’??]:

 in the right formalization (to rule out

Voronoi diagram):

 description complexity of the hash function is π‘œ1βˆ’Ξ©(1)

 Practical variant [A-Indyk-Laarhoven-Razenshteyn-Schmidt’15]  NNS for β„“βˆž

 [Indyk’98] gets approximation 𝑃(log log 𝑒) (poly space, sublinear qt)

 Cf., β„“βˆž has no non-trivial sketch!  Some matching lower bounds in the relevant model [ACP’08, KP’12]

 Can be thought of as data-dependent hashing

 NNS for any norm (eg, matrix norms, EMD) ?