New directions in approximate nearest neighbors for the angular - - PowerPoint PPT Presentation

new directions in approximate nearest neighbors for the
SMART_READER_LITE
LIVE PREVIEW

New directions in approximate nearest neighbors for the angular - - PowerPoint PPT Presentation

New directions in approximate nearest neighbors for the angular distance Thijs Laarhoven mail@thijs.com http://www.thijs.com/ Proximity Workshop, College Park (MD), USA (January 13, 2016) Nearest neighbor searching O Nearest neighbor


slide-1
SLIDE 1

New directions in approximate nearest neighbors for the angular distance

Thijs Laarhoven

mail@thijs.com http://www.thijs.com/

Proximity Workshop, College Park (MD), USA

(January 13, 2016)

slide-2
SLIDE 2

O

Nearest neighbor searching

slide-3
SLIDE 3

O

Nearest neighbor searching

Data set

slide-4
SLIDE 4

O

Nearest neighbor searching

Target

slide-5
SLIDE 5

O

Nearest neighbor searching

Nearest neighbor

slide-6
SLIDE 6

O

Nearest neighbor searching

Nearest neighbor (ℓ1-norm)

slide-7
SLIDE 7

O

Nearest neighbor searching

Nearest neighbor (angular distance)

slide-8
SLIDE 8

O

Nearest neighbor searching

Nearest neighbor (ℓ2-norm)

slide-9
SLIDE 9

O

r

Nearest neighbor searching

Distance guarantee

slide-10
SLIDE 10

O

r

Nearest neighbor searching

Approximate nearest neighbor

slide-11
SLIDE 11

O

r c · r

Nearest neighbor searching

Approximation factor c > 1

slide-12
SLIDE 12

O

Nearest neighbor searching

Example: Precompute Voronoi cells

slide-13
SLIDE 13

O

Nearest neighbor searching

Example: Precompute Voronoi cells

slide-14
SLIDE 14

O

Nearest neighbor searching

Given a target...

slide-15
SLIDE 15

O

Nearest neighbor searching

...quickly find the right cell

slide-16
SLIDE 16

O

Nearest neighbor searching

Works well in low dimensions

slide-17
SLIDE 17

Nearest neighbor searching

Problem setting

  • High dimensions d
slide-18
SLIDE 18

Nearest neighbor searching

Problem setting

  • High dimensions d
  • Large data set of size n = 2Ω(d/ log d)

◮ Smaller n? =

⇒ Use JLT to reduce d

slide-19
SLIDE 19

Nearest neighbor searching

Problem setting

  • High dimensions d
  • Large data set of size n = 2Ω(d/ log d)

◮ Smaller n? =

⇒ Use JLT to reduce d

  • Assumption: Data set lies on the sphere

◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]

slide-20
SLIDE 20

Nearest neighbor searching

Problem setting

  • High dimensions d
  • Large data set of size n = 2Ω(d/ log d)

◮ Smaller n? =

⇒ Use JLT to reduce d

  • Assumption: Data set lies on the sphere

◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]

  • “Random” case: c · r =

√ 2

◮ Random unit vectors are usually approximately orthogonal

slide-21
SLIDE 21

Nearest neighbor searching

Problem setting

  • High dimensions d
  • Large data set of size n = 2Ω(d/ log d)

◮ Smaller n? =

⇒ Use JLT to reduce d

  • Assumption: Data set lies on the sphere

◮ Angular NNS in Rd equivalent to Eucl. NNS on the sphere ◮ Reduction from Eucl. NNS in Rd to Eucl. NNS on the sphere [AR’15]

  • “Random” case: c · r =

√ 2

◮ Random unit vectors are usually approximately orthogonal

  • Goal: Query time O(nρ) with ρ < 1
slide-22
SLIDE 22

O

Nearest neighbor searching

“Random” instances

slide-23
SLIDE 23

O

Nearest neighbor searching

“Random” instances

slide-24
SLIDE 24

O

Nearest neighbor searching

“Random” instances

slide-25
SLIDE 25

O

c · r = √ 2

Nearest neighbor searching

“Random” instances

slide-26
SLIDE 26

O

c · r = √ 2

Nearest neighbor searching

“Random” instances

slide-27
SLIDE 27

O

c · r = √ 2

r

Nearest neighbor searching

“Random” instances

slide-28
SLIDE 28

Locality-sensitive hashing

Overview

slide-29
SLIDE 29

Locality-sensitive hashing

Overview

  • Idea: Use nice partitions of the space

◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region

slide-30
SLIDE 30

Locality-sensitive hashing

Overview

  • Idea: Use nice partitions of the space

◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region

  • Precomputation: Store hash tables of vectors per region

◮ For each region, store contained vectors from data set ◮ Rerandomization: Many partitions to increase success probability

slide-31
SLIDE 31

Locality-sensitive hashing

Overview

  • Idea: Use nice partitions of the space

◮ Nearby vectors are often in the same region ◮ Distant vectors are unlikely to be in the same region

  • Precomputation: Store hash tables of vectors per region

◮ For each region, store contained vectors from data set ◮ Rerandomization: Many partitions to increase success probability

  • Query: Check hash tables for collisions

◮ Compute target’s region for each hash table ◮ Check corresponding buckets for potential nearest neighbors ◮ Reduces search space before doing a linear search

slide-32
SLIDE 32

O

Hyperplane LSH

[Charikar, STOC’02]

slide-33
SLIDE 33

O

Hyperplane LSH

Random point

slide-34
SLIDE 34

O

Hyperplane LSH

Opposite point

slide-35
SLIDE 35

O

Hyperplane LSH

Two Voronoi cells

slide-36
SLIDE 36

O

Hyperplane LSH

Another pair of points

slide-37
SLIDE 37

O

Hyperplane LSH

Another hyperplane

slide-38
SLIDE 38

O

Hyperplane LSH

Defines partition

slide-39
SLIDE 39

O

Hyperplane LSH

Preprocessing

slide-40
SLIDE 40

O

Hyperplane LSH

Query

slide-41
SLIDE 41

O

Hyperplane LSH

Collisions

slide-42
SLIDE 42

O

Hyperplane LSH

Failure

slide-43
SLIDE 43

O

Hyperplane LSH

Rerandomization

slide-44
SLIDE 44

O

Hyperplane LSH

Collisions

slide-45
SLIDE 45

O

Hyperplane LSH

Success

slide-46
SLIDE 46

O

Hyperplane LSH

Overview

slide-47
SLIDE 47

O

Hyperplane LSH

Overview

  • 2 regions induced by each hyperplane
  • Simple: one hyperplane corresponds to one inner product
  • Fast: k hyperplanes give you 2k regions
slide-48
SLIDE 48

O

Hyperplane LSH

Overview

  • 2 regions induced by each hyperplane
  • Simple: one hyperplane corresponds to one inner product
  • Fast: k hyperplanes give you 2k regions

For “random” settings, query time O(nρ) with ρ = √ 2 π ln 2 · 1 c

  • 1 + od,c(1)
  • .
slide-49
SLIDE 49

O

Hyperplane LSH

Overview

  • 2 regions induced by each hyperplane
  • Simple: one hyperplane corresponds to one inner product
  • Fast: k hyperplanes give you 2k regions

For “random” settings, query time O(nρ) with ρ = √ 2 π ln 2 · 1 c

  • 1 + od,c(1)
  • .

Efficient but suboptimal as ρ ∝ 1

c2 is achievable

slide-50
SLIDE 50

O

Cross-Polytope LSH

[Terasawa–Tanaka, WADS’07] [Andoni et al., NIPS’15]

slide-51
SLIDE 51

O

Cross-Polytope LSH

Vertices of cross-polytope (simplex)

slide-52
SLIDE 52

O

Cross-Polytope LSH

Random rotation

slide-53
SLIDE 53

O

Cross-Polytope LSH

Voronoi regions

slide-54
SLIDE 54

O

Cross-Polytope LSH

Defines partition

slide-55
SLIDE 55

O

Cross-Polytope LSH

Overview

slide-56
SLIDE 56

O

Cross-Polytope LSH

Overview

  • 2d regions in d dimensions
  • Advantage: regions same size and more symmetric

For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
slide-57
SLIDE 57

O

Cross-Polytope LSH

Overview

  • 2d regions in d dimensions
  • Advantage: regions same size and more symmetric

For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • Essentially optimal for large c and n = 2o(d) [Dub’10, AR’15]
slide-58
SLIDE 58

O

Spherical/Voronoi LSH

[Andoni et al., SODA’14] [Andoni–Razenshteyn, STOC’15]

slide-59
SLIDE 59

O

Spherical/Voronoi LSH

Random points

slide-60
SLIDE 60

O

Spherical/Voronoi LSH

Voronoi cells

slide-61
SLIDE 61

O

Spherical/Voronoi LSH

Defines partition

slide-62
SLIDE 62

O

Spherical/Voronoi LSH

Overview

slide-63
SLIDE 63

O

Spherical/Voronoi LSH

Overview

2O(

√ d) points in d dimensions

  • More points improves performance
  • More points makes decoding slower
slide-64
SLIDE 64

O

Spherical/Voronoi LSH

Overview

2O(

√ d) points in d dimensions

  • More points improves performance
  • More points makes decoding slower

For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • .
slide-65
SLIDE 65

O

Spherical/Voronoi LSH

Overview

2O(

√ d) points in d dimensions

  • More points improves performance
  • More points makes decoding slower

For “random” settings, query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • .

Essentially optimal for large c and n = 2o(d)

slide-66
SLIDE 66

O

LSH overview

  • Hyperplane LSH: 2 Voronoi cells

◮ Efficient decoding ◮ Suboptimal for large d, c

  • Cross-Polytope LSH: 2d Voronoi cells

◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)

  • Spherical/Voronoi LSH: 2O(

√ d) Voronoi cells

◮ Slow decoding ◮ Optimal for large c and n = 2o(d)

slide-67
SLIDE 67

O

LSH overview

  • Hyperplane LSH: 2 Voronoi cells

◮ Efficient decoding ◮ Suboptimal for large d, c

  • Cross-Polytope LSH: 2d Voronoi cells

◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)

  • Spherical/Voronoi LSH: 2O(

√ d) Voronoi cells

◮ Slow decoding ◮ Optimal for large c and n = 2o(d)

  • 1. Can we use even more Voronoi cells?
slide-68
SLIDE 68

O

LSH overview

  • Hyperplane LSH: 2 Voronoi cells

◮ Efficient decoding ◮ Suboptimal for large d, c

  • Cross-Polytope LSH: 2d Voronoi cells

◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)

  • Spherical/Voronoi LSH: 2O(

√ d) Voronoi cells

◮ Slow decoding ◮ Optimal for large c and n = 2o(d)

  • 1. Can we use even more Voronoi cells?
  • 2. Can decoding be made faster?
slide-69
SLIDE 69

O

LSH overview

  • Hyperplane LSH: 2 Voronoi cells

◮ Efficient decoding ◮ Suboptimal for large d, c

  • Cross-Polytope LSH: 2d Voronoi cells

◮ Reasonably efficient decoding ◮ Optimal for large c and n = 2o(d)

  • Spherical/Voronoi LSH: 2O(

√ d) Voronoi cells

◮ Slow decoding ◮ Optimal for large c and n = 2o(d)

  • 1. Can we use even more Voronoi cells?
  • 2. Can decoding be made faster?
  • 3. What about n = 2Ω(d)?
slide-70
SLIDE 70

O

Structured filters

Overview

slide-71
SLIDE 71

O

Structured filters

Partition dimensions into blocks

slide-72
SLIDE 72

O

Structured filters

Random subcodes

slide-73
SLIDE 73

O

Structured filters

Construct concatenated code

slide-74
SLIDE 74

O

Structured filters

Construct concatenated code

slide-75
SLIDE 75

O

Structured filters

Normalize (only for example)

slide-76
SLIDE 76

O

Structured filters

Normalize (only for example)

slide-77
SLIDE 77

O

Structured filters

Normalize (only for example)

slide-78
SLIDE 78

O

Structured filters

Construct Voronoi cells

slide-79
SLIDE 79

O

Structured filters

Defines partition

slide-80
SLIDE 80

O

Structured filters

...with efficient decoding

slide-81
SLIDE 81

O

Structured filters

Techniques

  • Idea 1: Increase number of regions to 2Θ(d)

◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large

slide-82
SLIDE 82

O

Structured filters

Techniques

  • Idea 1: Increase number of regions to 2Θ(d)

◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large

  • Idea 2: Use structured codes for random regions

◮ Spherical/Voronoi LSH with dependent random points ◮ Concatenated code of log d low-dim. spherical codes ◮ Allows for efficient list-decoding

slide-83
SLIDE 83

O

Structured filters

Techniques

  • Idea 1: Increase number of regions to 2Θ(d)

◮ Number of hash tables increases to 2Θ(d) – ok for n = 2Θ(d) ◮ Decoding cost potentially too large

  • Idea 2: Use structured codes for random regions

◮ Spherical/Voronoi LSH with dependent random points ◮ Concatenated code of log d low-dim. spherical codes ◮ Allows for efficient list-decoding

  • Idea 3: Replace partitions with filters

◮ Relaxation: filters need not partition the space ◮ Simplified analysis ◮ Might not be needed to achieve improvement

slide-84
SLIDE 84

O

Structured filters

Results

For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • .
slide-85
SLIDE 85

O

Structured filters

Results

For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • .

For random dense settings (n = 2κd with small κ), we obtain ρ = 1 − κ 2c2 − 1

  • 1 + od,κ(1)
  • .
slide-86
SLIDE 86

O

Structured filters

Results

For random sparse settings (n = 2o(d)), query time O(nρ) with ρ = 1 2c2 − 1

  • 1 + od(1)
  • .

For random dense settings (n = 2κd with small κ), we obtain ρ = 1 − κ 2c2 − 1

  • 1 + od,κ(1)
  • .

For random dense settings (n = 2κd with large κ), we obtain ρ = −1 2κ log

  • 1 −

1 2c2 − 1 1 + od(1)

  • .
slide-87
SLIDE 87

Asymmetric nearest neighbors

Previous results: symmetric NNS

  • Query time: O(nρ)
  • Update time: O(nρ)
  • Preprocessing time: O(n1+ρ)
  • Space complexity: O(n1+ρ)
slide-88
SLIDE 88

Asymmetric nearest neighbors

Previous results: symmetric NNS

  • Query time: O(nρ)
  • Update time: O(nρ)
  • Preprocessing time: O(n1+ρ)
  • Space complexity: O(n1+ρ)

Can we get a tradeoff between these costs?

slide-89
SLIDE 89

O

Asymmetric nearest neighbors

Voronoi regions

slide-90
SLIDE 90

Asymmetric nearest neighbors

Spherical cap

slide-91
SLIDE 91

α

Asymmetric nearest neighbors

Cap height α

slide-92
SLIDE 92

α

Asymmetric nearest neighbors

Smaller α = ⇒ Larger caps, more work

slide-93
SLIDE 93

α

Asymmetric nearest neighbors

Larger α = ⇒ Smaller caps, less work

slide-94
SLIDE 94

αu αq

Asymmetric nearest neighbors

αq > αu = ⇒ Faster queries, slower updates

slide-95
SLIDE 95

αq αu

Asymmetric nearest neighbors

αq < αu = ⇒ Slower queries, faster updates

slide-96
SLIDE 96

αq αu

Asymmetric nearest neighbors

Results

General expressions Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)

slide-97
SLIDE 97

αq αu

Asymmetric nearest neighbors

Results

General expressions Small c = 1 + ε Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 ρq = 1 − 4ε2 + O(ε3) ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) ρq = 1 − 4ε + O(ε2) ρu = 1 − 4ε + O(ε2) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 ρq = 0 ρu = 1/(4ε2) + O(1/ε) Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)

slide-98
SLIDE 98

αq αu

Asymmetric nearest neighbors

Results

General expressions Large c → ∞ Minimize space (αq/αu = cos θ) ρq = (2c2 − 1)/c4 ρu = 0 ρq = 2/c2 + O(1/c4) ρu = 0 Balance costs (αq/αu = 1) ρq = 1/(2c2 − 1) ρu = 1/(2c2 − 1) ρq = 1/(2c2) + O(1/c4) ρu = 1/(2c2) + O(1/c4) Minimize time (αq/αu = 1/ cos θ) ρq = 0 ρu = (2c2 − 1)/(c2 − 1)2 ρq = 0 ρu = 2/c2 + O(1/c4) Query time O(nρq), update time O(nρu), preprocessing time O(n1+ρu), space complexity O(n1+ρu)

slide-99
SLIDE 99

αq αu

Asymmetric nearest neighbors

Tradeoffs

slide-100
SLIDE 100

Conclusions

Main result: Allow using more regions with list-decodable codes

  • For n = 2o(d), non-asymptotic improvement
  • For n = 2Θ(d), asymptotic improvement
  • Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
  • Improved tradeoffs between query and update complexities
slide-101
SLIDE 101

Conclusions

Main result: Allow using more regions with list-decodable codes

  • For n = 2o(d), non-asymptotic improvement
  • For n = 2Θ(d), asymptotic improvement
  • Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
  • Improved tradeoffs between query and update complexities

Open problems

  • Tradeoff for n = 2o(d) optimal?
  • Lower bounds for n = 2Θ(d)?
  • Apply similar ideas to other norms?
  • Practicality?
slide-102
SLIDE 102

Conclusions

Main result: Allow using more regions with list-decodable codes

  • For n = 2o(d), non-asymptotic improvement
  • For n = 2Θ(d), asymptotic improvement
  • Corollary: Lower bounds for n = 2o(d) do not hold for n = 2Θ(d)
  • Improved tradeoffs between query and update complexities

Open problems

  • Tradeoff for n = 2o(d) optimal?
  • Lower bounds for n = 2Θ(d)?
  • Apply similar ideas to other norms?
  • Practicality?

Questions?