Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, - - PowerPoint PPT Presentation

curse of dimensionality in pivot based indexes
SMART_READER_LITE
LIVE PREVIEW

Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, - - PowerPoint PPT Presentation

Overview Our Work Discussion Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of Mathematics and Statistics University of Ottawa Ottawa, Ontario, Canada SISAP 2009, Prague, 29/09/2009 Ilya


slide-1
SLIDE 1

Overview Our Work Discussion

Curse of Dimensionality in Pivot-based Indexes

Ilya Volnyansky, Vladimir Pestov

Department of Mathematics and Statistics University of Ottawa Ottawa, Ontario, Canada

SISAP 2009, Prague, 29/09/2009

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-2
SLIDE 2

Overview Our Work Discussion

Outline

1

Overview The Setting for Similarity Search Previous Work

2

Our Work Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-3
SLIDE 3

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Similarity Workloads

Universe Ω: metric space with metric 휌. Dataset X ⊂ Ω, always finite, with metric 휌. A range query: given q ∈ Ω and r > 0 find {x ∈ X∣휌(x, q) < r} For analysis purposes, we add: A measure 휇 on Ω. Treat X as i.i.d. sample ∼ 휇 of size n

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-4
SLIDE 4

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Similarity Workloads

Universe Ω: metric space with metric 휌. Dataset X ⊂ Ω, always finite, with metric 휌. A range query: given q ∈ Ω and r > 0 find {x ∈ X∣휌(x, q) < r} For analysis purposes, we add: A measure 휇 on Ω. Treat X as i.i.d. sample ∼ 휇 of size n

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-5
SLIDE 5

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Curse of dimensionality conjecture

All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔(log n) and d = no(1), any sequence of indexes built

  • n a sequence of datasets Xd ⊂ Σd allowing similarity

search in time polynomial in d must use n휔(1) space.

Handbook of Discrete and Computational Geometry

The Hamming cube Σd of dimension d: The set of all binary sequences of length d.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-6
SLIDE 6

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Curse of dimensionality conjecture

All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔(log n) and d = no(1), any sequence of indexes built

  • n a sequence of datasets Xd ⊂ Σd allowing similarity

search in time polynomial in d must use n휔(1) space.

Handbook of Discrete and Computational Geometry

The Hamming cube Σd of dimension d: The set of all binary sequences of length d.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-7
SLIDE 7

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Fixed dimension

Examples of previous work: Let n the size of X vary, but the space (Ω, 휌, 휇) be fixed. The usual “asymptotic” analysis in the CS sense. Does not investigate the curse of dimensionality.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-8
SLIDE 8

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Fixed n

Let the dimension and hence (Ω, 휌, 휇) vary but the size n of X stay the same. e.g. [Weber 98], [Chávez 01] Too small sample size n makes it easier to index spaces of high dimension d. When both d and n vary, the math is more challenging.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-9
SLIDE 9

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Points to keep in mind

Distinction between X and Ω. Both d and n grow. Need to make assumptions about the sequence of Ω’s (?) Need to make assumption about the indexes.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-10
SLIDE 10

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Gameplan

1

Pick an index type to analyze.

2

Pick a cost model.

3

The sequence of Ω’s exhibits concentration of measure, the “intrinsic dimension” grows.

4

Statistical Learning Theory: linking properties of Ω’s and properties of X’s.

5

Conclusion: if all conditions are met, the Curse of Dimensionality will take place.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-11
SLIDE 11

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Main Result

From a sequence of metric spaces with measure (Ωd, 휌d, 휇d), where d = 1, 2, 3, . . . take i.i.d. samples (datasets) Xd ∼ 휇d. Assume (Ωd, 휌d, 휇d) display the concentration of measure. The VC dimension of closed balls in (Ωd, 휌d) is O(d). We build a pivot-index using k pivots, where k = o(nd/d). Sample size nd satisfies d = 휔(log nd) and d = no(1)

d

. Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃D such that for all d ⩾ D, the probability that at least half the queries on dataset Xd take less than (1 − 휀)nd time is less than 휂.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-12
SLIDE 12

Overview Our Work Discussion The Setting for Similarity Search Previous Work

Main Result

From a sequence of metric spaces with measure (Ωd, 휌d, 휇d), where d = 1, 2, 3, . . . take i.i.d. samples (datasets) Xd ∼ 휇d. Assume (Ωd, 휌d, 휇d) display the concentration of measure. The VC dimension of closed balls in (Ωd, 휌d) is O(d). We build a pivot-index using k pivots, where k = o(nd/d). Sample size nd satisfies d = 휔(log nd) and d = no(1)

d

. Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃D such that for all d ⩾ D, the probability that at least half the queries on dataset Xd take less than (1 − 휀)nd time is less than 휂.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-13
SLIDE 13

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Pivot indexing scheme

Build an index:

1

Pick {p1 . . . pk} from X

2

Calculate n × k array of distances 휌(x, pi), 1 ⩽ i ⩽ k, x ∈ X Perform query given q and r :

1

Compute 휌k(q, x) := sup1⩽i⩽k ∣휌(q, pi) − 휌(x, pi)∣.

2

Since 휌(q, x) ⩾ 휌k(q, x), no need to compute 휌(q, x) if 휌k(q, x) > r

3

Compute 휌(q, x) otherwise.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-14
SLIDE 14

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Pivot indexing scheme

Build an index:

1

Pick {p1 . . . pk} from X

2

Calculate n × k array of distances 휌(x, pi), 1 ⩽ i ⩽ k, x ∈ X Perform query given q and r :

1

Compute 휌k(q, x) := sup1⩽i⩽k ∣휌(q, pi) − 휌(x, pi)∣.

2

Since 휌(q, x) ⩾ 휌k(q, x), no need to compute 휌(q, x) if 휌k(q, x) > r

3

Compute 휌(q, x) otherwise.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-15
SLIDE 15

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The cost model

Only one cost: 휌(q, x) Computing 휌k(q, x) costs k. Let Cq,r,p1,...,pk denote all the discarded points in X: {x ∈ X∣휌k(q, x) > r} Let n = ∣X∣. Total cost: k + n − ∣Cq,r,p1,...,pk∣.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-16
SLIDE 16

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The cost model

Only one cost: 휌(q, x) Computing 휌k(q, x) costs k. Let Cq,r,p1,...,pk denote all the discarded points in X: {x ∈ X∣휌k(q, x) > r} Let n = ∣X∣. Total cost: k + n − ∣Cq,r,p1,...,pk∣.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-17
SLIDE 17

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Concentration of Measure

A function f : Ω → ℝ is 1-Lipschitz if ∣f(휔1) − f(휔2)∣ ⩽ 휌(휔1, 휔2) ∀휔1, 휔2 ∈ Ω Examples: f(x) = x f(x) = 1

2x

f(x) = √ (x2 + 1) Its median is a number M such that 휇{휔∣f(휔) ⩽ M} ⩾ 1/2 and 휇{휔∣f(휔) ⩾ M} ⩾ 1/2

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-18
SLIDE 18

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Concentration of Measure

A function f : Ω → ℝ is 1-Lipschitz if ∣f(휔1) − f(휔2)∣ ⩽ 휌(휔1, 휔2) ∀휔1, 휔2 ∈ Ω Examples: f(x) = x f(x) = 1

2x

f(x) = √ (x2 + 1) Its median is a number M such that 휇{휔∣f(휔) ⩽ M} ⩾ 1/2 and 휇{휔∣f(휔) ⩾ M} ⩾ 1/2

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-19
SLIDE 19

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Concentration of Measure

A sequence of spaces (Ωd, 휌d, 휇d)∞

d=1 exhibits (normal)

concentration of measure if there are C, c > 0 such that for every 1-Lipschitz function f : Ω → ℝ with median M: ∀휖 > 0, 휇{휔∣ ∣f(휔) − M∣ > 휖} < Ce−c휖2d Examples: The Spheres 핊d in ℝd+1 The Balls 픹d. The Hamming Cubes Σd.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-20
SLIDE 20

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Concentration of Measure

A sequence of spaces (Ωd, 휌d, 휇d)∞

d=1 exhibits (normal)

concentration of measure if there are C, c > 0 such that for every 1-Lipschitz function f : Ω → ℝ with median M: ∀휖 > 0, 휇{휔∣ ∣f(휔) − M∣ > 휖} < Ce−c휖2d Examples: The Spheres 핊d in ℝd+1 The Balls 픹d. The Hamming Cubes Σd.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-21
SLIDE 21

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The concentration functions of various spheres

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 actual d = 3 d = 10 d = 20

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-22
SLIDE 22

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The concentration functions of various spheres

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 actual d = 3 d = 10 d = 20 estimated d = 3 d = 10 d = 20

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-23
SLIDE 23

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The concentration of measure in spheres

We can replace f : Ω → ℝ by f : Ω → ℝN. Suppose f : 핊d → ℝ2. d = 10, 20, 50, 100.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-24
SLIDE 24

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

The concentration of measure in spheres

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-25
SLIDE 25

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Distribution of distances of projected spheres

d= 10 Proj= 2

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 0.564 d= 20 Proj= 2

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 0.397 d= 50 Proj= 2

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 0.244 d= 100 Proj= 2

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 0.18

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-26
SLIDE 26

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Distribution of distances of spheres

d= 10 Proj= 10

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 1.393 d= 20 Proj= 20

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 1.413 d= 50 Proj= 50

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 1.41 d= 100 Proj= 100

0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6

mean= 1.412

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-27
SLIDE 27

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Connection to indexing

Observe that 휌(⋅, p) : Ω → ℝ : 휔 → 휌(휔, p) is a 1-Lipschitz function, as the Δ-inequality: 휌(휔1, p) ⩽ 휌(휔1, 휔2) + 휌(휔2, 휔p) 휌(휔2, p) ⩽ 휌(휔2, 휔1) + 휌(휔1, 휔p) Leads to: 휌(휔1, p) − 휌(휔2, 휔p) ⩽ 휌(휔1, 휔2) 휌(휔2, p) − 휌(휔1, 휔p) ⩽ 휌(휔2, 휔1) and hence ∣휌(휔1, p) − 휌(휔2, 휔p)∣ ⩽ 휌(휔1, 휔2).

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-28
SLIDE 28

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Connection to indexing

휌(⋅, p) is a 1-Lipschitz function. Recall 풞q,r,p1,...,pk = {휔 ∈ Ω∣휌k(q, 휔) > r}. Compare to Cq,r,p1,...,pk = {x ∈ X∣휌k(q, x) > r}. If concentration of measure is present, it follows that 휇d(풞q,r,p1,...,pk) < Ce−cr2d. We want to know about ∣Cq,r,p1,...,pk∣.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-29
SLIDE 29

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Connection to indexing

휌(⋅, p) is a 1-Lipschitz function. Recall 풞q,r,p1,...,pk = {휔 ∈ Ω∣휌k(q, 휔) > r}. Compare to Cq,r,p1,...,pk = {x ∈ X∣휌k(q, x) > r}. If concentration of measure is present, it follows that 휇d(풞q,r,p1,...,pk) < Ce−cr2d. We want to know about ∣Cq,r,p1,...,pk∣.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-30
SLIDE 30

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Connection to indexing

휌(⋅, p) is a 1-Lipschitz function. Recall 풞q,r,p1,...,pk = {휔 ∈ Ω∣휌k(q, 휔) > r}. Compare to Cq,r,p1,...,pk = {x ∈ X∣휌k(q, x) > r}. If concentration of measure is present, it follows that 휇d(풞q,r,p1,...,pk) < Ce−cr2d. We want to know about ∣Cq,r,p1,...,pk∣.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-31
SLIDE 31

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Glivenko-Cantelli and the generalization

Let X be an i.i.d. sample of size n from (ℝ, 휇) (any* prob. measure). If we let 휇n(A) := ∣X ∩ A∣ then sup

A∈풜

∣ 휇n(A) − 휇(A) ∣

P

− → 0 where 풜 = {(a, b]∣a, b ∈ ℝ}. This is known as the Glivenko-Cantelli theorem.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-32
SLIDE 32

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Generalization of Glivenko-Cantelli

Let X be an i.i.d. sample of size n from (Ω, 휇). If we let 풜 be a collection of subsets with the “finite Vapnik-Chervonenkis (VC) dimension Δ” property then sup

A∈풜

∣ 휇n(A) − 휇(A) ∣

P

− → 0 Furthermore: We know the rate of convergence: exp(−Δ휀2n).

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-33
SLIDE 33

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Generalization of Glivenko-Cantelli

Let X be an i.i.d. sample of size n from (Ω, 휇). If we let 풜 be a collection of subsets with the “finite Vapnik-Chervonenkis (VC) dimension Δ” property then sup

A∈풜

∣ 휇n(A) − 휇(A) ∣

P

− → 0 Furthermore: We know the rate of convergence: exp(−Δ휀2n).

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-34
SLIDE 34

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Examples of Spaces with bounds on VC

The VC dimension of half-spaces in ℝd is d + 1. The VC-dimension of all open (or closed) balls in ℝd {x ∈ ℝd∣ ∥x − v∥ < r} is also d + 1. axis-aligned rectangular parallelepipeds in ℝd, [a1, b1] × [a2, b2] × . . . × [ad, bd] have a VC dimension of 2d

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-35
SLIDE 35

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Bounds on k-fold Intersections of Spherical Shells

Below Δ denotes the VC dimension of 풞: For (ℝd, L2), Δ ⩽ k(8d + 12) ln(6k). For (ℝd, L∞), Δ ⩽ k(16d + 4) ln(6k). For (Σd, 휌), Δ ⩽ k(8d + 8 log2 d + 4) ln(6k).

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-36
SLIDE 36

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Main Result

From a sequence of metric spaces with measure (Ωd, 휌d, 휇d), where d = 1, 2, 3, . . . take i.i.d. samples (datasets) Xd ∼ 휇d. Assume (Ωd, 휌d, 휇d) display the concentration of measure. The VC dimension of closed balls in (Ωd, 휌d) is O(d). We build a pivot-index using k pivots, where k = o(nd/d). Sample size nd satisfies d = 휔(log nd) and d = no(1)

d

. Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃D such that for all d ⩾ D, the probability that at least half the queries on dataset Xd take less than (1 − 휀)nd time is less than 휂.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-37
SLIDE 37

Overview Our Work Discussion Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds

Main Result

From a sequence of metric spaces with measure (Ωd, 휌d, 휇d), where d = 1, 2, 3, . . . take i.i.d. samples (datasets) Xd ∼ 휇d. Assume (Ωd, 휌d, 휇d) display the concentration of measure. The VC dimension of closed balls in (Ωd, 휌d) is O(d). We build a pivot-index using k pivots, where k = o(nd/d). Sample size nd satisfies d = 휔(log nd) and d = no(1)

d

. Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃D such that for all d ⩾ D, the probability that at least half the queries on dataset Xd take less than (1 − 휀)nd time is less than 휂.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

slide-38
SLIDE 38

Overview Our Work Discussion

Discussion

1

Rigorous, linear bounds.

2

Independent of choice of pivots.

3

Somewhat artificial situation of growth in d and n.

Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes