Analysis of hierarchical metric-tree indexing schemes for - - PowerPoint PPT Presentation

analysis of hierarchical metric tree indexing schemes
SMART_READER_LITE
LIVE PREVIEW

Analysis of hierarchical metric-tree indexing schemes for - - PowerPoint PPT Presentation

Analysis of hierarchical metric-tree indexing schemes for similarity search in high-dimensional datasets Vladimir Pestov vpest283@uottawa.ca http://aix1.uottawa.ca/ vpest283 Department of Mathematics and Statistics University of Ottawa


slide-1
SLIDE 1

Analysis of hierarchical metric-tree indexing schemes

for similarity search in high-dimensional datasets

Vladimir Pestov

vpest283@uottawa.ca http://aix1.uottawa.ca/˜vpest283

Department of Mathematics and Statistics University of Ottawa

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.1/25

slide-2
SLIDE 2

General setting

Workload: W = (Ω, X, Q), where:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-3
SLIDE 3

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-4
SLIDE 4

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-5
SLIDE 5

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-6
SLIDE 6

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-7
SLIDE 7

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-8
SLIDE 8

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q. A (dis)similarity measure s: Ω × Ω → R,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-9
SLIDE 9

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q. A (dis)similarity measure s: Ω × Ω → R, e.g. a metric, or a pseudometric.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-10
SLIDE 10

General setting

Workload: W = (Ω, X, Q), where:

Ω is the domain, X ⊂ Ω finite subset (dataset, or instance), and Q ⊆ 2Ω is the set of queries.

Answering a query Q ∈ Q is listing all x ∈ X ∩ Q. A (dis)similarity measure s: Ω × Ω → R, e.g. a metric, or a pseudometric. A range similarity query centred at ω ∈ Ω:

Q = {x ∈ Ω: s(ω, x) < ε}

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.2/25

slide-11
SLIDE 11

Similarity workloads

ε

W = (Ω, d, X, {Bε(x)})

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.3/25

slide-12
SLIDE 12

Similarity workloads

ε

W = (Ω, d, X, {Bε(x)}) k-nearest neighbours (k-NN) query centred at x∗ ∈ Ω,

where k ∈ N.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.3/25

slide-13
SLIDE 13

Example

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.4/25

slide-14
SLIDE 14

Example

Ω = strings of length m = 10 from the alphabet Σ of 20

standard amino acids: Ω = Σ10.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.4/25

slide-15
SLIDE 15

Example

Ω = strings of length m = 10 from the alphabet Σ of 20

standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProt

database (as of 19-Oct-2002). |X| = 23, 817, 598.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.4/25

slide-16
SLIDE 16

Example

Ω = strings of length m = 10 from the alphabet Σ of 20

standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProt

database (as of 19-Oct-2002). |X| = 23, 817, 598. Similarity measure given by the most common scoring matrix in sequence comparison, BLOSUM62, by

s(a, b) = m

i=1 s(ai, bi) (the ungapped score).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.4/25

slide-17
SLIDE 17

Example

Ω = strings of length m = 10 from the alphabet Σ of 20

standard amino acids: Ω = Σ10.

X = all peptide fragments of length 10 in the SwissProt

database (as of 19-Oct-2002). |X| = 23, 817, 598. Similarity measure given by the most common scoring matrix in sequence comparison, BLOSUM62, by

s(a, b) = m

i=1 s(ai, bi) (the ungapped score).

Converted into quasi-metric d(a, b) = s(a, a) − s(a, b), generating the same set of queries (range and k-NN).

(joint with A. Stojmirovi´ c)

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.4/25

slide-18
SLIDE 18

Inner vs outer

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.5/25

slide-19
SLIDE 19

Inner vs outer

Inner workload if X = Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.5/25

slide-20
SLIDE 20

Inner vs outer

Inner workload if X = Ω, Outer workload if |X| ≪ |Ω|.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.5/25

slide-21
SLIDE 21

Inner vs outer

Inner workload if X = Ω, Outer workload if |X| ≪ |Ω|. Fragment example: outer,

|X|/|Ω| = 23, 817, 598/2010 ≈ 0.0000023

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.5/25

slide-22
SLIDE 22

Inner vs outer

Inner workload if X = Ω, Outer workload if |X| ≪ |Ω|. Fragment example: outer,

|X|/|Ω| = 23, 817, 598/2010 ≈ 0.0000023

Most points ω ∈ Ω have NN x ∈ X within ε = 25 (high biological relevance).

5 10 15 20 25 30 35 0.0 0.2 0.4 0.6 0.8 1.0

DISTANCE µ(Nε(X))

Quasi−metric Metric

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.5/25

slide-23
SLIDE 23

Hierarchical tree index structures

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-24
SLIDE 24

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-25
SLIDE 25

Hierarchical tree index structures

A sequence of refining partitions of the domain:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-26
SLIDE 26

Hierarchical tree index structures

A sequence of refining partitions of the domain: Space O(n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-27
SLIDE 27

Hierarchical tree index structures

A sequence of refining partitions of the domain: Space O(n). To process a range query Bε(ω), we traverse the tree all the way down to the leaf level.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-28
SLIDE 28

Hierarchical tree index structures

A sequence of refining partitions of the domain: Space O(n). To process a range query Bε(ω), we traverse the tree all the way down to the leaf level. What happens in each node?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.6/25

slide-29
SLIDE 29

Pruning

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-30
SLIDE 30

Pruning

  • If Bε(ω) ∩ B = ∅, the sub-tree descending from the node B

can be pruned:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-31
SLIDE 31

Pruning

  • If Bε(ω) ∩ B = ∅, the sub-tree descending from the node B

can be pruned:

B ω A B A ε ε ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-32
SLIDE 32

Pruning

  • If Bε(ω) ∩ B = ∅, the sub-tree descending from the node B

can be pruned:

B ω A B A ε ε ε

that is, if it can be certified that

ω / ∈ Bε = {x ∈ Ω: d(x, B) < ε}.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-33
SLIDE 33

Pruning

  • If Bε(ω) ∩ B = ∅, the sub-tree descending from the node B

can be pruned:

B ω A B A ε ε ε A ω B A B ε ε ε

that is, if it can be certified that

ω / ∈ Bε = {x ∈ Ω: d(x, B) < ε}.

  • Otherwise the search branches out.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-34
SLIDE 34

Pruning

  • If Bε(ω) ∩ B = ∅, the sub-tree descending from the node B

can be pruned:

B ω A B A ε ε ε A ω B A B ε ε ε

that is, if it can be certified that

ω / ∈ Bε = {x ∈ Ω: d(x, B) < ε}.

  • Otherwise the search branches out.

How to “certify” that Bε(ω) ∩ B = ∅?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

slide-35
SLIDE 35

Decision functions

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-36
SLIDE 36

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-37
SLIDE 37

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f ↾ B ≤ 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-38
SLIDE 38

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f ↾ B ≤ 0. Then f ↾ Bε < ε,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-39
SLIDE 39

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f ↾ B ≤ 0. Then f ↾ Bε < ε,

x f B f(x) ε y

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-40
SLIDE 40

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f ↾ B ≤ 0. Then f ↾ Bε < ε,

x f B f(x) ε y

that is, f(ω) ≥ ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-41
SLIDE 41

Decision functions

Let f : Ω → R be a 1-Lipschitz function,

|f(x) − f(y)| ≤ d(x, y) ∀x, y ∈ Ω,

such that f ↾ B ≤ 0. Then f ↾ Bε < ε,

x f B f(x) ε y

that is, f(ω) ≥ ε is a certificate that Bε(ω) ∩ B = ∅

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

slide-42
SLIDE 42

Metric trees

A metric tree for a metric similarity workload (Ω, ρ, X): a binary rooted tree T , a collection of partially defined 1-Lipschitz functions

ft : Bt → R for every inner node t (decision functions),

a collection of bins Bt ⊆ Ω for every leaf node t, containing pointers to elements X ∩ Bt, such that

Broot(T) = Ω, ∀ inner node t and child nodes t−, t+, Bt ⊆ Bt− ∪ Bt+.

When processing a range query Bε(ω),

t− [ t+ ] is accessed ⇐ ⇒ ft(ω) < ε [resp. ft(ω) > −ε].

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.9/25

slide-43
SLIDE 43

What happens in practice?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

slide-44
SLIDE 44

What happens in practice?

The best indexing schemes for exact similarity search in high-dimensional outer datasets are often (not always!)

  • utperformed by linear scan.

∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

slide-45
SLIDE 45

What happens in practice?

The best indexing schemes for exact similarity search in high-dimensional outer datasets are often (not always!)

  • utperformed by linear scan.

∗ ∗ ∗

The emphasis has shifted towards approximate similarity search:

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

slide-46
SLIDE 46

What happens in practice?

The best indexing schemes for exact similarity search in high-dimensional outer datasets are often (not always!)

  • utperformed by linear scan.

∗ ∗ ∗

The emphasis has shifted towards approximate similarity search: given ε > 0 and ω ∈ Ω, return a point that is [with high probability] at a distance < (1 + ε)dNN(ω) from ω.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

slide-47
SLIDE 47

The curse of dimensionality conjecture

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-48
SLIDE 48

The curse of dimensionality conjecture

Conjecture.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-49
SLIDE 49

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-50
SLIDE 50

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Suppose d = no(1), but d = ω(log n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-51
SLIDE 51

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Suppose d = no(1), but d = ω(log n). Any data structure for exact nearest neighbour search in X,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-52
SLIDE 52

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Suppose d = no(1), but d = ω(log n). Any data structure for exact nearest neighbour search in X, with dO(1) query time,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-53
SLIDE 53

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Suppose d = no(1), but d = ω(log n). Any data structure for exact nearest neighbour search in X, with dO(1) query time, must use nω(1) space.

∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-54
SLIDE 54

The curse of dimensionality conjecture

Conjecture. Let X ⊆ {0, 1}d be a dataset with n points, where the Hamming cube is equipped with the Hamming (ℓ1) distance:

d(x, y) = ♯{i: xi = yi}.

Suppose d = no(1), but d = ω(log n). Any data structure for exact nearest neighbour search in X, with dO(1) query time, must use nω(1) space.

∗ ∗ ∗

The cell probe model: Ω(d/ log n) lower bound (Barkol–Rabani, 2000).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

slide-55
SLIDE 55

Concentration of measure

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

slide-56
SLIDE 56

Concentration of measure

The phenomenon of concentration of measure on high- dimensional structures (“Geometric LLN”):

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

slide-57
SLIDE 57

Concentration of measure

The phenomenon of concentration of measure on high-dimensional structures (“Geometric LLN”): for a typical “high-dimensional” structure Ω, if A is a subset containing at least half of all points, then the measure of the ε-neighbourhood Aε of A is

  • verwhelmingly close to 1 already for small ε > 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

slide-58
SLIDE 58

Concentration of measure

The phenomenon of concentration of measure on high-dimensional structures (“Geometric LLN”): for a typical “high-dimensional” structure Ω, if A is a subset containing at least half of all points, then the measure of the ε-neighbourhood Aε of A is

  • verwhelmingly close to 1 already for small ε > 0.

) Aε ε A at least half of all points bounds from above contains Α Ω α(Ω,ε) µ(Ω Ω \ Aε \ Aε

  • Metric tree indexing schemes for similarity search

Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

slide-59
SLIDE 59

Concentration function

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

slide-60
SLIDE 60

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

slide-61
SLIDE 61

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure. The concentration function of Ω:

α(ε) =

  • 1

2,

if ε = 0,

1 − min

  • µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 1

2

  • ,

if ε > 0.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

slide-62
SLIDE 62

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure. The concentration function of Ω:

α(ε) =

  • 1

2,

if ε = 0,

1 − min

  • µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 1

2

  • ,

if ε > 0. For Ω = Σn, the Hamming cube (normalized distance +

  • unif. measure):

αΣn(ε) ≤ e−2ε2n.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

slide-63
SLIDE 63

Concentration function

Let Ω = (Ω, d, µ) be a metric space with measure. The concentration function of Ω:

α(ε) =

  • 1

2,

if ε = 0,

1 − min

  • µ♯ (Aε) : A ⊆ Ω, µ♯(A) ≥ 1

2

  • ,

if ε > 0. For Ω = Σn, the Hamming cube (normalized distance +

  • unif. measure):

αΣn(ε) ≤ e−2ε2n.

Gaussian estimates are typical (Euclidean spheres Sn, cubes In, ...)

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

slide-64
SLIDE 64

Example: the Hamming cube

0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 Concentration function versus Chernoff’s bound, n = 101 Concentration function Chernoff bound

Concentration function α(Σ101, ε) versus Chernoff bound

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.14/25

slide-65
SLIDE 65

Effects of concentration on branching

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

slide-66
SLIDE 66

Effects of concentration on branching

A A B < α(C,ε) < α(C,ε) C ω B ε ε ε

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

slide-67
SLIDE 67

Effects of concentration on branching

A A B < α(C,ε) < α(C,ε) C ω B ε ε ε

For all query points ω ∈ C except a set of measure

≤ 2α(C, ε),

the search algorithm branches out at the node C.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

slide-68
SLIDE 68

Search radius

εNN(ω) is a 1-Lipschitz function, so concentrates near

the median value, εM;

εM → Eµ⊗µd(x, y) = O(1).

Example: 1000 pts ∼ [0, 1]10, the ℓ2-εNN:

εM = 0.69419 Ed(x, y) = 1.2765.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.16/25

slide-69
SLIDE 69

A naive average O(n) lower bound

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-70
SLIDE 70

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)...

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-71
SLIDE 71

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-72
SLIDE 72

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points. A balanced metric tree of depth O(log n), with O(n) bins of roughly equal size (µ-measure).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-73
SLIDE 73

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points. A balanced metric tree of depth O(log n), with O(n) bins of roughly equal size (µ-measure). in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-74
SLIDE 74

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points. A balanced metric tree of depth O(log n), with O(n) bins of roughly equal size (µ-measure). in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist. For every element A of level t partition,

α(A, εM) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2

Md.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-75
SLIDE 75

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points. A balanced metric tree of depth O(log n), with O(n) bins of roughly equal size (µ-measure). in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist. For every element A of level t partition,

α(A, εM) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2

Md.

branching at every node occurs for all ω except

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-76
SLIDE 76

A naive average O(n) lower bound

Suppose datapoints are distributed according to µ ∈ P(Ω)... ...as well as query points. A balanced metric tree of depth O(log n), with O(n) bins of roughly equal size (µ-measure). in 1/2 the cases, εNN ≥ εM = O(1), the median NN dist. For every element A of level t partition,

α(A, εM) ≤ 2µ(A)−1α(Ω, εM/2) = O(2t)e−O(1)ε2

Md.

branching at every node occurs for all ω except ♯(nodes) × 2 sup

A

α(A, ε) = O(n2)e−O(1)d = o(1),

because d = ω(log n), e−O(1)d is superpoly(n).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

slide-77
SLIDE 77

What’s wrong?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

slide-78
SLIDE 78

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

slide-79
SLIDE 79

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ. Implicit assumption: empirical measure µn(A) = |A|

n ≈ µ(A).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

slide-80
SLIDE 80

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ. Implicit assumption: empirical measure µn(A) = |A|

n ≈ µ(A).

But the scheme is chosen after seeing an instance X!

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

slide-81
SLIDE 81

What’s wrong?

A dataset X is modeled by a sequence of i.i.d. r.v. Xi ∼ µ. Implicit assumption: empirical measure µn(A) = |A|

n ≈ µ(A).

But the scheme is chosen after seeing an instance X!

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

How much can be said of concentration in (Ω, µn)?

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

slide-82
SLIDE 82

VC dimension

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

slide-83
SLIDE 83

VC dimension

Let A be a family of subsets of Ω (a concept class).

B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such that

A ∩ B = C.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

slide-84
SLIDE 84

VC dimension

Let A be a family of subsets of Ω (a concept class).

B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such that

A ∩ B = C.

A Ω B C

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

slide-85
SLIDE 85

VC dimension

Let A be a family of subsets of Ω (a concept class).

B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A

such that

A ∩ B = C.

A Ω B C

The Vapnik–Chervonenkis dimension VC-dim (A ) of A is the largest cardinality of a set B ⊆ Ω shattered by A .

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

slide-86
SLIDE 86

Statistical learning bounds

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

slide-87
SLIDE 87

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

slide-88
SLIDE 88

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d. Then for all ǫ, δ > 0 and every probability measure µ on Ω,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

slide-89
SLIDE 89

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d. Then for all ǫ, δ > 0 and every probability measure µ on Ω, if n datapoints in X are drawn randomly and independently acoording to µ, then with confidence 1 − δ

∀A ∈ A ,

  • µ(A) − X ∩ A

n

  • < ǫ,

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

slide-90
SLIDE 90

Statistical learning bounds

Let A ⊆ 2Ω be a concept class of finite VC dimension, d. Then for all ǫ, δ > 0 and every probability measure µ on Ω, if n datapoints in X are drawn randomly and independently acoording to µ, then with confidence 1 − δ

∀A ∈ A ,

  • µ(A) − X ∩ A

n

  • < ǫ,

provided n is large enough:

n ≥ 128 ε2

  • d log

2e2 ε log 2e ε

  • + log 8

δ

  • .

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

slide-91
SLIDE 91

Bin access lemma

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

slide-92
SLIDE 92

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω of measure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

slide-93
SLIDE 93

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω of measure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.

Then the 2δ-neighbourhood of every point ω ∈ Ω, apart from a set of measure at most 1

2α(δ)

1 2, meets at least ⌈1

2α(δ)− 1

2⌉

elements of γ.

∗ ∗ ∗

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

slide-94
SLIDE 94

Bin access lemma

Let δ > 0, and let γ be a collection of subsets A ⊆ Ω of measure µ(A) ≤ α(δ) ≤ 1

4 each, satisfying µ(∪γ) ≥ 1/2.

Then the 2δ-neighbourhood of every point ω ∈ Ω, apart from a set of measure at most 1

2α(δ)

1 2, meets at least ⌈1

2α(δ)− 1

2⌉

elements of γ.

∗ ∗ ∗

If we can now guarantee that the bins are not too large, we get a lower bound on the number of bin accesses.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

slide-95
SLIDE 95

Bin complexity estimates

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

slide-96
SLIDE 96

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used for constructing a metric tree of a particular type.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

slide-97
SLIDE 97

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities

f a, f ∈ F, a ∈ R.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

slide-98
SLIDE 98

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities

f a, f ∈ F, a ∈ R.

Suppose

p = VC-dim (A ) < ∞

(pseudodimension of F in the sense of Vapnik).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

slide-99
SLIDE 99

Bin complexity estimates

Let F be a class of 1-Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities

f a, f ∈ F, a ∈ R.

Suppose

p = VC-dim (A ) < ∞

(pseudodimension of F in the sense of Vapnik). Denote B the class of all bins of all possible metric trees of depth ≤ h built using F. Then VC-dim (B) ≤ 2hp log(hp) = O(hp).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

slide-100
SLIDE 100

Rigorous lower bounds

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

slide-101
SLIDE 101

Rigorous lower bounds

  • thm. Let F be a class of 1-Lipschitz functions on {0, 1}d

with VC dimension of the class of sets given by inequalities

f a being poly(d).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

slide-102
SLIDE 102

Rigorous lower bounds

  • thm. Let F be a class of 1-Lipschitz functions on {0, 1}d

with VC dimension of the class of sets given by inequalities

f a being poly(d).

With probability approaching 1, every metric tree indexing scheme for a random sample X of {0, 1}d containing n points, where d = no(1) and d = ω(log n),

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

slide-103
SLIDE 103

Rigorous lower bounds

  • thm. Let F be a class of 1-Lipschitz functions on {0, 1}d

with VC dimension of the class of sets given by inequalities

f a being poly(d).

With probability approaching 1, every metric tree indexing scheme for a random sample X of {0, 1}d containing n points, where d = no(1) and d = ω(log n), will have the worst-case performance dω(1).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

slide-104
SLIDE 104

Rigorous lower bounds

  • thm. Let F be a class of 1-Lipschitz functions on {0, 1}d

with VC dimension of the class of sets given by inequalities

f a being poly(d).

With probability approaching 1, every metric tree indexing scheme for a random sample X of {0, 1}d containing n points, where d = no(1) and d = ω(log n), will have the worst-case performance dω(1).

⊳ Can suppose every bin contains poly(d) datapoints, and

the tree depth is poly(d). The VC-dim of all possible bins is poly(d) = o(n). If ǫ = n1/2−γ, by learning estimates the measure of each bin of the scheme is O(n−1/2+γ), so there will be Ω(n1/4−γ) = dω(1) bin accesses. ⊲

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

slide-105
SLIDE 105

Example: vp-tree

The vp-tree (Yianilos) uses decision functions of the form

ft(ω) = (1/2)(ρ(xt+, ω) − ρ(xt−, ω)),

where

t± are two children of t and xt± are the vantage points for the node t.

If Ω = Rd, VC dimension is d + 1.

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.24/25

slide-106
SLIDE 106

Example: M-tree

The M-tree (Ciaccia, Patella, Zezula) employs decision functions

ft(ω) = ρ(xt, ω) − sup

τ∈Bt

ρ(xt, τ),

where

Bt is a block corresponding to the node t, xt is a datapoint chosen for each node t, and

suprema on the r.h.s. are precomputed and stored. If Ω = Rd, VC-dim is d + 1; for Ω = {0, 1}d, it is O(d).

Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.25/25