[PPT] - The A-tree: An Index Structure for High-dimensional Spaces Using PowerPoint Presentation

SLIDE 1

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara Institute of Science and Technology) Shunsuke Uemura (Nara Institute of Science and Technology) Haruhiko Kojima (NTT Cyber Solutions Laboratories)

SLIDE 2

Introduction

■ Demand

– High-performance multimedia database systems – Content-based retrieval with high speed and accuracy

■ Multimedia databases

– Large size – Various features, high-dimensional data

■ More efficient spatial indices for high-

dimensional data

SLIDE 3

Our Approach

■ VA-File and SR-tree are excellent search

methods for high-dimensional data.

■ Comparisons of them motivated the concept

f the A-tree.

– No comparisons of them have been reported. – We performed experiments using various data sets

■ Approximation tree (A-tree)

– Relative approximation: MBRs and data objects are approximated based on their parent MBR. – About 77% reduction in the number of page accesses compared with VA-File and SR-tree

SLIDE 4

Related Work (1)

R5 R3 R4 R8 R6 R7 R1 R2 Non-leaf Node Leaf Node

■ R-tree family

– Tree structure using MBRs (Minimum Bounding Rectangles) and/or MBSs (Minimum Bounding Spheres) – SR-tree:

Structured by both MBRs and MBSs
Outperforms SS-tree and R*-tree for 16-dimensional data

R1 R2 R3 R4 R5 R6 R7 R8

SLIDE 5

Related Work (2)

■ VA-File (Vector Approximation File)

– Use approximation file and vector file

1. Divide the entire data space into cells
2. Approximate vector data by using the cells, then create the

approximation file

3. Select candidate vectors by scanning the approximation file
4. Access to the candidate vectors in the vector file

– Better than X-tree and R*-tree beyond dimensionality of 6 11 10 01 00 11 10 01 00 0.6 0.8 0.9 0.1 10 11 11 00 Approximation Vector Data

SLIDE 6

Experimental Results and Analysis

■ Structure suitable for non-uniformly distributed data

– Structure changes according to data distribution.

■ Large entry size for high-dimensional spaces

– Large entries small fanout many node accesses

■ Changing node size and fanout

– Larger node size does NOT lead to low IO cost. – Larger fanout always contributes to the reduction in node accesses.

■ MBS contribution

– The contribution of MBSs in node pruning is small in high- dimensional spaces.

-- Properties of the SR-tree ---

SLIDE 7

Experimental Results and Analysis

■ Data skew degenerates search performance.

– Absolute approximation: the approximation is independent of data distribution. – Effective for uniformly distributed data – Unsuitable for non-uniformly distributed data

A large amount of dense data tends to be approximated

by the same value.

Absolute approximation leads to large approximation

errors.

-- Properties of the VA-File ---

SLIDE 8

The A-tree (Approximation tree)

■ Ideas from the SR-tree and VA-File comparison:

– Tree structure

Tree structures are suitable for non-uniformly distributed

data.

– Relative approximation

MBRs and data objects are approximated based on their

parent bounding rectangle.

Small approximation error
Small entry size and large fanout low IO cost

– Partial usage of MBSs in high-dimensional searches

MBSs are not stored in the A-tree.
The centroid of data objects in a subtree is used only for

update.

SLIDE 9

(28, 20) (28, 4) (4, 4) (4, 20) Rectangle A Rectangle B VBR C (22, 16) (22, 10) (10, 10) (10, 16) (11, 15) (11, 11) (21, 11) (21, 15)

Virtual Bounding Rectangle (VBR)

■ C approximates a rectangle B. ■ C is calculated from rectangles A and B. ■ Search using VBRs guarantees the same

result as that of MBRs.

SLIDE 10

1 2 3 4 5 6 7 i-th dimensional coordinate axis 3 19 6 8 Edge of rectangle A Edge of rectangle B

Subspace Code

■ Subspace code represents a VBR. ■ The edge of child MBR B is quantized in

relation to the edge of parent MBR A.

■ The edge of B is approximated as a pair of 8-

ary codes (1, 2) or binary codes (001, 010).

SLIDE 11

Rectangle B VBR C

Subspace Code

■ C is the VBR of B in A ■ C is represented by the subspace codes:

S = (010, 011, 101, 101)

010 011 101 101 Rectangle A

SLIDE 12

The A-tree Structure

■ Relative approximation:

– MBRs and data objects in child nodes are approximated based on parent MBR.

■ Configuration

– One node contains partial information of rectangles in two consecutive generations.

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 13

The A-tree Structure

P1 and P2: data objects,

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 14

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 15

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 16

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 17

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 18

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 19

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 20

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 21

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects CD1 -- CD4: centroid of the data objects in the subtree

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

SLIDE 22

The A-tree Structure

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 Index nodes Data nodes ■ Data nodes ■ Index nodes

– leaf nodes – intermediate nodes – root node

SLIDE 23

The A-tree Structure

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 Index nodes Data nodes ■ Data node

– data objects – pointers to the data description records Data node

SLIDE 24

The A-tree Structure

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 Index nodes Data nodes ■ Leaf node

– an MBR – a pointer to the data node – subspace codes of VBRs Leaf nodes

SLIDE 25

The A-tree Structure

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 Index nodes Data nodes ■ Intermediate node

– an MBR – a list of entries

a pointer to the child node
the subspace code of a VBR
the centroid of data objects in the subtree
the number of the data objects

Intermediate nodes

SLIDE 26

The A-tree Structure

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD1 CD2 CD3 CD4 Index nodes Data nodes ■ Root node:

– a list of entries

a pointer to the child node
the subspace code of a VBR
the centroid of data objects in the subtree
the number of the data objects

Root node

SLIDE 27

Search Algorithm

■ Basic ideas:

– VBRs are calculated from parent MBR and the subspace codes. – Exception: the entire space is used in the root node. – The algorithm uses calculated VBRs for pruning.

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 Root node R (Entire space)

SLIDE 28

Search Algorithm

■ Calculate V1 and V2 from R, SC(V1) and SC(V2)

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

SLIDE 29

Search Algorithm

■ Calculate V1 and V2 from R, SC(V1) and SC(V2) ■ Calculate V3 and V4 from M1, SC(V3) and SC(V4)

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

SLIDE 30

Search Algorithm

■ Calculate V1 and V2 from R, SC(V1) and SC(V2) ■ Calculate V3 and V4 from M1, SC(V3) and SC(V4) ■ Calculate C1 and C2 from M3, SC(C1) and SC(C2)

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

SLIDE 31

Search Algorithm

■ Calculate V1 and V2 from R, SC(V1) and SC(V2) ■ Calculate V3 and V4 from M1, SC(V3) and SC(V4) ■ Calculate C1 and C2 from M3, SC(C1) and SC(C2) ■ Access to P1

M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

SLIDE 32

Update Algorithm

■ Basic idea:

– Based on the update algorithm of the SR-tree, but: – Needs to update subspace codes

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 33

Code Calculation

Parent MBR VBRs

SLIDE 34

Code Calculation

Inserted point Parent MBR VBRs

■ If parent MBR does not change, calculate the

subspace code for the inserted data object.

SLIDE 35

Code Calculation

Inserted point Parent MBR VBRs Inserted point

■ If parent MBR does not change, calculate the

subspace code for the inserted data object.

■ If parent MBR changes, calculate all subspace codes

SLIDE 36

■ Update data node and leaf node

– Insert a new data object P3 – Update M3

Update Algorithm

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 37

■ Update data node and leaf node

– Insert a new data object P3 – Update M3 – If M3 does not change, calculate SC(C3).

Update Algorithm

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 38

■ Update data node and leaf node

– Insert a new data object P3 – Update M3 – If M3 does not change, calculate SC(C3). – If M3 changes, calculate SC(C1), SC(C2) and SC(C3).

Update Algorithm

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 39

■ Update intermediate node

– If M3 changes, update M1. – If M3 changes but M1 does not change, calculate SC(V3). – If M1 changes, calculate SC(V3), SC(V4). – Calculate CD3

Update Algorithm

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 40

■ Update root node

– If M1 changes, calculate SC(V1) – Calculate CD1

Update Algorithm

CD3 M1

SC(V3) SC(V4)

M2 M4 M3

SC(C1)

P1 P2

SC(V1) SC(V2) SC(C2)

CD4 CD1 CD2 P3

SC(C3)

SLIDE 41

Performance Test

■ Data sets: real data set (hue histogram image data),

uniformly distributed data set, cluster data set.

■ Data size: 100,000 ■ Dimension: varies from 4 to 64 ■ Page size: 8 KB ■

20-nearest neighbor queries

■ Evaluation is based on the average for 1,000

insertion or query points.

■ CPU: 296 MHz ■ Code length:

– The code length that gave the best performance was chosen. – A-tree: code length varies from 4 to 12. – VA-File: code length varies from 4 to 8 according to [18].

SLIDE 42

Search Performance

■ A-tree gives significantly superior performance! ■ 77% reduction in number of page accesses for

64-dimensional real data

■ Relative approximation

– Small entry size and large fanout low IO cost

Real data Uniformly distributed data

SLIDE 43

■ Approximation error ε: error of the distance between p

and Vi during a search

p: query point, S: the number of visited VBRs, Vi: visited VBRs, Mi : the MBRs corresponding to Vi

■ Optimum code length depends on dimensionality and

data distribution

Influence of Code Length

å

=

= ×

=

S i i i

M p V p S r r

1

, , 1 , 100 ) 1 ( e

SLIDE 44

VA-File/A-tree Comparison

■ VA-File (absolute approximation)

– approximated using the entire space edge length 2-l

■ A-tree (relative approximation)

– approximated using parent MBR smaller VBR size, fewer object accesses Edge length of VBRs/cells Number of data object accesses

SLIDE 45

CPU-time

■ CPU-time for real data

– Similar to the SR-tree and outperforms the VA-File

■ VA-File

– Calculates the approximated position coordinate for all objects

■ A-tree

– Reducing node accesses leads to low CPU cost.

SLIDE 46

Insertion and Storage Cost

■ Increase in the insertion cost is modest. ■ About 20% less storage cost for 64-dimensional data

(1) VBRs need only small storage volumes. (2) The number of index nodes is extremely small.

Insertion cost Storage cost

SLIDE 47

■ The A-tree offers excellent search

performance for high-dimensional data

– Relative approximation

MBRs and data objects in child nodes are approximated

based on parent MBR.

– About 77% reduction in the number of page accesses compared with VA-File and SR-tree

■ Future work

– Cost model for finding optimum code length

Conclusions

SLIDE 48

Contribution of MBSs for Pruning

■ SR-tree contains both MBRs and MBSs but: