SLIDE 1 ' & $ % \Similarit y Query Pro cessing using Disk Arra ys" Ap
N. P apadop
& Y annis Manolop
Departmen t
Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference, Seattle, June 1998 1
SLIDE 2 ' & $ % Outline
tro duction (Disk Arra ys & Similarit y Queries)
& Problem Denition
y Searc h Algorithms
erformance Ev aluation
Remarks & F uture W
2
SLIDE 3 ' & $ % In tro duction
Arra ys
erful storage media
increasing imp
handle large n um b er
requests.
I/O parallelism (b
in terquery & in traquery).
ault toleran t (e.g. RAID lev els 1, 3, 5). 3
SLIDE 4 ' & $ % In tro duction
Arra ys (con t.)
cont. cont. cont. cont. DMA cont. MEMORY CPU SCSI BUS
4
SLIDE 5 ' & $ % In tro duction
y Queries W e fo cus
the v ector mo del, where an
ject is represen ted b y a set
attributes, comp
a v ector in a m ulti-dimensional space. There are t w
tal t yp es
similarit y queries that can b e applied in the v ector mo del:
ange Query, where the user sp ecies a shap e and asks for all the
jects falling inside the corresp
region,
ar est-Neighb
Query, where the user giv es an
ject, and requests for the k nearest
jects. 5
SLIDE 6 ' & $ % Assumptions
set
n-d p
ts is
in an R
R
is partitioned in the disk arra y b y means
the Pr
Index heuristic.
partitioning is p erformed no de-wise, i.e. after a split, the new page is assigned to a disk. 6
SLIDE 7 ' & $ % Problem Denition Giv en a set
jects (n-d p
ts), a query
ject P q , and an in teger n um b er k , determine an ecien t plan to access the parallel R
in
to rep
the k nearest neigh b
P q , trying to: (i) maximize parallelism, (ii) access as few no des as p
and (iii) reduce query resp
time. 7
SLIDE 8 ' & $ % Similarit y Searc h Algorithms
mm min max
D D D Pq R2 R1
min (P q ; R x ): the min distance b et w een a p
t and an MBR.
mm (P q ; R x ): ensures the existence
at least
p
t.
max (P q ; R x ): the max distance b et w een a p
t and an MBR. 8
SLIDE 9 ' & $ % Similarit y Searc h Algorithms
b y Roussop
et. al., for answ ering NN queries in R-trees.
is a branc h-and-b
algorithm, and in eac h step, a new no de is accessed according to the distance b et w een the query p
t and the no de MBR.
distance
the query p
t to an MBR can b e either the D min (optimistic),
the D mm (p essimistic). Exp erimen ts ha v e demonstrated that using D min is more ecien t. Limitation: in traquery parallelism can not b e exploited, since eac h time a single no de is accessed. 9
SLIDE 10 ' & $ % Similarit y Searc h Algorithms
erates in a greedy philosoph y , trying to access in parallel as man y no des as p
a no de MBR is in tersected b y the curren t query h yp ersphere, then the no de is accessed,
it is rejected.
algorithm rst determines a threshold distance D thr es and then descen ts the R
fetc hing the no des from the corresp
disks. Limitation: a large n um b er
no des is accessed, leading to p erformance degradation. 10
SLIDE 11 ' & $ % Similarit y Searc h Algorithms
(con t.)
10 5 P R1 R3 R2 10
Let k =5. The circle determined b y P and D max (P ; R 2 ), guaran tees the existence
5 p
ts. FPSS fetc hes ALL pages that in tersect the circle (i.e. R 1 , R 2 and R 3 ). The pro cess is applied to all R
lev els. 11
SLIDE 12 ' & $ % Similarit y Searc h Algorithms
Candidate Reduction Criterion: Giv en a query p
t P q , a threshold distance D th and a set
MBRs R = fR 1 ; :::; R m g then for a R x 2 R:
D th < D min (P q ; R x ), then R x is rejected.
D th
mm (P q ; R x ), then R x is set activ e.
D th
min (P q ; R x ) and D th < D mm (P q ; R x ), then R x is sa v ed for p
future reference. 12
SLIDE 13 ' & $ % Similarit y Searc h Algorithms
(con t.)
10 5 P R1 R3 R2 10
MBRs R 1 and R 2 will b ecome activ e, and the corresp
pages will b e fetc hed, whereas MBR R 3 will b e sa v ed as a candidate for future reference, since D th > D min (P ; R 3 ) and D th < D mm (P ; R 3 ). 13
SLIDE 14 ' & $ % Similarit y Searc h Algorithms
(con t.) The CRSS algorithm
erates in four mo des: 1. The algorithm
erates in AD APTIVE mo de un til the leaf-lev el is reac hed for the rst time. Distance D th is adapted. 2. Ev ery time the leaf-lev el is reac hed, the algorithm passes to UPD A TE mo de. The k b est distances are (p
up dated. 3. The NORMAL mo de refers to cases where the algorithm
erates in an in termediate tree-lev el, but after the AD APTIVE mo de. 4. The TERMINA TE mo de signals that there are no candidate no des left, and the k NNs ha v e b een determined. 14
SLIDE 15 ' & $ % Similarit y Searc h Algorithms
(con t.) Imp
t Optimizations:
N D denote the n um b er
disks, and AN the n um b er
activ e no des. If AN > N D , then
N D pages will b e fetc hed. Thanks to the eciency
the Pr
Index sc heme, w e an ticipate that these no des are assigned to dieren t disks. The rest AN
D no des are sa v ed as candidates.
the AD APTIVE mo de it is imp
t that the activ e no des con tain
jects. This guaran tees that when the leaf-lev el is reac hed for the rst time,
distances are a v ailable. (In eac h no de, a sp ecial eld giv es the n um b er
jects lo cated under the corresp
subtree). 15
SLIDE 16 ' & $ % Similarit y Searc h Algorithms
Denition: 1. A similarit y searc h algorithm is called strict
if exactly k
jects are insp ected, when answ ering a k
query . 2. A similarit y searc h algorithm is called we ak
if the minim um n um b er
pages is retriev ed, when answ ering a k
query . Observ ation: Algorithms BBSS, FPSS and CRSS are neither strict
nor w eak
16
SLIDE 17 ' & $ % Similarit y Searc h Algorithms
(con t.) W e assume a h yp
w eak
algorithm (W OPTSS). Let the distance D k from the query p
t P q to its k
nearest neigh b
b e kno wn in adv ance. Then, W OPTSS will retriev e the pages that in tersect the h yp ersphere with cen ter P q and radius D k . The n um b er
pages retriev ed b y this algorithms serv es as a lo w er b
for an y similarit y searc h algorithm. 17
SLIDE 18 ' & $ % P erformance Ev aluation The sim ulation mo del is depicted b elo w.
DMA
RAM CPU I/O bus pending disk requests pending bus requests
18
SLIDE 19 ' & $ % P erformance Ev aluation (con t.)
0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 100 200 300 400 500 600 700 Number of Accessed Nodes (normalized to WOPTSS) Nearest Neighbors Requested (1 - 700) Set: Gaussian, Population: 80000, Disks: 10, Dimensions: 10 BBSS CRSS WOPTSS
19
SLIDE 20 ' & $ % P erformance Ev aluation (con t.)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14 16 18 20 Mean Response Time (sec) Queries per second (0.1 - 20) Set: California, Population: 62173, Disks: 10, NNs: 100, Dimensions: 2 BBSS FPSS CRSS WOPTSS
20
SLIDE 21 ' & $ % P erformance Ev aluation (con t.)
1 2 3 4 5 6 7 8 5 10 15 20 25 30 Normalized Mean Response Time Number of Disks (1 - 30) Set: Gaussian, Population: 50000, Dimensions: 5, NNs: 10 BBSS CRSS WOPTSS
21
SLIDE 22 ' & $ % P erformance Ev aluation (con t.)
1 2 3 4 5 6 7 20 40 60 80 100 Normalized Mean Response Time Nearest Neighbors (1 - 100) Set: Uniform, Population: 80000, Disks: 10, Dimensions: 5 BBSS CRSS WOPTSS
22
SLIDE 23 ' & $ % Concluding Remarks
e presen ted four algorithms for the problem
similarit y query pro cessing in a disk arra y en vironmen t.
giv es the b est p erformance with resp ect to query resp
time. F uture W
erform an analysis estimating the resp
time
a query ,
similarit y queries in shado w disks and
RAID lev els,
y searc h algorithms for m ulti-pro cessor en vironmen ts,
access metho ds. 23