' $ \Similarit y Query Pro cessing using Disk Arra ys" - - PowerPoint PPT Presentation

similarit y query pro cessing using disk arra ys ap
SMART_READER_LITE
LIVE PREVIEW

' $ \Similarit y Query Pro cessing using Disk Arra ys" - - PowerPoint PPT Presentation

' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference,


slide-1
SLIDE 1 ' & $ % \Similarit y Query Pro cessing using Disk Arra ys" Ap
  • stolos
N. P apadop
  • ulos
& Y annis Manolop
  • ulos
Departmen t
  • f
Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference, Seattle, June 1998 1
slide-2
SLIDE 2 ' & $ % Outline
  • In
tro duction (Disk Arra ys & Similarit y Queries)
  • Assumptions
& Problem Denition
  • Similarit
y Searc h Algorithms
  • P
erformance Ev aluation
  • Concluding
Remarks & F uture W
  • rk
2
slide-3
SLIDE 3 ' & $ % In tro duction
  • Disk
Arra ys
  • P
  • w
erful storage media
  • f
increasing imp
  • rtance.
  • Can
handle large n um b er
  • f
requests.
  • Exploit
I/O parallelism (b
  • th
in terquery & in traquery).
  • F
ault toleran t (e.g. RAID lev els 1, 3, 5). 3
slide-4
SLIDE 4 ' & $ % In tro duction
  • Disk
Arra ys (con t.)

cont. cont. cont. cont. DMA cont. MEMORY CPU SCSI BUS

4
slide-5
SLIDE 5 ' & $ % In tro duction
  • Similarit
y Queries W e fo cus
  • n
the v ector mo del, where an
  • b
ject is represen ted b y a set
  • f
attributes, comp
  • sing
a v ector in a m ulti-dimensional space. There are t w
  • fundamen
tal t yp es
  • f
similarit y queries that can b e applied in the v ector mo del:
  • R
ange Query, where the user sp ecies a shap e and asks for all the
  • b
jects falling inside the corresp
  • nding
region,
  • Ne
ar est-Neighb
  • r
Query, where the user giv es an
  • b
ject, and requests for the k nearest
  • b
jects. 5
slide-6
SLIDE 6 ' & $ % Assumptions
  • The
set
  • f
n-d p
  • in
ts is
  • rganized
in an R
  • tree.
  • The
R
  • tree
is partitioned in the disk arra y b y means
  • f
the Pr
  • ximity
Index heuristic.
  • The
partitioning is p erformed no de-wise, i.e. after a split, the new page is assigned to a disk. 6
slide-7
SLIDE 7 ' & $ % Problem Denition Giv en a set
  • f
  • b
jects (n-d p
  • in
ts), a query
  • b
ject P q , and an in teger n um b er k , determine an ecien t plan to access the parallel R
  • tree,
in
  • rder
to rep
  • rt
the k nearest neigh b
  • rs
  • f
P q , trying to: (i) maximize parallelism, (ii) access as few no des as p
  • ssible,
and (iii) reduce query resp
  • nse
time. 7
slide-8
SLIDE 8 ' & $ % Similarit y Searc h Algorithms
  • Distances

mm min max

D D D Pq R2 R1

  • D
min (P q ; R x ): the min distance b et w een a p
  • in
t and an MBR.
  • D
mm (P q ; R x ): ensures the existence
  • f
at least
  • ne
p
  • in
t.
  • D
max (P q ; R x ): the max distance b et w een a p
  • in
t and an MBR. 8
slide-9
SLIDE 9 ' & $ % Similarit y Searc h Algorithms
  • BBSS
  • Prop
  • sed
b y Roussop
  • ulos
et. al., for answ ering NN queries in R-trees.
  • It
is a branc h-and-b
  • und
algorithm, and in eac h step, a new no de is accessed according to the distance b et w een the query p
  • in
t and the no de MBR.
  • The
distance
  • f
the query p
  • in
t to an MBR can b e either the D min (optimistic),
  • r
the D mm (p essimistic). Exp erimen ts ha v e demonstrated that using D min is more ecien t. Limitation: in traquery parallelism can not b e exploited, since eac h time a single no de is accessed. 9
slide-10
SLIDE 10 ' & $ % Similarit y Searc h Algorithms
  • FPSS
  • It
  • p
erates in a greedy philosoph y , trying to access in parallel as man y no des as p
  • ssible.
  • If
a no de MBR is in tersected b y the curren t query h yp ersphere, then the no de is accessed,
  • therwise
it is rejected.
  • The
algorithm rst determines a threshold distance D thr es and then descen ts the R
  • tree,
fetc hing the no des from the corresp
  • nding
disks. Limitation: a large n um b er
  • f
no des is accessed, leading to p erformance degradation. 10
slide-11
SLIDE 11 ' & $ % Similarit y Searc h Algorithms
  • FPSS
(con t.)

10 5 P R1 R3 R2 10

Let k =5. The circle determined b y P and D max (P ; R 2 ), guaran tees the existence
  • f
5 p
  • in
ts. FPSS fetc hes ALL pages that in tersect the circle (i.e. R 1 , R 2 and R 3 ). The pro cess is applied to all R
  • tree
lev els. 11
slide-12
SLIDE 12 ' & $ % Similarit y Searc h Algorithms
  • CRSS
Candidate Reduction Criterion: Giv en a query p
  • in
t P q , a threshold distance D th and a set
  • f
MBRs R = fR 1 ; :::; R m g then for a R x 2 R:
  • if
D th < D min (P q ; R x ), then R x is rejected.
  • if
D th
  • D
mm (P q ; R x ), then R x is set activ e.
  • if
D th
  • D
min (P q ; R x ) and D th < D mm (P q ; R x ), then R x is sa v ed for p
  • ssible
future reference. 12
slide-13
SLIDE 13 ' & $ % Similarit y Searc h Algorithms
  • CRSS
(con t.)

10 5 P R1 R3 R2 10

MBRs R 1 and R 2 will b ecome activ e, and the corresp
  • nding
pages will b e fetc hed, whereas MBR R 3 will b e sa v ed as a candidate for future reference, since D th > D min (P ; R 3 ) and D th < D mm (P ; R 3 ). 13
slide-14
SLIDE 14 ' & $ % Similarit y Searc h Algorithms
  • CRSS
(con t.) The CRSS algorithm
  • p
erates in four mo des: 1. The algorithm
  • p
erates in AD APTIVE mo de un til the leaf-lev el is reac hed for the rst time. Distance D th is adapted. 2. Ev ery time the leaf-lev el is reac hed, the algorithm passes to UPD A TE mo de. The k b est distances are (p
  • ssibly)
up dated. 3. The NORMAL mo de refers to cases where the algorithm
  • p
erates in an in termediate tree-lev el, but after the AD APTIVE mo de. 4. The TERMINA TE mo de signals that there are no candidate no des left, and the k NNs ha v e b een determined. 14
slide-15
SLIDE 15 ' & $ % Similarit y Searc h Algorithms
  • CRSS
(con t.) Imp
  • rtan
t Optimizations:
  • Let
N D denote the n um b er
  • f
disks, and AN the n um b er
  • f
activ e no des. If AN > N D , then
  • nly
N D pages will b e fetc hed. Thanks to the eciency
  • f
the Pr
  • ximity
Index sc heme, w e an ticipate that these no des are assigned to dieren t disks. The rest AN
  • N
D no des are sa v ed as candidates.
  • During
the AD APTIVE mo de it is imp
  • rtan
t that the activ e no des con tain
  • k
  • b
jects. This guaran tees that when the leaf-lev el is reac hed for the rst time,
  • k
distances are a v ailable. (In eac h no de, a sp ecial eld giv es the n um b er
  • f
  • b
jects lo cated under the corresp
  • nding
subtree). 15
slide-16
SLIDE 16 ' & $ % Similarit y Searc h Algorithms
  • OPTIMAL
Denition: 1. A similarit y searc h algorithm is called strict
  • ptimal
if exactly k
  • b
jects are insp ected, when answ ering a k
  • NN
query . 2. A similarit y searc h algorithm is called we ak
  • ptimal
if the minim um n um b er
  • f
pages is retriev ed, when answ ering a k
  • NN
query . Observ ation: Algorithms BBSS, FPSS and CRSS are neither strict
  • ptimal
nor w eak
  • ptimal.
16
slide-17
SLIDE 17 ' & $ % Similarit y Searc h Algorithms
  • OPTIMAL
(con t.) W e assume a h yp
  • thetical
w eak
  • ptimal
algorithm (W OPTSS). Let the distance D k from the query p
  • in
t P q to its k
  • th
nearest neigh b
  • r
b e kno wn in adv ance. Then, W OPTSS will retriev e the pages that in tersect the h yp ersphere with cen ter P q and radius D k . The n um b er
  • f
pages retriev ed b y this algorithms serv es as a lo w er b
  • und
for an y similarit y searc h algorithm. 17
slide-18
SLIDE 18 ' & $ % P erformance Ev aluation The sim ulation mo del is depicted b elo w.

DMA

  • new queries

RAM CPU I/O bus pending disk requests pending bus requests

18
slide-19
SLIDE 19 ' & $ % P erformance Ev aluation (con t.)

0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 100 200 300 400 500 600 700 Number of Accessed Nodes (normalized to WOPTSS) Nearest Neighbors Requested (1 - 700) Set: Gaussian, Population: 80000, Disks: 10, Dimensions: 10 BBSS CRSS WOPTSS

19
slide-20
SLIDE 20 ' & $ % P erformance Ev aluation (con t.)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14 16 18 20 Mean Response Time (sec) Queries per second (0.1 - 20) Set: California, Population: 62173, Disks: 10, NNs: 100, Dimensions: 2 BBSS FPSS CRSS WOPTSS

20
slide-21
SLIDE 21 ' & $ % P erformance Ev aluation (con t.)

1 2 3 4 5 6 7 8 5 10 15 20 25 30 Normalized Mean Response Time Number of Disks (1 - 30) Set: Gaussian, Population: 50000, Dimensions: 5, NNs: 10 BBSS CRSS WOPTSS

21
slide-22
SLIDE 22 ' & $ % P erformance Ev aluation (con t.)

1 2 3 4 5 6 7 20 40 60 80 100 Normalized Mean Response Time Nearest Neighbors (1 - 100) Set: Uniform, Population: 80000, Disks: 10, Dimensions: 5 BBSS CRSS WOPTSS

22
slide-23
SLIDE 23 ' & $ % Concluding Remarks
  • W
e presen ted four algorithms for the problem
  • f
similarit y query pro cessing in a disk arra y en vironmen t.
  • CRSS
giv es the b est p erformance with resp ect to query resp
  • nse
time. F uture W
  • rk
  • P
erform an analysis estimating the resp
  • nse
time
  • f
a query ,
  • Study
  • f
similarit y queries in shado w disks and
  • ther
RAID lev els,
  • Similarit
y searc h algorithms for m ulti-pro cessor en vironmen ts,
  • Use
  • ther
access metho ds. 23