Peer-to-Peer Similarity Search in Metric Spaces Christos - - PowerPoint PPT Presentation

peer to peer similarity search in metric spaces
SMART_READER_LITE
LIVE PREVIEW

Peer-to-Peer Similarity Search in Metric Spaces Christos - - PowerPoint PPT Presentation

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business


slide-1
SLIDE 1

Christos Doulkeridis, AUEB 1

Peer-to-Peer Similarity Search in Metric Spaces

Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis

http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business (AUEB) Athens, Greece

slide-2
SLIDE 2

Christos Doulkeridis, AUEB 2

Motivation

  • Similarity search in metric spaces
  • Objects are represented in a high dimensional feature space
  • Complex distance functions (e.g. text, multimedia)
  • Goal: share the computational load over a set of computers

Peer-to-Peer

  • DBISP2P’07 session on P2P similarity search
  • Existing work

– Centralized settings – Structured P2P systems (not preserving peer autonomy)

slide-3
SLIDE 3

Christos Doulkeridis, AUEB 3

Outline

1. Preliminaries

a. Metric spaces b. iDistance

2. SIMPEER

a. Construction b. Range query processing c. k-NN query processing

3. Experimental results 4. Conclusions & further work

slide-4
SLIDE 4

Christos Doulkeridis, AUEB 4

Metric Space

  • Metric space M=(D,d)

– d(p,q) = d(q,p) (symmetry) – d(p,q) > 0, q≠p and d(p,p)=0 (non negativity) – d(p,q) ≤ d(p,o) + d(o,q) (triangle inequality)

  • Similarity queries

– Range queries: R(q,r) = { u ∈ D | d(q,u) < r } – k-NN queries: NNk(q)

slide-5
SLIDE 5

Christos Doulkeridis, AUEB 5

iDistance – Indexing the Distance

  • Space partitioning into n clusters
  • Reference points Ki
  • Each cluster mapped to an interval
  • Each object x mapped to 1-d
  • Values indexed in a B+-Tree
  • Query R(q,r)

– If a query intersects with a cluster – Scan the interval

slide-6
SLIDE 6

Christos Doulkeridis, AUEB 6

SIMPEER 3-level Clustering Scheme

1. Each peer

  • clusters its own data
  • indexes local points using iDistance

2. Each super-peer

  • receives its peers’ cluster descriptions
  • computes the hyper-clusters using
  • ur extension of iDistance

3. Super-peers

  • exchange hyper-clusters
  • build a set of routing clusters

Super-peer architecture

slide-7
SLIDE 7

Christos Doulkeridis, AUEB 7

R(q,r)

iDistance Extension

  • Map clusters, not points
  • Index the furthest point of each

cluster only!

  • Each cluster Cj mapped to
  • Values indexed in a B+-Tree
  • Query R(q,r)

– Search region [ d(Oi,q) - r, r’I ] Hyper-cluster

Oi C1 C2 C3 r’i

slide-8
SLIDE 8

Christos Doulkeridis, AUEB 8

R(q,r)

Peer Query Processing

  • Data organization

– Clustering / Space partitioning – iDistance

  • LCp sent to super-peer

– LCp= { Ci: (Ki,ri) }

  • Range query processing

– Scan intervals of B+-tree Peer data space K1,r1 K3,r3 K2,r2 LCp = {C1(K1,r1), C2(K2,r2), C3(K3,r3)} Leaf nodes of B+-tree

slide-9
SLIDE 9

Christos Doulkeridis, AUEB 9

R(q,r)

Super-Peer Query Processing

  • Super-peer

– Creates hyper-clusters based on peer clusters – Indexes the furthest point

  • f each peer’s cluster
  • Range query processing

– Find peers to forward the query – Peer selection mechanism Super-peer data space LHCsp = {HC1(O1,r’1), HC2(O2,r’2), HC3(O3,r’3)} O1,r’1 O2,r’2 O3,r’3

slide-10
SLIDE 10

Christos Doulkeridis, AUEB 10

Routing Indices

  • Super-peers broadcast hyper-clusters
  • Recipient super-peers

– Treat hyper-clusters similarly to peer clusters

  • Build routing clusters RCi

– Used to determine the neighbouring super-peer to forward the query – Super-peer selection mechanism

slide-11
SLIDE 11

Christos Doulkeridis, AUEB 11

k-NN Query Processing

  • Convert k-NN query to range query R(q,r)

– Use estimated range r – Based on (peer) cluster information at a super-peer local estimation – Based on hyper-cluster information at a super-peer global estimation – No communication required for estimation!

  • Maximum 2 round-trips required!

– If less than k objects retrieved, cannot avoid second round-trip – Super-peer computes an upper bound for r, based on its peers data

  • Goal: make a good estimation, such that

– First round-trip is enough (overestimate r) – r is sufficient, but not too large (do not overestimate r too much)

slide-12
SLIDE 12

Christos Doulkeridis, AUEB 12

Histogram Construction

  • Distribution of distances

– F(r) = Pr {d(q,p) ≤ r}

  • Expected number of

retrieved objects by R(q,r)

– #objs(R(q,r)) = n x F(r)

  • Assumption

– “high” homogeneity of viewpoints inside a cluster [Ciaccia, PODS’98] – Approximate Fq with a sampled distance distribution F

Fi(rB) rB 2rB srB

...

Fi(2rB) Fi(srB) Clusteri

Frequency of distances for: d ≤ 2rB

slide-13
SLIDE 13

Christos Doulkeridis, AUEB 13

Local Estimation (LE)

Ci Ki ri

R(q,r) R(q,r)

r’ d(Ki, q) + r > ri d(Ki, q) + r ≤ ri Condition : Estimated #objects : ni x Fi(r) ni x Fi(r’) where r’=ri+r-d(Ki,q)/2 Ci Ki ri

Binary search on [0,srB] to find the smallest r for which the estimated number of objects ≥ k

slide-14
SLIDE 14

Christos Doulkeridis, AUEB 14

Global Estimation (GE)

  • Hyper-clusters enhanced with 2 histograms: (hci) (hdi)

– Number of clusters intersecting the query (nci)

  • Distance distribution of clusters within a hyper-cluster

– Number of data objects contained in the intersection (ndi)

  • Superimpose cluster histograms, by keeping the minimum

value of each bin

  • Also keep the minimum cardinality of all clusters

Estimated #objects : nci(r) x ndi(r)

slide-15
SLIDE 15

Christos Doulkeridis, AUEB 15

Experimental Setup

  • GT-ITM topology generator (4K-16K peers)
  • #Super-peers={200,400}
  • DEGsp=4-7
  • DEGp=20-60
  • kp=10
  • Sunthetic {uniform,clustered} datasets

– 8-32d, 3M-12M objects

  • Real datasets

– VEC 1M 45-dim vectors of color image features – CovType 581K 54-dim instances of forest Covertype data

slide-16
SLIDE 16

Christos Doulkeridis, AUEB 16

Construction Cost

  • Mainly depends on super-peer topology
  • One-time cost!
  • Approx. 1.5MB per super-peer

Total construction cost (MB)

100 200 300 400 500 600 700 4 5 6 7 DEGsp Nsp=200 Nsp=400

slide-17
SLIDE 17

Christos Doulkeridis, AUEB 17

Range Queries – Response Time

  • (Nsp=200, Np=2000, n=1M, d=16)
  • Increases only slightly with cardinality
  • Higher response time in clustered dataset
  • Most results come from the same network paths, causing delays

Response Time (sec)

2 4 6 8 10 12 14 16 18 3 6 9 12 Cardinality (x10^6) Uniform, k=120 Uniform, k=60 Clustered, k=120 Clustered, k=60

Network transfer rate 4KB/sec

slide-18
SLIDE 18

Christos Doulkeridis, AUEB 18

Range Queries – Success Ratio

  • Clustered dataset (Nsp=200, Np=2000, n=1M)
  • Success ratio = how many of the contacted peers (super-peers)

returned results

Success Ratio

20 40 60 80 100 2 1.67 1.33 1 0.67 0.33 Query Selectivity (x10^-5) SP, d=8 SP, d=32 P, d=8 P, d=32

slide-19
SLIDE 19

Christos Doulkeridis, AUEB 19

k-NN Queries – Overestimation(%)

  • VEC dataset (Nsp=200, Np=2000)

– LE better (initially) – GE becomes better with increasing ksp

2 4 6 8 10 12 k=100 ksp=5 k=50 ksp=5 k=100 ksp=10 k=50 ksp=10 k=100 ksp=15 k=50 ksp=15 Overestimation (%) LE/RE GE/RE

slide-20
SLIDE 20

Christos Doulkeridis, AUEB 20

Conclusions & Further Work

  • SIMPEER

– A metric-based framework for P2P similarity search – Utilizes a three-level clustering scheme

  • Support for range and k-NN query processing
  • Distributed statistics
  • Further work

– Extension for non-vector-based data representations – Devise an approach that deals with uniform data distributions in a better way

slide-21
SLIDE 21

Christos Doulkeridis, AUEB 21

Thank you for your attention !

More info:

http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr