peer to peer similarity search in metric spaces
play

Peer-to-Peer Similarity Search in Metric Spaces Christos - PowerPoint PPT Presentation

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business


  1. Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business (AUEB) Athens, Greece Christos Doulkeridis, AUEB 1

  2. Motivation • Similarity search in metric spaces • Objects are represented in a high dimensional feature space • Complex distance functions (e.g. text, multimedia) • Goal: share the computational load over a set of computers Peer-to-Peer • DBISP2P’07 session on P2P similarity search • Existing work – Centralized settings – Structured P2P systems (not preserving peer autonomy) Christos Doulkeridis, AUEB 2

  3. Outline 1. Preliminaries a. Metric spaces b. iDistance 2. SIMPEER a. Construction b. Range query processing c. k -NN query processing 3. Experimental results 4. Conclusions & further work Christos Doulkeridis, AUEB 3

  4. Metric Space • Metric space M=(D,d) – d(p,q) = d(q,p) (symmetry) – d(p,q) > 0, q ≠ p and d(p,p)=0 (non negativity) – d(p,q) ≤ d(p,o) + d(o,q) (triangle inequality) • Similarity queries – Range queries: R(q,r) = { u ∈ D | d(q,u) < r } – k -NN queries: NN k (q) Christos Doulkeridis, AUEB 4

  5. iDistance – Indexing the Distance • Space partitioning into n clusters • Reference points K i • Each cluster mapped to an interval • Each object x mapped to 1-d Values indexed in a B + -Tree • • Query R(q,r) – If a query intersects with a cluster – Scan the interval Christos Doulkeridis, AUEB 5

  6. SIMPEER 3-level Clustering Scheme 1. Each peer • clusters its own data • indexes local points using iDistance 2. Each super-peer • receives its peers’ cluster descriptions • computes the hyper-clusters using our extension of iDistance 3. Super-peers • exchange hyper-clusters • build a set of routing clusters Super-peer architecture Christos Doulkeridis, AUEB 6

  7. iDistance Extension • Map clusters, not points • Index the furthest point of each cluster only! r’ i • Each cluster C j mapped to C 3 Values indexed in a B + -Tree R(q,r) • O i C 1 C 2 • Query R(q,r) – Search region [ d(O i ,q) - r, r’ I ] Hyper-cluster Christos Doulkeridis, AUEB 7

  8. Peer Query Processing LC p = {C 1 (K 1 ,r 1 ), C 2 (K 2 ,r 2 ), C 3 (K 3 ,r 3 )} • Data organization – Clustering / Space partitioning K 1 ,r 1 K 3 ,r 3 – iDistance • LC p sent to super-peer R(q,r) K 2 ,r 2 – LC p = { C i : (K i ,r i ) } Peer data space • Range query processing – Scan intervals of B + -tree Leaf nodes of B + -tree Christos Doulkeridis, AUEB 8

  9. Super-Peer Query Processing LHC sp = {HC 1 (O 1 ,r’ 1 ), HC 2 (O 2 ,r’ 2 ), HC 3 (O 3 ,r’ 3 )} • Super-peer – Creates hyper-clusters O 1 ,r’ 1 based on peer clusters – Indexes the furthest point of each peer’s cluster • Range query processing O 3 ,r’ 3 – Find peers to forward the R(q,r) query Super-peer O 2 ,r’ 2 – Peer selection mechanism data space Christos Doulkeridis, AUEB 9

  10. Routing Indices • Super-peers broadcast hyper-clusters • Recipient super-peers – Treat hyper-clusters similarly to peer clusters • Build routing clusters RC i – Used to determine the neighbouring super-peer to forward the query – Super-peer selection mechanism Christos Doulkeridis, AUEB 10

  11. k-NN Query Processing • Convert k-NN query to range query R(q,r) – Use estimated range r – Based on (peer) cluster information at a super-peer local estimation – Based on hyper-cluster information at a super-peer global estimation – No communication required for estimation! • Maximum 2 round-trips required! – If less than k objects retrieved, cannot avoid second round-trip – Super-peer computes an upper bound for r, based on its peers data • Goal: make a good estimation, such that – First round-trip is enough (overestimate r) – r is sufficient, but not too large (do not overestimate r too much) Christos Doulkeridis, AUEB 11

  12. Histogram Construction Frequency of distances for: d ≤ 2r B • Distribution of distances – F(r) = Pr {d(q,p) ≤ r} Cluster�i • Expected number of F i (sr B ) retrieved objects by R(q,r) – #objs(R(q,r)) = n x F(r) • Assumption – “high” homogeneity of F i (2r B ) viewpoints inside a cluster F i (r B ) [Ciaccia, PODS’98] – Approximate F q with a sampled distance distribution F ... r B 2r B sr B Christos Doulkeridis, AUEB 12

  13. Local Estimation (LE) r i r i r’ K i K i R(q,r) R(q,r) C i C i d(K i , q) + r ≤ r i Condition : d(K i , q) + r > r i Estimated n i x F i (r) n i x F i (r’) #objects : where r’=r i +r-d(K i ,q)/2 Binary search on [0,sr B ] to find the smallest r for which the estimated number of objects ≥ k Christos Doulkeridis, AUEB 13

  14. Global Estimation (GE) • Hyper-clusters enhanced with 2 histograms: (hc i ) (hd i ) – Number of clusters intersecting the query (nc i ) • Distance distribution of clusters within a hyper-cluster – Number of data objects contained in the intersection (nd i ) • Superimpose cluster histograms, by keeping the minimum value of each bin • Also keep the minimum cardinality of all clusters Estimated nc i (r) x nd i (r) #objects : Christos Doulkeridis, AUEB 14

  15. Experimental Setup • GT-ITM topology generator (4K-16K peers) • #Super-peers={200,400} • DEG sp =4-7 • DEG p =20-60 • k p =10 • Sunthetic {uniform,clustered} datasets – 8-32d, 3M-12M objects • Real datasets – VEC 1M 45-dim vectors of color image features – CovType 581K 54-dim instances of forest Covertype data Christos Doulkeridis, AUEB 15

  16. Construction Cost • Mainly depends on super-peer topology • One-time cost! • Approx. 1.5MB per super-peer Total construction cost (MB) 700 600 500 400 Nsp=200 300 Nsp=400 200 100 0 4 5 6 7 DEGsp Christos Doulkeridis, AUEB 16

  17. Range Queries – Response Time • (N sp =200, N p =2000, n=1M, d=16) • Increases only slightly with cardinality • Higher response time in clustered dataset • Most results come from the same network paths, causing delays Response Time (sec) 18 16 Network transfer 14 Uniform, k=120 12 rate Uniform, k=60 10 4KB/sec 8 Clustered, k=120 6 Clustered, k=60 4 2 0 3 6 9 12 Cardinality (x10^6) Christos Doulkeridis, AUEB 17

  18. Range Queries – Success Ratio • Clustered dataset (N sp =200, N p =2000, n=1M) • Success ratio = how many of the contacted peers (super-peers) returned results Success Ratio 100 80 SP, d=8 60 SP, d=32 P, d=8 40 P, d=32 20 0 2 1.67 1.33 1 0.67 0.33 Query Selectivity (x10^-5) Christos Doulkeridis, AUEB 18

  19. k-NN Queries – Overestimation(%) • VEC dataset (N sp =200, N p =2000) – LE better (initially) – GE becomes better with increasing k sp 12 10 Overestimation (%) 8 LE/RE 6 GE/RE 4 2 0 k=100 k=50 k=100 k=50 k=100 k=50 ksp=5 ksp=5 ksp=10 ksp=10 ksp=15 ksp=15 Christos Doulkeridis, AUEB 19

  20. Conclusions & Further Work • SIMPEER – A metric-based framework for P2P similarity search – Utilizes a three-level clustering scheme • Support for range and k -NN query processing • Distributed statistics • Further work – Extension for non-vector-based data representations – Devise an approach that deals with uniform data distributions in a better way Christos Doulkeridis, AUEB 20

  21. Thank you for your attention ! More info: http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Christos Doulkeridis, AUEB 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend