Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , - - PowerPoint PPT Presentation

keyword searching in hypercubic manifolds
SMART_READER_LITE
LIVE PREVIEW

Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , - - PowerPoint PPT Presentation

Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , Steven Hand, Pietro Lio Yu University of Cambridge Computer Laboratory Motivation Unstructured P2P networks such as Guntella evaluates complex queries by flooding the network


slide-1
SLIDE 1

Keyword Searching in Hypercubic Manifolds

Yu Yu-

  • En Lu

En Lu, Steven Hand, Pietro Lio University of Cambridge Computer Laboratory

slide-2
SLIDE 2

Motivation

Unstructured P2P networks such as Guntella

evaluates complex queries by flooding the network while nothing can be guaranteed

Distributed Hash Tables evaluates simple queries

due to hashing whilst guarantee is provided (at least theoretically)

What if we may cluster similar objects in similar

regions of network via only hashing?

No preprocessing is needed No global knowledge required, only the hash Plug & play on top of current DHT designs

slide-3
SLIDE 3

PHT, Mercury, P-Grid etc. Qube Qube DHT Systems

Types of Queries

F Exact Query K1 ∧ K2 ∧ K3 e.g. “Harry Potter V.mpg” F Range Query K1 ∧ K2 ∧ K3 ∧ K4 ≥ 128 e.g. “Harry Potter V.mpg AND bit-rate > 128kbps” F Partial Match Query K1 ∧ K2 ∨ K3 ∨ K4 “Harry Potter [III,IV].mpg’ F Flawed Query Ki+1 ∧ Kj−20 “Hary Porter.mpg”

slide-4
SLIDE 4

Possible answers to the query

A Projection of Object Feature Space

Query Harry Movie French Cuisine

slide-5
SLIDE 5

A Qube View of the Mappings: Features - Overlay – Nodes

Vertices/ Abstract Graph Nodes/ Network topology

Each object is represented as a bit-string where 1 denotes it

contains a keyword and 0 means not

Each bit string is then hashed onto the P2P name space The nodes in the network chooses positions in the P2P space

randomly and links with each other in some overlay topology. In

  • ur case, a hypercube is used.

Harry potter Bon jovi

slide-6
SLIDE 6

Design Principles and Fundamental Trade-offs

Latency vs. Message Complexity

Low message complexity usually means low latency Not entirely true for DHT systems

Fairness vs. Performance

Sending everything to a handful of ultra-peers is fast and

simple

Having things spread across the network means fairer

system (and perhaps better availability)

Storage vs. Synchronisation complexity

Most popular queries may be processed by querying one

random node due to generous replication/caching

For some applications such as distributed inverted index,

frequent synchronisation is costy

slide-7
SLIDE 7

Hashing and Network Topology

Harry Potter music movie 7 Keywords:

1 1 0 0 1 1

Summary Hash

Non-expansive: Fair Partitioning:

Keyword edges linking

nodes in one word distance

Similar objects are located

in manifolds of the Hypercube

d(x,y) ≤ d(h(x), h(y)) h−1(u) = h−1(v)

Harry Potter music Harry Potter movie Harry Potter Harry Potter movie 7

h : {0, 1}∗ → {0, 1}b

slide-8
SLIDE 8

Query Processing

0000 0001 0010 0100 0101 0111 0011 1000 1000 1001 1010 1100 1101 1111 1011 1110 0110 Harry Potter Harry Potter Music Harry Harry Potter VII

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1011 1101 1111 1010 1100 1110

slide-9
SLIDE 9

Experimental Setup

A Hypercube DHT is instantiated where end-to-end

distance is drawn from King dataset*

King dataset contains latency measurements of a set of

DNS servers

Surrogates to a logical ID is chosen based on Plaxton style

post-fix matching scheme

Nodes choose DHT IDs randomly

No network proximity to expose worst case performance

and tradeoff with dimensionality and caching

A sample of FreeDB**, a free online CD album

containing 20 million songs, is used to reflect actual

  • bjects in real world

Gnutella query traces served as our query samples

* http://pdos.csail.mit.edu/p2psim/kingdata/ ** http://www.freedb.org

slide-10
SLIDE 10

Retrieval Cost

* Recall rate = percentage of relevant objects found ** Legend tuples denotes (b,n) where b is dimensionality and n is the size of the network

slide-11
SLIDE 11

Query Latency in Wide Area

Selection of b controls the degree of clustering

slide-12
SLIDE 12

Network Performance

* This result takes the query “bon jovi” for example where there are 3242 distinct, related songs in FreeDB

slide-13
SLIDE 13

Conclusion

Qube spreads objects across the network by

their similarity

Better fairness and availability Zero preprocessing Little synchronisation need By tuning parameter b, one may choose the

degree of performance/fairness tradeoff

Further investigating lower latency schemes

to trim probing cost and decouple query accuracy from network size

slide-14
SLIDE 14

Future Work

Large scale simulation (>100K nodes with

realistic network latency generator)

Flash crowds query model and

replication/caching

Distributed proximity-searches such as kNN

under Euclidean metric

slide-15
SLIDE 15

Thank you!

Yu-En.Lu@cl.cam.ac.uk