A Peer-to-Peer Inverted Index Implementation for Word-based Content - - PowerPoint PPT Presentation

a peer to peer inverted index implementation for word
SMART_READER_LITE
LIVE PREVIEW

A Peer-to-Peer Inverted Index Implementation for Word-based Content - - PowerPoint PPT Presentation

A Peer-to-Peer Inverted Index Implementation for Word-based Content Search Nuno Lopes University of Minho October 2003 P2P System Characterization Scalable up to millions of nodes Highly dynamic node membership Reduced node uptime:


slide-1
SLIDE 1

A Peer-to-Peer Inverted Index Implementation for Word-based Content Search

Nuno Lopes University of Minho October 2003

slide-2
SLIDE 2

P2P System Characterization

  • Scalable up to millions of nodes
  • Highly dynamic node membership
  • Reduced node uptime: 1 hour on average
  • No centralized authority

c

2003 Nuno Lopes

1 SDDI 2003

slide-3
SLIDE 3

1st Generation of P2P Systems File Sharing Oriented

  • Napster

Centralized search with p2p file download

⇒ Single point-of-failure

  • Gnutella

Broadcast based search

⇒ Network overloaded

c

2003 Nuno Lopes

2 SDDI 2003

slide-4
SLIDE 4

Searching Model

  • Local model

Individual peer search Examples: Gnutella, Pedone’02

  • Global model

Information is placed on a global (distributed) shared index

c

2003 Nuno Lopes

3 SDDI 2003

slide-5
SLIDE 5

2nd Generation of P2P Systems Distributed Hash Table (DHT) Based

  • Examples: Chord, Pastry, others...
  • Simple hash table operations on (key,value) pairs
  • Efficient routing: O(log N) hops for any peer
  • Scalable state information: O(log N) routing entries

per peer

  • But... incapable of searching

c

2003 Nuno Lopes

4 SDDI 2003

slide-6
SLIDE 6

Inverted Index Description

  • Association word → {document location}SET
  • Document Location Set is highly dynamic
  • Follows Zipf distribution

100 200 300 400 500 600 700 5000 10000 15000 20000 25000 30000 35000 # Documents Words

c

2003 Nuno Lopes

5 SDDI 2003

slide-7
SLIDE 7

Inverted Index API

  • INSERT(word, reference)
  • REMOVE(word, reference)
  • HAS REF(word, reference): bool
  • GET REF(word): reference
  • NEXT REF(word, reference): reference

c

2003 Nuno Lopes

6 SDDI 2003

slide-8
SLIDE 8

Inverted Index Implementation

Index is splited in constant size blocks, accessed through 2 layers:

  • DHT as base platform for block-oriented storage

⇒ Unsuitable as a stand-alone implementation

  • B+ tree for block management

Responsible for the set implementation to each word

c

2003 Nuno Lopes

7 SDDI 2003

slide-9
SLIDE 9

Current Simulation Settings

  • Only the B+ tree layer is simulated
  • Peers store a single block each
  • Messages have an atomic cost
  • Single client requests index operations on the system
  • Data consists on 1000 small documents with 36499

unique words

c

2003 Nuno Lopes

8 SDDI 2003

slide-10
SLIDE 10

Initial Simulation Results

  • B+ trees make the storage load uniform across peers
  • However... root blocks for popular words have high

network load

100 200 300 400 500 600 700 800 10000 20000 30000 40000 50000 60000 Access rate Blocks

c

2003 Nuno Lopes

9 SDDI 2003

slide-11
SLIDE 11

Caching Mechanism

  • Clients have high probability of requesting the same

blocks for popular words

  • Caching of (non-leaf) blocks reduces the number of

accesses

  • In order to avoid stale copies, leaf blocks are never

cached

  • Higher level blocks are less probable to become

modified and therefore stale

c

2003 Nuno Lopes

10 SDDI 2003

slide-12
SLIDE 12

Simulation Results (Using Cache)

  • The use of a cache mechanism (LRU) distributes

more evenly the network load on peers

  • Access rates were reduced by a factor of 10

10 20 30 40 50 60 10000 20000 30000 40000 50000 60000 Access rate Blocks

c

2003 Nuno Lopes

11 SDDI 2003

slide-13
SLIDE 13

Open Questions

  • Measurement of DHT as stand-alone implementation
  • f inverted index
  • Analysis of the block caching mechanism to determine

the best cache size for different numbers of peers on the system

  • Implementation of multiple blocks to peer association

for studying effective peer load

  • AND and OR search operators implementation and

load measurement

c

2003 Nuno Lopes

12 SDDI 2003