A Peer-to-Peer Inverted Index Implementation for Word-based Content - - PowerPoint PPT Presentation
A Peer-to-Peer Inverted Index Implementation for Word-based Content - - PowerPoint PPT Presentation
A Peer-to-Peer Inverted Index Implementation for Word-based Content Search Nuno Lopes University of Minho October 2003 P2P System Characterization Scalable up to millions of nodes Highly dynamic node membership Reduced node uptime:
SLIDE 1
SLIDE 2
P2P System Characterization
- Scalable up to millions of nodes
- Highly dynamic node membership
- Reduced node uptime: 1 hour on average
- No centralized authority
c
2003 Nuno Lopes
1 SDDI 2003
SLIDE 3
1st Generation of P2P Systems File Sharing Oriented
- Napster
Centralized search with p2p file download
⇒ Single point-of-failure
- Gnutella
Broadcast based search
⇒ Network overloaded
c
2003 Nuno Lopes
2 SDDI 2003
SLIDE 4
Searching Model
- Local model
Individual peer search Examples: Gnutella, Pedone’02
- Global model
Information is placed on a global (distributed) shared index
c
2003 Nuno Lopes
3 SDDI 2003
SLIDE 5
2nd Generation of P2P Systems Distributed Hash Table (DHT) Based
- Examples: Chord, Pastry, others...
- Simple hash table operations on (key,value) pairs
- Efficient routing: O(log N) hops for any peer
- Scalable state information: O(log N) routing entries
per peer
- But... incapable of searching
c
2003 Nuno Lopes
4 SDDI 2003
SLIDE 6
Inverted Index Description
- Association word → {document location}SET
- Document Location Set is highly dynamic
- Follows Zipf distribution
100 200 300 400 500 600 700 5000 10000 15000 20000 25000 30000 35000 # Documents Words
c
2003 Nuno Lopes
5 SDDI 2003
SLIDE 7
Inverted Index API
- INSERT(word, reference)
- REMOVE(word, reference)
- HAS REF(word, reference): bool
- GET REF(word): reference
- NEXT REF(word, reference): reference
c
2003 Nuno Lopes
6 SDDI 2003
SLIDE 8
Inverted Index Implementation
Index is splited in constant size blocks, accessed through 2 layers:
- DHT as base platform for block-oriented storage
⇒ Unsuitable as a stand-alone implementation
- B+ tree for block management
Responsible for the set implementation to each word
c
2003 Nuno Lopes
7 SDDI 2003
SLIDE 9
Current Simulation Settings
- Only the B+ tree layer is simulated
- Peers store a single block each
- Messages have an atomic cost
- Single client requests index operations on the system
- Data consists on 1000 small documents with 36499
unique words
c
2003 Nuno Lopes
8 SDDI 2003
SLIDE 10
Initial Simulation Results
- B+ trees make the storage load uniform across peers
- However... root blocks for popular words have high
network load
100 200 300 400 500 600 700 800 10000 20000 30000 40000 50000 60000 Access rate Blocks
c
2003 Nuno Lopes
9 SDDI 2003
SLIDE 11
Caching Mechanism
- Clients have high probability of requesting the same
blocks for popular words
- Caching of (non-leaf) blocks reduces the number of
accesses
- In order to avoid stale copies, leaf blocks are never
cached
- Higher level blocks are less probable to become
modified and therefore stale
c
2003 Nuno Lopes
10 SDDI 2003
SLIDE 12
Simulation Results (Using Cache)
- The use of a cache mechanism (LRU) distributes
more evenly the network load on peers
- Access rates were reduced by a factor of 10
10 20 30 40 50 60 10000 20000 30000 40000 50000 60000 Access rate Blocks
c
2003 Nuno Lopes
11 SDDI 2003
SLIDE 13
Open Questions
- Measurement of DHT as stand-alone implementation
- f inverted index
- Analysis of the block caching mechanism to determine
the best cache size for different numbers of peers on the system
- Implementation of multiple blocks to peer association
for studying effective peer load
- AND and OR search operators implementation and
load measurement
c
2003 Nuno Lopes
12 SDDI 2003