a peer to peer inverted index implementation for word
play

A Peer-to-Peer Inverted Index Implementation for Word-based Content - PowerPoint PPT Presentation

A Peer-to-Peer Inverted Index Implementation for Word-based Content Search Nuno Lopes University of Minho October 2003 P2P System Characterization Scalable up to millions of nodes Highly dynamic node membership Reduced node uptime:


  1. A Peer-to-Peer Inverted Index Implementation for Word-based Content Search Nuno Lopes University of Minho October 2003

  2. P2P System Characterization • Scalable up to millions of nodes • Highly dynamic node membership • Reduced node uptime: 1 hour on average • No centralized authority � 2003 Nuno Lopes c 1 SDDI 2003

  3. 1st Generation of P2P Systems File Sharing Oriented • Napster Centralized search with p2p file download ⇒ Single point-of-failure • Gnutella Broadcast based search ⇒ Network overloaded � 2003 Nuno Lopes c 2 SDDI 2003

  4. Searching Model • Local model Individual peer search Examples: Gnutella, Pedone’02 • Global model Information is placed on a global (distributed) shared index � 2003 Nuno Lopes c 3 SDDI 2003

  5. 2nd Generation of P2P Systems Distributed Hash Table (DHT) Based • Examples: Chord, Pastry, others... • Simple hash table operations on ( key , value ) pairs • Efficient routing: O (log N ) hops for any peer • Scalable state information: O (log N ) routing entries per peer • But... incapable of searching � 2003 Nuno Lopes c 4 SDDI 2003

  6. Inverted Index Description • Association word �→ { document location } SET • Document Location Set is highly dynamic • Follows Zipf distribution 700 600 500 # Documents 400 300 200 100 0 0 5000 10000 15000 20000 25000 30000 35000 Words � 2003 Nuno Lopes c 5 SDDI 2003

  7. Inverted Index API • INSERT( word , reference ) • REMOVE( word , reference ) • HAS REF( word , reference ): bool • GET REF( word ): reference • NEXT REF( word , reference ): reference � 2003 Nuno Lopes c 6 SDDI 2003

  8. Inverted Index Implementation Index is splited in constant size blocks, accessed through 2 layers: • DHT as base platform for block-oriented storage ⇒ Unsuitable as a stand-alone implementation • B+ tree for block management Responsible for the set implementation to each word � 2003 Nuno Lopes c 7 SDDI 2003

  9. Current Simulation Settings • Only the B+ tree layer is simulated • Peers store a single block each • Messages have an atomic cost • Single client requests index operations on the system • Data consists on 1000 small documents with 36499 unique words � 2003 Nuno Lopes c 8 SDDI 2003

  10. Initial Simulation Results • B+ trees make the storage load uniform across peers • However... root blocks for popular words have high network load 800 700 600 500 Access rate 400 300 200 100 0 0 10000 20000 30000 40000 50000 60000 Blocks � 2003 Nuno Lopes c 9 SDDI 2003

  11. Caching Mechanism • Clients have high probability of requesting the same blocks for popular words • Caching of (non-leaf) blocks reduces the number of accesses • In order to avoid stale copies, leaf blocks are never cached • Higher level blocks are less probable to become modified and therefore stale � 2003 Nuno Lopes c 10 SDDI 2003

  12. Simulation Results (Using Cache) • The use of a cache mechanism (LRU) distributes more evenly the network load on peers • Access rates were reduced by a factor of 10 60 50 40 Access rate 30 20 10 0 0 10000 20000 30000 40000 50000 60000 Blocks � 2003 Nuno Lopes c 11 SDDI 2003

  13. Open Questions • Measurement of DHT as stand-alone implementation of inverted index • Analysis of the block caching mechanism to determine the best cache size for different numbers of peers on the system • Implementation of multiple blocks to peer association for studying effective peer load • AND and OR search operators implementation and load measurement � 2003 Nuno Lopes c 12 SDDI 2003

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend