Toward a Multi-tier Index for Information Retrieval System Madhav - - PowerPoint PPT Presentation

toward a multi tier index for information retrieval system
SMART_READER_LITE
LIVE PREVIEW

Toward a Multi-tier Index for Information Retrieval System Madhav - - PowerPoint PPT Presentation

Toward a Multi-tier Index for Information Retrieval System Madhav Ram FJ37459 IR systems are mainly developed to help manage huge literature that have been developed IR systems provide users with easy access to information and its main


slide-1
SLIDE 1

Toward a Multi-tier Index for Information Retrieval System

Madhav Ram FJ37459

slide-2
SLIDE 2

Introduction

  • IR systems are mainly developed to help manage huge

literature that have been developed

  • IR systems provide users with easy access to information

and its main function are representation, storage,

  • rganization and access to information items
  • The performance of information retrieval process

decreases drastically as the information stored in the system increases

slide-3
SLIDE 3

Related work

The previous work in this field follows in mainly in two directions The first is sequential processing- here one processor is at a time to construct inverted file index used for information retrieval The second is through parallel processing which uses multi processor to construct inverted file index An Inverted Index is a sorted list of Keywords with each keyword having links to documents containing those keywords

slide-4
SLIDE 4

Sequential Processing

  • The first file that contains list of keywords is called Dictionary Files and second that contains the

documents linkage is called the posting file

  • The data structure that is used in here is binary search
  • With this sorted array can be easily implemented and reasonably fast for search

Fig: Implementation of inverted index using sorted array.

slide-5
SLIDE 5

Sequential Processing (contd..)

  • A system called Glimpse is

implemented using block addressing idea to speedup the construction of the inverted file is developed in

  • The main advantage of using

block addressing is the shrinking of the inverted file size to become only 5%

  • verhead of the of the original

text size

  • Partial indexing is the approach
  • f dividing the original text files

into into smaller buckets that fir into main memory

Fig: Partial Indexing Technique merging the partial indexes in a binary fashion

slide-6
SLIDE 6

Parallel Processing

  • In the bulk-synchronous parallel model of computing, parallelism is

tackled using two approaches they are: Local Index approach and Global Index approach

  • Local inverted index list index is constructed in each processor by

considering only the documents which are stored respectively

  • With Global indexing the whole collection of document is used to

produce a single inverted list index which are identical to sequential

  • nes
  • Three distribute algorithms are used to build global inverted files for

very large text collections. The three algorithms are Local Buffer and Local List algorithm (LL Algo.); Local Buffer and Remote List algorithm (LR Algo.); Remote Buffer and Remote List Algorithm (RR Algo.)

slide-7
SLIDE 7

Approach

  • The two methods to enhance the IR systems the first one to use

special purpose hardware and the second one is to use the Multi-Tier index algorithm

  • The second is discussed in this paper and it is based on usage of new

algorithms

  • The most common indexing technique used is inverted file index,

which represents data as indexed data

  • The main disadvantage with inverted file is the updating of the index

because it is expensive

  • The factors that affected the indexing process are construction,

searching and the updating time of the inverted file index

slide-8
SLIDE 8

Approach

  • The inverted file index constructed from the developed algorithms

consists of two associated files, the first file is dictionary and the second file is called postings

  • The main benefits of using multi-tier design is to speedup search

process for any query and easily updating

  • The first step in search process looking up in first-tier directory to

identify the first letter in query and in second-tier determine file name to perform the search

  • The second step is searching in second-tier
  • The updating process is the third step here we create an inverted file

index for updated files and remerge

  • Finally, posting file is updated
slide-9
SLIDE 9

Experimental Results

  • All the datasets used for this research are synthetic datasets.
  • For synthetic datasets random function generator are used to create

words to text document

  • Partial indices concept is used for constructing inverted file index
  • Visual basic is used for two different hardware system: first is PII 333

MHZ with 64MB RAM; second is 2.8MHZ Dell server with 1GB RAM

  • Measure performance of updating by different file sizes on different

size of inverted file

slide-10
SLIDE 10

Experimental Results : PII 333 MHz

Updating by 1KB using PII 333

50 100 150 1K 512 K 2M 8M

Inverted File Index Size Updating Time

Partial Multi-Tier

Updating by 1MB using PII 333 100 200 300 400 1 K 5 1 2 K 2 M 8 M Inverted File Index Size Updating Time

Partial Multi-Tier

Figure a: Updating time by 1KB file size using Partial and Multi-Tier inverted file Figure b: Updating time by 1 MB file size using Partial and Multi-Tier inverted file

slide-11
SLIDE 11

Experimental Results: PII 333 MHz

Figure c: Updating time by 2KB file size using Partial and Multi-Tier inverted file

Updating by 2MB using PII 333 100 200 300 400 500 600 1 K 5 1 2 K 2 M 8 M Inverted File Index Size Updating Time

Partial Multi-Tier

slide-12
SLIDE 12

Experimental Results : Dell Server 2.8 GHz

Figure d: Updating time by 1KB file size using Partial and Multi-Tier inverted file Figure e: Updating time by 1 MB file size using Partial and Multi-Tier inverted file

Updating by 1KB using 2.8 GHZ

5 10 15 1K 512 K 2M 8M Inverted File Index Size

Updating Time

Partial Multi-Tier

Updating by 1MB using 2.8 GHZ

10 20 30 40 50 1K 512 K 2M 8M Inverted File Index Size

Updating Time

Partial Multi-Tier

slide-13
SLIDE 13

Experimental Results : Dell Server 2.8 GHz

Figure f: Updating time by 2MB file size using Partial and Multi-Tier inverted file Updating by 2MB using 2.8GHz

20 40 60 80 100 1K 512 K 2M 8M Inverted File Index Size

Updating Time

Partial Multi-Tier

slide-14
SLIDE 14

Conclusion

  • Multi-Tier indexing technique have superior performance than a

partial index technique

  • Updating process using a Multi-Tier index performs better than a

partial index

  • This is an indicator that updating can be performed for large and

small file size with predictable performance

slide-15
SLIDE 15

Thank You!