users information
play

Users Information Abstract Sources With the ever incr e - PDF document

Index Structures for Information Filtering Under the V ector Space Mo del T ak W. Y an and Hector Garcia-Molina Departmen t of Computer Science Stanford Univ ersit y Stanford, CA 94305 Users Information


  1. Index Structures for Information Filtering Under the V ector Space Mo del � T ak W. Y an and Hector Garcia-Molina Departmen t of Computer Science Stanford Univ ersit y Stanford, CA 94305 Users Information Abstract Sources With the ever incr e asing volumes of ele ctr onic informa- tion gener ation, users of information systems ar e facing an information overlo ad. It is desir able to supp ort infor- Information Filtering mation �ltering as a c omplement to tr aditional r etrieval Server(s) me chanism. The numb er of users, and thus pr o�les (r ep- r esenting users' long-term inter ests), hand le d by an infor- mation �ltering system is p otential ly huge, and the system has to pr o c ess a c onstant str e am of inc oming information Figure 1: Information Filtering Serv er(s) in a timely fashion. The e�ciency of the �ltering pr o c ess is thus an imp ortant issue. In this p ap er, we study what data structur es and algo- systems can pro vide an information �ltering mec hanism, rithms c an b e use d to e�ciently p erform lar ge-sc ale infor- through whic h a user subscrib es pr o�les , or queries that mation �ltering under the ve ctor sp ac e mo del, a r etrieval are con tin uously ev aluated, to represen t his long-term in- mo del establishe d as b eing e�e ctive. We apply the ide a of terests, and then passiv ely receiv es information �ltered b y the standar d inverte d index to index user pr o�les. We de- the system according to the pro�les. vise an alternative to the standar d inverte d index, in which we, inste ad of indexing every term in a pr o�le, sele ct only Researc h in information �ltering has receiv ed a lot of the signi�c ant ones to index. We evaluate their p erfor- atten tion lately . Ho w ev er, previous w ork has fo cused on manc e and show that the indexing metho ds r e quir e or ders the e�ectiv eness (precision and recall) of the �ltering, and of magnitude fewer I/Os to pr o c ess a do cument than when little has b een done to address the e�ciency (p erformance) no index is use d. We also show that the pr op ose d alterna- asp ect of the problem. W e b eliev e that information �ltering tive p erforms b etter in terms of I/O and CPU pr o c essing is going to b e used on a large scale and hence the e�ciency time in many c ases. issue m ust b e addressed. In this pap er, w e presen t data structure and algorithms to supp ort information �ltering. Wide area information retriev al is no w a realit y; large- scale w orld-wide information �ltering is also foreseeable. 1 In tro duction Consider a p opulation of users and a n um b er of informa- tion sources in a net w ork ed information �ltering en viron- Information is increasingl y a v ailable in electronic form. men t. The �ltering can b e done either at the information The n um b er and size of full text do cumen t databases are sources, at the user sites, or at an in termediate information rapidly increasing. Users of suc h database systems are fac- �ltering server (Figure 1). Relying solely on user �ltering ing an information o v erload; it is b ecoming di�cult for is exp ensiv e since net w ork bandwidth is w asted to transmit users to rely solely on traditional retrosp ectiv e searc h and irrelev an t information and a lot of w asteful lo cal pro cess- retriev al mec hanisms to k eep themselv es apprised of new ing is done. Relying on �ltering at the sources themselv es do cumen ts that are relev an t to their in terest. As a com- is also exp ensiv e since users need to replicate their pro�les plemen t to con v en tional searc h mec hanism, information at al l p ossible sources. The information �ltering serv er is a go o d compromise. It collects information from a set of � This researc h w as sp onsored b y the Adv anced Researc h Pro jects Agency (ARP A) of the Departmen t of Defense under sources and routes it to in terested users. Of course, there Gran t No.MD A972-92-J-1029 with the Corp oration for National can b e m ultiple information �ltering serv ers on the net- Researc h Initiativ es (CNRI). The views and conclusions con- w ork, eac h servicing a di�eren t set (ma yb e o v erlapping) of tained in this do cumen t are those of the authors and should not users and information sources. b e in terpreted as necessarily represen ting the o�cial p olicies or In this pap er, w e fo cus on one information �ltering endorsemen t , either expressed or implied, of ARP A, the U.S. serv er and consider what data structure and algorithms Go v ernmen t , or CNRI.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend