Users Information Abstract Sources With the ever incr e - - PDF document

users information
SMART_READER_LITE
LIVE PREVIEW

Users Information Abstract Sources With the ever incr e - - PDF document

Index Structures for Information Filtering Under the V ector Space Mo del T ak W. Y an and Hector Garcia-Molina Departmen t of Computer Science Stanford Univ ersit y Stanford, CA 94305 Users Information


slide-1
SLIDE 1 Index Structures for Information Filtering Under the V ector Space Mo del
  • T
ak W. Y an and Hector Garcia-Molina Departmen t
  • f
Computer Science Stanford Univ ersit y Stanford, CA 94305 Abstract With the ever incr e asing volumes
  • f
ele ctr
  • nic
informa- tion gener ation, users
  • f
information systems ar e facing an information
  • verlo
ad. It is desir able to supp
  • rt
infor- mation ltering as a c
  • mplement
to tr aditional r etrieval me chanism. The numb er
  • f
users, and thus pr
  • les
(r ep- r esenting users' long-term inter ests), hand le d by an infor- mation ltering system is p
  • tential
ly huge, and the system has to pr
  • c
ess a c
  • nstant
str e am
  • f
inc
  • ming
information in a timely fashion. The eciency
  • f
the ltering pr
  • c
ess is thus an imp
  • rtant
issue. In this p ap er, we study what data structur es and algo- rithms c an b e use d to eciently p erform lar ge-sc ale infor- mation ltering under the ve ctor sp ac e mo del, a r etrieval mo del establishe d as b eing ee ctive. We apply the ide a
  • f
the standar d inverte d index to index user pr
  • les.
We de- vise an alternative to the standar d inverte d index, in which we, inste ad
  • f
indexing every term in a pr
  • le,
sele ct
  • nly
the signic ant
  • nes
to index. We evaluate their p erfor- manc e and show that the indexing metho ds r e quir e
  • r
ders
  • f
magnitude fewer I/Os to pr
  • c
ess a do cument than when no index is use d. We also show that the pr
  • p
  • se
d alterna- tive p erforms b etter in terms
  • f
I/O and CPU pr
  • c
essing time in many c ases. 1 In tro duction Information is increasingl y a v ailable in electronic form. The n um b er and size
  • f
full text do cumen t databases are rapidly increasing. Users
  • f
suc h database systems are fac- ing an information
  • v
erload; it is b ecoming dicult for users to rely solely
  • n
traditional retrosp ectiv e searc h and retriev al mec hanisms to k eep themselv es apprised
  • f
new do cumen ts that are relev an t to their in terest. As a com- plemen t to con v en tional searc h mec hanism, information
  • This
researc h w as sp
  • nsored
b y the Adv anced Researc h Pro jects Agency (ARP A)
  • f
the Departmen t
  • f
Defense under Gran t No.MD A972-92-J-1029 with the Corp
  • ration
for National Researc h Initiativ es (CNRI). The views and conclusions con- tained in this do cumen t are those
  • f
the authors and should not b e in terpreted as necessarily represen ting the
  • cial
p
  • licies
  • r
endorsemen t , either expressed
  • r
implied,
  • f
ARP A, the U.S. Go v ernmen t ,
  • r
CNRI.

Users Information Sources Information Filtering Server(s)

Figure 1: Information Filtering Serv er(s) systems can pro vide an information ltering mec hanism, through whic h a user subscrib es pr
  • les,
  • r
queries that are con tin uously ev aluated, to represen t his long-term in- terests, and then passiv ely receiv es information ltered b y the system according to the proles. Researc h in information ltering has receiv ed a lot
  • f
atten tion lately . Ho w ev er, previous w
  • rk
has fo cused
  • n
the eectiv eness (precision and recall)
  • f
the ltering, and little has b een done to address the eciency (p erformance) asp ect
  • f
the problem. W e b eliev e that information ltering is going to b e used
  • n
a large scale and hence the eciency issue m ust b e addressed. In this pap er, w e presen t data structure and algorithms to supp
  • rt
information ltering. Wide area information retriev al is no w a realit y; large- scale w
  • rld-wide
information ltering is also foreseeable. Consider a p
  • pulation
  • f
users and a n um b er
  • f
informa- tion sources in a net w
  • rk
ed information ltering en viron- men t. The ltering can b e done either at the information sources, at the user sites,
  • r
at an in termediate information ltering server (Figure 1). Relying solely
  • n
user ltering is exp ensiv e since net w
  • rk
bandwidth is w asted to transmit irrelev an t information and a lot
  • f
w asteful lo cal pro cess- ing is done. Relying
  • n
ltering at the sources themselv es is also exp ensiv e since users need to replicate their proles at al l p
  • ssible
sources. The information ltering serv er is a go
  • d
compromise. It collects information from a set
  • f
sources and routes it to in terested users. Of course, there can b e m ultiple information ltering serv ers
  • n
the net- w
  • rk,
eac h servicing a dieren t set (ma yb e
  • v
erlapping)
  • f
users and information sources. In this pap er, w e fo cus
  • n
  • ne
information ltering serv er and consider what data structure and algorithms
slide-2
SLIDE 2 it can emplo y to sp eed up the ltering pro cess. This is im- p
  • rtan
t b ecause, rstly , the n um b er
  • f
users and proles a serv er has to handle is p
  • ten
tially h uge. Secondly , as the rate
  • f
information generation is high, a ltering serv er will ha v e to pro cess a large n um b er
  • f
new do cumen ts ev eryda y , esp ecially if the serv er collects information from a n um b er
  • f
sources. Thirdly , it is imp
  • rtan
t to deliv er relev an t in- formation to users in a timely fashion for suc h a service to b e useful. In summary , information ltering serv ers will ha v e to handle h uge n um b er
  • f
proles and pro cess a con- stan t stream
  • f
incoming do cumen ts in a timely fashion. Th us, to dev elop ecien t pro cessing metho ds for a single ltering serv er can b e seen as the rst but imp
  • rtan
t step in ac hieving ecien t ltering
  • n
a global scale. T
  • further
motiv ate the need for ecien t information l- tering metho ds, let us lo
  • k
at a p
  • pular
information source to da y { Netnews. The study [11] rep
  • rts
that, as
  • f
Jan- uary 1993, the total Netnews readership w
  • rld-wide
is es- timated to b e 1.9 million. The estimates for the a v erage trac are 49.5 MB and 19,210 messages p er da y (coun ting cross-p
  • sted
messages
  • nly
  • nce).
If w e consider a Net- news ltering serv er that serv es a small fraction (sa y 5%)
  • f
this user p
  • pulation,
and eac h user has sa y v e proles, the serv er will ha v e to handle h undreds
  • f
thousands
  • f
proles. T
  • matc
h this large n um b er
  • f
proles against a daily inux
  • f
tens
  • f
thousands
  • f
do cumen ts in a timely fashion, it is apparen t that ecien t data structures and al- gorithms are needed. F urthermore, k eep in mind that these Netnews n um b ers are for a single information source to day. In the future,
  • ne
w
  • uld
exp ect man y more sources with ev en higher v
  • lumes.
Netnews do es supp
  • rt
a rudimen tary ltering mec ha- nism b y categorizing articles in to newsgroups and allo wing users to subscrib e to newsgroups
  • f
in terest. Ho w ev er, a ner gran ularit y
  • f
information need matc hing, b y means
  • f
information retriev al tec hniques, will cater m uc h b etter to individu al in terests. Researc h in information retriev al has giv en rise to man y retriev al mo dels, notably the b
  • lean
mo del, the v ector space mo del, and the probabilisti c mo del, that are applicabl e to information ltering [1]. Reference [17] presen ts data structures and algorithms for informa- tion ltering under the b
  • lean
mo del. In this pap er, w e consider the v ector space mo del (VSM), whic h is widely recognized as an eectiv e retriev al mo del. It uses a nat- ural language in terface, whic h mak es it easy to use. A w ell-kno wn tec hnique, called relev ance feedbac k, pro vides an easy w a y to impro v e retriev al eectiv eness. Some
  • f
the ideas in the VSM ha v e b een implemen ted in the W AIS system [8]. The p
  • pularit
y
  • f
W AIS demonstrates the ap- p eal
  • f
the VSM. Our metho ds are th us for do cumen ts and proles represen ted in the VSM. Our algorithms mak e use
  • f
an inverte d index to sp eed up the ltering pro cess. In v erted indexes ha v e b een used b y information retriev al systems to facilitate traditional retro- sp ectiv e searc h, namely b y building an index
  • f
do cumen ts. In this pap er, w e in v estigate ho w the idea
  • f
an in v erted index can b e used to sp eed up prole pro cessing. Sp eci- cally , w e prop
  • se
to use an in v erted index
  • f
proles. 1 In 1 Other retriev al metho ds (e.g., signature les [4]) can also b e used to sp eed up ltering (e.g., building a signature le
  • f
the information retriev al scenario, a user query is matc hed against a do cumen t index. Here, an incoming do cumen t is matc hed against a prole index. W e in v estigate what mo dications need to b e made, and what alternativ es are feasible. Inciden tally , w e ha v e implemen ted t w
  • exp
erimen tal l- tering serv ers at Stanford to disseminate Netnews articles and computer science tec hnical rep
  • rts.
The reader is en- couraged to try
  • ut
these services. F
  • r
instructions
  • n
ho w to use these services, send an electronic mail message to either elib@db.s ta nfo rd .ed u (for tec hnical rep
  • rts)
  • r
netnews@d b.s ta nfo rd. ed u (for Netnews) with the w
  • rd
\help" in the message b
  • dy
. Instructions will b e returned automatically . The curren t v ersions
  • f
these serv ers are not ecien t (they use the Brute F
  • rce
metho d describ ed later
  • n).
Ho w ev er, as more users subscrib e to
  • ur
serv ers, there is an
  • b
vious need for an ecien t implemen tation, and this motiv ated the w
  • rk
rep
  • rted
in this pap er. The rest
  • f
the pap er is
  • rganized
as follo ws. In Section 2, w e giv e a brief summary
  • f
the VSM, as applied to in- formation ltering. In Section 3, w e presen t three metho ds to pro cess proles. Details
  • f
the analysis and sim ulations used to ev aluate the p erformance
  • f
the metho ds are de- scrib ed in Section 4. The results
  • f
the ev aluation are pre- sen ted in Section 5. Section 6 is a surv ey
  • f
related w
  • rk
and Section 7 is for conclusion. 2 Information Filtering Under VSM In this section, w e giv e a brief summary
  • f
the VSM as used in information ltering. The purp
  • se
  • f
this is to explain some terminology and assumptions necessary for the exp
  • sition
  • f
  • ur
algorithms in Section 3. F
  • r
an in- depth in tro duction to the VSM and information ltering the reader is referred to [12] and [1] resp ectiv ely . 2.1 Do cumen t and Prole V ector In the VSM, w e iden tify do cumen ts b y terms. A do cu- men t D is represen ted as a v ector
  • f
dimension m, where m is the total n um b er
  • f
terms used to iden tify con ten t. Eac h term is giv en a w eigh t that signies its statistical imp
  • r-
tance. W e write D = (w 1 ; :::; w m ), where w i is the w eigh t assigned to the i-th term (and is for terms not presen t in D ). T
  • compute
the v ector represen tation
  • f
a do cumen t, w e rst collect the individua l w
  • rds
  • ccurring
in the do cumen t. W
  • rds
that b elong to the stop list, whic h is a list
  • f
high- frequency w
  • rds
with lo w con ten t discriminati ng p
  • w
er, are deleted. Then a stemming routine is used to reduce eac h remaining w
  • rd
to w
  • rd-stem
form. F
  • r
eac h remaining w
  • rd
stem (a term), a w eigh t is assigned in an attempt to represen t ho w \imp
  • rtan
t" that term is. One common w a y to compute the w eigh t
  • f
a term is to m ultiply the term proles). In this pap er, w e fo cus
  • n
in v ersion-base d metho ds. F urther w
  • rk
w
  • uld
need to b e done to compare the p erformanc e
  • f
signature-b ased and in v ersion-based metho ds for informatio n ltering.
slide-3
SLIDE 3 frequency (tf ) factor with the in v erse do cumen t frequency (id f ) factor. The tf factor is prop
  • rtional
to the frequency
  • f
the term within the do cumen t. The id f factor corre- sp
  • nds
to the con ten t discriminati ng p
  • w
er
  • f
the term: a term that app ears rarely in do cumen ts (e.g., \queue") has a high id f , while a term that
  • ccurs
in a large n um b er
  • f
do cumen ts (e.g., \system") has a lo w id f . 2 (See Sec- tion 4.1.1 for examples
  • f
form ulas used to calculate these factors.) As proles in the VSM are expressed in natural lan- guage, w e can represen t proles just lik e do cumen ts. A prole P app ears as P = (u 1 ; :::; u m ). Sometimes w e follo w the con v en tion
  • f
writing a do cumen t
  • r
prole v ector as a v ector
  • f
(term, w eigh t) pairs; those terms not listed ha v e w eigh ts equal to 0. Th us, a prole P with p non-zero w eigh ted terms can b e written as P = ((y 1 ; u 1 ); :::; (y p ; u p )): F
  • r
instance, in the prole P = ((\queue" ; 0:93); (\system" ; 0:37)), term \queue" has a w eigh t 0.93, \system" has 0.37, and all
  • ther
terms ha v e a zero w eigh t. The w eigh ts again describ e the \imp
  • rtance"
  • f
eac h term. 2.2 Sim il arit y Measure W e can measure the degree
  • f
similarit y b et w een a do cumen t-prole pair based
  • n
the w eigh ts
  • f
the corre- sp
  • nding
matc hing terms. The cosine measure has b een used for this purp
  • se;
giv en a do cumen t D = (w 1 ; :::; w m ) and a prole P = (u 1 ; :::; u m ), the cosine similarit y measure is: sim(D ; P ) = D
  • P
kD kkP k = P m i=1 w i u i p P m i=1 w 2 i P m i=1 u 2 i : In this pap er w e assume that the do cumen t and prole v ectors are normalized b y their lengths; th us the ab
  • v
e simplies to: sim(D ; P ) = D
  • P
= m X i=1 w i u i : 2.3 Relev ance Threshold In an information retriev al setting, a query is run against a database
  • f
do cumen ts, and the relev an t do cumen ts are returned to the user, rank ed b y their scores, i.e., the simi- larities b et w een the query and the do cumen ts. In an in- formation ltering setting, a prole is compared with a single do cumen t
  • r
a small n um b er
  • f
do cumen ts. It is undesirable to lter do cumen ts based
  • n
the ranks among a small batc h
  • f
do cumen ts. In [5 ], a xed n um b er
  • f
top rank ed do cumen ts is returned
  • v
er a certain p erio d
  • f
time. This is
  • nly
p
  • ssible
if the p erio d is long enough to allo w a signican t n um b er
  • f
do cumen ts to b e collected to mak e 2 In an information ltering setting, the n um b er
  • f
new in- coming do cumen ts pro cessed at
  • ne
time is small, so the in v erse do cumen t frequencies within the batc h ma y not b e the most reliable. Instead, w e ma y extract the id f s from a pre-existin g reference corpus
  • f
text, as is done in [5 ]. the ranking meaningful; and in doing so, the timeliness
  • f
the do cumen ts is sacriced. Also, the ltering eectiv e- ness (precision and recall) dep ends
  • n
the particular set
  • f
do cumen ts receiv ed during a p erio d. If all do cumen ts are relev an t, then some will b e missed (lo w recall). If few do c- umen ts are relev an t, then some do cumen ts deliv ered will b e irrelev an t (lo w precision). Reference [5] indeed rep
  • rts
suc h dra wbac ks. An alternativ e, as suggested in [5], is to allo w the user to sp ecify some kind
  • f
absolute relev ance threshold { do c- umen ts ab
  • v
e the threshold are considered relev an t, and those b elo w are not. With this strategy , instan taneous pro- cessing
  • f
do cumen ts is p
  • ssible
(i.e., a do cumen t can b e pro cessed
  • ne
at a time, as so
  • n
as it is receiv ed). Also, the precision and recall
  • f
the ltering are indep enden t
  • f
when it is p erformed. Suc h relev ance threshold can also b e used in con v en tional information retriev al; [13] describ es suc h an exp erimen t. W e sum up this discussion with the follo wing denition. Denition 1: Giv en a prole P and a relev ance threshold
  • ,
a do cumen t D is r elevant to P if sim(D ; P ) >
  • .
2 2.4 Relev ance F eedbac k Relev ance feedbac k is a tec hnique used to impro v e the eectiv eness
  • f
retriev al. This tec hnique can b e applied to information ltering as w ell. Our metho ds will w
  • rk
regardless
  • f
whether relev ance feedbac k is used. W e did p erform a set
  • f
exp erimen ts to ev aluate the metho ds under relev ance feedbac k. Due to space limitations, w e presen t the results in [18]. 3 Data Structures and Algorithms In this section w e describ e three metho ds that matc h a do cumen t against a n um b er
  • f
proles and determine the proles to whic h the do cumen t is relev an t. W e assume that a do cumen t is pro cessed
  • ne
at time, as so
  • n
as it arriv es. Our metho ds can easily b e extended to handle the case when a n um b er
  • f
do cumen ts is batc hed together for pro cessing, but w e do not address this here. In t w
  • f
the metho ds, w e mak e use
  • f
an in v erted index. In an index, for eac h term x, w e collect proles that con tain it to form an in v erted list. 3 The mapping from terms to the lo cation
  • f
their in v erted lists
  • n
disk is implemen ted as a hash table, called the dir e ctory. W e assume that the in v erted lists are stored
  • n
disk while the directory ts in main memory . Our fo cus in this pap er is
  • n
ecien t VSM ltering algo- rithms. The issue
  • f
ho w to ecien tly up date proles in the data structures is not addressed. W e assume that suc h up- dates are batc hed and are p erio dicall y installed. Ho w ev er, in the ev aluation
  • f
  • ur
indexing metho ds, w e do consider t w
  • ptions
  • f
storing in v erted lists
  • n
disk. One
  • ption
is to pac k all the lists in to con tiguous blo c ks, and the
  • ther
3 As detailed later, w e ma y collect all
  • r
some
  • f
the proles that con tain a term to form its in v erted list.
slide-4
SLIDE 4 is to store eac h list individua ll y in an in tegral n um b er
  • f
blo c ks. While handling up dates in the rst
  • ption
requires reading and writing all the lists, it is m uc h easier in the second
  • ption.
On the
  • ther
hand, the storage space re- quiremen t for the rst
  • ption
is higher. In
  • ur
ev aluation w e examine the trade-o in v
  • lv
ed. 3.1 Brute F
  • rce
(BF) Metho d If w e store proles sequen tially
  • n
disk without an y in- dex structures, then all proles m ust b e ev aluated when a new do cumen t is receiv ed. This is the Brute F
  • r
c e (BF) metho d. When a do cumen t arriv es, w e rst compute its v ector represen tation as describ ed in Section 2. Then w e examine eac h prole in turn. F
  • r
eac h (term, w eigh t) pair (x; u) in a prole, w e nd x's w eigh t w in the do cumen t v ector, and calculate the pro duct w
  • u.
The sum
  • f
suc h pro ducts is the cosine similarit y measure. The do cumen t is relev an t to a prole if the cosine measure is greater than the relev ance threshold asso ciated with the prole. W e store a prole
  • n
disk as a v ariable-length r e c
  • r
d with these elds: the prole iden tier, the length { i.e., the n um b er
  • f
terms in the prole, the (term, w eigh t) pairs, and nally the relev ance threshold. 3.2 Prole Indexing (PI) Metho d T
  • reduce
the n um b er
  • f
proles that m ust b e examined, w e build an in v erted index
  • f
proles. W e call this the Pr
  • le
Indexing (PI) metho d. F
  • r
eac h term x, w e collect all the proles that con tain it to form its in v erted list. The list is made up
  • f
p
  • stings;
eac h con tains the iden tier
  • f
a prole in v
  • lving
x and the w eigh t
  • f
x in it. Th us, a prole with p terms will b e found in p p
  • stings;
eac h in a dieren t list. When pro cessing a do cumen t D , w e
  • nly
need to examine those proles in the in v erted lists
  • f
the terms that are in D . T
  • matc
h a do cumen t against these proles, w e need t w
  • (main
memory) arra ys, THRESHOLD and SCORE. (This metho d and the next use more main memory than the BF metho d.) The n um b er
  • f
en tries in eac h arra y is equal to the n um b er
  • f
proles the system handles. Eac h prole has an en try in eac h arra y: the THRESHOLD en try stores the relev ance threshold, and the SCORE en try is used to k eep the score
  • f
the prole. When a do cumen t D arriv es, w e initiali ze the SCORE arra y to all 0's. F
  • r
eac h term x with w eigh t w in the do c- umen t, w e use the directory to retriev e x's in v erted list. Then w e pro cess eac h prole P in the list. That is, if the w eigh t
  • f
x in P is u, w e incremen t SCORE[P ] b y the pro d- uct
  • f
w
  • u.
After all do cumen t terms are pro cessed, a prole whose SCORE en try is greater than the THRESH- OLD en try matc hes the do cumen t. T
  • illustrate,
consider three proles: P 1 = ((a, 0.46), (b, 0.14), (c, 0.17), (d, 0.62), (e, 0.59)), P 2 = ((a, 0.95), (b, 0.30)), and P 3 = ((c, 0.14), (e, 0.49), (f , 0.17), (g , 0.42), (h, 0.11), (i, 0.10), (j , 0.72)), with relev ance thresholds
  • f
  • 1
= 0.25,
  • 2
= 0.20, and
  • 3
= 0.25 resp ectiv ely . P2 P3 P1 P2 P3 P1 P1 P1 P1 P3 P3 P3 P3 P3 P1 P1 Inverted Lists a b c d e f g h i j Directory MAIN MEMORY DISK 0.25 0.25 THRESHOLD 0.20 SCORE 0.2194 0.0450 0.6991 0.14 0.17 0.59 0.17 0.42 0.11 0.10 0.72 0.46 P2 P2 P3 0.62 P3 0.49 0.14 0.30 0.95 Figure 2: Data Structures for Prole Indexing The in v erted index for these proles is sho wn in the righ t-hand side
  • f
Figure 2. F
  • r
example, the a list con tains the p
  • stings
for P 1 and P 2 . The 0.46 v alue in the rst en try in this list is the w eigh t
  • f
a in P 1 . No w supp
  • se
this do cumen t arriv es: D = ((b; 0:15); (d; 0:32); (f ; 0:21); (h; 0:14); (j; 0:90)): T
  • pro
cess this do cumen t, rst w e read the b list, and in- cremen t the SCORE en tries
  • f
P 1 and P 2 b y 0:15
  • 0:14
= 0.021 and 0:15
  • 0:30
= 0.045 resp ectiv ely . The lists
  • f
d, f , h, and j are pro cessed similarly . The nal v alues
  • f
the SCORE arra y are as sho wn in the gure. This do cumen t is relev an t to P 3 . Notice the PI metho d is almost symmetrical to the metho d used in information retriev al to matc h a query against a database
  • f
do cumen ts with an index
  • f
do cu- men ts, with the roles
  • f
do cumen ts and queries (proles) rev ersed. The dierence is that the THRESHOLD arra y is not used; instead, after the computation
  • f
similaritie s, the SCORE arra y is sorted to nd the rank
  • f
the do cumen ts. 3.3 Selectiv e Prole Indexing (SPI) Metho d In the PI metho d, w e index a prole b y all its terms. In this subsection w e in v estigate an alternativ e in whic h w e
  • nly
select a n um b er
  • f
terms for indexing. Consider the term b in P 1 in
  • ur
running example. Sup- p
  • se
a do cumen t arriv es and it do es not con tain the terms a, c, d,
  • r
e. The maxim um score P 1 could ha v e against this do cumen t is 0.14 (if b's w eigh t in the do cumen t is the high- est p
  • ssible,
1.0), whic h is less than the threshold sp ecied. A t a threshold
  • f
0.25, the term b is insignican t in that it alone cannot pro duce enough score for a do cumen t to b e relev an t. Th us, w e ma y c ho
  • se
not to index the prole with the term b { a do cumen t that con tains
  • nly
b and no
  • ther
terms in the prole will not b e relev an t an yw a y . Ho w ev er, a do cumen t that con tains b and another term in the prole ma y b e relev an t; so w e need to duplicate (b; 0:14) in the p
  • stings
  • f
the
  • ther
terms in their resp ectiv e lists. (Since
slide-5
SLIDE 5 w e assume that the in v erted lists are stored
  • n
disk, it is b etter to duplicate the pair than to store it elsewhere and k eep a p
  • in
ter in the p
  • stings
to reference it (extra I/Os will b e needed to lo
  • k
it up). If the en tire index ts in main memory , it is b etter to use the p
  • in
ter
  • ption.
See commen ts in Section 7.) Similarly , consider the sub v ector ((h; 0:11); (i; 0:10)) in P 3 . Supp
  • se
a do cumen t arriv es that do es not ha v e the
  • ther
terms in P 3 . Then an upp er b
  • und
to the similarit y b et w een P 3 and this do cumen t is 0:11 + 0:10 = 0:21 (w e can actually nd a tigh ter upp er b
  • und,
b y a theorem pro v ed b elo w). Again, with a threshold
  • f
0.25, the sub v ector is insignica n t. In this case, w e ma y c ho
  • se
not to p
  • st
the prole in the in v erted lists
  • f
h and i and duplicate the pairs in the p
  • stings
  • f
the
  • ther
terms in the prole. These
  • bserv
ations lead us to this denition. Denition 2: Giv en a prole v ector P = ((y 1 , u 1 ), ..., (y p , u p )), a sub v ector P s = ((y i 1 , u i 1 ), ..., (y i s , u i s )), 1
  • i
1 < ... < i s
  • p,
is insignic an t at a threshold
  • f
  • if
for an y do cumen t D , sim(D ; P s )
  • .
2 Giv en a prole lik e P 3 , there ma y b e sev eral insigni- can t sub v ectors (e.g., ((h, 0.11), (i, 0.10)) is
  • ne,
((c, 0.14), (i, 0.10)) is another). Whic h sub v ector should w e use to reduce the n um b er
  • f
index p
  • stings?
One idea is to use the sub v ector that con tains the most lo w-id f terms. Lo w- id f terms
  • ccur
more frequen tly in do cumen ts; th us, b y not p
  • sting
these terms w e exp ect to sa v e the most lo
  • kup
w
  • rk.
Denition 3: Giv en a prole v ector P = ((y 1 , u 1 ), ..., (y p , u p )), a sub v ector P s = ((y i 1 , u i 1 ), ..., (y i s , u i s )), 1
  • i
1 < ... < i s
  • p,
is most insignic an t at a threshold
  • f
  • if
it has the largest n um b er
  • f
lo w est id f terms among the insignica n t sub v ectors at a threshold
  • f
  • .
2 Assuming id f s are distinct, a prole v ector has a unique most insignica n t sub v ector at a giv en threshold. W e need a w a y
  • f
c hec king whether a sub v ector is the most insigni- can t sub v ector and this requires the abilit y to compute the maxim um p
  • ssible
similarit y b et w een a prole sub v ector and an y do cumen t v ector. In tuitiv ely , w e can see that the similarit y b et w een a prole sub v ector and an y unit do cu- men t v ector is highest when the do cumen t v ector is \in the same direction" as the prole sub v ector. And if that hap- p ens, the similarit y is giv en b y the magnitude
  • f
the prole sub v ector. This is formally stated and pro v ed as follo ws. Theorem 1: F
  • r
an y P and an y D , kD k = 1, w e ha v e sim(D ; P )
  • kP
k. Pro
  • f:
This follo ws easily from the Cauc h y-Sc h w arz In- equalit y [6]: sim(D ; P ) = D
  • P
  • jD
  • P
j
  • kD
kkP k = kP k: T
  • nd
the most insignicant sub v ector
  • f
a prole v ec- tor, w e can sort the terms b y id f and include as man y terms as p
  • ssible.
F
  • r
example, consider P 3 again. W e assume that the term w eigh ts are directly prop
  • rtional
to the id f s 0.62 0.14 0.17 2 b c 0.49 0.14 0.11 0.10 c h 3 i 0.14 0.11 i 0.10 0.42 3 c h 0.14 0.11 i 0.10 0.72 3 c h P1 P2 P3 P1 P2 P3 MAIN MEMORY DISK P3 P3 P3 P1 THRESHOLD P1 Inverted Lists 0.25 0.20 0.25 SCORE 0.2194 0.0450 0.6991 j i h g e c d b a Directory P1 f P2 0.46 0.14 0.17 b 2 P2 c 0.95 0.30 0.59 0.14 0.17 2 b c P3 0.14 0.11 i 0.10 0.17 3 c h Figure 3: Data Structures for the SPI Metho d (whic h is true if the tf comp
  • nen
ts are the same). As k((c; 0:14); (h; 0:11); (i; 0:10))k = 0:2042
  • 0:25;
and k((f ; 0:17); (c; 0:14); (h; 0:11); (i; 0:10))k = 0:2657 > 0:25; ((c; 0:14); (h; 0:11); (i; 0:10)) is the most insignicant sub- v ector
  • f
P 3 at a threshold
  • f
0.25. This also sho ws that Theorem 1 is stronger than the naiv e w a y
  • f
nding an up- p er b
  • und
b y simply adding the w eigh ts, as w e ha v e done earlier. With this kno wledge, w e can indeed index the proles selectiv ely . F
  • r
eac h prole, w e nd the most insigni can t sub v ector at the threshold sp ecied. The prole is then p
  • sted
in the in v erted lists
  • f
the signican t (relativ e to the most insigni can t sub v ector) terms. In eac h p
  • sting,
w e include the insignica n t terms and their w eigh ts; i.e., they are replicated in the lists
  • f
all the signican t terms. This is called the Sele ctive Pr
  • le
Indexing (SPI) metho d. Eac h p
  • sting
con tains the prole iden tier, the w eigh t
  • f
the term indexed, the n um b er
  • f
insignica n t pairs, and the pairs
  • f
insigni can t terms and w eigh ts. P
  • stings
in the same list are stored sequen tially in blo c ks. W e also require the THRESHOLD and SCORE arra ys as in the PI metho d. When a do cumen t comes along, w e construct its v ector represen tation. Next w e initiali ze the SCORE arra y to all 0's. Then w e index the directory to retriev e the in v erted lists
  • f
eac h term. Supp
  • se
w e are pro cessing the term x with w eigh t w in the do cumen t. F
  • r
eac h prole P in the x list, supp
  • se
the w eigh t
  • f
x in P is u, and the insigni can t pairs are (y i 1 ; u i 1 ), ..., (y i s ; u i s ). W e examine P 's SCORE en try . There are t w
  • cases:
if the SCORE en try is zero, w e rst add the pro duct w
  • u.
Then w e lo
  • k
up eac h term y i j in the do cumen t v ector. Supp
  • se
its w eigh t in the do cumen t is w i j . W e add the pro duct w i j
  • u
i j to the SCORE en try . In the second case, the SCORE en try is not zero, meaning that w e ha v e already added the con tribution
  • f
the insignic an t terms in some earlier computation. Th us w e
  • nly
add the pro duct w
  • u.
After all do cumen t terms ha v e b een pro cessed, a prole matc hes the do cumen t if its SCORE en try is greater than the THRESHOLD en try .
slide-6
SLIDE 6 Figure 3 sho ws the index for
  • ur
running example. F
  • r
instance, supp
  • se
w e are pro cessing the rst pair (b, 0.15) from the do cumen t v ector. The list
  • f
b has
  • nly
  • ne
p
  • st-
ing, that
  • f
P 2 . W e add the pro duct 0.15
  • 0.30
= 0.045 to P 2 's SCORE en try . As there is no insignicant sub v ector, w e are done with this p
  • sting
and also with the b list. Next w e pro cess the pair (d, 0.32). Only P 1 's p
  • sting
is in the d list. First w e add the pro duct 0.32
  • 0.62
= 0.1984 to SCORE[P 1 ]. Then w e pro cess the insigni can t sub v ector ((b, 0.14), (c, 0.17)). T
  • do
this, w e lo
  • k
up the term b in the do cumen t v ector, getting a w eigh t
  • f
0.15. Th us w e incremen t SCORE[P 1 ] b y the pro duct 0.15
  • 0.14
= 0.021. Next, w e lo
  • k
up c, whic h is not in the do cumen t v ector. W e are no w done with this list. The
  • ther
pairs are pro- cessed similarly . The nal v alues for SCORE are as sho wn in the gure. 4 P erformance Ev aluation 4.1 Mo dels W e use analysis and sim ulations to ev aluate the p erfor- mance
  • f
the metho ds. T
  • allo
w exibili t y in
  • ur
p erfor- mance ev aluation, w e use syn thetic do cumen t and prole mo dels. T
  • mak
e them realistic, w e base
  • ur
mo dels
  • n
prop erties
  • f
a database
  • f
Netnews (text) articles receiv ed b y
  • ur
Departmen t's Netnews host during the p erio d
  • f
April 22 to April 29, 1993. A total
  • f
212,972 articles w ere collected, making up a 550MB database. Belo w w e describ e
  • ur
mo dels. 4.1.1 Do cumen t Mo del The follo wing steps w ere carried
  • ut
to study the
  • ccur-
rence frequency
  • f
terms in the database. First, a lexical analysis screened
  • ut
all non-alphab etica l c haracters from the do cumen ts (i.e., articles). Then a stemming routine (P
  • rter's
algorithm [10]) w as run to reduce the remain- ing w
  • rds
to w
  • rd-stem
form. Eac h stem th us
  • btained
is a term. Next w e measured the
  • ccurrency
frequency
  • f
eac h term in the database,
  • btaining
the plot sho wn in Figure 4 (note the log/log scale). The straigh t line in the graph w as deriv ed b y curv e tting using [16]. W e can see the database do es demonstrate Zipan c haracteristics [19]. The x-in tercept (i.e. size
  • f
the term v
  • cabulary
, whic h w e denote b y v ) is found to b e 521,915. Also, the a v erage n um b er
  • f
w
  • rds
p er do cumen t (denoted b y d) is found to b e 323. Hence, w e adopt the follo wing probabilisti c do cumen t mo del, whic h is similar to the
  • ne
in [15]. The terms in a do cumen t come from a v
  • cabulary
V
  • f
size v . Eac h term is uniquely represen ted b y an in teger x, 1
  • x
  • v
. The probabilit y that an y term app ears is describ ed b y the probabili t y distribution Z . W e rank the terms in non- increasing
  • rder
  • f
frequencies, i.e., 8x; y ; 1
  • x
< y
  • v
, w e ha v e Z (x)
  • Z
(y ); for con v enience, w e use the rank to iden tify the terms. W e assume the frequency distribution

1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 Number of Ocurrences Term Rank

Figure 4: T erm Rank vs F requency Graph for Netnews Database follo ws Zipf 's La w; i.e., Z (x) = 1 x P v y =1 1=y : A do cumen t has d term
  • ccurrences
and is generated b y a sequence
  • f
d indep enden t and iden tically distributed tri- als; eac h trial pro duces
  • ne
term from V according to the distribution Z . The most frequen t s terms form the stop list; stop-listed terms are deleted from a do cumen t b efore its v ector represen tation is computed. W e c ho
  • se
s to b e 100 in the ev aluation. Finally , the v ector represen tations
  • f
the do cumen ts are computed as describ ed in Section 2. The exact form ulas used to compute the w eigh t
  • f
a term x i are from [13], whic h ha v e b een empirically found to b e eectiv e: tf i = 0:5 + 0:5
  • f
i max j f j ; and id f i = log (1=fraction
  • f
do cumen ts with x i ); where f i is the frequency
  • f
the term x i in the do cumen t. W e analytical ly compute the fraction in id f as the proba- bilit y that x i app ears in a do cumen t. 4.1.2 Prole Mo del Lo
  • king
at
  • ur
database, w e nd that a large fraction
  • f
the terms in the v
  • cabulary
  • ccur
v ery infrequen tly . Those terms are mostly from missp elli ngs, t yp
  • s,
  • r
self-in v en ted w
  • rds.
W e do not exp ect these terms to app ear in pro- les, whic h represen t long term in terests. W e mo del this b y assuming that prole terms are c hosen from the set Q = fs + 1; :::; q g, called the queried v
  • cabulary
,
  • ut
  • f
the v
  • cabulary
V = f1; :::; v g; q < v . (Recall that w e are iden tifying terms b y their ranks.) A base v alue
  • f
50,000 is c hosen for q , co v ering more than 97%
  • f
the total
  • ccur-
rences
  • f
terms in the Netnews database. W e assume that eac h term in Q is equally lik ely to b e c hosen for a prole. This uniform distribution is justied as queries tend to use a mix
  • f
frequen t and relativ ely in- frequen t w
  • rds
[15]. Also, terms rarely
  • ccur
more than
slide-7
SLIDE 7
  • nce
in a prole [12]; th us w e assume that a prole is a set
  • f
p terms c hosen randomly without replacemen t from the queried v
  • cabulary
Q. The n um b er
  • f
proles in the system is n. T
  • simplify
the study
  • f
the eect
  • f
prole size (p)
  • n
p erformance, w e assume all proles ha v e the same length, i.e., p is xed for all proles. Some
  • f
these assumptions ma y not b e v alid when rele- v ance feedbac k is used. In [18], w e mo dify
  • ur
prole mo del in the ev aluation
  • f
the metho ds under relev ance feedbac k. 4.1.3 Choice
  • f
Relev ance Threshold It is hard to mo del the relev ance threshold distribution . F
  • r
a user, a suitable relev ance threshold for his prole dep ends
  • n
the individual prole terms (their id f s), the degree
  • f
correlation among the terms, the amoun t
  • f
relev an t, as w ell as irrelev an t, information in the incoming stream, and his desired lev el
  • f
precision and recall (is it crucial to re- ceiv e all p
  • ssibly
relev an t do cumen ts,
  • r
is it more desirable to receiv e those that are lik ely to b e relev an t?) Instead
  • f
deriving a complicated mo del
  • f
relev ance threshold, w e assume the relev ance threshold is xed for all proles. This allo ws us to study clearly its impact
  • n
the metho ds. A reasonable base case v alue w as found b y the follo wing pro cedure. First a random do cumen t w as gen- erated. Then a prole w as created to con tain a n um b er
  • f
  • v
erlapping terms, randomly selected from the do cumen t. The similarit y b et w een the do cumen t and the prole w as computed. The pro cedure w as rep eated a large n um b er
  • f
times. F
  • r
a base case prole length
  • f
5, w e found that a prole with 4
  • r
more matc hing terms has an a v erage sim- ilarit y
  • f
ab
  • ut
0.2. Th us w e use this as the base v alue
  • f
the relev ance threshold for
  • ur
ev aluation. Of course, this is not sa ying that the relev ance threshold simply translates to the n um b er
  • f
matc hing terms. W e are merely settling with a reasonable starting p
  • in
t in
  • ur
ev aluation. In Sec- tion 5.5, w e v ary the threshold
  • v
er the en tire range
  • f
p
  • ssible
v alues from to 1 and examine its eect
  • n
the p erformance. P arameter Base V alue Description v 521,915 size
  • f
v
  • cabulary
d 323 # term
  • ccurrences
p er do cumen t s 100 end
  • f
stop list q 50,000 end
  • f
queried v
  • cabulary
n 300,000 # proles p 5 # terms p er prole
  • 0.2
relev ance threshold i 4 # b ytes for prole iden tier l 2 # b ytes for prole length t 4 # b ytes to represen t a term f 4 # b ytes for a
  • ating
p
  • in
t n um b er b 512 # b ytes in a disk blo c k T able 1: Summary
  • f
P arameters Used in P erformance Ev aluation T able 1 summarizes the parameters used in the mo dels, together with some parameters that sp ecify the sizes
  • f
v arious elds in the data structures, and the disk blo c k size. Keep in mind that the base v alues sho wn are simply starting p
  • in
ts for
  • ur
ev aluation. W e explore dieren t sets
  • f
v alues in
  • ur
exp erimen ts { Section 5 sho ws some
  • f
the results. 4.2 Metrics W e compare the metho ds with resp ect to their space and time requiremen ts. F
  • r
space requiremen t, w e lo
  • k
at ho w m uc h disk space eac h structure tak es. (Although main memory space requiremen ts
  • f
the metho ds dier, w e assume they t in main memory .) W e study t w
  • w
a ys
  • f
storing the in v erted lists in the indexing metho ds: the rst is to pac k all lists con tiguously in to sequen tiall y blo c ks, lea ving no disk space in b et w een lists; the second w a y is to store eac h list in an in tegral n um b er
  • f
blo c ks, allo wing easy list expansions. By comparing the space requiremen t for these t w
  • ptions,
w e can see the amoun t
  • f
in ternal fragmen tation the second
  • ption
pro duces. F
  • r
time requiremen t, in an I/O b
  • und
system, the crit- ical measure is the n um b er
  • f
I/O's to pro cess a do cumen t; in a CPU b
  • und
system (includin g the case when a large p
  • rtion
  • f
the data structures can b e cac hed in main mem-
  • ry),
the amoun t
  • f
computation is the critical comp
  • nen
t. Hence, w e lo
  • k
at b
  • th
asp ects in
  • ur
comparison. F
  • r
the CPU computation, w e coun t the n um b er
  • f
  • ating-p
  • in
t m ultiplicati
  • ns
eac h metho d requires to pro cess a do cu- men t. The n um b er
  • f
m ultiplicati
  • n
s is
  • ne
  • f
the ma jor computation costs in pro cessing a do cumen t, so w e b eliev e it is a go
  • d
measure
  • f
CPU cost. In summary , w e lo
  • k
at these metrics:
  • the
exp ected total disk space required in n um b er
  • f
blo c ks (with con tiguous allo cation and fragmen ted al- lo cation for indexing metho ds),
  • the
exp ected n um b er
  • f
disk reads needed to matc h a do cumen t, and
  • the
exp ected n um b er
  • f
  • ating
p
  • in
t m ultiplic atio ns p erformed to pro cess a do cumen t. 4.3 Analysis and Sim ul ations Except those for the SPI metho d, the results in the Sec- tion 5 w ere
  • btained
b y deriving analytical solutions and then n umerically ev aluating the expressions. Due to space limitations, w e presen t the analysis in [18]. Sim ulations w ere conducted to
  • btain
the results for the SPI metho d. W e also constructed sim ulation s to v alidate the analysis. The sim ulation results did matc h the ana- lytical
  • nes.
W e wrote
  • ur
sim ulation program in C. The program rst generates n proles according to the prole mo del, and then computes the size
  • f
the index structures needed to store the proles. Next the sim ulation program generates a do cumen t according to the do cumen t mo del and coun ts the n um b er
  • f
disk reads and m ultiplic atio ns needed to matc h it against the n proles. F
  • r
eac h scenario w e ha v e tested, the program is run enough times (with dif- feren t random n um b er generator seeds) to mak e sure that the results are within 5%
  • f
the true v alues, with a 90% lev el
  • f
condence.
slide-8
SLIDE 8 5 Results 5.1 Base Case Results The results for the base case are giv en in T able 2. In the case when the in v erted lists
  • f
the indexing metho ds are pac k ed con tiguously , the total space requiremen t for the three metho ds are roughly comparable. PI is b etter than the BF metho d, since the threshold v alues are stored in main memory . The SPI metho d requires more space than PI, b ecause some (term, w eigh t) pairs are duplicated in a n um b er
  • f
lists in the index. When the in v erted lists are not pac k ed, but are stored individu all y in an in tegral n um b er
  • f
blo c ks, in ternal frag- men tation leads to an increase in total space requiremen t
  • f
ab
  • ut
68% for SPI to 113% for PI. The split-list strategy allo ws for easier up dates, but w e ha v e to pa y the price
  • f
higher storage costs. F
  • r
the n um b er
  • f
disk reads p erformed p er do cumen t, w e see
  • rders
  • f
magnitude impro v emen t
  • f
the indexing metho ds
  • v
er the BF metho d. The SPI metho d is b est, due to the fact that certain frequen t terms in a prole are not indexed. F
  • r
this same reason, the n um b er
  • f
m ultipli- cations for SPI is lo w er than that for BF and PI (the latter t w
  • p
erform the same n um b er
  • f
m ultiplic atio ns; see the analysis in [18]). Con tiguous F ragmen ted Disk Multi- Metho d Size (Blo c ks) Size (Blo c ks) Reads plications BF 29,297 { 29,297 4,314 PI 23,438 49,900 144 4,314 SPI 29,630 49,804 127 3,434 T able 2: Results for the Base Case In what follo ws, w e describ e sev eral sensitivit y studies in whic h w e v ary the parameter v alues. 5.2 Size
  • f
Queried V
  • cabulary
The rst parameter that w e exercise is q , whic h con trols the size
  • f
the queried v
  • cabulary
. Figures 5 to 7 sho w the results. In Figure 5, when the con tiguous-list strategy is used, the total space requiremen t for eac h metho d is insensitiv e to q . Ho w ev er, when the split-list strategy is used for the indexing metho ds, their space requiremen t do es v ary with q . The uctuations in the graph for SPI can b e explained as follo ws. When q is 20,000, eac h in v erted list
  • ccupies
2 blo c ks. As q increases, the n um b er
  • f
lists increases, and so the total size increases. A t the same time, the n um b er
  • f
p
  • stings
in a list decreases, since they are distributed
  • v
er a larger n um b er
  • f
lists. A t some p
  • in
t (around q = 30,000), the lists b egin to shrink in size to 1 blo c k, and this explains the drop in total size. Thereafter, the total space requiremen t increases linearly with q , as eac h list ts in 1 blo c k. The same reasoning can b e applied to the uctuations in the graph
  • f
PI. Figure 6 sho ws the results for the n um b er
  • f
blo c ks read p er do cumen t. The n um b er
  • f
blo c ks read for the BF metho d is constan tly equal to its total space require- men t, and th us the graph is
  • mitted
to sho w the v ariations

20 25 30 35 40 45 50 55 60 65 70 20 25 30 35 40 45 50 55 60 65 70 Blocks (x 1000) Queried Vocabulary End (x 1000) BF PI (Contiguous) PI (Fragmented) SPI (Contiguous) SPI (Fragmented)

Figure 5: T
  • tal
Size vs. Queried V
  • cabulary
End q

120 140 160 180 200 220 240 20 25 30 35 40 45 50 55 60 65 70 Blocks Queried Vocabulary End (x 1000) PI SPI

Figure 6: Disk I/Os P er Do cumen t vs. Queried V
  • cabulary
End q

2 3 4 5 6 7 8 9 10 20 25 30 35 40 45 50 55 60 65 70 Multiplications (x 1000) Queried Vocabulary End (x 1000) BF / PI SPI

Figure 7: Multiplicatio ns P er Do cumen t vs. Queried V
  • cabulary
End q
slide-9
SLIDE 9

10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 Blocks (x 1000) Profile Length BF PI (Contiguous) PI (Fragmented) SPI (Contiguous) SPI (Fragmented)

Figure 8: T
  • tal
Size vs. Prole Length p

120 130 140 150 160 170 180 190 200 1 2 3 4 5 6 7 8 9 10 Blocks Profile Length PI SPI

Figure 9: Disk I/Os P er Do cumen t vs. Prole Length p in the
  • ther
metho ds b etter. The sharp drop in the n um- b er
  • f
I/Os required corresp
  • nds
to the shrinking
  • f
the list length (from 2 blo c ks to 1 blo c k). Thereafter, the n um b er
  • f
I/Os increases, as the n um b er
  • f
lists read p er do cumen t increases (due to the increase in the queried v
  • cabulary
size). The rise is more prominen t in PI than in SPI. F
  • r
the n um b er
  • f
m ultiplica tion s p er do cumen t (Fig- ure 7), SPI is b etter throughout than the
  • ther
metho ds. The trend is do wn w ard for all metho ds, as more infrequen t terms app ear in proles. 5.3 Prole Length The next parameter that w e v ary is the prole length. Figures 8 and 9 sho w the results. F
  • r
con tiguous allo cation, w e see the total space require- men t gro ws with p for all metho ds (Figure 8). F
  • r
frag- men ted allo cation, with a small p, the in v erted lists eac h t in
  • ne
blo c k, so the size remains constan t at the queried v
  • cabulary
size. With larger p, the lists gro w in length, so the total space requiremen t gro ws also. The SPI metho d gro ws at a faster rate than the PI metho d. The n um b er
  • f
disk I/Os required b y the SPI metho d

10 20 30 40 50 60 70 80 90 100 110 100 200 300 400 500 600 700 800 Blocks (x 1000) Number of Profiles (x 1000) BF PI (Contiguous) PI (Fragmented) SPI (Contiguous) SPI (Fragmented)

Figure 10: T
  • tal
Size vs. Num b er
  • f
Proles n

120 140 160 180 200 220 240 260 280 300 100 200 300 400 500 600 700 800 Blocks Number of Profiles (x 1000) PI SPI

Figure 11: Disk I/Os P er Do cumen t vs. Num b er
  • f
Proles n initiall y de cr e ases as p is increased from 1 (Figure 9). This is b ecause it b ecomes more lik ely that a prole includes infrequen t terms and is th us indexed b y those terms. With the longer lists at larger p (greater than 7), its p erformance deteriorates and then stabilizes. On the
  • ther
hand, for the n um b er
  • f
m ultiplicati
  • ns
, SPI is alw a ys b etter than the t w
  • ther
metho ds (graphs
  • mitted).
5.4 Num b er
  • f
Proles W e v ary the n um b er
  • f
proles from 100,000 to 800,000. F
  • r
the total space requiremen t (results sho wn in Figure 10), w e ha v e a similar graph as that for p. F
  • r
con tiguous allo cation, the space requiremen t gro ws linearly with n. F
  • r
fragmen ted allo cation, the space required is at rst constan t and then increases. Eac h in v erted list ts in 1 blo c k at the b eginning, but as n increases, 2 blo c ks are needed to hold a list. The lists gro w at a faster rate in the SPI metho d initiall y , but PI so
  • n
catc hes up with it. Figure 11 sho ws the results for the n um b er
  • f
disk I/Os required p er do cumen t. Those for the BF metho d are
  • mit-
ted. W e see there is a range
  • f
n v alues where SPI requires more I/Os p er do cumen t; this happ ens when an SPI in-
slide-10
SLIDE 10 v erted list gro ws faster than a PI list. When the list length b ecomes the same in b
  • th
metho ds, SPI again b ecomes b etter PI. In terms
  • f
n um b er
  • f
m ultiplicati
  • n
s p er do cumen t, all metho ds scale prop
  • rtionall
y to the n um b er
  • f
proles, with the SPI metho d alw a ys b etter than the
  • ther
t w
  • metho
ds. Due to space considerations, w e
  • mit
the graphs here. 5.5 Relev ance Threshold The next parameter that w e v ary is the relev ance thresh-
  • ld.
Although it ma y not mak e sense to ha v e threshold v alue
  • f
  • r
1, w e study the en tire range
  • f
p
  • ssible
v alues to conrm
  • ur
in tuition ab
  • ut
the SPI metho d. The
  • ther
metho ds are insensitiv e to the relev ance threshold. With
  • increasing,
w e exp ect a more substan tial p
  • rtion
  • f
a prole to b e insigni can t and b e duplicated in the lists
  • f
signican t terms in SPI. Th us the total index size in- creases, but as
  • increases
further, the insignican t p
  • rtion
is p
  • sted
in few er lists (the n um b er
  • f
signican t terms decreases). Th us, a certain maxim um w
  • uld
b e reac hed somewhere in the range. This is indeed the case for
  • ur
results sho wn in Figure 12. Although the total size increases and then decreases with increasing
  • ,
the n um b er
  • f
I/Os is alw a ys decreas- ing (Figure 13), b ecause proles are indexed in few er lists
  • f
lo w er frequency terms. Similarly , the n um b er
  • f
m ulti- plications decreases also (Figure 14). The r elative p erformance
  • f
SPI against the
  • ther
t w
  • do
es not v ary m uc h with dieren t v alues
  • f
  • .
F
  • r
the space requiremen t, it almost alw a ys requires more space that the
  • ther
t w
  • ,
except when
  • is
close to 1. F
  • r
the time requiremen t, it is alw a ys no w
  • rse
than the
  • ther
metho ds. 5.6 Do cumen t Size The size
  • f
do cumen ts
  • nly
aects the t w
  • time
require- men t metrics. The p erformance
  • f
the metho ds with re- sp ect to b
  • th
metrics scales prop
  • rtionall
y to the do cumen t size, with no c hange in relativ e p erformance. Due to space limitations, w e
  • mit
the results. 6 Related W
  • rk
References [2, 5, 9 ] in v estigate the eectiv eness
  • f
dier- en t retriev al mo dels applied to information ltering. In [17], w e study what index structures can b e used to sp eed up information ltering under the b
  • lean
mo del. The PI and SPI metho ds presen ted in this pap er can b e seen as generalization s
  • f
the Coun ting and Key metho ds in [17]. T erry et al. [14] prop
  • se
the notion
  • f
con tin uous queries in relational databases. Users issue con tin uous queries, whic h are rewritten in to incremen tal queries and run p eri-
  • dically
. Their w
  • rk
concen trates
  • n
relational databases, while
  • urs
is concerned with the dissemination
  • f
unstruc- tured data (do cumen ts) using information retriev al tec h- niques.

20 25 30 35 40 45 50 55 60 65 70 75 0.2 0.4 0.6 0.8 1 Blocks (x 1000) Relevance Threshold BF PI (Contiguous) PI (Fragmented) SPI (Contiguous) SPI (Fragmented)

Figure 12: T
  • tal
Size vs. Relev ance Threshold
  • 20

40 60 80 100 120 140 160 0.2 0.4 0.6 0.8 1 Blocks Relevance Threshold PI SPI

Figure 13: Disk I/Os P er Do cumen t vs. Relev ance Threshold
  • 1

2 3 4 0.2 0.4 0.6 0.8 1 Multiplications (x 1000) Relevance Threshold BF / PI SPI

Figure 14: Multiplications P er Do cumen t vs. Relev ance Threshold
slide-11
SLIDE 11 Related to the idea
  • f
a prole index is that
  • f
the \seg- men t tree" presen ted in [3]. There, Danzig et al. presen t a distributed indexing sc heme as a w a y to pro vide e- cien t retrosp ectiv e searc h
  • f
a large n um b er
  • f
retriev al systems. Sp ecial sites, called index brok ers, main tain in- dexes
  • f
remote retriev al systems. They subscrib e \gener- ator queries" that k eep them informed
  • f
c hanges in these systems. The segmen t tree is prop
  • sed
to index n umerical generator queries
  • v
er Library
  • f
Congress n um b ers (e.g., all new items in the range QA76 to QA77). Index struc- tures for general proles are not addressed. 7 Conclusion In this pap er, w e study what data structures and al- gorithms can b e used to facilitate large-scale information ltering under the VSM. W e apply the idea
  • f
the stan- dard in v erted index to index user proles (w e call this the PI metho d) and sho w that
  • nly
sligh t mo dications are needed to use the index to sp eed up ltering. W e devise an alternativ e, called the SPI metho d, to the standard in- v erted index { instead
  • f
indexing ev ery term in a prole, w e select
  • nly
the signican t
  • nes
to index. W e ev aluate their p erformance, together with the BF metho d whic h uses no prole index. In summary , w e see that the three metho ds require ap- pro ximately the same disk space when in v erted lists are pac k ed in to con tiguous blo c ks. When lists are stored in- dividuall y in an in tegral n um b er
  • f
blo c ks, the indexing metho ds require more disk space than the BF metho d. On the
  • ther
hand, when w e compare the time requiremen t, the BF metho d is the clear loser. The indexing metho ds require few er n um b er
  • f
I/Os to matc h a do cumen t b y
  • r-
ders
  • f
magnitude. Among the PI and SPI metho ds, SPI is alw a ys b etter in terms
  • f
CPU pro cessing. It can also impro v e the n um b er
  • f
I/Os required in man y cases, de- p ending mainly
  • n
the prole length and the n um b er
  • f
proles. Although in those cases where SPI wins, the dierence ma y app ear small, w e should remem b er that the results sho wn are for pro cessing a single do cumen t. An informa- tion serv er will b e doing this matc hing da y in and da y
  • ut,
and the dierence will b e magnied. Another
  • bserv
ation is that as SPI is alw a ys the b est in CPU pro cessing, when main memory is large enough to hold the en tire index, SPI is the clear c hoice. In that case, instead
  • f
duplicatin g in- signican t terms in lists
  • f
indexed terms, w e can just use a p
  • in
ter to reference the insignica n t terms, stored sepa- rately . Ac kno wledgemen ts Thanks to Ben Kao and An thon y T
  • masic
for helpful discussions
  • n
this pap er, and to the anon ymous referees for their commen ts. References [1] BELKIN, N.J., and CR OFT, W.B. Information ltering and information retriev al: t w
  • sides
  • f
the same coin? Communic ations
  • f
the A CM 35, 12 (Dec. 1992), 29-38. [2] CR OFT, W.B. The Univ ersit y
  • f
Massac h usetts TIPSTER pro ject. SIGIR F
  • rum
26, 2 (F all 1992), 29-33. [3] D ANZIG, P ., AHN, J., NOLL, J., and OBRA CZKA, K. Distributed indexing: a scalable mec hanism for distributed information retriev al. In Pr
  • c.
A CM SIGIR Confer enc e (Chicago, Oct. 1991), pp. 220-229. [4] F ALOUTSOS, C. Access metho ds for text. A CM Comput- ing Surveys 17, 1 (Mar. 1985), 49-74. [5] F OL TZ, P .W., and DUMAIS, S.T. P ersonalized informa- tion deliv ery: an analysis
  • f
informatio n ltering metho ds. Communic ations
  • f
the A CM 35, 12 (Dec. 1992), 29-38. [6] FRIEDBER G, S.H., INSEL, A.J., and SPENCE, L.E. Lin- e ar A lgebr a, Pren tice Hall, Englew
  • d
Clis, New Jersey , 1989. [7] HOR TON, M. Ho w to read the net w
  • rk
news. UNIX Do c- umentation, A T&T Bell Lab
  • ratories.
[8] KAHLE, B., and MEDLAR, A. An information system for corp
  • rate
users: wide area information serv ers. Connexions { The Inter
  • p
er ability R ep
  • rt
5, 11 (No v. 1991), 2-9. [9] POLLOCK, S. A rule-based message ltering system. A CM T r ansactions
  • n
Oc e Information Systems 6, 3 (July 1988), 232-54. [10] POR TER, M.F. An algorithm for sux stripping. Pr
  • gr
am 14, 3 (1980), 130-7. [11] REID, B. USENET Readership Summary Rep
  • rt
for Jan- uary 1993. USENET Newsgroup news.lists (F ebruary 8, 1993). [12] SAL TON, G. A utomatic T ext Pr
  • c
essing, Addison W esley , Reading, Massac h usetts, 1989. [13] SAL TON, G. Global text matc hing for informatio n re- triev al. Scienc e 253 (Aug. 1991), 1012-5. [14] TERR Y, D., GOLDBER G, D., NICHOLS, D., and OKI, B. Con tin uous queries
  • v
er app end-only databases. In Pr
  • c.
A CM SIGMOD Confer enc e (San Diego, Ma y 1992), pp. 321-30. [15] TOMASIC, A., and GAR CIA-MOLINA, H. P erformance
  • f
in v erted indices in distributed text do cumen t retriev al systems. In Pr
  • c.
Par al lel and Distribute d Information Systems Confer enc e (San Diego, Jan. 1993), pp. 8-17. [16] W OLFRAM, S., Mathematic a, Addison W esley , Redw
  • d
Cit y , California, 1991. [17] Y AN, T.W., and GAR CIA-MOLINA, H. Index structures for selectiv e dissemination
  • f
information. T ec hnical Rep
  • rt
ST AN-CS-92-1454, Stanford Univ ersit y , 1992. [18] Y AN, T.W., and GAR CIA-MOLINA, H. Index structures for informatio n ltering under the v ector space mo del. T ec hnical Rep
  • rt
ST AN-CS-93-1494, Stanford Univ ersit y , 1993. [19] ZIPF, G.K. Human Behavior and the Principle
  • f
L e ast Eort, Addison-W esley Press, Cam bridge, Massac h usetts, 1949.