na vig a tion re tr ie va l with site anc hor t e xt
play

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki - PowerPoint PPT Presentation

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1 Introduc tion Na vig a tion Re trie va l T a sk in NT CIR- 4


  1. Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1

  2. Introduc tion � Na vig a tion Re trie va l T a sk in NT CIR- 4 WE B (task B) � Se a rc hing for one or more "re pre se nta tive We b pa g e s." � R e le vancy and R e pr e se ntative ne ss of doc ume nt a re both importa nt. � Motiva tion � Ve rify the e ffic ie nc y of re fe re ntia l informa tion Re tr ie va l syste m whic h inde xe s only site a nc hor te xt Re trie va l syste m whic h inde xe s only site a nc hor te xt T wo a dva nta g e s : ・ T he inde x size is ve r y small. ・ A use r c an r e tr ie ve unc r awle d doc ume nts as we ll as c r awle d doc ume nts. 2

  3. Site Anc hor T e xt � Anc hor te xt of links from e xte rna l We b site � Anc hor ( d , a )+Anc hor ( e , a )+Anc hor ( f , a ) www.a.c om www.b.c om Summa rizing c onte nt a nd Summa rizing c onte nt a nd d e f popula rity of the We b site popula rity of the We b site a We c a n c a lc ula te r e le vancy We c a n c a lc ula te r e le vancy and r e pr e se ntative ne ss . and r e pr e se ntative ne ss . b c www.c .c om Note : We de fine d "e xte r na l We b site s" simply as site s whose domain name is diffe r e nt fr om the tar ge t page . 3

  4. Re trie va l Me thod Ste p1 : Par se the que r y and se ar c h for page s Ste p2 : De te r mine sc or e of e ac h page Ste p3 : Sor t page s by Sc or e � Sc ore of pa g e p = × Score( p ) Rep( p ) Rel( p , q ) e se nta tive ne ss of pa g e p Re pr Re pre se nta tive ne ss of pa g e p de rive d from link struc ture de rive d from link struc ture Re le va nc y of pa g e p and que r y q Re le va nc y of pa g e p and que r y q ba se d on two kinds of me a sure s, r e fe r e nce consiste ncy ba se d on two kinds of me a sure s, r e fe r e nce consiste ncy a nd spe cificity of wor d combination a nd spe cificity of wor d combination 4

  5. Re pre se nta tive ne ss of pa g e p � De rive d from link struc ture = × Rep( p ) C T www.a.c om www.b.c om C : Citation fr e que nc y fr om e xte r nal We b site s d e f T : L ike lihood of top page de te r mine d by following he ur istic s: (H 1 ) Doe s the URL of the page c onsist of only domain name ? a (H 2 ) Doe s the file name of the URL c ontain suc h a str ing as "inde x" or "de fault"? (H 3 ) Doe s the URL e nd with a slash "/ " ? b c = × δ + × δ + × δ + T w w w w 1 1 2 2 3 3 4 www.c .c om  1 if H is true δ = i  i  0 if H is false http:/ / www.c .c om/ abc / inde x.html i = ( w , w , w , w ) ( 1000 , 100 , 10 , 1 ) 1 2 3 4 = × = 5 Rep( a ) 3 101 303 e.g.

  6. Re le va nc y of pa g e p a nd que ry q Ma in c onc e pt : E ffe c tive use of limite d informa tion to de te rmine the re le va nc y � Re fe re nc e c onsiste nc y � How c onsiste ntly is the pa g e re fe rre d by e xte r nal We b site s? � (How sha rply doe s the site foc us on a topic ?) � Spe c ific ity of word c ombina tion � How spe c ific ally ar e page s ide ntifie d by g ive n wor d c ombina tion? 6

  7. Re fe re nc e c onsiste nc y � Whic h is re le va nc e for que ry "i- pod" ? blog blog Clie iPod MBA iPod Ma tsui iPod iPod Apple x L a Vie y iPod NE C iPod   2 ∑ f   = × t Rel( p , q ) kw   t N   sa ∈ t q f t : F re que nc y of word t in the site a nc hor te xt for pa g e p N sa : Amount of site a nc hor te xt for pa g e p − ( n i ) = q kw t : We ig ht of the word in que ry q kw 2 i < Rel( x , " " ) Rel( y , " " ) iPod iPod In this c ase ... 7

  8. Spe c ific ity of word c ombina tion � How spe c ific a lly a re pa g e s ide ntifie d by g ive n word c ombina tion? N t 1 = Rel( p , q ) log τ ∈ D( p , q ) τ ∈ D( p , q ) : Numbe r of page s that c ontain t 2 oup inc lude d in both page p and t 3 ke ywor d gr y q que r < < < D( t , t , t ) D( t , t ) D( t , t ) D( t , t ) if a nd 1 2 3 1 2 1 3 2 3 ∈ ∈ ∈ ∈ i D( t , t , t ), j D( t , t ), k D( t , t ), l D( t , t ) the n, 1 2 3 1 2 1 3 2 3 ( ) ( ) ( ) ( ) . > > > Rel i , q Rel j , q Rel k , q Rel l , q Note : T r aditional T F - IDF sc he ma te nds to be biase d towar d wor ds with highly spe c ific ity ( t 2 and t 3 ), so Rel( l , q ) > Rel( j , q ) or Rel( k , q ) in this c ase . 8

  9. E va lua tion � Doc ume nt c olle c tion :100GB NW100G- 01 � T ota l size of site a nc hor te xt : 94MB � E va lua tion sc a le s : WRR (a nd DCG) � "re le va nt", "pa rtia lly r e le vant", "ir r e le vant" � Compa re d with following 4 syste ms: ID Inde x Re le va nc y c a lc ula tion OKA F ull te xt of c ra wle d pa g e s OKAPI ANC F ull te xt of c ra wle d pa g e s Hig h we ig ht to a nc hor te xt SAR Site a nc hor te xt only Re fe r e nc e c onsiste nc y SAS Site a nc hor te xt only Spe c ific ity of wor d c ombina tion 9

  10. Re sult a nd disc ussion (1/ 4) � Site a nc hor te xt re trie va l (SAR a nd SAS) ha s g re a t a dva nta g e s ove r simple full te xt re trie va l (OKA). 0.6 wrr.1-0 wrr.1-1 0.5 0.4 Site anc hor te xt WRR re trie va l (SAR a nd 0.3 SAS) outpe r for me d 0.2 the simple full te xt r e tr ie val (OKA) 0.1 0 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS ※ T 10 T : <T IT L E > / DS : <DE SC> for T opic Pa rt

  11. Re sult a nd disc ussion (2/ 4) � Some importa nt informa tion in a nc hor te xt c a n be lost whe n site a nc hor te xt wa s e xtra c te d. � e .g . http:/ / a bc .jp/ ~usr1/ a nd http:/ / a bc .jp/ ~usr2/ a re de a lt with a s the sa me site . 0.6 wrr.1-0 Anc hor we ig hte d wrr.1-1 0.5 full te xt r e tr ie va l (ANC) wa s be tte r 0.4 tha n site a nc hor WRR 0.3 te xt re trie va l (SAR a nd SAS) 0.2 0.1 0 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS ※ T 11 T : <T IT L E > / DS : <DE SC> for T opic Pa rt

  12. Re sult a nd disc ussion (3/ 4) � De spite a ve ry sma ll inde x, SAR a nd SAS we re c ompa ra ble with ANC (up to 88% on WRR) � E spe c ia lly a c c ura c y ra tio te nds to be hig he r in da ta se rie s tha t g ive a sc ore only for the "re le va nt" pa g e s. � Site a nc hor te xt c a n pinpoint hig hly re le va nt doc ume nts. SAR/ ANC SAC/ ANC dc g .3- 0 0.84 0.81 dc g .3- 2 0.75 0.71 dc g .3- 3 0.72 0.68 wrr.1- 0 0.88 0.76 wrr.1- 1 0.84 0.71 ※ T opic Par t is <T IT L E > 12

  13. Re sult a nd disc ussion (4/ 4) � Some unc ra wle d pa g e s a re "re le va nt" a nd re le va nc y for the unc ra wle d pa g e s c a n be de te rmine d ba se d on r e fe r e nc e infor mation. 0 . 7 w r r . 1 - 0 T he g a p of WRR va lue w r r . 1 - 1 0 . 6 inc r e a se d be twe e n 0 . 5 R SAR a nd ANC (or SAS 0 . 4 R a nd ANC) c ra wle d 0 . 3 W doc ume nts 0 . 2 0 . 1 0 O A S S @ @ @ @ N A A O A S S K R S C N A A A K - - R S - - C A T T T T - - - - T T T T T T T T T T T T WRR for c r awle d doc ume nts only 13

  14. Conc lusion a nd F uture work � Site a nc hor te xt re trie va l syste m ... � Ha s ve ry sma ll inde x size (one - thousa nds of orig ina l doc ume nt se t) � Outpe rforms simple full- te xt re trie va l. � Is c ompa ra ble with a nc hor te xt we ig hte d full- te xt re trie va l (up to 88% a c c ura c y). � T e nds to pinpoint hig hly re le va nt pa g e s. � Ca n re trie ve unc ra wle d pa g e s a s we ll a s c ra wle d pa g e s ba se d on only r e fe r e ntia l infor ma tion. � In future work ... � Inte g ra te site a nc hor te xt re trie va l a nd tra ditiona l r e tr ie val syste m � Addre ss the proble m of We b site bounda rie s 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend