1
Na vig a tion Re tr ie va l with Site Anc hor T e xt
Hide ki KAWAI, Ke nji T AT E ISHI a nd T
- shikazu F
UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs.
Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki - - PowerPoint PPT Presentation
Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1 Introduc tion Na vig a tion Re trie va l T a sk in NT CIR- 4
1
Hide ki KAWAI, Ke nji T AT E ISHI a nd T
UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs.
2
a sk in NT CIR- 4 WE B (task B)
Se a rc hing for one or more "re pre se nta tive
We b pa g e s."
R
e le vancy and R e pr e se ntative ne ss of
doc ume nt a re both importa nt.
Ve rify the e ffic ie nc y of re fe re ntia l informa tion
Re trie va l syste m whic h inde xe s only site a nc hor te xt Re tr ie va l syste m whic h inde xe s only site a nc hor te xt
T wo a dva nta g e s :
・
T he inde x size is ve r y small.
・
A use r c an r e tr ie ve unc r awle d doc ume nts as we ll as c r awle d doc ume nts.
3
Anc hor
(d,a)+Anc hor (e,a)+Anc hor (f,a)
a b c d e f
www.a.c om www.b.c om www.c .c om
Summa rizing c onte nt a nd popula rity of the We b site Summa rizing c onte nt a nd popula rity of the We b site We c a n c a lc ula te r
e le vancy
and r
e pr e se ntative ne ss.
We c a n c a lc ula te r
e le vancy
and r
e pr e se ntative ne ss.
Note :
We de fine d "e xte r na l We b site s" simply as site s whose domain name is diffe r e nt fr
ge t page .
4
Re pre se nta tive ne ss of pa g e p
de rive d from link struc ture
Re pr e se nta tive ne ss of pa g e p
de rive d from link struc ture
Re le va nc y of pa g e p and que r y q
ba se d on two kinds of me a sure s, r
e fe r e nce consiste ncy
a nd spe cificity of wor
d combination
Re le va nc y of pa g e p and que r y q
ba se d on two kinds of me a sure s, r
e fe r e nce consiste ncy
a nd spe cificity of wor
d combination
Ste p1 : Par se the que r y and se ar c h for page s Ste p2 : De te r mine sc or e of e ac h page Ste p3 : Sor t page s by Sc or e
5
a b c d e f
www.a.c om www.b.c om www.c .c om
C : Citation fr
e que nc y fr
nal We b site s
T : L
ike lihood of top page de te r mine d by following he ur istic s:
) 1 , 10 , 100 , 1000 ( ) , , , ( is H if is H if 1
4 3 2 1 4 3 3 2 2 1 1
= = + × + × + × = w w w w false true w w w w T
i i i
δ δ δ δ
(H1) Doe s the URL
(H2) Doe s the file name of the URL c ontain suc h a str ing as "inde x" or "de fault"? (H3) Doe s the URL e nd with a slash "/ " ? http:/ / www.c .c om/ abc / inde x.html
303 101 3 ) Rep( = × = a
e.g.
6
How c onsiste ntly is the pa g e re fe rre d by
e xte r nal We b site s?
(How sha rply doe s the site foc us on a topic ?)
How spe c ific ally ar
e page s ide ntifie d by g ive n wor d c ombina tion?
Ma in c onc e pt :
E ffe c tive use of limite d informa tion to de te rmine the re le va nc y
7
iPod iPod blog blog Clie MBA Ma tsui L a Vie iPod iPod Apple
x y
iPod iPod NE C
∈
× =
q t sa t t
N f kw q p
2
) , Rel(
ft : F
re que nc y of word t in the site a nc hor te xt for pa g e p
Nsa : Amount of site a nc hor te xt for pa g e p kwt : We ig ht of the word in que ry q
) (
2
i n i
q
kw
−
=
) " " , Rel( ) " " , Rel(
iPod iPod
y x <
In this c ase ...
8
by g ive n word c ombina tion?
t1 t2 t3 ( ) ( ) ( ) ( ).
, Rel , Rel , Rel , Rel ) , D( ), , D( ), , D( ), , , D( ) , D( ) , D( ) , D( ) , , D(
3 2 3 1 2 1 3 2 1 3 2 3 1 2 1 3 2 1
q l q k q j q i t t l t t k t t j t t t i t t t t t t t t t > > > ∈ ∈ ∈ ∈ < < <
the n, a nd if
Note :
T r aditional T F
sc he ma te nds to be biase d towar d wor ds with highly spe c ific ity (t2 and t3), so Rel(l, q) > Rel(j,q) orRel(k,q) in this c ase.
) , D( q p ∈ τ
: Numbe r
ke ywor d gr
que r y q
9
va lua tion sc a le s : WRR (a nd DCG) "re le va nt", "pa rtia lly r
e le vant", "ir r e le vant"
Spe c ific ity of wor d c ombina tion Site a nc hor te xt only SAS Re fe r e nc e c onsiste nc y Site a nc hor te xt only SAR Hig h we ig ht to a nc hor te xt F ull te xt of c ra wle d pa g e s ANC OKAPI F ull te xt of c ra wle d pa g e s OKA
Re le va nc y c a lc ula tion Inde x ID
10
0.1 0.2 0.3 0.4 0.5 0.6 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS WRR
wrr.1-0 wrr.1-1
Site a nc hor te xt re trie va l (SAR a nd SAS) ha s g re a t
a dva nta g e s ove r simple full te xt re trie va l (OKA).
※ T
T : <T IT L E > / DS : <DE SC> for T
Site anc hor te xt re trie va l (SAR a nd SAS) outpe r for me d the simple full te xt r e tr ie val (OKA)
11
0.1 0.2 0.3 0.4 0.5 0.6 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS WRR
wrr.1-0 wrr.1-1
Some importa nt informa tion in a nc hor te xt c a n be lost
whe n site a nc hor te xt wa s e xtra c te d.
with a s the sa me site . Anc hor we ig hte d full te xt r e tr ie va l (ANC) wa s be tte r tha n site a nc hor te xt re trie va l (SAR a nd SAS)
※ T
T : <T IT L E > / DS : <DE SC> for T
12
De spite a ve ry sma ll inde x, SAR a nd SAS we re
c ompa ra ble with ANC (up to 88% on WRR)
E
spe c ia lly a c c ura c y ra tio te nds to be hig he r in da ta se rie s tha t g ive a sc ore only for the "re le va nt" pa g e s.
Site a nc hor te xt c a n pinpoint hig hly re le va nt
doc ume nts. 0.71 0.84 wrr.1- 1 0.76 0.88 wrr.1- 0 0.68 0.72 dc g .3- 3 0.71 0.75 dc g .3- 2 0.81 0.84 dc g .3- 0 SAC/ ANC SAR/ ANC
※T
t is <T IT L E >
13
. 1 . 2 . 3 . 4 . 5 . 6 . 7 O K A
T A N C
T S A R
T S A S
T @ O K A
T @ A N C
T @ S A R
T @ S A S
T
W R R
w r r . 1
r r . 1
Some unc ra wle d pa g e s a re "re le va nt" a nd re le va nc y
for the unc ra wle d pa g e s c a n be de te rmine d ba se d on r e fe r e nc e infor mation.
WRR for c r awle d doc ume nts only
T he g a p of WRR va lue inc r e a se d be twe e n SAR a nd ANC (or SAS a nd ANC) c ra wle d doc ume nts
14
Ha s ve ry sma ll inde x size (one - thousa nds of
Outpe rforms simple full- te xt re trie va l. Is c ompa ra ble with a nc hor te xt we ig hte d full- te xt
re trie va l (up to 88% a c c ura c y).
T
e nds to pinpoint hig hly re le va nt pa g e s.
Ca n re trie ve unc ra wle d pa g e s a s we ll a s c ra wle d
pa g e s ba se d on only r e fe r e ntia l infor ma tion.
Inte g ra te site a nc hor te xt re trie va l a nd tra ditiona l
r e tr ie val syste m
Addre ss the proble m of We b site bounda rie s