Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki - - PowerPoint PPT Presentation

na vig a tion re tr ie va l with site anc hor t e xt
SMART_READER_LITE
LIVE PREVIEW

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki - - PowerPoint PPT Presentation

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1 Introduc tion Na vig a tion Re trie va l T a sk in NT CIR- 4


slide-1
SLIDE 1

1

Na vig a tion Re tr ie va l with Site Anc hor T e xt

Hide ki KAWAI, Ke nji T AT E ISHI a nd T

  • shikazu F

UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs.

slide-2
SLIDE 2

2

Introduc tion

Na vig a tion Re trie va l T

a sk in NT CIR- 4 WE B (task B)

Se a rc hing for one or more "re pre se nta tive

We b pa g e s."

R

e le vancy and R e pr e se ntative ne ss of

doc ume nt a re both importa nt.

Motiva tion

Ve rify the e ffic ie nc y of re fe re ntia l informa tion

Re trie va l syste m whic h inde xe s only site a nc hor te xt Re tr ie va l syste m whic h inde xe s only site a nc hor te xt

T wo a dva nta g e s :

T he inde x size is ve r y small.

A use r c an r e tr ie ve unc r awle d doc ume nts as we ll as c r awle d doc ume nts.

slide-3
SLIDE 3

3

Site Anc hor T e xt

Anc hor te xt of links from e xte rna l We b site

Anc hor

(d,a)+Anc hor (e,a)+Anc hor (f,a)

a b c d e f

www.a.c om www.b.c om www.c .c om

Summa rizing c onte nt a nd popula rity of the We b site Summa rizing c onte nt a nd popula rity of the We b site We c a n c a lc ula te r

e le vancy

and r

e pr e se ntative ne ss.

We c a n c a lc ula te r

e le vancy

and r

e pr e se ntative ne ss.

Note :

We de fine d "e xte r na l We b site s" simply as site s whose domain name is diffe r e nt fr

  • m the tar

ge t page .

slide-4
SLIDE 4

4

Re trie va l Me thod

Sc ore of pa g e p

) , Rel( ) Rep( ) Score( q p p p × =

Re pre se nta tive ne ss of pa g e p

de rive d from link struc ture

Re pr e se nta tive ne ss of pa g e p

de rive d from link struc ture

Re le va nc y of pa g e p and que r y q

ba se d on two kinds of me a sure s, r

e fe r e nce consiste ncy

a nd spe cificity of wor

d combination

Re le va nc y of pa g e p and que r y q

ba se d on two kinds of me a sure s, r

e fe r e nce consiste ncy

a nd spe cificity of wor

d combination

Ste p1 : Par se the que r y and se ar c h for page s Ste p2 : De te r mine sc or e of e ac h page Ste p3 : Sor t page s by Sc or e

slide-5
SLIDE 5

5

Re pre se nta tive ne ss of pa g e p

De rive d from link struc ture

a b c d e f

www.a.c om www.b.c om www.c .c om

T C p × = ) Rep(

C : Citation fr

e que nc y fr

  • m e xte r

nal We b site s

T : L

ike lihood of top page de te r mine d by following he ur istic s:

) 1 , 10 , 100 , 1000 ( ) , , , ( is H if is H if 1

4 3 2 1 4 3 3 2 2 1 1

=    = + × + × + × = w w w w false true w w w w T

i i i

δ δ δ δ

(H1) Doe s the URL

  • f the page c onsist of
  • nly domain name ?

(H2) Doe s the file name of the URL c ontain suc h a str ing as "inde x" or "de fault"? (H3) Doe s the URL e nd with a slash "/ " ? http:/ / www.c .c om/ abc / inde x.html

303 101 3 ) Rep( = × = a

e.g.

slide-6
SLIDE 6

6

Re le va nc y of pa g e p a nd que ry q

Re fe re nc e c onsiste nc y

How c onsiste ntly is the pa g e re fe rre d by

e xte r nal We b site s?

(How sha rply doe s the site foc us on a topic ?)

Spe c ific ity of word c ombina tion

How spe c ific ally ar

e page s ide ntifie d by g ive n wor d c ombina tion?

Ma in c onc e pt :

E ffe c tive use of limite d informa tion to de te rmine the re le va nc y

slide-7
SLIDE 7

7

Re fe re nc e c onsiste nc y

Whic h is re le va nc e for que ry "i- pod" ?

iPod iPod blog blog Clie MBA Ma tsui L a Vie iPod iPod Apple

x y

iPod iPod NE C

        × =

q t sa t t

N f kw q p

2

) , Rel(

ft : F

re que nc y of word t in the site a nc hor te xt for pa g e p

Nsa : Amount of site a nc hor te xt for pa g e p kwt : We ig ht of the word in que ry q

) (

2

i n i

q

kw

=

) " " , Rel( ) " " , Rel(

iPod iPod

y x <

In this c ase ...

slide-8
SLIDE 8

8

Spe c ific ity of word c ombina tion

How spe c ific a lly a re pa g e s ide ntifie d

by g ive n word c ombina tion?

) , D( log ) , Rel( q p N q p ∈ = τ

t1 t2 t3 ( ) ( ) ( ) ( ).

, Rel , Rel , Rel , Rel ) , D( ), , D( ), , D( ), , , D( ) , D( ) , D( ) , D( ) , , D(

3 2 3 1 2 1 3 2 1 3 2 3 1 2 1 3 2 1

q l q k q j q i t t l t t k t t j t t t i t t t t t t t t t > > > ∈ ∈ ∈ ∈ < < <

the n, a nd if

Note :

T r aditional T F

  • IDF

sc he ma te nds to be biase d towar d wor ds with highly spe c ific ity (t2 and t3), so Rel(l, q) > Rel(j,q) orRel(k,q) in this c ase.

) , D( q p ∈ τ

: Numbe r

  • f page s that c ontain

ke ywor d gr

  • up inc lude d in both page p and

que r y q

slide-9
SLIDE 9

9

E va lua tion

Doc ume nt c olle c tion :100GB NW100G- 01 T

  • ta l size of site a nc hor te xt : 94MB

E

va lua tion sc a le s : WRR (a nd DCG) "re le va nt", "pa rtia lly r

e le vant", "ir r e le vant"

Compa re d with following 4 syste ms:

Spe c ific ity of wor d c ombina tion Site a nc hor te xt only SAS Re fe r e nc e c onsiste nc y Site a nc hor te xt only SAR Hig h we ig ht to a nc hor te xt F ull te xt of c ra wle d pa g e s ANC OKAPI F ull te xt of c ra wle d pa g e s OKA

Re le va nc y c a lc ula tion Inde x ID

slide-10
SLIDE 10

10

0.1 0.2 0.3 0.4 0.5 0.6 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS WRR

wrr.1-0 wrr.1-1

Re sult a nd disc ussion (1/ 4)

Site a nc hor te xt re trie va l (SAR a nd SAS) ha s g re a t

a dva nta g e s ove r simple full te xt re trie va l (OKA).

※ T

T : <T IT L E > / DS : <DE SC> for T

  • pic Pa rt

Site anc hor te xt re trie va l (SAR a nd SAS) outpe r for me d the simple full te xt r e tr ie val (OKA)

slide-11
SLIDE 11

11

0.1 0.2 0.3 0.4 0.5 0.6 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS WRR

wrr.1-0 wrr.1-1

Re sult a nd disc ussion (2/ 4)

Some importa nt informa tion in a nc hor te xt c a n be lost

whe n site a nc hor te xt wa s e xtra c te d.

  • e .g . http:/ / a bc .jp/ ~usr1/ a nd http:/ / a bc .jp/ ~usr2/ a re de a lt

with a s the sa me site . Anc hor we ig hte d full te xt r e tr ie va l (ANC) wa s be tte r tha n site a nc hor te xt re trie va l (SAR a nd SAS)

※ T

T : <T IT L E > / DS : <DE SC> for T

  • pic Pa rt
slide-12
SLIDE 12

12

Re sult a nd disc ussion (3/ 4)

De spite a ve ry sma ll inde x, SAR a nd SAS we re

c ompa ra ble with ANC (up to 88% on WRR)

E

spe c ia lly a c c ura c y ra tio te nds to be hig he r in da ta se rie s tha t g ive a sc ore only for the "re le va nt" pa g e s.

Site a nc hor te xt c a n pinpoint hig hly re le va nt

doc ume nts. 0.71 0.84 wrr.1- 1 0.76 0.88 wrr.1- 0 0.68 0.72 dc g .3- 3 0.71 0.75 dc g .3- 2 0.81 0.84 dc g .3- 0 SAC/ ANC SAR/ ANC

※T

  • pic Par

t is <T IT L E >

slide-13
SLIDE 13

13

. 1 . 2 . 3 . 4 . 5 . 6 . 7 O K A

  • T

T A N C

  • T

T S A R

  • T

T S A S

  • T

T @ O K A

  • T

T @ A N C

  • T

T @ S A R

  • T

T @ S A S

  • T

T

W R R

w r r . 1

  • w

r r . 1

  • 1

Re sult a nd disc ussion (4/ 4)

Some unc ra wle d pa g e s a re "re le va nt" a nd re le va nc y

for the unc ra wle d pa g e s c a n be de te rmine d ba se d on r e fe r e nc e infor mation.

WRR for c r awle d doc ume nts only

T he g a p of WRR va lue inc r e a se d be twe e n SAR a nd ANC (or SAS a nd ANC) c ra wle d doc ume nts

slide-14
SLIDE 14

14

Conc lusion a nd F uture work

Site a nc hor te xt re trie va l syste m ...

Ha s ve ry sma ll inde x size (one - thousa nds of

  • rig ina l doc ume nt se t)

Outpe rforms simple full- te xt re trie va l. Is c ompa ra ble with a nc hor te xt we ig hte d full- te xt

re trie va l (up to 88% a c c ura c y).

T

e nds to pinpoint hig hly re le va nt pa g e s.

Ca n re trie ve unc ra wle d pa g e s a s we ll a s c ra wle d

pa g e s ba se d on only r e fe r e ntia l infor ma tion.

In future work ...

Inte g ra te site a nc hor te xt re trie va l a nd tra ditiona l

r e tr ie val syste m

Addre ss the proble m of We b site bounda rie s