Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent - - PowerPoint PPT Presentation

revisiting document length hypotheses
SMART_READER_LITE
LIVE PREVIEW

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent - - PowerPoint PPT Presentation

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004 Sumio FUJITA PATOLIS Corporation Introduction Is patent search different from traditional document retrieval tasks? If the answer is


slide-1
SLIDE 1

Revisiting Document Length Hypotheses

NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004 Sumio FUJITA PATOLIS Corporation

slide-2
SLIDE 2

Introduction

  • Is patent search different from traditional

document retrieval tasks?

  • If the answer is yes,

– How different? – And why different?

  • Comparative study of CLIR J-J task and Patent

main task may lead us to the answers.

  • Emphasis on document length hypotheses
slide-3
SLIDE 3

Why emphasis on document length?

  • Because according to the retrieval methods, average

number of passages of retrieved documents at NTCIR-4 Patent task are considerably different!

– PLLS2(TF*IDF): 72 – PLLS6(KL-Dir): 46

  • Effectiveness in NTCIR-4 CLIR J-J(MAP)

– TF*IDF: 0.3801 (PLLS-J-J-T-03) – KL-Dir: 0.3145

  • Effectiveness in NTCIR-4 Patent(MAP)

– KL-Dir: 0.2408 (PLLS6) – TF*IDF: 0.1703

  • Different document length hypotheses to different tasks?
slide-4
SLIDE 4

System description

  • PLLS evaluation experiment system
  • based on Lemur toolkit 2.0.1[Ogilvie et al. 02] for

indexing system

  • PostgreSQL integration for treating bibliographic

information

  • Distributed search against patent full-text

collection partitioned by the published year

  • Simulated centralized search as baseline
slide-5
SLIDE 5

System description

  • Indexing language:

– Chasen version 2.2.9 as Japanese morphological analyzer with IPADIC dictionary version 2.5.1

  • Retrieval models:

– TF*IDF with BM25 TF – KL-divergence of probabilistic language models with Dirichlet prior smoothing[Zhai et al. 01]

  • Rocchio feedback for TF*IDF and markov chain

query update method for KL-divergence retrieval model [Lafferty et al. 01]

slide-6
SLIDE 6

Language modeling for IR

) | ( ) ( ) | ( d q p d p q d p ∝

  • retrieval version of a Naïve Bayes classifier

+ =

i i d

q p d p d q p d p ) | ( log ) ( log )) | ( ) ( log(

∈V w

d w p q w p )) | ( log( ) | (

Negative cross entropy between the query language model and a document language model

slide-7
SLIDE 7

Smoothing methods

  • Jelinek-Mercer method
  • Dirichlet-Prior method

µ µ

µ

+ + = | | ) | ( ) , ( ) | ( d C w p d w freq d w p ) | ( ) | ( ) 1 ( ) | ( C w p d w p d w p

ml

λ λ

λ

+ − =

Freq(w,d)/|d| Background probability is not divided by doclen! Background probability is divided by doclen!

slide-8
SLIDE 8

Document dependent priors

  • Document length is a good choice in TREC

experiments since it is predictive of relevance against TREC test set [Miller et al. 99][Singhal et

  • al. 96].
  • Hyper Link Information in Web search
  • What are the good priors in Patent search?

– IPC prior?

slide-9
SLIDE 9

Document length hypotheses

  • Why are longer documents longer than shorter
  • nes?
  • The “Scope hypothesis” considers a long

document as a concatenation of a number of unrelated short documents.

  • The “Verbosity hypothesis” assumes that a long

document covers the same scope as a short document but it uses more words. [Robertson et al. 94]

slide-10
SLIDE 10

Scope hypothesis

(NTCIR-3 CLIR-J-J)

. 5 . 1 . 1 5 . 2 . 2 5 . 3 . 3 5 1 1 1 1 P ( B i n | R e l a ) P ( B i n | R e l b ) P ( B i n | D i r _ R e t ) P ( B i n | T F _ R e t ) 線形 ( P ( B i n | T F _ R e t ) ) 線形 ( P ( B i n | D i r _ R e t ) ) 線形 ( P ( B i n | R e l b ) )

P(Bin|Relb) P(Bin|BM25TF_Ret) P(Bin|Dir_Ret) Median of document length in each bin

slide-11
SLIDE 11

Verbosity hypothesis

(NTCIR-3 Patent)

. 2 . 4 . 6 . 8 . 1 . 1 2 1 1 1 1 P ( B i n | R e l a ) P ( B i n | R e l b ) P ( B i n | D i r _ R e t ) P ( B i n | T F _ R e t ) 線形 ( P ( B i n | D i r _ R e t ) ) 線形 ( P ( B i n | T F _ R e t ) ) 線形 ( P ( B i n | R e l b ) )

P(Bin|Relb) P(Bin|BM25TF_Ret) P(Bin|Dir_Ret) Median of document length in each bin

slide-12
SLIDE 12

Verbosity hypothesis

(NTCIR-3 Patent)

Median of claim numbers in each bin

. 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 1 2 3 4 5 6 7 8 9 1 P ( C b i n | R e l a ) P ( C b i n | R e l b ) 対数 ( P ( C b i n | R e l b ) )

P(Bin|Relb)

slide-13
SLIDE 13

Augmenting average document length year by year

5 1 1 5 2 2 5 3 3 5 4 4 5 5 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 1 y e a r A v g D

  • c

L e n

Interpolation suggests that it may be as longer as 4500 words/doc in the year 2010! This is twice as long as the year 1993.

slide-14
SLIDE 14

Average unique terms in a document as well

1 2 3 4 5 6 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 1 y e a r U n i q T e r m

Interpolation suggests that it may be as many as 560 words/doc in the year 2010! This is 140% of the year 1993.

slide-15
SLIDE 15

Are long patent documents simply verbose?

  • Presumably verbose in view of subject topic coverage /

topical relevance?

  • How about in view of “Invalidation”?
  • Why patent documents are getting longer every year?
  • Longer patent documents are stronger because of their

document characteristics.

– They can broaden the extension of the rights covered by the claim. – Needs to cover and to describe augmenting complexities of technological domains.

slide-16
SLIDE 16

Average document length of relevant and non-relevant documents

NTCIR-3 CLIR NTCIR-3 Patent NTCIR-4 Patent A docs (relevant) 315(167%) 3164(109%) 3137(127%) AB docs

(partially relevant)

290(153%) 3075(106%) 2946(119%) ABCD docs (pooled) 232(123%) 3123(107%) 3321(134%) All docs

(in the collection)

189(100%) 2906(100%) 2478(100%)

Document length clearly affects the relevance. Document length merely affects the relevance. Document length fairly affects the relevance.

slide-17
SLIDE 17

Verbose but strong?

(NTCIR-4 Patent)

. 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 . 9 . 1 1 1 1 1 R e l B P O O L P L L S 6 T F I D F _ B E S T 線形 ( R e l B ) 線形 ( P O O L ) 線形 ( P L L S 6 ) 線形 ( T F I D F _ B E S T )

P(Bin|Pool) P(Bin|Relb) Median of document length in each bin P(Bin|PLLS6) P(Bin|BM25TF)

slide-18
SLIDE 18

CLIR experiments

  • Title or Description Only runs: simple TF*IDF with PFB
  • Title and Description runs: Fusion of Title run and

Description run

  • Post submission: KL-divergence runs(Dirichlet smoothing,

KL-Dir) with/without document length priors

d in t

  • f
  • ccurrence
  • f

number : ) , ( appears t where documents

  • f

number : ) ( collection in the documents

  • f

number total : term : document : ) , ( ) ) 1 (( 1 ) , ( ) 1 1 ( ) ) ( log 4 ( ) , ( t d freq t df N t d t d freq avdl dl b b k t d freq k t df N k t d w

d

+ + − + + =

slide-19
SLIDE 19

CLIR runs for J-J SLIR

AP-Rigid RP-Rigid AP-Relax RP-relax PLLS-J-J-TD-01 0.3915 0.4100 0.4870 0.4975 PLLS-J-J-TD-02 0.3913 0.4098 0.4878 0.4986 PLLS-J-J-T-03 0.3801 0.3922 0.4711 0.4783 PLLS-J-J-D-04 0.3804 0.3978 0.4838 0.4931 AP-Rigid RP-Rigid AP-Relax RP-relax JMSmooth λ=0.45 TITLE 0.2696 0.3025 0.3756 0.4077 JMSmooth λ=0.55 DESC 0.2683 0.3110 0.3703 0.4146 DirSmooth µ=1000 TITLE 0.3145 0.3445 0.3990 0.4313 DirSmooth µ=2000 DESC 0.3006 0.3311 0.3907 0.4226

KL-JM/KL-dir runs perform poorly.

slide-20
SLIDE 20

CLIR J-J with doc length priors

  • PLLS-J-J-T-03(TF*IDF):0.3801
  • Dirichlet :0.3145
  • Dirichlet with a doc length prior:0.2908
  • Simple penalization or promotion by document

length does not help.

  • More work is needed for document length

normalization in Language modeling IR.

slide-21
SLIDE 21

Patent main task experiments

  • Invalidation search by claim-document

matching(claim-to-be-invalidated-as-query)

  • Indexing range:

full text vs selected fields indexing

  • KL-Dir vs TF*IDF
  • Distributed retrieval strategy vs centralized

retrieval

slide-22
SLIDE 22

Indexing range: full text vs selected fields indexing

  • Full text is much better(statistically significant,

p=0.05) than selected fields(Abs+Claims) indexing.

  • KL-Dir, Selected fields, (PLLS3):0.1548
  • KL-Dir,Fulltext,(PLLS6):0.2408
slide-23
SLIDE 23

KL-Dir vs TF*IDF

  • TF*IDF, Selected, (PLLS1):0.1734
  • KL-Dir, Selected, (PLLS3):0.1548
  • But with additional topic set:
  • TF*IDF, Selected, (PLLS1):0.0499
  • KL-Dir, Selected, (PLLS3):0.0557
  • No big difference(not statistically significant)!
slide-24
SLIDE 24

Distributed retrieval vs centralized retrieval

KL-Dir TF*IDF Distributed base 0.2408 0.1703 Distributed BEST 0.2488 0.2516 Centralized base 0.2274 0.1712 Centralized BEST 0.2508 0.2625

No statistically significant difference between KL-Dir and TF*IDF

Centralized search is not necessarily must!

slide-25
SLIDE 25

Patent with doc length penalization

  • TF*IDF Best(Centralized): 0.2625
  • Best while B=0.9-1.0

– Doc length penalization helps! – NTCIR-4 CLIR J-J: 0.35 – 0.5 – Usually 0.2-0.3 while document length is controlled – Theoretically 0.0 while document length is uniform

  • Best while k1 is about 0.9

– NTCIR-4 CLIR J-J: 1 – 1.2

  • Better while query TF is constant
slide-26
SLIDE 26

Conclusions

  • According to the different document length

hypotheses of the retrieval tasks, different retrieval methods are examined with various parameters.

  • In news paper search, BM25 TF, which tends to

retrieve longer documents outperforms KL-Dir method while no big difference in patent retrieval.

  • Simple penalization or promotion by document

length prior does not help i.e. cosine normalization

  • r document length priors.