Effective Topic Distillation Effective Topic Distillation with Key - - PowerPoint PPT Presentation

effective topic distillation effective topic distillation
SMART_READER_LITE
LIVE PREVIEW

Effective Topic Distillation Effective Topic Distillation with Key - - PowerPoint PPT Presentation

Effective Topic Distillation Effective Topic Distillation with Key Resource Pre- -selection selection with Key Resource Pre Yiqun Liu, Min Zhang and Shaoping Ma State Key Lab of Intelligent Tech. & Sys. Tsinghua University, Beijing,


slide-1
SLIDE 1

Effective Topic Distillation Effective Topic Distillation with Key Resource Pre with Key Resource Pre-

  • selection

selection

Yiqun Liu, Min Zhang and Shaoping Ma

State Key Lab of Intelligent Tech. & Sys. Tsinghua University, Beijing, 100084 liuyiqun03@mails.tsinghua.edu.cn (2004/10/19)

slide-2
SLIDE 2

For AIRS presentation 04/10/19

Outline Outline

  • Why Key Resource Pre-selection?
  • Possibilities of selecting key resources
  • How to select key resources?
  • Experiments
  • Conclusion
slide-3
SLIDE 3

For AIRS presentation 04/10/19

Why Key resource selection? (1) Why Key resource selection? (1)

  • The amount of web pages

Medium 2002 Internet Surface Web 167 TB Deep Web 91,850 TB #Surface web pages 20 billion #Deep web pages 130 billion

According to "How Much Information", 2003. http://www.sims.berkeley.edu/how-much-info-2003.

slide-4
SLIDE 4

For AIRS presentation 04/10/19

Why Key resource selection? (2) Why Key resource selection? (2)

  • Index amount of web search engine

GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista Billions Of Textual Documents Indexed Billions Of Textual Documents Indexed

According to a report by search engine watch website; September 2, 2003

Less than 1/6

slide-5
SLIDE 5

For AIRS presentation 04/10/19

Why Key resource selection? (3) Why Key resource selection? (3)

Not all pages can be indexed by web IR tools Many pages Indexed aren’t key resources TD is difficult Key Resource Selection

slide-6
SLIDE 6

For AIRS presentation 04/10/19

Definitions of TD and key resource Definitions of TD and key resource

  • Key Resource (Key Resource Page)

– High-quality web pages for a particular topic

  • Offering credible information/service for this topic
  • Introducing other useful web pages for this topic

– Key resources are only a small part of relevant pages

  • Topic Distillation (TD)

– To find key resources for certain topics – A major task for web search (it covers over 70% web search queries)

slide-7
SLIDE 7

For AIRS presentation 04/10/19

  • Selecting key resources is useful for TD
  • Possibilities of selecting key resources

– Is there any difference between ordinary pages and key r esource pages?

  • How to select key resources?
  • Experiments
  • Conclusion

Outline Outline

slide-8
SLIDE 8

For AIRS presentation 04/10/19

Non Non-

  • content features of key resources

content features of key resources

  • Key resources v.s. ordinary pages (non-content features)

– Common-used features

  • In-degree, URL-type, Doc-length

– Features involving site’s self-link analysis

  • In-site out-link number, anchor text rate
  • Two Data sets to compare the differences

– Key resource page training set

  • Built with TREC 11 TD task relevant qrels

– Ordinary page set: .GOV (over 1.2M web pages from .GOV domain)

slide-9
SLIDE 9

For AIRS presentation 04/10/19

In In-

  • degree

degree

  • Key resource pages have more in-links
slide-10
SLIDE 10

For AIRS presentation 04/10/19

URL URL-

  • type

type

  • Key resource pages tend to be non-FILE type
slide-11
SLIDE 11

For AIRS presentation 04/10/19

0.00% 3.00% 6.00% 9.00% 12.00% 15.00% 18.00% <200 600 1000 3000 5000 7000 9000 20000 >30000 Training Set .GOV Corpus

Doc Doc-

  • length

length

  • Key resources don’t have too few words
slide-12
SLIDE 12

For AIRS presentation 04/10/19

In In-

  • site Out

site Out-

  • link analysis

link analysis

  • Definition
  • Feature

– In-site out-link number – In-site out-link anchor text rate

Site A P1 P2 1 2 3 ) text full page web ( WordCount ) anchor link

  • ut

site in ( WordCount rate − − =

slide-13
SLIDE 13

For AIRS presentation 04/10/19

In In-

  • site Out

site Out-

  • link analysis

link analysis

  • Key resource pages have more in-site out-links and lo

nger in-site out-link anchor texts

In-site out-link anchor text rate In-site out-link anchor number

slide-14
SLIDE 14

For AIRS presentation 04/10/19

  • Selecting key resources is useful for TD
  • Possibilities of selecting key resources
  • How to select key resources?

– Construction of a key resource decision tree

  • Experiments
  • Conclusion

Outline Outline

slide-15
SLIDE 15

For AIRS presentation 04/10/19

Construction of a key resource decision tree Construction of a key resource decision tree

  • Why decision tree?

– The most effective and efficient classifier when there are small number of features

  • 5 non-content features

– Providing a metric to estimate quality of these features in the fo rm of

  • Information gain (ID3)
  • Information ratio (C4.5)
slide-16
SLIDE 16

For AIRS presentation 04/10/19

Construction of a key resource decision tree Construction of a key resource decision tree

68.53% of .GOV

slide-17
SLIDE 17

For AIRS presentation 04/10/19

Outline Outline

  • Selecting key resources is useful for TD
  • Possibilities of selecting key resources
  • How to select key resources?
  • Experiments

– Is this key resource selection process effective? – Does TD perform better on the key resource result set?

  • conclusion
slide-18
SLIDE 18

For AIRS presentation 04/10/19

Is this key resource selection process e Is this key resource selection process e ffective? ffective?

  • Key resource selection algorithm is effective

70% 20%

slide-19
SLIDE 19

For AIRS presentation 04/10/19

Does TD perform better on the key res Does TD perform better on the key res

  • urce result set?
  • urce result set?
  • Test set:

– From TREC 2003 TD task – 50 topics and corresponding relevant qrels

  • Evaluation Metrics:

– Precision at 10 documents – R-precision (precision at #relevant documents)

  • Weighting

– BM2500 ranking, default parameters

slide-20
SLIDE 20

For AIRS presentation 04/10/19

Does TD perform better on the key res Does TD perform better on the key res

  • urce result set?
  • urce result set?
  • Text retrieval on different data set

G = .GOV corpus K = Key resource set F = Full text A = Anchor text T = Trec 2003 best run

76% 83%

24.89% .GOV data

slide-21
SLIDE 21

For AIRS presentation 04/10/19

Conclusion Conclusion

  • Key resource pre-selection is needed for TD

– Finding high quality pages independent of a given user request

  • A new type of non-content features

– In-site out-link analyses

  • Algorithm of using decision tree to find key resources
  • Key resource page set:

– use less than 20% .GOV pages – cover more than 70% key resource information – get better performance than whole page set (There is 76% performance improvement in p@10)

slide-22
SLIDE 22

For AIRS presentation 04/10/19

Welcome to contact me: liuyiqun03@mails.tsinghua.edu.cn

Thank you! Questions and comments?