into Co-citation Networks for Scientific Paper Searches Masaki Eto - - PowerPoint PPT Presentation

into co citation networks
SMART_READER_LITE
LIVE PREVIEW

into Co-citation Networks for Scientific Paper Searches Masaki Eto - - PowerPoint PPT Presentation

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Womens College Tokyo, Japan masaki.eto@gakushuin.ac.jp BIRNDL 2016 Outline of this presentation 1. Background Co-citation and


slide-1
SLIDE 1

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Masaki Eto Gakushuin Women’s College Tokyo, Japan masaki.eto@gakushuin.ac.jp BIRNDL 2016

slide-2
SLIDE 2

Outline of this presentation

  • 1. Background

Co-citation and network model Outline of co-citation network searching

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

2

slide-3
SLIDE 3

Co-citation Network

3

f c a d b

1 3 12 5 2 1 12

Co-citation =a linkage between a pair of documents concurrently cited by a third document Edge = co-citation linkage Weight = number of co-citing documents Node = cited document Network model e

slide-4
SLIDE 4

Outline of Co-citation Network Searching

4

  • 1. User inputs a

seed document

  • 3. System outputs

ranked documents Search system

Seed

f c a d b

1 3 12 5 2 1 12

  • 2. System creates a network and ranks the

documents in the network e

slide-5
SLIDE 5

Outline of this presentation

  • 1. Background

Co-citation and network model Similar document search

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

5

slide-6
SLIDE 6

Enlarging the Co-citation Networks so as to Include New Relevant Documents

6

Seed

  • Co-citation linkage

Research question Do satellite documents have relevant linkages to the seed that are not identified by co-citation linkages?

Word-based linkage

Satellite documents of B

Specifying via full-text search

Title words of B

  • Doc. B
  • Incorporating

documents into the network

  • Doc. X
slide-7
SLIDE 7

Outline of this presentation

  • 1. Background

Co-citation and network model Similar document search

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

7

slide-8
SLIDE 8

Specifying Satellite Documents

8

Full-text search Satellite documents of b Title words Seed

c a d e f

Top-ranked N documents

(e.g. N = 10)

Host documents b

Tf-idf (Indri Search Engine by Lemure project)

b

  • Host documents are sources

for specifying satellite documents

  • Each host document is
  • ne hop from the seed
slide-9
SLIDE 9

Problem of Satellite Documents

9

Irrelevant host yields a lot of irrelevant satellite documents f c a d b e Seed Relevant host yields a lot of relevant satellite documents Not all co-citation linkages are relevant

Checking the appropriateness of host documents

slide-10
SLIDE 10

10

Full-Text Searches “Co-citation in the same paragraph

has strong relationship”

(Eto 2013, Gipp & Beel 2009)

  • ---[A]-[B]---
  • ----------[C]-
  • (Seed)

Co-citation contexts are analyzed

  • Doc. B is selected as host
  • Doc. C is not selected as host

A and B are cited in the same paragraph A and C are cited in different paragraphs Satellite documents

Checking the Appropriateness

  • f Host Documents (optional process)
  • Doc. B

Citing document

  • Doc. C
  • Doc. A
  • Doc. X

Parsing

slide-11
SLIDE 11

Outline of this presentation

  • 1. Background

Co-citation and network model Similar document search

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

11

slide-12
SLIDE 12

Incorporating Satellite Documents

12

Satellite documents of b

T3 T2 e T1 f

“New” or already “Existing” in the initial co-citation network f c a d b e Seed 2 3 1 1 3 New Existing 1

  • >4

Added weight or New edge 2 weight = 1 1 1 1 New node and new edge T1 T2 T3

slide-13
SLIDE 13

Outline of this presentation

  • 1. Background

Co-citation and network model Similar document search

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

13

slide-14
SLIDE 14

Ranking Documents in the Network by the RWR (Random walk With Restart) Algorithm (Tong, 2008)

14

Seed Simple random walk The walker proceeds to the connected documents based

  • n transition probabilities calculated by weights of edges

b f c a g d e

12 3

0.8 (= 12/15) 0.2 (=3/15) 15 = 12 + 3 12 1 5 2 3 1 1

Start

slide-15
SLIDE 15

RWR: What is ‘Restart’?

15

b

The walker returns to the seed document with the probability r at every step Seed

f c a

g

d e r = 0.1

0.1 r ≓ parameter of the penalty for distance from the seed (If r is high, documents near the seed have high document scores) 0.72 0.18 Proceed OR Return

slide-16
SLIDE 16

RWR: How are document scores calculated?

16

  • When t is low, the position probability is unstable. As the

number of t increases, the position probability may converge

  • The position of the walker at Step (t) can be estimated by the

transition probabilities Start

b

Seed

f c a

g

e

0.72 0.18 0.54 0.225 0.135 0.432 0.432 0.036 0.225 0.675 0.45 0.225 0.225 0.225 0.675 0.5625 0.1125 0.225

0.1 0.1 0.1 0.1 0.1 0.1 0.1

d

slide-17
SLIDE 17

RWR: How are documents ranked?

17

b

Seed

c a

g

d e

38.78% 27.37% 10.08% 5.38% 12.11% 3.62%

Step (∞) converged

2.66%

1st 2nd 3rd 4th 5th 6th Converged position probability = Document score f

slide-18
SLIDE 18

Outline of this presentation

  • 1. Background

Co-citation and network model Similar document search

  • 2. Research question

Satellite documents

  • 3. Proposed Retrieval Method

Specifying satellite documents Incorporating satellite documents Ranking documents in the network

  • 4. Experiment

Evaluating the proposed method

18

slide-19
SLIDE 19

Information Retrieval Experiment

Retrieval Methods

  • Baseline (initial co-citation network)

Network created by taking up to two hops from the seed

  • Proposed Method (all)

All one hop documents from the seed are host documents

  • Proposed Method (context)

Host documents are selected by co-citation context Test Collection

  • 152,000 documents (XML) (Pubmed central dataset)
  • Each document has MeSH descriptors
  • 100 seed documents

Evaluation metric

  • nDCG@K (K = 5, 10, 50, 100)

19

slide-20
SLIDE 20

Search Run

Seed 152,000 documents Input a seed document

20

Create an initial co-citation network b

Seed c

a d e f b

Seed c

a d e f Incorporating satellite documents Ranked results by RWR are compared

Baseline Proposed methods

  • All
  • Context
slide-21
SLIDE 21

Relevance Assessment

Seed document

21

Top K ranked retrieved documents

・ ・ ・

Relevance scores were estimated based on similarity between the seed and each retrieved document

1st 2nd 3rd 4th

Jaccard Coeffiecinet Relevance Score >= 0.3 3 >= 0.2 2 >= 0.1 1

3 1

Jaccard Coeffiecinet based on MeSH descriptors

nDCG@K

K = 5, 10, 50, 100

Search performance

slide-22
SLIDE 22

Result (averaging results of 100 seed )

22

  • The maximum scores at each K are the

results of Proposed with N = 100

Proposed N = 10 Proposed N = 100

K Baseline all context all context 5 .226 .226 .232* .224 .234** 10 .223 .221 .227** .226 .230** 50 .188 .191* .189** .197** .191 100 .174 .181** .177* .188** .180**

* P < .05, ** P < .01

Proposed methods tended to outperform the baseline

  • The scores of Proposed (context) are higher than

those of the baseline method in all cases The checking process had a stable and positive impact on improving the search performance

slide-23
SLIDE 23

Conclusion

This study proposed a technique to enlarge co- citation networks by incorporating satellite documents in scientific paper searches Retrieval methods using the proposed technique tended to outperform the baseline method, which was based on the initial co-citation network

23

slide-24
SLIDE 24

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP26730163

24

slide-25
SLIDE 25

Q and A

Thank you!

25