Learning Links in MeSH Co-occurrence Network Preliminary Results - - PowerPoint PPT Presentation

learning links in mesh co occurrence network
SMART_READER_LITE
LIVE PREVIEW

Learning Links in MeSH Co-occurrence Network Preliminary Results - - PowerPoint PPT Presentation

Learning Links in MeSH Co-occurrence Network Preliminary Results Andrej Kastrin 1 , Thomas C. Rindflesch 2 and Dimitar Hristovski 3 andrej.kastrin@gmail.com dimitar.hristovski@gmail.com 1 Faculty of Information Studies, Novo mesto, Slovenia 2


slide-1
SLIDE 1

Learning Links in MeSH Co-occurrence Network

Preliminary Results Andrej Kastrin1, Thomas C. Rindflesch2 and Dimitar Hristovski3 andrej.kastrin@gmail.com dimitar.hristovski@gmail.com

1Faculty of Information Studies, Novo mesto, Slovenia 2Lister Hill National Center for Biomedical Communications, National Library of

Medicine, Bethesda, MD, USA

3Institute of Biostatistics and Medical Informatics, Faculty of Medicine, University

  • f Ljubljana, Ljubljana, Slovenia

MIE 2014, Istanbul, Turkey

slide-2
SLIDE 2

Literature-Based Discovery

  • Find implicit relations between entities.
  • Propose implicit relations as potential scientific hypoteses.
  • Swanson’s XYZ model:
  • Relations XY and YZ are known
  • Implicit relation XZ is (putative) new discovery

X Z Y

2/19

slide-3
SLIDE 3

Swanson’s Example

  • Blood viscosity was found to co-occur with Raynaud’s disease.
  • Fish oil reduces blood viscosity.
  • Fish oil was proposed as a new treatment for Raynaud’s disease.

X

Fish oil

Z

Raynaud’s disease

Y

High blood viscosity

3/19

slide-4
SLIDE 4

Literature-Based Discovery as Link Prediction Problem

  • We can model biomedical literature as a network of biomedical

concepts.

  • Link prediction refers to the prediction of future links between

concepts that are not directly connected in the current snapshot of a network. X Z Y

4/19

slide-5
SLIDE 5

MEDLINE/PubMed

www.ncbi.nlm.nih.gov/pubmed

5/19

slide-6
SLIDE 6

Medical Subject Headings (MeSH)

  • MeSH is the source of nodes for our network.
  • MeSH is a comprehensive controlled vocabulary for indexing in the

life sciences.

  • The 2013 version of MeSH contains 26 853 descriptors.
  • Every article in MEDLINE/PubMed is indexed with about 10-15

descriptors.

  • Some descriptors are designated (*), indicating the article’s major

topic.

6/19

slide-7
SLIDE 7

MeSH Terms as Used to Describe a Paper

PMID- 20091016 TI

  • Chi-square-based scoring function for...

AB

  • OBJECTIVES: Text categorization has been used...

MH

  • Access to Information

MH

  • Algorithms

MH

  • Artificial Intelligence

MH

  • Bayes Theorem

MH

  • *Chi-Square Distribution

MH

  • Data Collection

MH

  • Data Interpretation, Statistical

MH

  • *Data Mining

MH

  • Humans

MH

  • *MEDLINE

MH

  • Medical Informatics

MH

  • *Natural Language Processing

7/19

slide-8
SLIDE 8

Methods

  • We have a training network G[t1, t2] which contains interactions

among nodes that take place in the time interval [t1, t2].

  • We have a test network G[t3, t4] which contains interactions among

nodes that take place in the time interval [t3, t4].

  • Learning (prediction) task: provide a list of edges that are present

in the test network, but absent in the training network.

Training network A B C D E F G H Test network A B C D E F G H

8/19

slide-9
SLIDE 9

Data Collection

  • We constructed two networks:
  • Training network [2003-2007]
  • Test network [2008-2012]
  • Networks were post-processed to remove non-informative edges.
  • We applied χ2 test for independence for each co-occurrence pair to
  • btain a statistic which indicates whether a particular pair occurs

together more often than by chance.

9/19

slide-10
SLIDE 10

Similarity Measures for Link Prediction

  • For each node pair (u, v) we calculate a similarity score s(u, v).
  • Score s(u, v) gives the likelihood of link formation between nodes u

and v.

  • We used two similarity measures:
  • Jaccard coefficient

suv = |Γ(u) ∩ Γ(v)| |Γ(u) ∪ Γ(v)| where Γ(u) is set of neighbors of u

  • Adamic-Adar coefficient

suv =

  • z∈Γ(u)∩Γ(v)

1 log |Γ(z)|

10/19

slide-11
SLIDE 11

Jaccard Coefficient

u v

suv = |Γ(u) ∩ Γ(v)| |Γ(u) ∪ Γ(v)| = 4 9 = 0.44

11/19

slide-12
SLIDE 12

Adamic–Adar Coefficient

u v z2 z1 z3 z4

suv =

  • z

1 log |Γ(z)| = 1 log 7 + · · · + 1 log 4 = 7.60

12/19

slide-13
SLIDE 13

Performance Assessment

  • Major challenge is huge number of possible node pairs.
  • We use a bootstrap resampling approach:
  • We draw a random sample of 1000 nodes and create appropriate

training and test networks.

  • We compute a link prediction score s(u, v) for each node pair that

is not associated with any interaction before time t3.

  • We assign class label “positive” to this node pair if the link occurs in

test network and “negative” otherwise.

  • We repeat this procedure 100 times.
  • Using class labels and similarity scores we constructed an ROC

curve.

13/19

slide-14
SLIDE 14

Results

Topological Characteristics of the MeSH Networks

Parameter Train Test Nodes 24 225 25 570 Edges 4 897 380 5 615 965 Edges (reduced) 3 328 288 3 810 535 Density 0.01 0.01 Mean degree 274.78 298.05 Average path length 2.23 2.20 Clustering coefficient 0.27 0.26 Small-worldness index 21.57 20.70

14/19

slide-15
SLIDE 15

Similarity Score Distribution

0.000 0.005 0.010 1000 2000 3000

Jaccard coefficient Density

Class 1

15/19

slide-16
SLIDE 16

Prediction Performance

Jaccard

False positive rate Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • AUC = 0.78

Adamic−Adar

False positive rate Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • AUC = 0.82

AUC (Area under the ROC curve): 0.90 – 1.00 = excellent, 0.80 – 0.90 = good, 0.70 – 0.80 = fair, 0.60 – 0.70 = poor, 0.50 – 0.60 = fail

16/19

slide-17
SLIDE 17

Example

  • Training network: 1991 – 1995
  • Test network: 1996

1|Case-Control Studies|Rats, Inbred Strains|4867 2|Follow-Up Studies|Binding Sites|4512 3|Blotting, Western|Combined Modality Therapy|4271 4|Indicators and Reagents|Age Factors|4138 5|France|Disease Models, Animal|3991 6|Prognosis|Chickens|3955 7|Water|Prognosis|3901 8|Questionnaires|Microscopy, Electron|3895 9|Great Britain|Disease Models, Animal|3833 10|Signal Transduction|Retrospective Studies|3748 ... 1135416|Prostatic Neoplasms|I-kappa B Proteins|261

17/19

slide-18
SLIDE 18

Example

18/19

slide-19
SLIDE 19

Future Work

  • Explore the role of node and edge attributes in prediction

performance.

  • Extend the study to semantic relations instead of co-occurrences.
  • Assess prediction performance on a large-scale network.
  • Develop network filtering methods.
  • Develop a web application for real-time computing.

19/19