learning links in mesh co occurrence network
play

Learning Links in MeSH Co-occurrence Network Preliminary Results - PowerPoint PPT Presentation

Learning Links in MeSH Co-occurrence Network Preliminary Results Andrej Kastrin 1 , Thomas C. Rindflesch 2 and Dimitar Hristovski 3 andrej.kastrin@gmail.com dimitar.hristovski@gmail.com 1 Faculty of Information Studies, Novo mesto, Slovenia 2


  1. Learning Links in MeSH Co-occurrence Network Preliminary Results Andrej Kastrin 1 , Thomas C. Rindflesch 2 and Dimitar Hristovski 3 andrej.kastrin@gmail.com dimitar.hristovski@gmail.com 1 Faculty of Information Studies, Novo mesto, Slovenia 2 Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD, USA 3 Institute of Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia MIE 2014, Istanbul, Turkey

  2. Literature-Based Discovery • Find implicit relations between entities. • Propose implicit relations as potential scientific hypoteses. • Swanson’s XYZ model: • Relations XY and YZ are known • Implicit relation XZ is (putative) new discovery Y X Z 2/19

  3. Swanson’s Example • Blood viscosity was found to co-occur with Raynaud’s disease. • Fish oil reduces blood viscosity. • Fish oil was proposed as a new treatment for Raynaud’s disease. High blood viscosity Y X Z Fish oil Raynaud’s disease 3/19

  4. Literature-Based Discovery as Link Prediction Problem • We can model biomedical literature as a network of biomedical concepts. • Link prediction refers to the prediction of future links between concepts that are not directly connected in the current snapshot of a network. Y X Z 4/19

  5. MEDLINE/PubMed www.ncbi.nlm.nih.gov/pubmed 5/19

  6. Medical Subject Headings (MeSH) • MeSH is the source of nodes for our network. • MeSH is a comprehensive controlled vocabulary for indexing in the life sciences. • The 2013 version of MeSH contains 26 853 descriptors. • Every article in MEDLINE/PubMed is indexed with about 10-15 descriptors. • Some descriptors are designated (*), indicating the article’s major topic. 6/19

  7. MeSH Terms as Used to Describe a Paper PMID- 20091016 TI - Chi-square-based scoring function for... AB - OBJECTIVES: Text categorization has been used... MH - Access to Information MH - Algorithms MH - Artificial Intelligence MH - Bayes Theorem MH - *Chi-Square Distribution MH - Data Collection MH - Data Interpretation, Statistical MH - *Data Mining MH - Humans MH - *MEDLINE MH - Medical Informatics MH - *Natural Language Processing 7/19

  8. Methods • We have a training network G [ t 1 , t 2 ] which contains interactions among nodes that take place in the time interval [ t 1 , t 2 ] . • We have a test network G [ t 3 , t 4 ] which contains interactions among nodes that take place in the time interval [ t 3 , t 4 ] . • Learning (prediction) task: provide a list of edges that are present in the test network, but absent in the training network. Training network Test network D D H H B B A A F F C C G G E E 8/19

  9. Data Collection • We constructed two networks: • Training network [2003-2007] • Test network [2008-2012] • Networks were post-processed to remove non-informative edges. • We applied χ 2 test for independence for each co-occurrence pair to obtain a statistic which indicates whether a particular pair occurs together more often than by chance. 9/19

  10. Similarity Measures for Link Prediction • For each node pair ( u , v ) we calculate a similarity score s ( u , v ) . • Score s ( u , v ) gives the likelihood of link formation between nodes u and v . • We used two similarity measures: • Jaccard coefficient s uv = | Γ( u ) ∩ Γ( v ) | | Γ( u ) ∪ Γ( v ) | where Γ( u ) is set of neighbors of u • Adamic-Adar coefficient 1 � s uv = log | Γ( z ) | z ∈ Γ( u ) ∩ Γ( v ) 10/19

  11. Jaccard Coefficient s uv = | Γ( u ) ∩ Γ( v ) | | Γ( u ) ∪ Γ( v ) | = 4 9 = 0 . 44 u v 11/19

  12. Adamic–Adar Coefficient 1 � s uv = log | Γ( z ) | z 1 1 z 1 = log 7 + · · · + log 4 z 2 = 7 . 60 u v z 3 z 4 12/19

  13. Performance Assessment • Major challenge is huge number of possible node pairs. • We use a bootstrap resampling approach: • We draw a random sample of 1000 nodes and create appropriate training and test networks. • We compute a link prediction score s ( u , v ) for each node pair that is not associated with any interaction before time t 3 . • We assign class label “positive” to this node pair if the link occurs in test network and “negative” otherwise. • We repeat this procedure 100 times. • Using class labels and similarity scores we constructed an ROC curve. 13/19

  14. Results Topological Characteristics of the MeSH Networks Parameter Train Test Nodes 24 225 25 570 Edges 4 897 380 5 615 965 Edges (reduced) 3 328 288 3 810 535 Density 0 . 01 0 . 01 Mean degree 274 . 78 298 . 05 Average path length 2 . 23 2 . 20 Clustering coefficient 0 . 27 0 . 26 Small-worldness index 21 . 57 20 . 70 14/19

  15. Similarity Score Distribution 0.010 Class Density 0 1 0.005 0.000 0 1000 2000 3000 Jaccard coefficient 15/19

  16. Prediction Performance Jaccard Adamic−Adar 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Average true positive rate ● Average true positive rate ● ● ● 0.8 ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.6 ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● 0.4 ● ● ● ● ● 0.2 0.2 AUC = 0.78 AUC = 0.82 0.0 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate AUC ( Area under the ROC curve ): 0.90 – 1.00 = excellent, 0.80 – 0.90 = good, 0.70 – 0.80 = fair, 0.60 – 0.70 = poor, 0.50 – 0.60 = fail 16/19

  17. Example • Training network: 1991 – 1995 • Test network: 1996 1|Case-Control Studies|Rats, Inbred Strains|4867 2|Follow-Up Studies|Binding Sites|4512 3|Blotting, Western|Combined Modality Therapy|4271 4|Indicators and Reagents|Age Factors|4138 5|France|Disease Models, Animal|3991 6|Prognosis|Chickens|3955 7|Water|Prognosis|3901 8|Questionnaires|Microscopy, Electron|3895 9|Great Britain|Disease Models, Animal|3833 10|Signal Transduction|Retrospective Studies|3748 ... 1135416|Prostatic Neoplasms|I-kappa B Proteins|261 17/19

  18. Example 18/19

  19. Future Work • Explore the role of node and edge attributes in prediction performance. • Extend the study to semantic relations instead of co-occurrences. • Assess prediction performance on a large-scale network. • Develop network filtering methods. • Develop a web application for real-time computing. 19/19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend