Automated Phenotypic Networks for the Integration of Heterogeneous - - PowerPoint PPT Presentation

automated phenotypic networks for the integration of
SMART_READER_LITE
LIVE PREVIEW

Automated Phenotypic Networks for the Integration of Heterogeneous - - PowerPoint PPT Presentation

Automated Phenotypic Networks for the Integration of Heterogeneous Databases Yves A. Lussier 1,2 Xiaoyan Wang 1 1 Dept of Biomedical Informatics 2 Dept of Medicine Columbia University HDG Preview of Take Home Points HDG HDG Exponential


slide-1
SLIDE 1

Automated Phenotypic Networks for the Integration of Heterogeneous Databases

Yves A. Lussier 1,2 Xiaoyan Wang 1

1 Dept of Biomedical Informatics 2 Dept of Medicine Columbia University HDG

slide-2
SLIDE 2

HDG HDG

  • Exponential growth of heterogeneous DBs

– difficult for human to review and recall

  • Complexity of Phenotypes

– Span scales of Biology, different granularity of description leading to compositional variants, ambiguity

  • Beyond Ontologies,

Computational Networks of Phenotypes

– map knowledge of genomic databases in reusable representations

Preview of Take Home Points

slide-3
SLIDE 3

HDG HDG

  • Challenge
  • Introduction:

– Data representation vs Schema – Curation vs Automation – Direct Maps vs Phenotypic Networks (PN)

  • Hypotheses
  • Methods
  • Results
  • Conclusions

Outline

slide-4
SLIDE 4

HDG HDG

Challenges

  • Heterogeneously data representation

– Structural differences – Naming conventions & standards differences across fields – Semantic differences – Context differences

  • Variable Database Schema
slide-5
SLIDE 5

HDG HDG

Examples of Interoperability

  • Based on Schema

Requires compatible indexes, supports unrelated schema

  • Mork P, Halevy A, Tarczay-Hornoch P. A model for data

integration systems of biomedical data applied to online genetic

  • databases. Proc AMIA Symp 2001:473-7.
  • Based on Data Representation

can map unrelated data dictionaries Requires compatible schema

slide-6
SLIDE 6

HDG HDG

Interoperability

Based on Data Representation

–Manual Curation

e.g.: UMLS, NCI Metathesaurus

  • rate-limiting for data sets using current

terminologies

– delayed and incomplete synchronization

  • High throughput unattainable for

uncoordinated data sets

–Computational Curation / Automation

E.g. automated indexing

slide-7
SLIDE 7

HDG HDG

Introduction

Interoperability based on Manual Curation

–rate-limiting for data sets using current terminologies

  • delayed and incomplete synchronization

–High throughput unattainable for uncoordinated data sets

slide-8
SLIDE 8

HDG HDG

Manual Indexing / Curation

UMLS

Mesh

UMDA GO OMIM SNOMED

Biomedical literature Other subdomains Anatomy Genome Annotations Genetic knowledge base Clinical repositories

……

2003 1993 1998 250 208,454 9,032 16,946 357,000 14,280

slide-9
SLIDE 9

HDG HDG

Introduction:

Automated Indexing

  • Automated Indexing

– Direct maps between two unrelated data dictionaries – No use of networks of relationships – Rare studies in clinical genetics and molecular biology; – Lexical matching

  • Sperzel WD et al. Biomedical database interconnectivity: An experiment linking MIM,

GENBANK, and META-1 via MEDLINE. Proc Annu Symp Comput Appl Med Care 1991:190-193.

– Lexical and semantics

  • Bodenreider O. Pac Symp Biocomputing 2004
  • Sarkar IN, Lussier YA et al.. Linking biomedical information and knowledge resources:

GO and UMLS. Pac Symp Biocomputing 2003;8:427-50.

slide-10
SLIDE 10

HDG HDG

Semantic Information Model of SNOMED Compositional, multiaxial, multi-hierararchic

T M F C D P G L

  • H. Pylori associated heamorrhagic Gastric Ulcer =

(4) D5-32220 Gastric (1) Ulcer (2) with haemorrhage (3) G-C002 associated with (5) L-13551 H. pylori (6) 1 3 2 4 4 5 5 6 6

A x e s

slide-11
SLIDE 11

HDG HDG

SNOMED Information Model: Representational variant

T M F C D P G L

  • H. pylori associated haemorrhagic Gastric Ulcer =

(7) DE-16016 H.pylori (6) associated Gastric (1) Ulcer (2) with (5) M-37000 haemorrhage (3) 1 4 5 5 6

A x e s

7 7 2 2 3 3

slide-12
SLIDE 12

HDG HDG

  • Challenge
  • Introduction: Phenotypic Networks (PN)
  • Hypotheses
  • Methods:

– Curation vs Automated mappings – Direct maps vs network-based maps

  • Results
  • Conclusions

Outline

slide-13
SLIDE 13

HDG HDG

Hypothesis Proof-of-Concept Study:

Automated Networks of Phenotypes can increase recall and precision

  • f queries across two heterogeneous databases

sharing no cross-indexes.

slide-14
SLIDE 14

HDG HDG

Method

  • Automated terminology networks

– Databases – Computational network of phenotypes – Incremental Lexico-semantic techniques

  • Lexical method
  • Semantic constrains
  • Multi-strategy / Incremental exploitation of the network

– Network’s pathways – Accuracy measurements

  • Evaluation

– Gold standard

slide-15
SLIDE 15

HDG HDG

Method: databases

Target databases

  • Human Disease Genes Database (HDG)

Jimenez-Sanchez G, Childs B, Valle D. Human disease genes. Nature 2001 409: 853-5

– Manually compiled database to classify disease genes & their products according to function – 921 disease genes are documented in the database

  • SNOMED-Clinical Term (clinical medicine)

– Concept-based clinical terminology – Version used: July, 2002 ; 333,325 concepts.

slide-16
SLIDE 16

HDG HDG

Method: databases Intermediating databases/terminologies

  • Online Mendelian Inheritance in Man (OMIM);

– 14,280 entries (Loci and diseases)

  • Unified Medical Language System (UMLS);

– 871,584 concepts (version 2002AB)

  • SNOMED 3.5

– 208,454 concepts (version SNOMED Intern., 3.5/ 1998)

slide-17
SLIDE 17

HDG HDG

Method: Manual Curation

SNOMED CT UMLS

SNOMED 3.5

HDG* OMIM Manual curation

921 * Jimenez-Sanchez G, Childs B, Valle D. Human disease genes. Nature 2001 409: 853-5

slide-18
SLIDE 18

HDG HDG

Method: Manual Curation

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation

250

slide-19
SLIDE 19

HDG HDG

Method: Manual Curation

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation

208,454

slide-20
SLIDE 20

HDG HDG

Method: Manual Curation

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation

208,454

slide-21
SLIDE 21

HDG HDG

Method: Manual Curation

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation

514 47 37 37

slide-22
SLIDE 22

HDG HDG

Method: Automated Terminology Network: ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

slide-23
SLIDE 23

HDG HDG

Method: Paths derived from the network

Path Name Intermediating terminologies (#)

Complete Path

P1 3

HDG = OMIM = UMLS = SMOMED3-5=SNOMED-CT

P2

HDG SNOMED-CT

P3 1

HDG UMLS SNOMED-CT

P4 1

HDG OMIM (Disease) SNOMED-CT

P5 1

HDG OMIM (Title) SNOMED-CT

P6 2 HDG UMLS OMIM SNOMED-CT P7 2

HDG OMIM UMLS SNOMED-CT

A = B Manual Curation / Mapping of terms via a common index between databases A and B. AB Automated Mapping / lexico-semantic mapping of terms between databases A and B.

slide-24
SLIDE 24

HDG HDG

Method: Automated Terminology Network: ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

P1

slide-25
SLIDE 25

HDG HDG

Method: Automated Terminology Network: ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

P2

slide-26
SLIDE 26

HDG HDG

Method: Automated Terminology Network: ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

P3 P3

slide-27
SLIDE 27

HDG HDG

Method:

ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

P5 P5

slide-28
SLIDE 28

HDG HDG

Method: Multistrategy ATN

SNOMED CT UMLS

SNOMED 3.5

HDG OMIM Manual curation Automatic mapping

slide-29
SLIDE 29

HDG HDG

Method: Lexico-Semantic techniques

Lexical Method: NORM

  • Punctuations removed
  • stop & duplicate words
  • Conversion to base form
  • Sort in alphabetical order

Darier’s disease Darier disease

slide-30
SLIDE 30

HDG HDG

Method: how it works

HDG 306700

HEMOPHILIA A 16872008

Hereditary factor VIII deficiency disease 319871002 Factor VIII fraction products

C0272322

AHG Deficiency AHG deficiency disease C0358603 Intermediate factor VIII|3|…

16872008

AHG Deficiency AHG deficiency disease 319871002 Factor VIII fraction products

HEMOPHILIA A HEMOPHILIA, CLASSIC; HEMA COAGULATION FACTOR VIIIC, PROCOAGULANT COMPONENT, COAGULATION FACTOR VIII, F8

OMIM 306700

UMLS SNOMED-CT SNOMED3.5

slide-31
SLIDE 31

HDG HDG

Method: how it works

HDG 306700

HEMOPHILIA A 16872008

Hereditary factor VIII deficiency disease 319871002 Factor VIII fraction products

C0272322

AHG Deficiency AHG deficiency disease C0358603 Intermediate factor VIII|3|…

16872008

AHG Deficiency AHG deficiency disease 319871002 Factor VIII fraction products

HEMOPHILIA A HEMOPHILIA, CLASSIC; HEMA COAGULATION FACTOR VIIIC, PROCOAGULANT COMPONENT, COAGULATION FACTOR VIII, F8

OMIM 306700

UMLS SNOMED-CT SNOMED3.5

slide-32
SLIDE 32

HDG HDG

Method: how it works

HDG 306700

HEMOPHILIA A 16872008

Hereditary factor VIII deficiency disease

C0272322

AHG Deficiency AHG deficiency disease |

16872008

AHG Deficiency AHG deficiency disease

HEMOPHILIA A HEMOPHILIA, CLASSIC; HEMA

OMIM 306700

UMLS SNOMED-CT SNOMED3.5

from the network

slide-33
SLIDE 33

HDG HDG

Method: evaluation

  • Gold Standard

– 3 independent curators – Agreement on 514 HDG-SNOMED maps

  • Quantitative analysis

– Recall: TP/ (TP+FN) – Precision: TP/(TP+FP)

TP = True positive, FN = false negative FP = False positive

  • Qualitative analysis

– Ambiguity – Redundancy

slide-34
SLIDE 34

HDG HDG

  • Challenge
  • Introduction: Phenotypic Networks (PN)
  • Hypotheses
  • Methods
  • Results

– Accuracy of direct maps vs pathways – Accuracy of manual vs automated curation – Accuracy of multi-strategy method

  • Conclusions

Outline

slide-35
SLIDE 35

HDG HDG

Result: Quantitative analysis

Manual curation ATN mapping Direct automated path

20 40 60 80 100 20 40 60 80 100 Recall(%) Precision(%)

Multi-Strategy

slide-36
SLIDE 36

HDG HDG

Precision vs. recall of each of the linking paths in the ATN

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 Recal l ( % ) Precision(%)

C

  • M

: P 1 C

  • M

: P 2 C

  • M

: P 3 C

  • M

: P 4 C

  • M

: P 5 C

  • M

: P 6 C

  • M

: P 7 C l M : P 1 C l M : P 2 C l M : P 3 C l M : P 4 C l M : P 5 C l M : P 6 C l M : P 7

CoM: Concept-Based Mapping; ClM: Class-Based Mapping

slide-37
SLIDE 37

HDG HDG

Result: Qualitative Analysis: P1

Ambiguity in HDG: 18%

HDG

Meningioma, NF2 related, sporadic, Schwannoma, sporadic (101000 )

SNOMED Neurofibromatosis, type 2 (92503002) Intracranial meningioma (302820008)

slide-38
SLIDE 38

HDG HDG

Result: Qualitative Analysis: P1

Redundancy in SNOMED: 15%

HDG

Apert syndrome (101200)

SNOMED

Apert's syndrome (63661009) Acrocephalosyndactyly (268262006)

slide-39
SLIDE 39

HDG HDG

Conclusions

  • Automated mapping traversing a network of

terminologies can have significantly improved recall (six fold increase in this study) over that of manual indexes, with minimal impact on precision.

  • Direct automated mapping (non-network) performed

significantly worse than any other method.

  • Incremental and class-based methods not investigated

in this study, have shown in a previous study to increase precision.

  • Automated terminology Networks may allow for high-

throughput linkages between disparate biomedical databases

slide-40
SLIDE 40

HDG HDG

Limitations

  • Small GS
  • Compositional mapping has not been

addressed with these methods

slide-41
SLIDE 41

HDG HDG

Future Directions

  • Support compositional mapping
  • Predict the accuracy of terminological

pathways large-scale networks.

slide-42
SLIDE 42

HDG HDG

Acknowledgments

  • Trainees: Michael Cantor, Aylit Schultz, Hui Nar Quek
  • Staff: Jianrong Li
  • National Institute of Allergy and Infectious Diseases (NIAID),
  • New York State Office of Science, Technology, and Academic Research

(NYSTAR)-sponsored Center for Advanced Technology at Columbia University

  • Office of Advanced Telemedicine (OAT) of the Health Resources and

Services Administration (HRSA),

  • Virginia Commonwealth University's Medical Informatics and Technology

Applications Consortium, a National Aeronautics and Space Administration (NASA) Commercial Space Center.

slide-43
SLIDE 43

HDG HDG

Thank you!

Questions?