Fast and Accurate Metadata Authoring Using Ontology-Based - - PowerPoint PPT Presentation

fast and accurate metadata authoring using ontology based
SMART_READER_LITE
LIVE PREVIEW

Fast and Accurate Metadata Authoring Using Ontology-Based - - PowerPoint PPT Presentation

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations S100 Martnez-Romero, M. , OConnor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University What


slide-1
SLIDE 1

S100 Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations

slide-2
SLIDE 2

What is metadata?

2 AMIA 2017 | amia.org

  • Data that describe data
  • Crucial for:
  • Finding experimental datasets online
  • Understanding how the experiments were performed
  • Reusing the data to perform new analyses
slide-3
SLIDE 3

3 AMIA 2017 | amia.org

slide-4
SLIDE 4

4 AMIA 2017 | amia.org

age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years

Poor metadata

slide-5
SLIDE 5

5 AMIA 2017 | amia.org

An analysis of metadata from NCBI’s BioSample

  • 73% of “Boolean” values
  • nonsmoker, former-smoker
  • 26% of “integer” values
  • JM52, UVPgt59.4, pig
  • 68% of ontology terms
  • presumed normal, wild_type

Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous

  • Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.

Poor metadata

slide-6
SLIDE 6

[Your presentation on this and next slides]

6 AMIA 2017 | amia.org

Metadata authoring is hard

slide-7
SLIDE 7
  • A computational

platform for metadata management

  • Goal: Overcome the

impediments to creating high-quality metadata

7 AMIA 2017 | amia.org

Metadata template Metadata template

slide-8
SLIDE 8

8 AMIA 2017 | amia.org

SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository

template metadata

LINCS

Public Databases

https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal
slide-9
SLIDE 9

9 AMIA 2017 | amia.org

We developed a metadata recommendation system

SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository

template metadata

LINCS

Public Databases

https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal
slide-10
SLIDE 10

Metadata recommendation system

10 AMIA 2017 | amia.org

Metadata Editor Metadata Repository

https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal

analyze existing metadata generate suggestions 1 2 3 store metadata Metadata Recommender

slide-11
SLIDE 11

11 AMIA 2017 | amia.org

Filling in a CEDAR template

slide-12
SLIDE 12

12 AMIA 2017 | amia.org

slide-13
SLIDE 13

13 AMIA 2017 | amia.org

slide-14
SLIDE 14

14 AMIA 2017 | amia.org

slide-15
SLIDE 15

15 AMIA 2017 | amia.org

slide-16
SLIDE 16

Evaluation workflow

16 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

slide-17
SLIDE 17

Evaluation workflow

17 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

slide-18
SLIDE 18

Evaluation workflow

18 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

slide-19
SLIDE 19

Evaluation workflow

19 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

  • For “disease”, ”sex”,

and “tissue”

  • Top 3 suggestions
slide-20
SLIDE 20

Testing & Analysis

Compared suggested vs. expected metadata

Measure: Reciprocal Rank (RR). Appropriate to judge

systems that return a ranking of suggestions when there is only a relevant result

20 AMIA 2017 | amia.org

!"#$%&'#() !(+, (!!) = 1 1

Position of the expected result in the ranking of suggestions

slide-21
SLIDE 21

How is the RR calculated?

21 AMIA 2017 | amia.org

Expected Suggested K Reciprocal Rank (RR) asthma 1) asthma 2) lung cancer 3) respiratory disease 1 1/1 lymphoma 1) myeloma 2) lymphoma 3) acute myeloid leukemia 2 1/2 lung cancer 1) respiratory disease 2) asthma 3) lung cancer 3 1/3

Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61

slide-22
SLIDE 22

Results

22 AMIA 2017 | amia.org

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

disease tissue sex Baseline Metadata Recommender

Mean Reciprocal Rank (MRR)

On average:

  • Metadata

Recommender = 0.77

  • Baseline

(majority vote) = 0.31 Better performance with respect to the baseline for:

  • Fields with many

different values

  • Templates with many

correlated fields

slide-23
SLIDE 23

Summary

  • We developed a metadata recommendation system

as part of an end-to-end system for metadata management called CEDAR

  • Generates context-sensitive suggestions in real time
  • Incorporates both ontology-based and free-text

suggestions

23 AMIA 2017 | amia.org

slide-24
SLIDE 24

Summary

Our approach makes it easier for scientists to generate high-quality metadata for experimental datasets

  • So that the datasets can be found, interpreted, and

reused

  • Essential to ensure scientific reproducibility

24 AMIA 2017 | amia.org

slide-25
SLIDE 25

25 AMIA 2017 | amia.org

facebook.com/metadatacenter @metadatacenter

http://cedar.metadatacenter.org

Channel: Metadata Center github.com/metadatacenter