[PPT] - Fast and Accurate Metadata Authoring Using Ontology-Based PowerPoint Presentation

SLIDE 1

S100 Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations

SLIDE 2

What is metadata?

2 AMIA 2017 | amia.org

Data that describe data
Crucial for:
Finding experimental datasets online
Understanding how the experiments were performed
Reusing the data to perform new analyses

SLIDE 3

3 AMIA 2017 | amia.org

SLIDE 4

4 AMIA 2017 | amia.org

age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years

Poor metadata

SLIDE 5

5 AMIA 2017 | amia.org

An analysis of metadata from NCBI’s BioSample

73% of “Boolean” values
nonsmoker, former-smoker
26% of “integer” values
JM52, UVPgt59.4, pig
68% of ontology terms
presumed normal, wild_type

Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous

Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.

Poor metadata

SLIDE 6

[Your presentation on this and next slides]

6 AMIA 2017 | amia.org

Metadata authoring is hard

SLIDE 7

A computational

platform for metadata management

Goal: Overcome the

impediments to creating high-quality metadata

7 AMIA 2017 | amia.org

Metadata template Metadata template

SLIDE 8

8 AMIA 2017 | amia.org

SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository

template metadata

LINCS

Public Databases

https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal

SLIDE 9

9 AMIA 2017 | amia.org

We developed a metadata recommendation system

SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository

template metadata

LINCS

Public Databases

https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal

SLIDE 10

Metadata recommendation system

10 AMIA 2017 | amia.org

Metadata Editor Metadata Repository

https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal

analyze existing metadata generate suggestions 1 2 3 store metadata Metadata Recommender

SLIDE 11

11 AMIA 2017 | amia.org

Filling in a CEDAR template

SLIDE 12

12 AMIA 2017 | amia.org

SLIDE 13

13 AMIA 2017 | amia.org

SLIDE 14

14 AMIA 2017 | amia.org

SLIDE 15

15 AMIA 2017 | amia.org

SLIDE 16

Evaluation workflow

16 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

SLIDE 17

Evaluation workflow

17 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

SLIDE 18

Evaluation workflow

18 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

SLIDE 19

Evaluation workflow

19 AMIA 2017 | amia.org

BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository

(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis

Test dataset Gene Expression metadata Metadata Recommender

20% 80% 80% 20%

For “disease”, ”sex”,

and “tissue”

Top 3 suggestions

SLIDE 20

Testing & Analysis

Compared suggested vs. expected metadata

Measure: Reciprocal Rank (RR). Appropriate to judge

systems that return a ranking of suggestions when there is only a relevant result

20 AMIA 2017 | amia.org

!"#$%&'#() !(+, (!!) = 1 1

Position of the expected result in the ranking of suggestions

SLIDE 21

How is the RR calculated?

21 AMIA 2017 | amia.org

Expected Suggested K Reciprocal Rank (RR) asthma 1) asthma 2) lung cancer 3) respiratory disease 1 1/1 lymphoma 1) myeloma 2) lymphoma 3) acute myeloid leukemia 2 1/2 lung cancer 1) respiratory disease 2) asthma 3) lung cancer 3 1/3

Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61

SLIDE 22

Results

22 AMIA 2017 | amia.org

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

disease tissue sex Baseline Metadata Recommender

Mean Reciprocal Rank (MRR)

On average:

Metadata

Recommender = 0.77

Baseline

(majority vote) = 0.31 Better performance with respect to the baseline for:

Fields with many

different values

Templates with many

correlated fields

SLIDE 23

Summary

We developed a metadata recommendation system

as part of an end-to-end system for metadata management called CEDAR

Generates context-sensitive suggestions in real time
Incorporates both ontology-based and free-text

suggestions

23 AMIA 2017 | amia.org

SLIDE 24

Summary

Our approach makes it easier for scientists to generate high-quality metadata for experimental datasets

So that the datasets can be found, interpreted, and

reused

Essential to ensure scientific reproducibility

24 AMIA 2017 | amia.org

SLIDE 25

25 AMIA 2017 | amia.org

facebook.com/metadatacenter @metadatacenter

http://cedar.metadatacenter.org

Channel: Metadata Center github.com/metadatacenter