Fast and Accurate Metadata Authoring Using Ontology-Based - - PowerPoint PPT Presentation
Fast and Accurate Metadata Authoring Using Ontology-Based - - PowerPoint PPT Presentation
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations S100 Martnez-Romero, M. , OConnor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University What
What is metadata?
2 AMIA 2017 | amia.org
- Data that describe data
- Crucial for:
- Finding experimental datasets online
- Understanding how the experiments were performed
- Reusing the data to perform new analyses
3 AMIA 2017 | amia.org
4 AMIA 2017 | amia.org
age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years
Poor metadata
5 AMIA 2017 | amia.org
An analysis of metadata from NCBI’s BioSample
- 73% of “Boolean” values
- nonsmoker, former-smoker
- 26% of “integer” values
- JM52, UVPgt59.4, pig
- 68% of ontology terms
- presumed normal, wild_type
Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous
- Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.
Poor metadata
[Your presentation on this and next slides]
6 AMIA 2017 | amia.org
Metadata authoring is hard
- A computational
platform for metadata management
- Goal: Overcome the
impediments to creating high-quality metadata
7 AMIA 2017 | amia.org
Metadata template Metadata template
8 AMIA 2017 | amia.org
SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository
template metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal9 AMIA 2017 | amia.org
We developed a metadata recommendation system
SUBMIT METADATA FILL IN METADATA DESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repository
template metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe LongitudinalMetadata recommendation system
10 AMIA 2017 | amia.org
Metadata Editor Metadata Repository
https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinalanalyze existing metadata generate suggestions 1 2 3 store metadata Metadata Recommender
11 AMIA 2017 | amia.org
Filling in a CEDAR template
12 AMIA 2017 | amia.org
13 AMIA 2017 | amia.org
14 AMIA 2017 | amia.org
15 AMIA 2017 | amia.org
Evaluation workflow
16 AMIA 2017 | amia.org
BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository
(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis
Test dataset Gene Expression metadata Metadata Recommender
20% 80% 80% 20%
Evaluation workflow
17 AMIA 2017 | amia.org
BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository
(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis
Test dataset Gene Expression metadata Metadata Recommender
20% 80% 80% 20%
Evaluation workflow
18 AMIA 2017 | amia.org
BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository
(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis
Test dataset Gene Expression metadata Metadata Recommender
20% 80% 80% 20%
Evaluation workflow
19 AMIA 2017 | amia.org
BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository
(1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis
Test dataset Gene Expression metadata Metadata Recommender
20% 80% 80% 20%
- For “disease”, ”sex”,
and “tissue”
- Top 3 suggestions
Testing & Analysis
Compared suggested vs. expected metadata
Measure: Reciprocal Rank (RR). Appropriate to judge
systems that return a ranking of suggestions when there is only a relevant result
20 AMIA 2017 | amia.org
!"#$%&'#() !(+, (!!) = 1 1
Position of the expected result in the ranking of suggestions
How is the RR calculated?
21 AMIA 2017 | amia.org
Expected Suggested K Reciprocal Rank (RR) asthma 1) asthma 2) lung cancer 3) respiratory disease 1 1/1 lymphoma 1) myeloma 2) lymphoma 3) acute myeloid leukemia 2 1/2 lung cancer 1) respiratory disease 2) asthma 3) lung cancer 3 1/3
Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61
Results
22 AMIA 2017 | amia.org
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
disease tissue sex Baseline Metadata Recommender
Mean Reciprocal Rank (MRR)
On average:
- Metadata
Recommender = 0.77
- Baseline
(majority vote) = 0.31 Better performance with respect to the baseline for:
- Fields with many
different values
- Templates with many
correlated fields
Summary
- We developed a metadata recommendation system
as part of an end-to-end system for metadata management called CEDAR
- Generates context-sensitive suggestions in real time
- Incorporates both ontology-based and free-text
suggestions
23 AMIA 2017 | amia.org
Summary
Our approach makes it easier for scientists to generate high-quality metadata for experimental datasets
- So that the datasets can be found, interpreted, and
reused
- Essential to ensure scientific reproducibility
24 AMIA 2017 | amia.org
25 AMIA 2017 | amia.org
facebook.com/metadatacenter @metadatacenter
http://cedar.metadatacenter.org
Channel: Metadata Center github.com/metadatacenter