Metadata Management for Data Integration in Medical Sciences - - PowerPoint PPT Presentation

Metadata Management for Data Integration in Medical Sciences - Experiences from the LIFE Study - Toralf Kirsten, Alexander Kiel, Mathias Rühle, Jonas Wagner LIFE Reserach Center for Civilization Diseases University of Leipzig BTW, Stuttgart, 08.03.2017

Data in Medical Sciences ● Clinical Care – Patients with dedicated problems in health – Many unstructured data, e.g., anamneses, findings, discharge reports, images – Structured data captured or derived from unstructured data: diagnoses, procedures etc. → goal: mostly billing ● Medical reserach projects – Recruited patients/probands – Determining a specific scientific goal – Mostly structured data + complex types (genetic data, images, …)

LIFE Research Center ● Center at the Medical Faculty, Univ. of Leipzig ● Goal: Prevalences, risk factors and development of common civilization diseases ● Different epidemiological studies – Two population based cohorts (inhabitants of Leipzig) – Three disease specific cohorts ● Complex data capturing processes by multiple hospitals and ambulances – Mostly structured data capturing – Complex data, e.g., omics data 3

Multiple Input Forms (10/'16) Assessment # Assess- avg( |Input |Items| Avg(|Items| / Type ments Forms| / |Assessment|) Assessment|) Interview 317 3 18,980 59.9 Questionnaire 217 2 16,740 77.1 Physical 78 2.5 10,606 136 Examination Laboratory 114 1.5 2,110 18.5 . . . . . . … . . . T otal > 850 2.4 > 51,000 66.7 > 1,700 (8 - 844) ● Evolution of input forms within a single input system ● Multiple input systems: Online ~, paper based data capturing, 5 spreadsheets, desktop databases, ...

Evolution of Input Forms: Example Änderungen 6

Evolution of Input Forms ● Problem: How form modifications can be managed with implications on data integration and later data analyses ? ● Two alternatives – Single evolving input form (per input system) – Multiple input forms: New form whenever a relevant modification need to be implemented Anthropometry … M F V1 ,S V1 lime_survey_76309 11. Weight (in g): F V1 76309X978X896 int S V1 12. Height (in m): 76309X235X972 char … … Anthropometry lime_survey_72354 M F V2 ,S V2 … F V2 S V2 14. Weigt in kg: 72534X673X245 int 7 72534X789X214 int 15. Height in cm: … …

Requirements ● Data capturing and analysis in parallel ● Large set of analysis projects (> 350, Jan. 2017) ● Consider data provenance ● Harmonization of schemas according to evolution of input forms and multiple input systems – Study Items (questions, parameters) – Code lists (coding of answers) ● Efficiency – Automatic data transfer & transformations – Dynamic extension of target schema (research database) ● Further requirements: „Data descriptions“ used in analysis, Metadata for query generation, reporting, curation ... 8

Problem: Integration of Input Forms ● Harmonization of study items ● Schema examples Input System Research Database Schema Lime Survey Mapping lime_survey_76309 Anthropometry S V1 Anthropometry M S V1 ,S T 76309X978X896 int Form 1 S T T00876 76309X235X972 char ? … F0001 int S V2 lime_survey_72354 Anthropometry F0002 int … Form 2 72534X673X245 int 72534X789X214 char M S V2 ,S T … No application of matching techniques on schema level – mostly names of schema elements are technically induced 9

Mapping based Approach ● Two step realization 1) Extension of target schema T for each new assessment – first version (first input form) 2) Mapping all further forms (vi > 1) to the succeeding form and reuse existing schema mappings M ● Central Idea: Transforming schema mapping problem into form mapping problem Form F vi Form F vi+1 Duality Schema S vi Schema S vi+1 10 Schema S T

Step 1: Mapping of first Form Version Input System Research Database Anthropometry Anthropometry … … Weight F T F V1 11. Weight: Height: 12. Height: … … T00876 S T lime_survey_76309 S V1 F0001 int 76309X978X896 int F0002 char int 76309X235X972 char T ransformation function: to_number() … … Derive schema mapping M S V1 ,S T by mapping composition 11

Step 2: Mapping of Form Version > 1 Input System Research Database Anthropometry … 11. Weight: Anthropometry F V1 12. Height: … … F T Weight in kg: Anthropometry Height in cm: … … F V2 14. Weigt in kg: 15. Height in cm: … S T S V1 S V2 T00876 lime_survey_76309 lime_survey_72354 76309X978X896 int 72534X673X245 int F0001 int 76309X235X972 char 72534X789X214 char F0002 int to_number() … … … 12 to_number()

Form Matching ● Match process taking item description into account: Question, parameter name ● Different matcher calculating similarity between two items, e.g., – String based similarity: n-gram, Levenshtein, … – Set based similarity: Jaccard, ... 13

Blocking ● Basic Idea: Reducing the number of item – item comparisons without loosing quality ● Different blocking strategies ● In LIFE – Recurring item groups, e.g., questions according to each drug (medication) – Item groups typically unmodified in succeeding forms ● Block → item group (block key → group name) – Comparing items of two dedicated blocks belonging to succeeding input forms having the same block key 14

Data Type Mappings ● Mapping data types when extracting data from source system and store them into a target DB – Different DBMS specific data types, e.g., TEXT (MySQL), VARCHAR2, LONG (ORACLE) – Implementation: type [length|precision[, scale]] e.g., VARCHAR2 (20), INT(1), DECIMAL(5, 3) ● Building data type patterns ● Map data type patterns of sources to target DB Source Data Type Pattern (MySQL) Target Data Type Pattern (ORACLE) VARCHAR(<LENGTH>) VARCHAR2(<LENGTH>) TEXT CLOB 15

Data Provenance ● Multiple input forms per assessment ● Key question in LIFE: What data have been produced by which input system – by which input form F x ? ● Idea: – Associate an identifier for each form in MD – Represent form identifier in target table as instance S V2 S V1 S T lime_survey_76309 lime_survey_72354 T00876 76309X978X896 int 72534X673X245 int 76309X235X972 char 72534X789X214 char F0001 int … … F0002 int … DQP-01-8767- 01 DQP-01-8767- 02 form_identifjer 16

Evaluation ● Set up – Use all checked mappings as gold standard – Map all input forms per assessment in chronologic order – Evaluate match quality – no user adaptations of descriptions, aliasing etc. 1,166 forms 327 assessments 17

Evaluation Results: Quality ● Trigram-Jaccard (string) with best precision but worsest recall ● Trigram-Dice with best F-Measure for nearly every threshold 18

Evaluation Results: Blocking ● Different blocking strategies ● Brute force = vector of all items ● Most reduction when blocking based on item groups ● Reduction factor 1,838 ● No significant loss of quality when blocking mode is used 19

Metadata Repository ● Sometimes called data dictionary ● Central collection of – Sources MD – Assessments and input forms, code lists, data types – Mappings on different levels ● Used for – Extraction, transformation & loading – Query generation – Reporting – Curation (in close connection with R) 20

Conclusions ● LIFE: Epidemiological study with large set assessments – Evolving input forms (multiple forms per assessment) – Different input systems ● Need for harmonization ● Matching input forms→ derive schema mappings – Automatic generation – Manual check & adaptation (if necessary) u o Y ● Scientific evaluation k n ● Running in production mode for 5y a h 21 T

Metadata Management for Data Integration in Medical Sciences - - PowerPoint PPT Presentation

Metadata Management for Data Integration in Medical Sciences - Experiences from the LIFE Study - Toralf Kirsten, Alexander Kiel, Mathias Rhle, Jonas Wagner LIFE Reserach Center for Civilization Diseases University of Leipzig BTW, Stuttgart,

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

DUNE Data Model Meeting: Metadata Metadata Needs And Considerations Steven Timm The following

Hitachi NEXT 2018 Automating Onboarding Data with Metadata Injection Contents Page 2:

Metadata In ArcGIS 10.0 Jason Cupp Whats New In ArcGIS 10.0 New Metadata Editor for

From SDTM to displays, through ADaM & Analyses Results Metadata, a flight on board METADATA

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Metadata Management for Spatial Data Infrastructures Kim Durante Metadata

MetaData Management 2005 MetaData Management 2005 Toronto IRMAC April 19, 2005 April

Using Property Graphs for Rich Metadata Management in HPC Systems Dong Dai , Robert B. Ross,

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Policy-Based Integration of Provenance Metadata Ashish Gehani Dawood Tariq Basim Baig SRI

The Value of Metadata As A Data Management & Project Management Tool Cele Morris

The OAI2LOD Server Exposing OAI-PMH Metadata as Linked Data Motivation more than 1700

Concurrent clinical condition - on presentation Important note: This is an archived metadata

The Practice of Metadata The hows and whys of metadata at USGS U.S. Department of the

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Lecture 04: More Process Modelling & Software Metrics 2015-05-04 Prof. Dr. Andreas Podelski,

DIY Malware Analysis with Minibis Christian Wojner, L. Aaron Kaplan ({wojner,kaplan}@cert.at)

Towards segment-based recognition of argumentation structure in short texts Andreas Peldszus

Classical segregation analysis 28.10.2005 GE02 day 4 part 3 Yurii Auchenko Erasmus MC Rotterdam

Probing Emergent Semantics in Predictive Agents via Question Answering Link to slides with

On the spectral features of robust probing security Maria Chiara Molteni 1 Vittorio Zaccaria 2 1

Dynamic VM Monitoring using Hypervisor Probes Z. J. Estrada , C. Pham, F. Deng, L. Yan, Z.

DTrace/SystemTap SDT Probes in OpenAFS Andrew Deason June 2019 OpenAFS Workshop 2019 1

Metadata Management for Data Integration in Medical Sciences - - PowerPoint PPT Presentation

Metadata Management for Data Integration in Medical Sciences - Experiences from the LIFE Study - Toralf Kirsten, Alexander Kiel, Mathias Rhle, Jonas Wagner LIFE Reserach Center for Civilization Diseases University of Leipzig BTW, Stuttgart,

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

DUNE Data Model Meeting: Metadata Metadata Needs And Considerations Steven Timm The following

Hitachi NEXT 2018 Automating Onboarding Data with Metadata Injection Contents Page 2:

Metadata In ArcGIS 10.0 Jason Cupp Whats New In ArcGIS 10.0 New Metadata Editor for

From SDTM to displays, through ADaM &amp; Analyses Results Metadata, a flight on board METADATA

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Metadata Management for Spatial Data Infrastructures Kim Durante Metadata

MetaData Management 2005 MetaData Management 2005 Toronto IRMAC April 19, 2005 April

Using Property Graphs for Rich Metadata Management in HPC Systems Dong Dai , Robert B. Ross,

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Policy-Based Integration of Provenance Metadata Ashish Gehani Dawood Tariq Basim Baig SRI

The Value of Metadata As A Data Management &amp; Project Management Tool Cele Morris

The OAI2LOD Server Exposing OAI-PMH Metadata as Linked Data Motivation more than 1700

Concurrent clinical condition - on presentation Important note: This is an archived metadata

The Practice of Metadata The hows and whys of metadata at USGS U.S. Department of the

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Lecture 04: More Process Modelling &amp; Software Metrics 2015-05-04 Prof. Dr. Andreas Podelski,

DIY Malware Analysis with Minibis Christian Wojner, L. Aaron Kaplan ({wojner,kaplan}@cert.at)

Towards segment-based recognition of argumentation structure in short texts Andreas Peldszus

Classical segregation analysis 28.10.2005 GE02 day 4 part 3 Yurii Auchenko Erasmus MC Rotterdam

Probing Emergent Semantics in Predictive Agents via Question Answering Link to slides with

On the spectral features of robust probing security Maria Chiara Molteni 1 Vittorio Zaccaria 2 1

Dynamic VM Monitoring using Hypervisor Probes Z. J. Estrada , C. Pham, F. Deng, L. Yan, Z.

DTrace/SystemTap SDT Probes in OpenAFS Andrew Deason June 2019 OpenAFS Workshop 2019 1

From SDTM to displays, through ADaM & Analyses Results Metadata, a flight on board METADATA

The Value of Metadata As A Data Management & Project Management Tool Cele Morris

Lecture 04: More Process Modelling & Software Metrics 2015-05-04 Prof. Dr. Andreas Podelski,