designing an institutional research data management
play

Designing an institutional research data management infrastructure - PowerPoint PPT Presentation

Designing an institutional research data management infrastructure for the life sciences Paul van Schayck PhD student, data steward Maastricht University Medical Center + p.vanschayck@maastrichtuniversity.nl


  1. Designing an institutional research data management infrastructure for the life sciences Paul van Schayck PhD student, data steward Maastricht University Medical Center + p.vanschayck@maastrichtuniversity.nl https://datahub.mumc.maastrichtuniversity.nl Peter Debyelaan 15, 6229 HX Maastricht P.O. Box 616, 6200 MD Maastricht The Netherlands 1

  2. providing Research Data Management services for Life Sciences Faculty Academic Hospital • • Independent research groups Patient privacy • • Heterogeneous (meta)data Electronic Health Records • • Right incentives Bridging organisations Designing an institutional research data management infrastructure for the life sciences Designing an institutional research data management infrastructure for the life sciences 2

  3. Life science background Life science depends more and more on the collection and analysis of comprehensive datasets . ‘ Small Science ’. Life science is performed in small temporary project groups. Open Science . There is an urgent call for more open, transparent and reproducible science. Designing an institutional research data management infrastructure for the life sciences 3 3

  4. DataHub characteristics FAIR -inspired from start. Open-source where possible. (Meta)data structuring + ontology enrichment. Project data structuring; Hierarchical organisation in projects and datasets. Faceted search , Lucene & ontology-powered, authorization controlled High volume; The infrastructure has been designed and tested with petabyte scale and high throughput in mind. Designing an institutional research data management infrastructure for the life sciences 4

  5. Healthcare Scientist Research data governance EHR Ontology CrossRef ePIC Master ELK Persistent Person Lookup Lookup Service Identifier Index s HL7CDA export DataHub core services Frontend HL7v3 Extract XML Transform Load Facetted Data search UI API Warehouse Life science Web portal Transform External REST metadata Repository API XML Rule based object store Data drop zone Browser Hitachi NAS Replication storage storage WebDAV Designing an institutional research data management infrastructure for the life sciences DataHub 2.0.0 5

  6. Authentication (federated) Web Portal IdP login SAML based SSO irods-php via proxy user Generate iRODS temporary password WebDAV Browser user@domain.tld#nlmumc temporary password Providing federated authentication in two methods: proxy-user and temporary password Outstanding issue: • Automated handling of user provisioning/expiration Designing an institutional research data management infrastructure for the life sciences 6

  7. Ingesting high volume data iRODS SMB/CIFS iRODS msiPhyPathReg Collection Network Mounted Share Collection data cifs-utils msiCollRsync Web Portal irods-php SMB/CIFS network share connected as iRODS mounted collection is ingested into iRODS using msiCollRsync Disadvantage: Advantage: • • Not compatible with federated No extra (client) software for users authentication • SMB/CIFS performs very well • msiCollRsync not performing (yet) Designing an institutional research data management infrastructure for the life sciences 7

  8. Project collection structure P000000001 Dataset: any C000000001 /nlmumc/projects/ number of files and directories P000000002 C000000002 P000000003 C000000003 Providing a generic project collection hierarchy with no assumptions • Unidentifiable collection names • Virtual collections? • Title AVUs on Project and Collections Designing an institutional research data management infrastructure for the life sciences 8

  9. contributor Project authorization Can create Project: Project collection: P000000001 C000000001 own manager inherited Open phase Can assign write contributor read only Closed phase read reader Keeping data authorization in iRODS using the rule engine to enforce policies Disadvantages: • Only on project level Note: iRODS groups are organizational • Too simplistic? units (departments) Designing an institutional research data management infrastructure for the life sciences 9

  10. Metadata modeling: being FAIR Ontology ePIC CrossRef Lookup Lookup Persistent Service Identifiers C000000001 ETL Islandora XML forms metadata.xml Validation Helping users early with annotating data FAIR Project -> Investigation -> Sample -> Assay (PISA) • Inspired by ISA tools, compatible with HCLS • Implemented Project and Investigation level • Descriptive metadata stored in file (!), AVUs for system metadata Designing an institutional research data management infrastructure for the life sciences 10

  11. Metadata indexing DWH Ontology CrossRef metadata. frontend Lookup Lookup xml Service API Data REST Warehouse API Transform Providing a user friendly facetted search interface for data findability • Indexed in SOLR: – All metadata – Semantics (OLS) – References (CrossRef) – Authorization on data (iRODS) • Rebuild on demand Designing an institutional research data management infrastructure for the life sciences 11

  12. Metadata: making use of semantics Autocomplete for ontology terms Ontology derived facetted search Designing an institutional research data management infrastructure for the life sciences 12

  13. DTAP: deployment for development External Web portal WebDAV REST API DWH CrossRef frontend Lookup Ontology Lookup Browser Service Highlights Challenge • 16 interacting containers for • Interactions with external services full environment (AD, NAS storage) • Runnable from laptop Designing an institutional research data management infrastructure for the life sciences 13

  14. DTAP: deployment for acceptation/production External WebDAV REST API DWH CrossRef frontend Lookup Ontology Lookup Service Web portal Browser Challenge • Differences in deployments and some environments Designing an institutional research data management infrastructure for the life sciences 14

  15. Todays challenge in the data life cycle Preserved data Active data • • Phases Phases: • • Create Archive • • Process Access • • Analyse Re-use • • Highly specific Generic RDM solutions repositories BRIDGE THE GAP! • Domain specific repositories Designing an institutional research data management infrastructure for the life sciences 15

  16. Lessons learned 1. Dual position of staff. Decentralize data stewards 2. Micro Service approach 3. Remote Procedure Calls for rules 4. Funding for long term storage is hard… 5. Open Source re-useable parts https://github.com/MaastrichtUniversity Designing an institutional research data management infrastructure for the life sciences 16

  17. Questions? Paul van Schayck PhD student, data steward Pascal Suppers Managing Director Maastricht University Medical Center + p.vanschayck@maastrichtuniversity.nl DataHub Maastricht https://datahub.mumc.maastrichtuniversity.nl P. Debyelaan 15 L. van Kleeftoren 6229 HX Maastricht 2nd floor (route 11) Peter Debyelaan 15, 6229 HX Maastricht The Netherlands T +31 6 27 07 16 54 P.O. Box 616, 6200 MD Maastricht E p.suppers@maastrichtuniversity.nl The Netherlands Designing an institutional research data management infrastructure for the life sciences 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend