Designing an institutional research data management infrastructure - - PowerPoint PPT Presentation

designing an institutional research data management
SMART_READER_LITE
LIVE PREVIEW

Designing an institutional research data management infrastructure - - PowerPoint PPT Presentation

Designing an institutional research data management infrastructure for the life sciences Paul van Schayck PhD student, data steward Maastricht University Medical Center + p.vanschayck@maastrichtuniversity.nl


slide-1
SLIDE 1

1

Designing an institutional research data management infrastructure for the life sciences

Paul van Schayck PhD student, data steward Maastricht University Medical Center+ p.vanschayck@maastrichtuniversity.nl https://datahub.mumc.maastrichtuniversity.nl Peter Debyelaan 15, 6229 HX Maastricht P.O. Box 616, 6200 MD Maastricht The Netherlands

slide-2
SLIDE 2

2 Designing an institutional research data management infrastructure for the life sciences

Life Sciences Faculty

  • Independent research groups
  • Heterogeneous (meta)data
  • Right incentives

Designing an institutional research data management infrastructure for the life sciences

Academic Hospital

  • Patient privacy
  • Electronic Health Records
  • Bridging organisations

providing Research Data Management services for

slide-3
SLIDE 3

3 Designing an institutional research data management infrastructure for the life sciences

Life science background

3

Open Science. There is an urgent call for more open, transparent and reproducible science. ‘Small Science’. Life science is performed in small temporary project groups. Life science depends more and more on the collection and analysis of comprehensive datasets.

slide-4
SLIDE 4

4 Designing an institutional research data management infrastructure for the life sciences

DataHub characteristics

FAIR-inspired from start. Open-source where possible. (Meta)data structuring + ontology enrichment. Project data structuring; Hierarchical organisation in projects and datasets. Faceted search, Lucene & ontology-powered, authorization controlled High volume; The infrastructure has been designed and tested with petabyte scale and high throughput in mind.

slide-5
SLIDE 5

5 Designing an institutional research data management infrastructure for the life sciences Healthcare Scientist DataHub core services Life science Research data governance

Data drop zone Web portal

metadata XML

Facetted search UI

Browser HL7CDA export EHR Extract Transform Load

Rule based

  • bject store

Hitachi NAS storage Replication storage

HL7v3 XML

Data Warehouse

Ontology Lookup Service API REST API Master Person Index Transform

External Repository WebDAV

ePIC Persistent Identifier s ELK CrossRef Lookup

DataHub 2.0.0

Frontend

slide-6
SLIDE 6

6 Designing an institutional research data management infrastructure for the life sciences

Authentication (federated)

Outstanding issue:

  • Automated handling of user provisioning/expiration

Web Portal SAML based SSO

irods-php via proxy user Generate iRODS temporary password temporary password

Providing federated authentication in two methods: proxy-user and temporary password

Browser

WebDAV

IdP login

user@domain.tld#nlmumc

slide-7
SLIDE 7

7 Designing an institutional research data management infrastructure for the life sciences

Ingesting high volume data

Disadvantage:

  • Not compatible with federated

authentication

  • msiCollRsync not performing (yet)

SMB/CIFS Network Share Web Portal iRODS Mounted Collection iRODS Collection

msiPhyPathReg msiCollRsync

SMB/CIFS network share connected as iRODS mounted collection is ingested into iRODS using msiCollRsync Advantage:

  • No extra (client) software for users
  • SMB/CIFS performs very well

irods-php cifs-utils data

slide-8
SLIDE 8

8 Designing an institutional research data management infrastructure for the life sciences

Project collection structure

  • Unidentifiable collection names
  • Virtual collections?
  • Title AVUs on Project and

Collections

P000000001 C000000001 C000000002 C000000003 P000000002 P000000003 /nlmumc/projects/

Dataset: any number of files and directories

Providing a generic project collection hierarchy with no assumptions

slide-9
SLIDE 9

9 Designing an institutional research data management infrastructure for the life sciences

Project authorization

Project: P000000001 Project collection: C000000001

Keeping data authorization in iRODS using the rule engine to enforce policies

Disadvantages:

  • Only on project level
  • Too simplistic?
  • wn

read write manager reader contributor Closed phase Open phase inherited read only

Can create Note: iRODS groups are organizational units (departments)

contributor

Can assign

slide-10
SLIDE 10

10 Designing an institutional research data management infrastructure for the life sciences

Metadata modeling: being FAIR

Project -> Investigation -> Sample -> Assay (PISA)

  • Inspired by ISA tools, compatible with HCLS
  • Implemented Project and Investigation level
  • Descriptive metadata stored in file (!), AVUs for system metadata

Islandora XML forms

Ontology Lookup Service ePIC Persistent Identifiers CrossRef Lookup

metadata.xml

C000000001

ETL

Helping users early with annotating data FAIR

Validation

slide-11
SLIDE 11

11 Designing an institutional research data management infrastructure for the life sciences

Metadata indexing

  • Indexed in SOLR:

– All metadata – Semantics (OLS) – References (CrossRef) – Authorization on data (iRODS)

  • Rebuild on demand

Data Warehouse

REST API

Ontology Lookup Service CrossRef Lookup

metadata. xml DWH frontend

API Transform

Providing a user friendly facetted search interface for data findability

slide-12
SLIDE 12

12 Designing an institutional research data management infrastructure for the life sciences

Metadata: making use of semantics

Ontology derived facetted search Autocomplete for

  • ntology terms
slide-13
SLIDE 13

13 Designing an institutional research data management infrastructure for the life sciences

DTAP: deployment for development

Challenge

  • Interactions with external services

(AD, NAS storage)

Ontology Lookup Service

CrossRef Lookup

DWH frontend Web portal

REST API

WebDAV

Browser

External

Highlights

  • 16 interacting containers for

full environment

  • Runnable from laptop
slide-14
SLIDE 14

14 Designing an institutional research data management infrastructure for the life sciences

DTAP: deployment for acceptation/production

Challenge

  • Differences in deployments and some environments

Ontology Lookup Service

CrossRef Lookup

DWH frontend

REST API

WebDAV

Browser

External

Web portal

slide-15
SLIDE 15

15 Designing an institutional research data management infrastructure for the life sciences

  • Phases
  • Create
  • Process
  • Analyse
  • Highly specific

RDM solutions

  • Phases:
  • Archive
  • Access
  • Re-use
  • Generic

repositories

  • Domain specific

repositories

Active data Preserved data

BRIDGE THE GAP!

Todays challenge in the data life cycle

slide-16
SLIDE 16

16 Designing an institutional research data management infrastructure for the life sciences

Lessons learned

  • 1. Dual position of staff. Decentralize data stewards
  • 2. Micro Service approach
  • 3. Remote Procedure Calls for rules
  • 4. Funding for long term storage is hard…
  • 5. Open Source re-useable parts

https://github.com/MaastrichtUniversity

slide-17
SLIDE 17

17 Designing an institutional research data management infrastructure for the life sciences

Questions?

Pascal Suppers Managing Director DataHub Maastricht

  • P. Debyelaan 15 L. van Kleeftoren

6229 HX Maastricht 2nd floor (route 11) The Netherlands T +31 6 27 07 16 54 E p.suppers@maastrichtuniversity.nl

Paul van Schayck PhD student, data steward Maastricht University Medical Center+ p.vanschayck@maastrichtuniversity.nl https://datahub.mumc.maastrichtuniversity.nl Peter Debyelaan 15, 6229 HX Maastricht P.O. Box 616, 6200 MD Maastricht The Netherlands