Cloud-enabled Virtual Environment for Health Research & - - PowerPoint PPT Presentation

cloud enabled virtual environment for health research amp
SMART_READER_LITE
LIVE PREVIEW

Cloud-enabled Virtual Environment for Health Research & - - PowerPoint PPT Presentation

Cloud-enabled Virtual Environment for Health Research & Education Subha Madhavan, PhD, FACMI Chief Data Scientist Georgetown University Medical Center Innovative Approaches to Cloud Computing - Overcoming the Challenges of Health Data


slide-1
SLIDE 1

HEALTH INFORMATICS

Cloud-enabled Virtual Environment for Health Research & Education

Subha Madhavan, PhD, FACMI Chief Data Scientist Georgetown University Medical Center

Innovative Approaches to Cloud Computing - Overcoming the Challenges of Health Data Storage and Computing Infrastructures Health Datapalooza’2020 Feb 11, 2020

slide-2
SLIDE 2

HEALTH INFORMATICS

Challenges

Practical challenges of modern biomedical research:

  • Ecosystem complex, governance, challenges
  • Datasets are outgrowing local infrastructure, inhibiting researchers’ ability to

maintain them.

  • Compute requirements to process these large datasets are exceeding local

capacity, inhibiting analysis.

  • Downloading large scale data to a local computer for analysis may be more

difficult than bringing computational capability to where the data is located.

  • Collaborating on research projects across organizations can be challenging

due to differences in local IT environments.

  • Critical data sharing needs but Privacy and information security are

paramount

slide-3
SLIDE 3

HEALTH INFORMATICS

Emerging solutions

slide-4
SLIDE 4

HEALTH INFORMATICS

Challenges at Academic Medical Centers

  • All GCP projects stand alone (no assigned Org)

– Users create stand alone non-GU identities – IT/ or self grants “Editor” rights to GCP projects – Users login with Google username/password

  • Agreement needed for access to research credits (Institution)

– GCP billing accounts not tied to GU cost centers – No path to SSO for collaboration and access management

  • IT control of GU Domain for G-Suite Org

– Enabled Identity and GCP services in G-Suite Org – Established SSO for G-Suite – Unified billing – still under development – Began migrating GCP Projects and Google Identities into G-Suite Org – On-boarding process for GCP not fully streamlined yet

slide-5
SLIDE 5

HEALTH INFORMATICS

Why we need VRE? 3 primary uses

PHI

Restricted Access PI/ Study driven Tools/ Compute

Collaboration

Research networks Collaboration Tools/ Compute

Education

Training data sets Public data sets Tools/ Compute

slide-6
SLIDE 6

HEALTH INFORMATICS Virtual Research Environment Running on Google Cloud Platform (GCP)

slide-7
SLIDE 7

HEALTH INFORMATICS

Virtual Research Environment (VRE)

slide-8
SLIDE 8

HEALTH INFORMATICS

DATA ANALYSIS AND COLLABORATION ON GCP

slide-9
SLIDE 9

HEALTH INFORMATICS

Data Analysis with Jupyter Notebook

  • Provides a nice user interface

for performing analysis, and data visualizations

  • Supports Python and R
  • Can be launched preinstalled
  • n whatever size machine you

think you will need

– From 1 vCPU to 160 vCPU and 3.75 GB RAM to 3844 GB RAM

slide-10
SLIDE 10

HEALTH INFORMATICS

Google BiqQuery

  • Serverless Data Warehouse

– Allows you to setup a SQL data warehouse easily without any complicated configuration – Supports standard ANSI SQL and provides ODBC and JDBC drivers – Scales to petabyte scale easily

  • Data governance and security

– Supports fine grained identity and access management – Encrypts all data at rest and in transit

  • BigQuery ML

– Enables users to create and execute machine learning models in BigQuery using standard SQL queries

slide-11
SLIDE 11

HEALTH INFORMATICS

BigQuery ML

  • Models are trained and accessed in BigQuery using SQL
  • Supported models in BigQuery ML

– Linear regression for forecasting – Binary logistic regression for classification – Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value.” – K-means clustering for data segmentation (beta) – TensorFlow model importing

  • This feature allows you to create BigQuery ML models from previously-trained

TensorFlow models, then perform prediction in BigQuery ML

slide-12
SLIDE 12

HEALTH INFORMATICS

Google Cloud AutoML

  • Cloud AutoML Natural Language

– Classification

  • Allows you to train your own model to classify documents according to labels you define

– Entity Extraction

  • Allows you to train your own model to identify a custom set of entities within English language

text

– Sentiment Analysis

  • Allows you to train your own model to analyze attitudes within English language text
  • Cloud AutoML Video Intelligence

– Classification

  • Allows you to train your own model to classify shots and segments in your videos according to

your own defined labels

– Object Tracking

  • Allows you to train your own model to follow specific objects in your videos
slide-13
SLIDE 13

HEALTH INFORMATICS

Google Cloud AutoML

  • Cloud AutoML Vision

– Classification

  • Allows you to train your own model to classify your images according to labels

that you define

– Object Detection

  • Allows you to train your own model to detect and extract multiple objects and

provide information about those objects including its position in the image

slide-14
SLIDE 14

HEALTH INFORMATICS

Using Google Cloud Storage for Massive Data Files

  • Google Cloud Storage is a reasonably priced highly available and

durable storage solution

  • Provides AES 256 bit encryption at rest for data stored

– Can be Google managed key or client managed key

  • Supports Identity and Access Management (IAM) fine grained

control for access to the data

– Supports the use of groups for access as well

  • This is a best practice and allows easy addition of users to the group as well as

removal

  • Group is granted a role (Owner, Writer, Reader) and all members of the group have

access

slide-15
SLIDE 15

HEALTH INFORMATICS

Secure Research Network

  • Storage of PHI data requires additional security precautions

– Encrypted volumes for all Compute Instances – Encrypted Storage buckets

  • Separate VPC with defined subnets and firewall rules

– Public and Private Subnets – All instances with PHI data on them must be in a private subnet without public IP – Connections to all instances must be done over encrypted protocols – Firewall rules should limit outside connections to specific CIDR ranges and protocols

slide-16
SLIDE 16

HEALTH INFORMATICS

Secure Research Network

  • Logging and Auditing

– All instances need to have logging enabled and logs exported to storage bucket for retention – Logs need to be regularly reviewed for unauthorized actions – IAM accounts need to be regularly audited to ensure only authorized users have access to resources

  • Intrusion Detection and Prevention (IDS/IDP)

– Should have in place a system that performs Intrusion Detection and Prevention

  • Looking at network traffic for unusual activity
slide-17
SLIDE 17

HEALTH INFORMATICS

VRE Use Case #1

  • Using GCP resources for Graduate Course Assignments

– Extracting and transforming large clinical datasets (MIMIC-III) for downstream statistical analysis

slide-18
SLIDE 18

HEALTH INFORMATICS

Use Case #1: Dataset

  • MIMIC III (Medical Information Mart for Intensive Care III)
  • Free publicly available hospital database containing de-identified data from

approximately 40,000 patients

  • Critical care units of the Beth Israel Deaconess Medical Center
  • 53,423 distinct hospital admissions for adult patients (aged 16 years or above)

admitted to critical care units between 2001 and 2012

  • Includes information such as demographics, vital signs, laboratory test results,

procedures, medications, clinical notes

slide-19
SLIDE 19

HEALTH INFORMATICS

Use Case #1: MIMIC-III Tables

Table name Description ADMISSIONS Every unique hospitalization for each patient in the database (defines HADM_ID). CAREGIVERS Every caregiver who has recorded data in the database (defines CGID). CHARTEVENTS All charted observations for patients. DIAGNOSES_ICD Hospital assigned diagnoses, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system. ICUSTAYS Every unique ICU stay in the database (defines ICUSTAY_ID). LABEVENTS Laboratory measurements for patients both within the hospital and in

  • utpatient clinics.

NOTEEVENTS Deidentified notes, including nursing and physician notes, ECG reports, radiology reports, and discharge summaries. PATIENTS Every unique patient in the database (defines SUBJECT_ID). PRESCRIPTIONS Medications ordered for a given patient. PROCEDURES_ICD Patient procedures, coded using the International Statistical Classification

  • f Diseases and Related Health Problems (ICD) system.

SERVICES The clinical service under which a patient is registered. TRANSFERS …. Patient movement from bed to bed within the hospital, including ICU admission and discharge. ….

More information: https://mimic.physionet.org/about/mimic/

slide-20
SLIDE 20

HEALTH INFORMATICS

Use Case #1: GCP resources

  • Uploaded MIMIC-III csv files to Google bucket
  • Import csv files into a Postgres SQL server running on GCP
  • Analyze data using R Studio and/or Jupyter Notebook running
  • n GCP
slide-21
SLIDE 21

HEALTH INFORMATICS

VRE Use Case #2

  • Google Cloud AutoML Vision for Medical Image Classification

– Pneumonia Detection using Chest X-Ray Images – Develop an end-to-end medical image classification model using GCP resources

slide-22
SLIDE 22

HEALTH INFORMATICS

Use Case #2: Dataset

  • 5,232 chest X-ray images from children.
  • 3,883 of those images are samples of bacterial (2,538) and

viral (1,345) pneumonia.

  • 1,349 samples are healthy lung X-ray images.
slide-23
SLIDE 23

HEALTH INFORMATICS

Use Case #2: Dataset

slide-24
SLIDE 24

HEALTH INFORMATICS

Use Case #2: GCP resources

  • Compute Engine: Scalable, high-performance VMs
  • Cloud Storage: Object storage (buckets)
  • AI Platform Notebooks: An enterprise notebook service (Jupyter)
  • AutoML Vision: Train machine learning models to classify images
slide-25
SLIDE 25

HEALTH INFORMATICS

Use Case #2: Sample AutoML Vision Results (After 2 hours of training)

slide-26
SLIDE 26

HEALTH INFORMATICS

Use Case #2: Sample AutoML Vision

slide-27
SLIDE 27

HEALTH INFORMATICS

VRE - Summary

  • What we have accomplished in one year of the VRE

– Access to a scalable, secure and cost effective IT platform – Reduced redundancy and sys admin resources – Leverage the cloud and available tools/ resources to efficiently develop advanced solutions collaborate with researchers and research networks, providers, and health IT companies – Single platform for educational purposes

  • Issues still to be addressed

– Systems to be migrated require resources: G-DOC, CTSA servers, REDCap, others – Need major upgrades and updates to be able to migrate – Some systems cannot be migrated b/c agreements with government agencies or for compliance reasons – No established policies or security controls and practices to meet HIPAA and Other Compliance Obligations – Little Understanding of the Shared Responsibility Model – Loss of Centralized Network Security Controls – How Do We Identify and Block Critical Threats?