 
              Cloud-enabled Virtual Environment for Health Research & Education Subha Madhavan, PhD, FACMI Chief Data Scientist Georgetown University Medical Center Innovative Approaches to Cloud Computing - Overcoming the Challenges of Health Data Storage and Computing Infrastructures Health Datapalooza’2020 Feb 11, 2020 HEALTH INFORMATICS
Challenges Practical challenges of modern biomedical research: • Ecosystem complex, governance, challenges • Datasets are outgrowing local infrastructure, inhibiting researchers’ ability to maintain them. • Compute requirements to process these large datasets are exceeding local capacity, inhibiting analysis. • Downloading large scale data to a local computer for analysis may be more difficult than bringing computational capability to where the data is located. • Collaborating on research projects across organizations can be challenging due to differences in local IT environments. • Critical data sharing needs but Privacy and information security are paramount HEALTH INFORMATICS
Emerging solutions HEALTH INFORMATICS
Challenges at Academic Medical Centers • All GCP projects stand alone (no assigned Org) – Users create stand alone non-GU identities – IT/ or self grants “Editor” rights to GCP projects – Users login with Google username/password • Agreement needed for access to research credits (Institution) – GCP billing accounts not tied to GU cost centers – No path to SSO for collaboration and access management • IT control of GU Domain for G-Suite Org – Enabled Identity and GCP services in G-Suite Org – Established SSO for G-Suite – Unified billing – still under development – Began migrating GCP Projects and Google Identities into G-Suite Org – On-boarding process for GCP not fully streamlined yet HEALTH INFORMATICS
Why we need VRE? 3 primary uses PHI Collaboration Education Restricted Research Training data sets networks Access PI/ Study driven Collaboration Public data sets Tools/ Compute Tools/ Compute Tools/ Compute HEALTH INFORMATICS
Virtual Research Environment Running on Google Cloud Platform (GCP) HEALTH INFORMATICS
Virtual Research Environment (VRE) HEALTH INFORMATICS
DATA ANALYSIS AND COLLABORATION ON GCP HEALTH INFORMATICS
Data Analysis with Jupyter Notebook • Provides a nice user interface for performing analysis, and data visualizations • Supports Python and R • Can be launched preinstalled on whatever size machine you think you will need – From 1 vCPU to 160 vCPU and 3.75 GB RAM to 3844 GB RAM HEALTH INFORMATICS
Google BiqQuery • Serverless Data Warehouse – Allows you to setup a SQL data warehouse easily without any complicated configuration – Supports standard ANSI SQL and provides ODBC and JDBC drivers – Scales to petabyte scale easily • Data governance and security – Supports fine grained identity and access management – Encrypts all data at rest and in transit • BigQuery ML – Enables users to create and execute machine learning models in BigQuery using standard SQL queries HEALTH INFORMATICS
BigQuery ML • Models are trained and accessed in BigQuery using SQL • Supported models in BigQuery ML – Linear regression for forecasting – Binary logistic regression for classification – Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value.” – K-means clustering for data segmentation (beta) – TensorFlow model importing • This feature allows you to create BigQuery ML models from previously-trained TensorFlow models, then perform prediction in BigQuery ML HEALTH INFORMATICS
Google Cloud AutoML • Cloud AutoML Natural Language – Classification • Allows you to train your own model to classify documents according to labels you define – Entity Extraction • Allows you to train your own model to identify a custom set of entities within English language text – Sentiment Analysis • Allows you to train your own model to analyze attitudes within English language text • Cloud AutoML Video Intelligence – Classification • Allows you to train your own model to classify shots and segments in your videos according to your own defined labels – Object Tracking • Allows you to train your own model to follow specific objects in your videos HEALTH INFORMATICS
Google Cloud AutoML • Cloud AutoML Vision – Classification • Allows you to train your own model to classify your images according to labels that you define – Object Detection • Allows you to train your own model to detect and extract multiple objects and provide information about those objects including its position in the image HEALTH INFORMATICS
Using Google Cloud Storage for Massive Data Files • Google Cloud Storage is a reasonably priced highly available and durable storage solution • Provides AES 256 bit encryption at rest for data stored – Can be Google managed key or client managed key • Supports Identity and Access Management (IAM) fine grained control for access to the data – Supports the use of groups for access as well • This is a best practice and allows easy addition of users to the group as well as removal • Group is granted a role (Owner, Writer, Reader) and all members of the group have access HEALTH INFORMATICS
Secure Research Network • Storage of PHI data requires additional security precautions – Encrypted volumes for all Compute Instances – Encrypted Storage buckets • Separate VPC with defined subnets and firewall rules – Public and Private Subnets – All instances with PHI data on them must be in a private subnet without public IP – Connections to all instances must be done over encrypted protocols – Firewall rules should limit outside connections to specific CIDR ranges and protocols HEALTH INFORMATICS
Secure Research Network • Logging and Auditing – All instances need to have logging enabled and logs exported to storage bucket for retention – Logs need to be regularly reviewed for unauthorized actions – IAM accounts need to be regularly audited to ensure only authorized users have access to resources • Intrusion Detection and Prevention (IDS/IDP) – Should have in place a system that performs Intrusion Detection and Prevention • Looking at network traffic for unusual activity HEALTH INFORMATICS
VRE Use Case #1 • Using GCP resources for Graduate Course Assignments – Extracting and transforming large clinical datasets (MIMIC-III) for downstream statistical analysis HEALTH INFORMATICS
Use Case #1: Dataset • MIMIC III (Medical Information Mart for Intensive Care III) • Free publicly available hospital database containing de-identified data from approximately 40,000 patients • Critical care units of the Beth Israel Deaconess Medical Center • 53,423 distinct hospital admissions for adult patients (aged 16 years or above) admitted to critical care units between 2001 and 2012 • Includes information such as demographics, vital signs, laboratory test results, procedures, medications, clinical notes HEALTH INFORMATICS
Use Case #1: MIMIC-III Tables Table name Description Every unique hospitalization for each patient in the database (defines ADMISSIONS HADM_ID). CAREGIVERS Every caregiver who has recorded data in the database (defines CGID). CHARTEVENTS All charted observations for patients. Hospital assigned diagnoses, coded using the International Statistical DIAGNOSES_ICD Classification of Diseases and Related Health Problems (ICD) system. ICUSTAYS Every unique ICU stay in the database (defines ICUSTAY_ID). Laboratory measurements for patients both within the hospital and in LABEVENTS outpatient clinics. Deidentified notes, including nursing and physician notes, ECG reports, NOTEEVENTS radiology reports, and discharge summaries. PATIENTS Every unique patient in the database (defines SUBJECT_ID). PRESCRIPTIONS Medications ordered for a given patient. Patient procedures, coded using the International Statistical Classification PROCEDURES_ICD of Diseases and Related Health Problems (ICD) system. SERVICES The clinical service under which a patient is registered. TRANSFERS Patient movement from bed to bed within the hospital, including ICU admission and discharge. …. …. More information: https://mimic.physionet.org/about/mimic/ HEALTH INFORMATICS
Use Case #1: GCP resources • Uploaded MIMIC-III csv files to Google bucket • Import csv files into a Postgres SQL server running on GCP • Analyze data using R Studio and/or Jupyter Notebook running on GCP HEALTH INFORMATICS
VRE Use Case #2 • Google Cloud AutoML Vision for Medical Image Classification – Pneumonia Detection using Chest X-Ray Images – Develop an end-to-end medical image classification model using GCP resources HEALTH INFORMATICS
Use Case #2: Dataset • 5,232 chest X-ray images from children. • 3,883 of those images are samples of bacterial (2,538) and viral (1,345) pneumonia. • 1,349 samples are healthy lung X-ray images. HEALTH INFORMATICS
Use Case #2: Dataset HEALTH INFORMATICS
Use Case #2: GCP resources • Compute Engine: Scalable, high-performance VMs • Cloud Storage: Object storage (buckets) • AI Platform Notebooks: An enterprise notebook service (Jupyter) • AutoML Vision: Train machine learning models to classify images HEALTH INFORMATICS
Use Case #2: Sample AutoML Vision Results (After 2 hours of training) HEALTH INFORMATICS
Recommend
More recommend