EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous - - PowerPoint PPT Presentation
EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous - - PowerPoint PPT Presentation
EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases) Po-Hsiang (Barnett) Chiu Phenotypes and phenotyping Physically observable traits of genotypes (and their interac;ons with environments) Biochemical or physiological
Phenotypes and phenotyping
Physically observable traits of genotypes (and their interac;ons with environments)
Diseases (and disease subtypes)
AFribu;ons of diseases (e.g. suscep;bility) Biochemical or physiological proper;es, behavior, and products of behavior
Data-Driven Phenotyping
- Data-driven phenotyping
– Two main methodologies
- Rule-based approach (e.g. eMerge, hFps://emerge.mc.vanderbilt.edu)
- Predic3ve Analy3cs
– Data sources:
- EHRs/EMRs: Medicinal treatments, diagnoses, lab measurements, etc.
- Genomic data: SNP arrays, copy number varia;on (CNVs), etc.
– Phenotypes
- Diseases, subtypes, or variables aFributed to disease predic;ons
Diagnos;c Concept Units
- Various diseases sharing the same set of
diagnos;c concept units
- Infec;ous diseases
– Lab tests
- Microorganism, blood, urine, body ;ssues, stool
– Medica;ons
- An;bio;c, an;virus, anthelmin;c
- Build sta;s;cal models for each diagnos;c
component and combine them appropriately
– Ensemble learning
Bulk Learning in a Nutshell …
Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set) as a substrate for model learning and evaluation wherein (a given) medical ontology is used to perform feature selection and model stacking is used to construct abstract feature representation of low sample complexity in order to reduce training requirements.
Key Concepts:
- 1. Build phenotyping models on top of mul;ple diseases
- 3. Models are combined via model stacking (a form of ensemble learning)
- 4. Abstract features
Dimensionality reduc;on
- 2. Automa;c feature selec;on using an exis;ng ontology
- 5. Less labeled data required for model evalua;ons
Phenotyping via Bulk Learning
- Under model stacking, we then arrive at the no;on of
“concept-driven phenotyping”
– A subset or combina;ons of lab tests are more aFributable to some diseases while the others are beFer explained by medica;ons
- In this study, infec;ous diseases associated with 100
ICD-9 codes as the domain of study for bulk learning
– For simplicity, consider different diagnos;c codes as different diseases … – Why 100 codes? – Code selec;on strategy?
Bulk Learning Basics I
- Addresses two central issues in predic;ve
analy;cal approach to computa;onal phenotyping
– Feature engineering
- Medical ontology for feature decomposi;on
- Medical En;;es Dict (hFp://med.dmi.columbia.edu)
– Data annota;on
- Ensemble learning (e.g. stacked generaliza;on
[Wolpert 1992])
- Feature abstrac;on for dimensionality reduc;on
Medical Ontology for Grouping Features
- Snapshot of Medical En;;es Dic;onary
(hFp://med.dmi.columbia.edu)
Model Stacking
- Why inspec;ng mul;ple (infec;ous) diseases?
– Using mul3ple diseases as substrate and iden;fy their common elements – Example stacking architecture (under stacked generaliza;on method)
Level 1 Level 0
Antibiotic Measure Urinary Chemistry Measure Intravenous Chemistry Measure Microbiology Measure
Level 2
Attributes: Level-0 Probabilities and Indicators Target: Diagnostic Codes (Silver Standard) Other Phenotypic Measures (e.g. Antiviral) Attributes: Level-1 Probabilities and ICD-9 Target: True Labels (Gold Standard)
Surrogate Labels vs True Labels
- Model stacking is used to achieve:
– Improve upon base model performances – Transform EHR data to a denser form
- Uses diagnos;c codes (e.g. ICD-9) as surrogate labels to
establish “approximate predic;ve models.”
- Why surrogate labels (e.g. ICD-9)?
– Features extracted from EHR can be large – Used to derive compact representa;on of the training data – “Free” supervised signals that are sufficiently close but can be obtained without extra work
- Objec;ve: Build sta;s;cal models in abstract feature space
– Create a sparse annota;on set (i.e. gold standard) that serves a proxy dataset for downstream model evalua;ons – 83 annotated cases
Σ Σ Σ Σ Σ
m1 a1 b1 u1 m1 (1)
m1g a1g b1g u1g
global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) logistic units raw features
microbiology antibiotic blood test urine test f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models
Performance Evalua;ons
- How well does the model predict ICD-9s (using a
separate test data)?
- How well does the model predict annotated data
(assoc. with “true labels”)?
– (Binarized) ICD-9 becomes a candidate feature among abstract features (e.g. probability scores, indicators)
- Annotated sample consists of randomly selected cases in
which errors of ICD-9 coding are corrected
- Data annota;ons and coding procedures are two
independent processes
Base Level Performances
127.4 Enterobiasis 047.8 (Other) viral meningi;s 009.1 Gastroenteri;s ... 053.9 Herpez zoster 117.9 Mycoses
Other Components
- Semi-supervised learning and virtual annota;on set
- The 3rd ;er in model stacking hierarchy
– Trade-off between learned abstract features and the ICD-9 codes as surrogate labels. – Performance evalua;on on predic;ng annotated labels
- Ontology-based feature engineering
- Proper design of treatment and control (training) data
Modeling Perspec;ve
- EHR data consist of observa;ons and latent variables
– Observa;ons can be directly answered via simple queries
- Did the pa;ent have tests on E. Coli?
- Did the pa;ent take Cekriaxon?
- Latent variables represent quan;;es that cannot be
directly observed in EHR or computed via simple queries
– Does the pa;ent have an infec;on? – Diagnos;c ques;ons: specifically which infec;ons do the pa;ent have?
- Learn classifiers to predict latent variables (with only access
to observa;ons)
Medical Perspec;ve
- Seemingly different infec;ous diseases may share
similar sets of lab tests and medica;ons
– Staph. aureus
- Skin infec;ons, pneumonia, blood poisoning
– Cekriaxone
- Meningi;s
- Infec;ons at different sites of the body (e.g. bloodstream, lungs,
urinary tracts)
- Mul;ple classifiers for the same disease
– 4 classifiers per ICD-9 code, each of which is binary classifier
- 400 classifiers at base level
Data Distribu;on Perspec;ve
“Can we build a joint model applicable to all diseases?”
Abstract Feature Representa;on: Design Choices
- Related work in construc;ng high-level features
– PCA, unsupervised feature learning, manifold learning, etc.
- Design choices
– Data characteris;cs – Interpretability
- Deep Neural Network
– Linear combina;on – Non-linear transforma;on (e.g. sigmoid, rec;fier, etc.)
- Feature set: con;nuous, dense, and “homogeneous”
– Image pixels – Times series of lab measurements – word2vec
- EHR data however are very different
– sparse and incomplete – consist of many different types (binary, categorical, con;nuous, etc.) – Features associated with mul;ple concepts
Moving Forward …
- Summary
– Bulk learning is a framework with at least the following system choices
- The bulk learning set (of target condi;ons) => base models
- Classifica;on algorithms (guideline: probabilis;c classifiers + well-calibrated)
- Stacking architecture (mul;ple ;ers => levels of abstrac;ons)
- Strategy for combining individual (local) disease models to a global model
– Advantage: Can use a small annotated sample for model construc;on and evalua;on within the abstract feature space (e.g. level-1 data)
- 83 clinical cases were labeled in this study
– Challenge: The model involving the interac;on between abstract features and ICD-9 do not generalize well into the region of the data where the ICD-9 coding was incorrect
- Mul;ple types of surrogate labels
Σ Σ Σ Σ Σ
m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i)Σ
m1g a1g b1g u1g local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) (i-1) (i) (i+1)Semi-supervised learning
Ac3ve learning Complex decision boundary?
Other surrogate labels
- Ongoing and future work
Reference
[1] D.H. Wolpert, Stacked generaliza;on, Neural Networks. 5 (1992) 241–259. [2] K.M. Ting, I.H. WiFen, Issues in stacked generaliza;on, J. Ar;f. Intell. Res. 10 (1999) 271–289. [3] J. Jin Chen, C. Cheng Wang, R. Runsheng Wang, Using Stacked Generaliza;on to Combine SVMs in Magnitude and Shape Feature Spaces for Classifica;on
- f Hyperspectral Data, IEEE Trans. Geosci. Remote Sens. 47 (2009) 2193-2205.
[4] David Baorto, James Cimino, et al. Available: hFp://med.dmi.columbia.edu. Access date: Oct 20, 2016. [5] T.A. Lasko, J.C. Denny, M.A. Levy, Computa;onal Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data, PLoS One. 8 (2013) e66341.
T H A N K
Y O U
Σ Σ Σ Σ
m1 a1 b1 u1
f11 f12 f1j f21 f2j f31 f41 f3jΣ Σ Σ Σ
m1 a1 b1 u1 logistic units raw features
f11 f12 f1j f21 f2j f31 f41 f3j
m1 a1 b1 u1
Σ
Level 0 Level 1
Microbiology An;bio;c Blood test Urine test
Example Features
Σ Σ Σ Σ Σ
m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i)
Σ
m1g a1g b1g u1g
local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1)
(i-1)
(i)
(i+1) logistic units raw features
microbiology antibiotic blood test urine test
- 2. Compute Base Models
Level-1 Global Unit Individual Level-1 Local Units
Level-1 abstract features
f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models
- 3. Compute Meta Models (via Ensemble Learning)
- 1. Define Feature Groups Using Medical Ontology
- 1a. Gather EHR data according to
medical concepts
- 1b. Use Medical Entities Dictionary to
delineate feature scopes
- 1c. Apply feature selection
within each concept group
- 3a. Per-disease ensembles:
compute local level-1 models
- 3b. Cross-disease ensemble:
compute a global level-1 model
Global level-1 features