EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous - - PowerPoint PPT Presentation

ehr based phenotyping bulk learning and evalua on with
SMART_READER_LITE
LIVE PREVIEW

EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous - - PowerPoint PPT Presentation

EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases) Po-Hsiang (Barnett) Chiu Phenotypes and phenotyping Physically observable traits of genotypes (and their interac;ons with environments) Biochemical or physiological


slide-1
SLIDE 1

EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases)

Po-Hsiang (Barnett) Chiu

slide-2
SLIDE 2

Phenotypes and phenotyping

Physically observable traits of genotypes (and their interac;ons with environments)

Diseases (and disease subtypes)

AFribu;ons of diseases (e.g. suscep;bility) Biochemical or physiological proper;es, behavior, and products of behavior

slide-3
SLIDE 3

Data-Driven Phenotyping

  • Data-driven phenotyping

– Two main methodologies

  • Rule-based approach (e.g. eMerge, hFps://emerge.mc.vanderbilt.edu)
  • Predic3ve Analy3cs

– Data sources:

  • EHRs/EMRs: Medicinal treatments, diagnoses, lab measurements, etc.
  • Genomic data: SNP arrays, copy number varia;on (CNVs), etc.

– Phenotypes

  • Diseases, subtypes, or variables aFributed to disease predic;ons
slide-4
SLIDE 4

Diagnos;c Concept Units

  • Various diseases sharing the same set of

diagnos;c concept units

  • Infec;ous diseases

– Lab tests

  • Microorganism, blood, urine, body ;ssues, stool

– Medica;ons

  • An;bio;c, an;virus, anthelmin;c
  • Build sta;s;cal models for each diagnos;c

component and combine them appropriately

– Ensemble learning

slide-5
SLIDE 5

Bulk Learning in a Nutshell …

Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set) as a substrate for model learning and evaluation wherein (a given) medical ontology is used to perform feature selection and model stacking is used to construct abstract feature representation of low sample complexity in order to reduce training requirements.

Key Concepts:

  • 1. Build phenotyping models on top of mul;ple diseases
  • 3. Models are combined via model stacking (a form of ensemble learning)
  • 4. Abstract features

Dimensionality reduc;on

  • 2. Automa;c feature selec;on using an exis;ng ontology
  • 5. Less labeled data required for model evalua;ons
slide-6
SLIDE 6

Phenotyping via Bulk Learning

  • Under model stacking, we then arrive at the no;on of

“concept-driven phenotyping”

– A subset or combina;ons of lab tests are more aFributable to some diseases while the others are beFer explained by medica;ons

  • In this study, infec;ous diseases associated with 100

ICD-9 codes as the domain of study for bulk learning

– For simplicity, consider different diagnos;c codes as different diseases … – Why 100 codes? – Code selec;on strategy?

slide-7
SLIDE 7

Bulk Learning Basics I

  • Addresses two central issues in predic;ve

analy;cal approach to computa;onal phenotyping

– Feature engineering

  • Medical ontology for feature decomposi;on
  • Medical En;;es Dict (hFp://med.dmi.columbia.edu)

– Data annota;on

  • Ensemble learning (e.g. stacked generaliza;on

[Wolpert 1992])

  • Feature abstrac;on for dimensionality reduc;on
slide-8
SLIDE 8

Medical Ontology for Grouping Features

  • Snapshot of Medical En;;es Dic;onary

(hFp://med.dmi.columbia.edu)

slide-9
SLIDE 9

Model Stacking

  • Why inspec;ng mul;ple (infec;ous) diseases?

– Using mul3ple diseases as substrate and iden;fy their common elements – Example stacking architecture (under stacked generaliza;on method)

Level 1 Level 0

Antibiotic Measure Urinary Chemistry Measure Intravenous Chemistry Measure Microbiology Measure

Level 2

Attributes: Level-0 Probabilities and Indicators Target: Diagnostic Codes (Silver Standard) Other Phenotypic Measures (e.g. Antiviral) Attributes: Level-1 Probabilities and ICD-9 Target: True Labels (Gold Standard)

slide-10
SLIDE 10

Surrogate Labels vs True Labels

  • Model stacking is used to achieve:

– Improve upon base model performances – Transform EHR data to a denser form

  • Uses diagnos;c codes (e.g. ICD-9) as surrogate labels to

establish “approximate predic;ve models.”

  • Why surrogate labels (e.g. ICD-9)?

– Features extracted from EHR can be large – Used to derive compact representa;on of the training data – “Free” supervised signals that are sufficiently close but can be obtained without extra work

  • Objec;ve: Build sta;s;cal models in abstract feature space

– Create a sparse annota;on set (i.e. gold standard) that serves a proxy dataset for downstream model evalua;ons – 83 annotated cases

slide-11
SLIDE 11

Σ Σ Σ Σ Σ

m1 a1 b1 u1 m1 (1)

m1g a1g b1g u1g

global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) logistic units raw features

microbiology antibiotic blood test urine test f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models

slide-12
SLIDE 12

Performance Evalua;ons

  • How well does the model predict ICD-9s (using a

separate test data)?

  • How well does the model predict annotated data

(assoc. with “true labels”)?

– (Binarized) ICD-9 becomes a candidate feature among abstract features (e.g. probability scores, indicators)

  • Annotated sample consists of randomly selected cases in

which errors of ICD-9 coding are corrected

  • Data annota;ons and coding procedures are two

independent processes

slide-13
SLIDE 13

Base Level Performances

slide-14
SLIDE 14

127.4 Enterobiasis 047.8 (Other) viral meningi;s 009.1 Gastroenteri;s ... 053.9 Herpez zoster 117.9 Mycoses

slide-15
SLIDE 15
slide-16
SLIDE 16

Other Components

  • Semi-supervised learning and virtual annota;on set
  • The 3rd ;er in model stacking hierarchy

– Trade-off between learned abstract features and the ICD-9 codes as surrogate labels. – Performance evalua;on on predic;ng annotated labels

  • Ontology-based feature engineering
  • Proper design of treatment and control (training) data
slide-17
SLIDE 17

Modeling Perspec;ve

  • EHR data consist of observa;ons and latent variables

– Observa;ons can be directly answered via simple queries

  • Did the pa;ent have tests on E. Coli?
  • Did the pa;ent take Cekriaxon?
  • Latent variables represent quan;;es that cannot be

directly observed in EHR or computed via simple queries

– Does the pa;ent have an infec;on? – Diagnos;c ques;ons: specifically which infec;ons do the pa;ent have?

  • Learn classifiers to predict latent variables (with only access

to observa;ons)

slide-18
SLIDE 18

Medical Perspec;ve

  • Seemingly different infec;ous diseases may share

similar sets of lab tests and medica;ons

– Staph. aureus

  • Skin infec;ons, pneumonia, blood poisoning

– Cekriaxone

  • Meningi;s
  • Infec;ons at different sites of the body (e.g. bloodstream, lungs,

urinary tracts)

  • Mul;ple classifiers for the same disease

– 4 classifiers per ICD-9 code, each of which is binary classifier

  • 400 classifiers at base level
slide-19
SLIDE 19

Data Distribu;on Perspec;ve

“Can we build a joint model applicable to all diseases?”

slide-20
SLIDE 20

Abstract Feature Representa;on: Design Choices

  • Related work in construc;ng high-level features

– PCA, unsupervised feature learning, manifold learning, etc.

  • Design choices

– Data characteris;cs – Interpretability

  • Deep Neural Network

– Linear combina;on – Non-linear transforma;on (e.g. sigmoid, rec;fier, etc.)

  • Feature set: con;nuous, dense, and “homogeneous”

– Image pixels – Times series of lab measurements – word2vec

  • EHR data however are very different

– sparse and incomplete – consist of many different types (binary, categorical, con;nuous, etc.) – Features associated with mul;ple concepts

slide-21
SLIDE 21

Moving Forward …

  • Summary

– Bulk learning is a framework with at least the following system choices

  • The bulk learning set (of target condi;ons) => base models
  • Classifica;on algorithms (guideline: probabilis;c classifiers + well-calibrated)
  • Stacking architecture (mul;ple ;ers => levels of abstrac;ons)
  • Strategy for combining individual (local) disease models to a global model

– Advantage: Can use a small annotated sample for model construc;on and evalua;on within the abstract feature space (e.g. level-1 data)

  • 83 clinical cases were labeled in this study

– Challenge: The model involving the interac;on between abstract features and ICD-9 do not generalize well into the region of the data where the ICD-9 coding was incorrect

  • Mul;ple types of surrogate labels

Σ Σ Σ Σ Σ

m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i)

Σ

m1g a1g b1g u1g local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) (i-1) (i) (i+1)

Semi-supervised learning

Ac3ve learning Complex decision boundary?

Other surrogate labels

  • Ongoing and future work
slide-22
SLIDE 22

Reference

[1] D.H. Wolpert, Stacked generaliza;on, Neural Networks. 5 (1992) 241–259. [2] K.M. Ting, I.H. WiFen, Issues in stacked generaliza;on, J. Ar;f. Intell. Res. 10 (1999) 271–289. [3] J. Jin Chen, C. Cheng Wang, R. Runsheng Wang, Using Stacked Generaliza;on to Combine SVMs in Magnitude and Shape Feature Spaces for Classifica;on

  • f Hyperspectral Data, IEEE Trans. Geosci. Remote Sens. 47 (2009) 2193-2205.

[4] David Baorto, James Cimino, et al. Available: hFp://med.dmi.columbia.edu. Access date: Oct 20, 2016. [5] T.A. Lasko, J.C. Denny, M.A. Levy, Computa;onal Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data, PLoS One. 8 (2013) e66341.

slide-23
SLIDE 23

T H A N K

Y O U

Σ Σ Σ Σ

m1 a1 b1 u1

f11 f12 f1j f21 f2j f31 f41 f3j
slide-24
SLIDE 24

Σ Σ Σ Σ

m1 a1 b1 u1 logistic units raw features

f11 f12 f1j f21 f2j f31 f41 f3j

m1 a1 b1 u1

Σ

Level 0 Level 1

Microbiology An;bio;c Blood test Urine test

slide-25
SLIDE 25

Example Features

slide-26
SLIDE 26
slide-27
SLIDE 27

Σ Σ Σ Σ Σ

m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i)

Σ

m1g a1g b1g u1g

local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1)

(i-1)

(i)

(i+1) logistic units raw features

microbiology antibiotic blood test urine test

  • 2. Compute Base Models

Level-1 Global Unit Individual Level-1 Local Units

Level-1 abstract features

f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models

  • 3. Compute Meta Models (via Ensemble Learning)
  • 1. Define Feature Groups Using Medical Ontology
  • 1a. Gather EHR data according to

medical concepts

  • 1b. Use Medical Entities Dictionary to

delineate feature scopes

  • 1c. Apply feature selection

within each concept group

  • 3a. Per-disease ensembles:

compute local level-1 models

  • 3b. Cross-disease ensemble:

compute a global level-1 model

Global level-1 features