Obtaining phenotype and
- utcome data from EHRs
Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018
Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS - - PowerPoint PPT Presentation
Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018 EHR data are dense and efficient for discovery: Vanderbilts experience (BioVU) BioVU start Vanderbilt biobank enrollment
Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018
EHR Data from Vanderbilt Biobank
BioVU start
Vanderbilt biobank enrollment
eMERGE Goals:
Clinical notes, test reports, etc
chief_complaint: Shortness of Breath history_present_illness: Congestive Heart Failure Type 2 diabetes, negated mother_medical_history: rheumatoid arthritis
Structured Output certainty (positive, negated) Who experienced it? (patient or family member?) Structured Output DrugName: atenolol Strength: 50 mg Frequency: daily
Research EHR
CC: SOB HPI: Mr. **jones** is a 65yo w/ h/o CHF, … no dm2…
Mother had RA. CC: SOB HPI: Mr. Smith is a 65yo w/ h/o CHF, … no dm2…
Mother had RA. Medication extraction Find biomedical concepts and qualifiers; create structured data Customized classifiers (smoking status, etc)
Billing codes
Deidentify: remove HIPAA identifiers + ….
Doesn’t have hypertension Has hypertension
Finding a “simple” disease in the EHR: Who has hypertension? Definition: SBP > 140 or DBP > 90
Patient 1 Patient 2
Multiple components are better (and blood pressure is the worst)
Teixeira, JAMIA 2016
Algorithm Development and Implementation
Clinical Notes (NLP - natural language processing) Billing codes ICD9 & CPT Medications ePrescribing & NLP Labs & test results NLP
True cases Identify phenotype
Case & control algorithm development and refinement Manual review; assess precision Deploy in BioVU Genetic associatio n tests
≥95% <95%
Early discovery science in eMERGE – Hypothyroidism
Am J Hum Genet. 2011;89:529-42
SCN5A/SCN10A
n=5,272
Ritchie et al., Circulation 2013
What happens in the “heart healthy” population? Examined the n=5272 “heart healthy” population Followed for development of atrial fibrillation based on genotype Years since normal ECG (and no heart disease) Atrial fibrillation-free survival HR=1.49 per G allele p=0.001 GG AG AA
Ritchie et al., Circulation 2013
Mega et al., NEJM 2009
From clinical trials
Normal metabolizers Carriers
From the EHR
Delaney et al. Clin Pharm Ther. 2012
N=807, P=0.005
Clopidogrel adverse events associated with CYP2C19
Train a machine learning algorithm
Gulshan et al. JAMA 2016
Associated phenotypes
The curated EHR- based phenome
Associated genotypes
Dense genomic information
Binary traits Continuous traits
P-value for replication:
Nat Biotech 2013; 31:1102-1111
Karnes et al, Sci Trans Med 2017
Van Driest et al, JAMA 2016
Calculating a Phenotype Risk Score (PheRS)
OMIM feature 1 OMIM feature 2 OMIM feature k
For each record i, generate PheRS
PheRS𝑗 =
𝑘=1 𝑙
1 0 𝜕𝑘
Score for subject i Add up terms for k phenotypes 0=phenotype j absent 1=phenotype j present weight for phenotype j derrived from entire EHR
Human Phenotype Ontology EHR phenotypes
Repeat this for all Mendelian diseases
Bastarache et al, Science 2018
CF cases CF controls Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M Chronic airway obstruction Pneumonia Diseases of pancreas Hypovolemia Acute upper respiratory infections Asthma Bronchiectasis Intestinal malabsorption Hepatomegaly Acute pulmonary heart disease Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7
Bastarache et al, Science 2018
N=21k on exome chip 6k SNVs Bastarache et al, Science 2018
Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks
Direct Volunteers Health Care Provider Organizations
Health Surveys
Overview of the All of Us approach and protocol
EHR data Baseline measurements Bio- specimens Smartphones & Wearables
Multiple data types linked together by semantic standards
From Healthcare Provider Orgs
Meds Billing codes Labs
Version 1 (2018)
Clinical Notes & Reports Clinical Messaging
Version 2 Raw Data Repository Data added centrally by DRC
Death Index Claims & Rx Data
…
Visits
Local Registries Much longer term Images From Direct Volunteers Sync for Science
Participant provided data (Health surveys, activity monitors, etc) Participant exams and biospecimens
Curated Data Repository APIs, Analysis tools, etc
Geospatial data Health data aggregators (PicnicHealth)
Sync 4 Science (S4S) – a technology to share health data
S4S Pilot Sites S4S:
Common Clinical Data set
Traditional Approach: Bring data to researchers
Problems
AoU Approach: Bring researchers to the data
Data
Advantages
Public Cloud Download from public repository
The power of a data biosphere of common semantics and APIs
Cathie Sudlow Professor of Neurology and Clinical Epidemiology Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh Director of Health Data Research UK Scottish substantive site Chief Scientist, UK Biobank International Cohorts Summit, Durham, North Carolina March 2018
future assays
– Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG) – Sample assays in all or large subsets: Complete: genome-wide genotyping; biochemistry panel Underway/planned: exome and whole genome sequencing; proteomics; infectious disease assays; stool microbiome – Multimodal imaging of 100,000 (>22,000 so far) – Web questionnaires
Aim: identify a wide range of incident diseases and other health related outcomes Active methods requiring participant re-engagement
Passive methods via linkages to national health records
– for more detailed assessment of exposures – and to obtain information on outcomes that cannot be obtained through linking to health records
– Details of dietary intake – Cognitive function – Mental health (thoughts and feelings) – Gastrointestinal symptoms Useful for following change
selective attrition
Scotland 36,000 participants England 446,000 participants Wales 21,000 participants
Regularly updated information on a wide range of diseases from NHS datasets in all three countries:
for all participants >14,000 by early 2016
for all participants >79,000 cancer cases by late 2015
for all participants 1000’s of cases of many incident diseases
referrals, prescriptions, labs etc for half of the participants 1000’s more cases of many incident diseases
status indicators
Cancer Neurodegenerative diseases Diabetes Chest diseases Cardiac diseases Musculoskeletal conditions Stroke Infections Mental health disorders Kidney diseases Eye diseases
Observed Predicted By recruitment Incident by 2015 Incident by 2022 Breast cancer 9,000 4,200 10,000 Colorectal cancer 2,300 2,500 7,000 Prostate cancer 3,000 4,300 9,000 → Date, stage and grade of cancer Beyond the structured registry data…exploring feasibility of retrieving additional information for subtyping of identified cancer cases through regional linkages to:
ascertained from baseline self report, hospital admissions and death registries
Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600
ascertained from baseline self report, hospital admissions and death registries
Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600
Estimated effects of including primary care data
From published studies
Mortality Hospital inpatient Hospital in- & outpatient Insurance Outpatients Primary care 0 0.2 0.4 0.6 0.8 1
Wide variation but in most PPV >80%
0 0.2 0.4 0.6 0.8 1 Alzheimer’s disease Vascular dementia
PPV for AD generally higher than for vasc dementia
Dementia: 80% Alzheimer’s disease: 72% Vascular dementia: 44%
Obtaining these data at national scale is challenging To extract value from these data on 1000’s of
approaches: crowd sourcing, natural language processing, machine learning, artificial intelligence… Structured, coded data from linked national healthcare datasets:
Deeper phenotyping of disease will require multiple unstructured data sources, including:
Cardiovascular Cholesterol Direct LDL-c HDL-c Triglyceride ApoA ApoB Lp(a) CRP Cancer SHBG Testosterone Oestradiol IGF-I Bone and joint Vitamin D Rheumatoid factor Alkaline Phosphatase Calcium Liver Albumin Direct Bilirubin Total Bilirubin GGT ALT AST Note: Haematological assays were conducted during recruitment phase Diabetes HbA1c Glucose Renal Creatinine Cystatin C Total protein Urea Phosphate Urate Urinary:
25,000 aliquots produced per day 700 participants per day 4,900 sample tubes per day 15 million 0.85ml aliquots
Total > 15 million aliquots
Condition 2012 2017 2022 Diabetes 10,000 25,000 40,000 Heart attack 7,000 17,000 28,000 Stroke 2,000 5,000 9,000 Chronic obstructive lung disease 3,000 8,000 14,000 Breast cancer 2,500 6,000 10,000 Colorectal cancer 1,500 3,500 7,000 Prostate cancer 1,500 3,500 7,000 Hip fracture 1,000 2,500 6,000 Alzheimer’s 1,000 3,000 9,000
Accelerometry data: 100,000 participants Continuous ECG monitoring: 20,000 + participants
Prospective design and large size enable reasonably well- powered studies of (causal) associations between accelerometry and cardiac rhythm measures and later
Prospective design and large size enable well-powered studies of (causal) associations between structure and function of organs and later onset disease…but…need scalable methods of analysing complex data to derive measures for large scale analyses
CKB Principal Investigator Professor of Epidemiology Nuffield Dept. of Population Health University of Oxford, UK (zhengming.chen@ndph.ox.ac.uk)
International Cohorts Summit, Duke University, USA, 26-27 March 2018
1.Registration & consent 8.Physical exam. 7.Questionnaire (with recording) 5.Sample collection
The clinic visit took 60-90 minutes, with daily statistical monitoring
2.Physical exam. 3.Physical exam. 4.Physical exam. 9.Clinic consultation
Questionnaire SES, smoking, alcohol, tea, diet, physical activity, indoor air pollution, sleep, reproductive patterns, medical history Measurements Blood pressure, height, weight, lung function, heart rate, bone density, exhaled CO, ECG, cIMT, ambient temperature, ambient air pollution, blood lipids, metabolites, proteomics, infectious markers, genetics Electronic health records >1,300 different diseases, 43K deaths, <5K lost to follow-up, ~0.9 million hospitalizations, >100 million chargeable items www.ckbiobank.org
Data is growing rapidly
◊ ◊
Active follow-up Disease registries National health insurance Death registries
(supplementing death and cancer registries)
Nearly all CKB participants now linked to the health insurance databases via unique national ID number
46063 32468 28274 19796 15539 14277 13882 11037 8035 6270 5904 5489 4632 4002 3188 3169 2748 2664 2273 1730 1316 1192 1104 1098 822 425 54
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Ischaemic Stroke (I63) Diabetes (E10-E14) Cancer (C00-C97 ) COPD (J41-J44) Fractures (S02, S12, S22, S32, S42, S52, S62, S72, S82, S92) Cataract (H25-H28) Angina (I20) Haemorrhagic Stroke (I61) MI (I21-I23) Arrhythmia (I47-I49) Chronic liver disease (K70-K77, B18-B19, I85, Z22.5) Pulmonary heart disease (I26-I27) Heart failure (I50) Anxiety disorders (F40-F48) Tuberculosis (A15-A19, B90, J65) Asthma (J45-J46) CKD (N02-N03, N07, N11,N18) Osteoporosis (M80-M81) Coronary revascularisation Rheumatoid arthritis (M05-M06, M45) Schizophrenia (F20-F29) Depression (F30-F39) Retinopathy (E10-4.3,H36.0) SAH (I60, I69.0) Parkinson disease (G20-G21) (Venomous) snake bite (T63.0, X20) Victim of earthquake (X34)
(43K deaths, 0.9M hospital admissions; 2017 HI data are being processed)
Haemorrhagic stroke NAFLD
Waist circumference (cm)
(>70K adjudicated: 30K stroke, 25K IHD, 15K cancer, >3K CKD)
(www.ckbiobank.org)