[PPT] - Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS PowerPoint Presentation

SLIDE 1

Obtaining phenotype and

utcome data from EHRs

Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018

SLIDE 2

EHR Data from Vanderbilt Biobank

BioVU start

Vanderbilt biobank enrollment

EHR data are dense and efficient for discovery: Vanderbilt’s experience (BioVU)

SLIDE 3

eMERGE Goals:

To perform genomic studies using the EHR
To implement of genomic medicine

SLIDE 4

Making text documents useful for research

Clinical notes, test reports, etc

chief_complaint: Shortness of Breath history_present_illness: Congestive Heart Failure Type 2 diabetes, negated mother_medical_history: rheumatoid arthritis

Structured Output certainty (positive, negated) Who experienced it? (patient or family member?) Structured Output DrugName: atenolol Strength: 50 mg Frequency: daily

Research EHR

CC: SOB HPI: Mr. **jones** is a 65yo w/ h/o CHF, … no dm2…

n atenolol 50mg daily…

Mother had RA. CC: SOB HPI: Mr. Smith is a 65yo w/ h/o CHF, … no dm2…

n atenolol 50mg daily…

Mother had RA. Medication extraction Find biomedical concepts and qualifiers; create structured data Customized classifiers (smoking status, etc)

Billing codes

Deidentify: remove HIPAA identifiers + ….

SLIDE 5

Doesn’t have hypertension Has hypertension

Finding a “simple” disease in the EHR: Who has hypertension? Definition: SBP > 140 or DBP > 90

Patient 1 Patient 2

SLIDE 6

Our “simple” example: Hypertension

Multiple components are better (and blood pressure is the worst)

Teixeira, JAMIA 2016

SLIDE 7

Algorithm Development and Implementation

Clinical Notes (NLP - natural language processing) Billing codes ICD9 & CPT Medications ePrescribing & NLP Labs & test results NLP

What we learned - Finding phenotypes in the EHR

True cases Identify phenotype

f interest

Case & control algorithm development and refinement Manual review; assess precision Deploy in BioVU Genetic associatio n tests

≥95% <95%

SLIDE 8

Early discovery science in eMERGE – Hypothyroidism

Am J Hum Genet. 2011;89:529-42

Algorithms can be deployed across multiple EHRs Analyses can be performed using extant data

SLIDE 9

GWAS of QRS Duration in eMERGE

SCN5A/SCN10A

n=5,272

Ritchie et al., Circulation 2013

SLIDE 10

What happens in the “heart healthy” population? Examined the n=5272 “heart healthy” population Followed for development of atrial fibrillation based on genotype Years since normal ECG (and no heart disease) Atrial fibrillation-free survival HR=1.49 per G allele p=0.001 GG AG AA

Ritchie et al., Circulation 2013

SLIDE 11

Mega et al., NEJM 2009

From clinical trials

Normal metabolizers Carriers

From the EHR

Delaney et al. Clin Pharm Ther. 2012

N=807, P=0.005

EHRs for drug response:

Clopidogrel adverse events associated with CYP2C19

SLIDE 12

Deep learning for Diabetic Retinopathy

Train a machine learning algorithm

ver >128k images

Gulshan et al. JAMA 2016

SLIDE 13

Phenome scanning (PheWAS) in the EHR

A genetic variant

Associated phenotypes

The curated EHR- based phenome

A phenotype

Associated genotypes

Dense genomic information

SLIDE 14

Replications of GWAS associations via PheWAS

Binary traits Continuous traits

P-value for replication:

All - 210/751: 2x10-98
Powered - 51/77: 3x10-47

Nat Biotech 2013; 31:1102-1111

SLIDE 15

PheWAS across all HLA types

(n= 37,270)

Karnes et al, Sci Trans Med 2017

SLIDE 16

The potential for “call back” deeper phenotyping: Long QT genes (SCN5A and KCNH2) in 2,200 sequenced patients in eMERGE

83 rare (MAF < 1%) in SCN5A, 45 in KCNH2
121/128 MAF < 0.5%, 92 singletons
Three labs assessed known/likely pathogenicity

Lab 1 16/121 Lab 2 24/121 Lab 3 17/121

4

Van Driest et al, JAMA 2016

SLIDE 17

Calculating a Phenotype Risk Score (PheRS)

OMIM feature 1 OMIM feature 2 OMIM feature k

...

For each record i, generate PheRS

PheRS𝑗 =

𝑘=1 𝑙

1 0 𝜕𝑘

Score for subject i Add up terms for k phenotypes 0=phenotype j absent 1=phenotype j present weight for phenotype j derrived from entire EHR

Human Phenotype Ontology EHR phenotypes

Repeat this for all Mendelian diseases

Bastarache et al, Science 2018

SLIDE 18

CF cases CF controls Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M Chronic airway obstruction Pneumonia Diseases of pancreas Hypovolemia Acute upper respiratory infections Asthma Bronchiectasis Intestinal malabsorption Hepatomegaly Acute pulmonary heart disease Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7

Example: a phenotype risk score in Cystic Fibrosis

Bastarache et al, Science 2018

SLIDE 19

PheRS identified potentially pathogenic SNVs

N=21k on exome chip 6k SNVs Bastarache et al, Science 2018

SLIDE 20

The All of Us Research Program – Breaking Down Data Silos

Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks

f the U.S. Department of Health and Human Services.

SLIDE 21

Direct Volunteers Health Care Provider Organizations

Health Surveys

Overview of the All of Us approach and protocol

EHR data Baseline measurements Bio- specimens Smartphones & Wearables

Multiple data types linked together by semantic standards

SLIDE 22

From Healthcare Provider Orgs

All of Us will aggregate data from many sources

Meds Billing codes Labs

Version 1 (2018)

Clinical Notes & Reports Clinical Messaging

Version 2 Raw Data Repository Data added centrally by DRC

Death Index Claims & Rx Data

…

Visits

Local Registries Much longer term Images From Direct Volunteers Sync for Science

Participant provided data (Health surveys, activity monitors, etc) Participant exams and biospecimens

Curated Data Repository APIs, Analysis tools, etc

Geospatial data Health data aggregators (PicnicHealth)

SLIDE 23

Sync 4 Science (S4S) – a technology to share health data

S4S Pilot Sites S4S:

FHIR-based
Starting with MU

Common Clinical Data set

SLIDE 24

Data Access is centralized in All of Us

Traditional Approach: Bring data to researchers

Problems

Data sharing = data copying
Security (data handoffs)
Huge infrastructure needed
Siloed compute

AoU Approach: Bring researchers to the data

Data

Advantages

Cost
Threat detection and auditing
Increased Accessibility
Shared compute

Public Cloud Download from public repository

SLIDE 25

The power of a data biosphere of common semantics and APIs

SLIDE 26

Obtaining phenotype and outcome data from e-health records and digital platforms: the experience of UK Biobank

Cathie Sudlow Professor of Neurology and Clinical Epidemiology Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh Director of Health Data Research UK Scottish substantive site Chief Scientist, UK Biobank International Cohorts Summit, Durham, North Carolina March 2018

SLIDE 27

UK Biobank in a nutshell

500,000 UK men and women aged 40-69 years when recruited during 2006-2010
Consent for all types of health research by both academic and commercial researchers
Extensive baseline questions and physical measures, with biological samples stored for

future assays

Subsequent enhancements in all or large subsets of participants:

– Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG) – Sample assays in all or large subsets: Complete: genome-wide genotyping; biochemistry panel Underway/planned: exome and whole genome sequencing; proteomics; infectious disease assays; stool microbiome – Multimodal imaging of 100,000 (>22,000 so far) – Web questionnaires

Comprehensive, long term follow-up for a wide range of health-related outcomes
Open access for approved research: see www.ukbiobank.ac.uk

SLIDE 28

Aim: identify a wide range of incident diseases and other health related outcomes Active methods requiring participant re-engagement

face to face reassessment
postal or web-based surveys
expensive
prone to incomplete coverage & selective loss to follow-up
miss cases emerging between assessments

Passive methods via linkages to national health records

can follow all participants without need for re-engagement
efficient and cost effective
need adequate consent at recruitment
rely on universal healthcare system & availability of relevant datasets
can only detect cases of disease diagnosed in a healthcare setting
data need to be accurate and sufficiently detailed for research studies

Follow-up of participants in very large prospective cohorts

SLIDE 29

Web questionnaires

Using email and web questionnaires

– for more detailed assessment of exposures – and to obtain information on outcomes that cannot be obtained through linking to health records

Of 350,000 with email, >150,000 complete each questionnaire

– Details of dietary intake – Cognitive function – Mental health (thoughts and feelings) – Gastrointestinal symptoms Useful for following change

ver time…but beware

selective attrition

SLIDE 30

Following the health of 0.5 million UK Biobank participants through linking to National Health Service (NHS) records

Scotland 36,000 participants England 446,000 participants Wales 21,000 participants

Regularly updated information on a wide range of diseases from NHS datasets in all three countries:

Deaths - date and cause of death

for all participants >14,000 by early 2016

Cancers – date, stage and grade of cancer

for all participants >79,000 cancer cases by late 2015

Admissions to hospital – dates, diagnoses, procedures

for all participants 1000’s of cases of many incident diseases

Primary care data – dates, diagnoses, symptoms, signs,

referrals, prescriptions, labs etc for half of the participants 1000’s more cases of many incident diseases

SLIDE 31

Maximising the value of the linked healthcare data

Messy ‘real world data’ - not collected primarily for research
Not 100% accurate due to administrative and clinical error
Mainly structured, coded datasets (ICD, OPCS4, Read…)
Experts advising in a range of disease areas:
Combine different linked data sources to create algorithmically derived disease

status indicators

Estimate the accuracy and completeness of these
Consider limitations and potential additional sources of unstructured data

Cancer Neurodegenerative diseases Diabetes Chest diseases Cardiac diseases Musculoskeletal conditions Stroke Infections Mental health disorders Kidney diseases Eye diseases

SLIDE 32

Cancers in UK Biobank ascertained from the national cancer registries

Observed Predicted By recruitment Incident by 2015 Incident by 2022 Breast cancer 9,000 4,200 10,000 Colorectal cancer 2,300 2,500 7,000 Prostate cancer 3,000 4,300 9,000 → Date, stage and grade of cancer Beyond the structured registry data…exploring feasibility of retrieving additional information for subtyping of identified cancer cases through regional linkages to:

histopathology reports
digitised histopathology slides
tumour specimens

SLIDE 33

Exemplar non-cancer conditions in UK Biobank

ascertained from baseline self report, hospital admissions and death registries

Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600

SLIDE 34

Exemplar non-cancer conditions in UK Biobank

ascertained from baseline self report, hospital admissions and death registries

Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600

Estimated effects of including primary care data

Accuracy? Limitations?

SLIDE 35

Dementia: positive predictive value of routine healthcare data

From published studies

Mortality Hospital inpatient Hospital in- & outpatient Insurance Outpatients Primary care 0 0.2 0.4 0.6 0.8 1

Wide variation but in most PPV >80%

0 0.2 0.4 0.6 0.8 1 Alzheimer’s disease Vascular dementia

PPV for AD generally higher than for vasc dementia

SLIDE 36

Dementia: positive predictive value of routine healthcare data

From comparison with expert review of free text electronic medical record in UK Biobank

Dementia: 80% Alzheimer’s disease: 72% Vascular dementia: 44%

SLIDE 37

Beyond the linked coded healthcare data

Obtaining these data at national scale is challenging To extract value from these data on 1000’s of

utcomes across multiple diseases, we need scalable

approaches: crowd sourcing, natural language processing, machine learning, artificial intelligence… Structured, coded data from linked national healthcare datasets:

Can ascertain cases of a wide range of diseases with acceptable accuracy
Capture only 10-20% of the information from electronic medical records
Are limited for detailed sub-phenotyping of disease

Deeper phenotyping of disease will require multiple unstructured data sources, including:

Free text of electronic records
Complex electrical signalling data (ECG’s, EEG’s etc)
Histopathology slide sets
Clinical imaging data

SLIDE 38

Acknowledgements

SLIDE 39

SLIDE 40

Biochemistry analyses in all 500,000 participants

Cardiovascular Cholesterol Direct LDL-c HDL-c Triglyceride ApoA ApoB Lp(a) CRP Cancer SHBG Testosterone Oestradiol IGF-I Bone and joint Vitamin D Rheumatoid factor Alkaline Phosphatase Calcium Liver Albumin Direct Bilirubin Total Bilirubin GGT ALT AST Note: Haematological assays were conducted during recruitment phase Diabetes HbA1c Glucose Renal Creatinine Cystatin C Total protein Urea Phosphate Urate Urinary:

Creatinine
Sodium
Potassium
Albumin

SLIDE 41

500,000 participants 22 recruitment centres 89% England 7% Scotland 4% Wales

SLIDE 42

Industrial scale processes: samples during recruitment

25,000 aliquots produced per day 700 participants per day 4,900 sample tubes per day 15 million 0.85ml aliquots

SLIDE 43

Blood

whole blood serum plasma red cells buffy coat

Urine
Saliva

Total > 15 million aliquots

SLIDE 44

Expected disease cases during follow-up

Condition 2012 2017 2022 Diabetes 10,000 25,000 40,000 Heart attack 7,000 17,000 28,000 Stroke 2,000 5,000 9,000 Chronic obstructive lung disease 3,000 8,000 14,000 Breast cancer 2,500 6,000 10,000 Colorectal cancer 1,500 3,500 7,000 Prostate cancer 1,500 3,500 7,000 Hip fracture 1,000 2,500 6,000 Alzheimer’s 1,000 3,000 9,000

SLIDE 45

Data from portable wearable devices

Accelerometry data: 100,000 participants Continuous ECG monitoring: 20,000 + participants

Prospective design and large size enable reasonably well- powered studies of (causal) associations between accelerometry and cardiac rhythm measures and later

nset disease

SLIDE 46

Sample analyses

Genome-wide genotyping of all participants
Standard panel of assays (e.g. lipids; hormones; metabolic) on

samples from all participants

Exome & whole genome sequencing, proteomics, metabolomics,

infectious disease assays, stool microbiome…all underway/planned

SLIDE 47

Multimodal imaging of 100,000 participants

>22,000 imaged so far

Prospective design and large size enable well-powered studies of (causal) associations between structure and function of organs and later onset disease…but…need scalable methods of analysing complex data to derive measures for large scale analyses

SLIDE 48

Obtaining phenotype and outcome data from health records:

China Kadoorie Biobank experience

Zhengming CHEN

CKB Principal Investigator Professor of Epidemiology Nuffield Dept. of Population Health University of Oxford, UK (zhengming.chen@ndph.ox.ac.uk)

International Cohorts Summit, Duke University, USA, 26-27 March 2018

SLIDE 49

>512K recruited from 10 localities in 2004-08
Participants interviewed, measured, and gave

plasma and DNA (urine) for long-term storage

All followed up indefinitely via electronic record

linkage to deaths and ALL hospital episodes

Periodic resurvey of 5% surviving participants

(for enhancements and sources of variation)

China Kadoorie Biobank (CKB)

Informed consent for linkages to health records and unspecified research use of stored samples

SLIDE 50

CKB: Clinical stations at local assessment centre

登记，知情同意，分发现场调查袋问卷 (包括一般调查问卷, COPD 问卷, CIDI问卷以及完整性检查 I) 完成 · 完整性检查 II – 检查是否有遗漏项目 · 分发体检结果报告体脂手握力体格检查身高腰臀围血压脉搏波速心血管检查采血及血检采尿及尿检血、尿样采集及检验肺功能一氧化碳浓度生理检查颈动脉超声跟骨密度心电图其他检查

1.Registration & consent 8.Physical exam. 7.Questionnaire (with recording) 5.Sample collection

The clinic visit took 60-90 minutes, with daily statistical monitoring

Recruitment rate: 7~800 per day

2.Physical exam. 3.Physical exam. 4.Physical exam. 9.Clinic consultation

SLIDE 51

CKB: Supported by >90 bespoke IT systems

SLIDE 52

CKB: Fully established with 10-year follow-up

Questionnaire SES, smoking, alcohol, tea, diet, physical activity, indoor air pollution, sleep, reproductive patterns, medical history Measurements Blood pressure, height, weight, lung function, heart rate, bone density, exhaled CO, ECG, cIMT, ambient temperature, ambient air pollution, blood lipids, metabolites, proteomics, infectious markers, genetics Electronic health records >1,300 different diseases, 43K deaths, <5K lost to follow-up, ~0.9 million hospitalizations, >100 million chargeable items www.ckbiobank.org

Data is growing rapidly

◊ ◊

SLIDE 53

Outcome Follow up in CKB

Active follow-up Disease registries National health insurance Death registries

CKB: Follow-up through record linkages

SLIDE 54

National health insurance system in China

(supplementing death and cancer registries)

Introduced during 2004-6, with almost universal

coverage in CKB areas by 2010

Multiple disease diagnoses, with ICD-10 codes plus

disease descriptions and >2,500 procedure codes

Managed electronically at city level, with detailed

chargeable items for reimbursement purposes

Lacks certain details (e.g. cancer pathology) required

for disease sub-phenotyping

Nearly all CKB participants now linked to the health insurance databases via unique national ID number

SLIDE 55

46063 32468 28274 19796 15539 14277 13882 11037 8035 6270 5904 5489 4632 4002 3188 3169 2748 2664 2273 1730 1316 1192 1104 1098 822 425 54

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Ischaemic Stroke (I63) Diabetes (E10-E14) Cancer (C00-C97 ) COPD (J41-J44) Fractures (S02, S12, S22, S32, S42, S52, S62, S72, S82, S92) Cataract (H25-H28) Angina (I20) Haemorrhagic Stroke (I61) MI (I21-I23) Arrhythmia (I47-I49) Chronic liver disease (K70-K77, B18-B19, I85, Z22.5) Pulmonary heart disease (I26-I27) Heart failure (I50) Anxiety disorders (F40-F48) Tuberculosis (A15-A19, B90, J65) Asthma (J45-J46) CKD (N02-N03, N07, N11,N18) Osteoporosis (M80-M81) Coronary revascularisation Rheumatoid arthritis (M05-M06, M45) Schizophrenia (F20-F29) Depression (F30-F39) Retinopathy (E10-4.3,H36.0) SAH (I60, I69.0) Parkinson disease (G20-G21) (Venomous) snake bite (T63.0, X20) Victim of earthquake (X34)

CKB: Participants with selected diseases in 10 years

(43K deaths, 0.9M hospital admissions; 2017 HI data are being processed)

Haemorrhagic stroke NAFLD

Waist circumference (cm)

SLIDE 56

CKB: Procedures for improving disease phenotyping

Pilot study of ~1000 cases for specific disease before deciding whether to undertake systematic adjudication

SLIDE 57

CKB: Disease standardisation and coding tool

SLIDE 58

CKB: Verifying reported diagnosis

SLIDE 59

CKB: Adjudicating & phenotyping major diseases

(>70K adjudicated: 30K stroke, 25K IHD, 15K cancer, >3K CKD)

SLIDE 60

CKB: “traffic” light approach for outcome data

SLIDE 61

Future work for disease phenotyping

Standardising and ICD-10 coding new events collected
Processing and incorporating >100M chargeable items

data to enhance disease phenotyping

Extending outcome adjudication to several other

diseases (e.g. heart failure, chronic liver disease)

Developing automated algorithm to sub-phenotype

stroke and other diseases according to clinical criteria

Piloting collection of discharge summary pages and

tumour tissue samples

SLIDE 62

CKB: Open data access platform

(www.ckbiobank.org)