Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS - - PowerPoint PPT Presentation

obtaining phenotype and
SMART_READER_LITE
LIVE PREVIEW

Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS - - PowerPoint PPT Presentation

Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018 EHR data are dense and efficient for discovery: Vanderbilts experience (BioVU) BioVU start Vanderbilt biobank enrollment


slide-1
SLIDE 1

Obtaining phenotype and

  • utcome data from EHRs

Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018

slide-2
SLIDE 2

EHR Data from Vanderbilt Biobank

BioVU start

Vanderbilt biobank enrollment

EHR data are dense and efficient for discovery: Vanderbilt’s experience (BioVU)

slide-3
SLIDE 3

eMERGE Goals:

  • To perform genomic studies using the EHR
  • To implement of genomic medicine
slide-4
SLIDE 4

Making text documents useful for research

Clinical notes, test reports, etc

chief_complaint: Shortness of Breath history_present_illness: Congestive Heart Failure Type 2 diabetes, negated mother_medical_history: rheumatoid arthritis

Structured Output certainty (positive, negated) Who experienced it? (patient or family member?) Structured Output DrugName: atenolol Strength: 50 mg Frequency: daily

Research EHR

CC: SOB HPI: Mr. **jones** is a 65yo w/ h/o CHF, … no dm2…

  • n atenolol 50mg daily…

Mother had RA. CC: SOB HPI: Mr. Smith is a 65yo w/ h/o CHF, … no dm2…

  • n atenolol 50mg daily…

Mother had RA. Medication extraction Find biomedical concepts and qualifiers; create structured data Customized classifiers (smoking status, etc)

Billing codes

Deidentify: remove HIPAA identifiers + ….

slide-5
SLIDE 5

Doesn’t have hypertension Has hypertension

Finding a “simple” disease in the EHR: Who has hypertension? Definition: SBP > 140 or DBP > 90

Patient 1 Patient 2

slide-6
SLIDE 6

Our “simple” example: Hypertension

Multiple components are better (and blood pressure is the worst)

Teixeira, JAMIA 2016

slide-7
SLIDE 7

Algorithm Development and Implementation

Clinical Notes (NLP - natural language processing) Billing codes ICD9 & CPT Medications ePrescribing & NLP Labs & test results NLP

What we learned - Finding phenotypes in the EHR

True cases Identify phenotype

  • f interest

Case & control algorithm development and refinement Manual review; assess precision Deploy in BioVU Genetic associatio n tests

≥95% <95%

slide-8
SLIDE 8

Early discovery science in eMERGE – Hypothyroidism

Am J Hum Genet. 2011;89:529-42

Algorithms can be deployed across multiple EHRs Analyses can be performed using extant data

slide-9
SLIDE 9

GWAS of QRS Duration in eMERGE

SCN5A/SCN10A

n=5,272

Ritchie et al., Circulation 2013

slide-10
SLIDE 10

What happens in the “heart healthy” population? Examined the n=5272 “heart healthy” population Followed for development of atrial fibrillation based on genotype Years since normal ECG (and no heart disease) Atrial fibrillation-free survival HR=1.49 per G allele p=0.001 GG AG AA

Ritchie et al., Circulation 2013

slide-11
SLIDE 11

Mega et al., NEJM 2009

From clinical trials

Normal metabolizers Carriers

From the EHR

Delaney et al. Clin Pharm Ther. 2012

N=807, P=0.005

EHRs for drug response:

Clopidogrel adverse events associated with CYP2C19

slide-12
SLIDE 12

Deep learning for Diabetic Retinopathy

Train a machine learning algorithm

  • ver >128k images

Gulshan et al. JAMA 2016

slide-13
SLIDE 13

Phenome scanning (PheWAS) in the EHR

A genetic variant

Associated phenotypes

The curated EHR- based phenome

A phenotype

Associated genotypes

Dense genomic information

slide-14
SLIDE 14

Replications of GWAS associations via PheWAS

Binary traits Continuous traits

P-value for replication:

  • All - 210/751: 2x10-98
  • Powered - 51/77: 3x10-47

Nat Biotech 2013; 31:1102-1111

slide-15
SLIDE 15

PheWAS across all HLA types

(n= 37,270)

Karnes et al, Sci Trans Med 2017

slide-16
SLIDE 16

The potential for “call back” deeper phenotyping: Long QT genes (SCN5A and KCNH2) in 2,200 sequenced patients in eMERGE

  • 83 rare (MAF < 1%) in SCN5A, 45 in KCNH2
  • 121/128 MAF < 0.5%, 92 singletons
  • Three labs assessed known/likely pathogenicity

Lab 1 16/121 Lab 2 24/121 Lab 3 17/121

4

Van Driest et al, JAMA 2016

slide-17
SLIDE 17

Calculating a Phenotype Risk Score (PheRS)

OMIM feature 1 OMIM feature 2 OMIM feature k

...

For each record i, generate PheRS

PheRS𝑗 =

𝑘=1 𝑙

1 0 𝜕𝑘

Score for subject i Add up terms for k phenotypes 0=phenotype j absent 1=phenotype j present weight for phenotype j derrived from entire EHR

Human Phenotype Ontology EHR phenotypes

Repeat this for all Mendelian diseases

Bastarache et al, Science 2018

slide-18
SLIDE 18

CF cases CF controls Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M Chronic airway obstruction Pneumonia Diseases of pancreas Hypovolemia Acute upper respiratory infections Asthma Bronchiectasis Intestinal malabsorption Hepatomegaly Acute pulmonary heart disease Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7

Example: a phenotype risk score in Cystic Fibrosis

Bastarache et al, Science 2018

slide-19
SLIDE 19

PheRS identified potentially pathogenic SNVs

N=21k on exome chip 6k SNVs Bastarache et al, Science 2018

slide-20
SLIDE 20

The All of Us Research Program – Breaking Down Data Silos

Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks

  • f the U.S. Department of Health and Human Services.
slide-21
SLIDE 21

Direct Volunteers Health Care Provider Organizations

Health Surveys

Overview of the All of Us approach and protocol

EHR data Baseline measurements Bio- specimens Smartphones & Wearables

Multiple data types linked together by semantic standards

slide-22
SLIDE 22

From Healthcare Provider Orgs

All of Us will aggregate data from many sources

Meds Billing codes Labs

Version 1 (2018)

Clinical Notes & Reports Clinical Messaging

Version 2 Raw Data Repository Data added centrally by DRC

Death Index Claims & Rx Data

Visits

Local Registries Much longer term Images From Direct Volunteers Sync for Science

Participant provided data (Health surveys, activity monitors, etc) Participant exams and biospecimens

Curated Data Repository APIs, Analysis tools, etc

Geospatial data Health data aggregators (PicnicHealth)

slide-23
SLIDE 23

Sync 4 Science (S4S) – a technology to share health data

S4S Pilot Sites S4S:

  • FHIR-based
  • Starting with MU

Common Clinical Data set

slide-24
SLIDE 24

Data Access is centralized in All of Us

Traditional Approach: Bring data to researchers

Problems

  • Data sharing = data copying
  • Security (data handoffs)
  • Huge infrastructure needed
  • Siloed compute

AoU Approach: Bring researchers to the data

Data

Advantages

  • Cost
  • Threat detection and auditing
  • Increased Accessibility
  • Shared compute

Public Cloud Download from public repository

slide-25
SLIDE 25

The power of a data biosphere of common semantics and APIs

slide-26
SLIDE 26

Obtaining phenotype and outcome data from e-health records and digital platforms: the experience of UK Biobank

Cathie Sudlow Professor of Neurology and Clinical Epidemiology Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh Director of Health Data Research UK Scottish substantive site Chief Scientist, UK Biobank International Cohorts Summit, Durham, North Carolina March 2018

slide-27
SLIDE 27

UK Biobank in a nutshell

  • 500,000 UK men and women aged 40-69 years when recruited during 2006-2010
  • Consent for all types of health research by both academic and commercial researchers
  • Extensive baseline questions and physical measures, with biological samples stored for

future assays

  • Subsequent enhancements in all or large subsets of participants:

– Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG) – Sample assays in all or large subsets: Complete: genome-wide genotyping; biochemistry panel Underway/planned: exome and whole genome sequencing; proteomics; infectious disease assays; stool microbiome – Multimodal imaging of 100,000 (>22,000 so far) – Web questionnaires

  • Comprehensive, long term follow-up for a wide range of health-related outcomes
  • Open access for approved research: see www.ukbiobank.ac.uk
slide-28
SLIDE 28

Aim: identify a wide range of incident diseases and other health related outcomes Active methods requiring participant re-engagement

  • face to face reassessment
  • postal or web-based surveys
  • expensive
  • prone to incomplete coverage & selective loss to follow-up
  • miss cases emerging between assessments

Passive methods via linkages to national health records

  • can follow all participants without need for re-engagement
  • efficient and cost effective
  • need adequate consent at recruitment
  • rely on universal healthcare system & availability of relevant datasets
  • can only detect cases of disease diagnosed in a healthcare setting
  • data need to be accurate and sufficiently detailed for research studies

Follow-up of participants in very large prospective cohorts

slide-29
SLIDE 29

Web questionnaires

  • Using email and web questionnaires

– for more detailed assessment of exposures – and to obtain information on outcomes that cannot be obtained through linking to health records

  • Of 350,000 with email, >150,000 complete each questionnaire

– Details of dietary intake – Cognitive function – Mental health (thoughts and feelings) – Gastrointestinal symptoms Useful for following change

  • ver time…but beware

selective attrition

slide-30
SLIDE 30

Following the health of 0.5 million UK Biobank participants through linking to National Health Service (NHS) records

Scotland 36,000 participants England 446,000 participants Wales 21,000 participants

Regularly updated information on a wide range of diseases from NHS datasets in all three countries:

  • Deaths - date and cause of death

for all participants >14,000 by early 2016

  • Cancers – date, stage and grade of cancer

for all participants >79,000 cancer cases by late 2015

  • Admissions to hospital – dates, diagnoses, procedures

for all participants 1000’s of cases of many incident diseases

  • Primary care data – dates, diagnoses, symptoms, signs,

referrals, prescriptions, labs etc for half of the participants 1000’s more cases of many incident diseases

slide-31
SLIDE 31

Maximising the value of the linked healthcare data

  • Messy ‘real world data’ - not collected primarily for research
  • Not 100% accurate due to administrative and clinical error
  • Mainly structured, coded datasets (ICD, OPCS4, Read…)
  • Experts advising in a range of disease areas:
  • Combine different linked data sources to create algorithmically derived disease

status indicators

  • Estimate the accuracy and completeness of these
  • Consider limitations and potential additional sources of unstructured data

Cancer Neurodegenerative diseases Diabetes Chest diseases Cardiac diseases Musculoskeletal conditions Stroke Infections Mental health disorders Kidney diseases Eye diseases

slide-32
SLIDE 32

Cancers in UK Biobank ascertained from the national cancer registries

Observed Predicted By recruitment Incident by 2015 Incident by 2022 Breast cancer 9,000 4,200 10,000 Colorectal cancer 2,300 2,500 7,000 Prostate cancer 3,000 4,300 9,000 → Date, stage and grade of cancer Beyond the structured registry data…exploring feasibility of retrieving additional information for subtyping of identified cancer cases through regional linkages to:

  • histopathology reports
  • digitised histopathology slides
  • tumour specimens
slide-33
SLIDE 33

Exemplar non-cancer conditions in UK Biobank

ascertained from baseline self report, hospital admissions and death registries

Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600

slide-34
SLIDE 34

Exemplar non-cancer conditions in UK Biobank

ascertained from baseline self report, hospital admissions and death registries

Observed By recruitment Incident by 2016 Myocardial infarction 12,000 7,400 8,100 Stroke 8,000 4,600 6,900 Diabetes 26,000 9,000 18,000 COPD 10,000 7,600 16,900 Asthma 60,000 5,700 19,000 Dementia 200 1,800 3,600

Estimated effects of including primary care data

Accuracy? Limitations?

slide-35
SLIDE 35

Dementia: positive predictive value of routine healthcare data

From published studies

Mortality Hospital inpatient Hospital in- & outpatient Insurance Outpatients Primary care 0 0.2 0.4 0.6 0.8 1

Wide variation but in most PPV >80%

0 0.2 0.4 0.6 0.8 1 Alzheimer’s disease Vascular dementia

PPV for AD generally higher than for vasc dementia

slide-36
SLIDE 36

Dementia: positive predictive value of routine healthcare data

From comparison with expert review of free text electronic medical record in UK Biobank

Dementia: 80% Alzheimer’s disease: 72% Vascular dementia: 44%

slide-37
SLIDE 37

Beyond the linked coded healthcare data

Obtaining these data at national scale is challenging To extract value from these data on 1000’s of

  • utcomes across multiple diseases, we need scalable

approaches: crowd sourcing, natural language processing, machine learning, artificial intelligence… Structured, coded data from linked national healthcare datasets:

  • Can ascertain cases of a wide range of diseases with acceptable accuracy
  • Capture only 10-20% of the information from electronic medical records
  • Are limited for detailed sub-phenotyping of disease

Deeper phenotyping of disease will require multiple unstructured data sources, including:

  • Free text of electronic records
  • Complex electrical signalling data (ECG’s, EEG’s etc)
  • Histopathology slide sets
  • Clinical imaging data
slide-38
SLIDE 38

Acknowledgements

slide-39
SLIDE 39
slide-40
SLIDE 40

Biochemistry analyses in all 500,000 participants

Cardiovascular Cholesterol Direct LDL-c HDL-c Triglyceride ApoA ApoB Lp(a) CRP Cancer SHBG Testosterone Oestradiol IGF-I Bone and joint Vitamin D Rheumatoid factor Alkaline Phosphatase Calcium Liver Albumin Direct Bilirubin Total Bilirubin GGT ALT AST Note: Haematological assays were conducted during recruitment phase Diabetes HbA1c Glucose Renal Creatinine Cystatin C Total protein Urea Phosphate Urate Urinary:

  • Creatinine
  • Sodium
  • Potassium
  • Albumin
slide-41
SLIDE 41

500,000 participants 22 recruitment centres 89% England 7% Scotland 4% Wales

slide-42
SLIDE 42

Industrial scale processes: samples during recruitment

25,000 aliquots produced per day 700 participants per day 4,900 sample tubes per day 15 million 0.85ml aliquots

slide-43
SLIDE 43
  • Blood

whole blood serum plasma red cells buffy coat

  • Urine
  • Saliva

Total > 15 million aliquots

slide-44
SLIDE 44

Expected disease cases during follow-up

Condition 2012 2017 2022 Diabetes 10,000 25,000 40,000 Heart attack 7,000 17,000 28,000 Stroke 2,000 5,000 9,000 Chronic obstructive lung disease 3,000 8,000 14,000 Breast cancer 2,500 6,000 10,000 Colorectal cancer 1,500 3,500 7,000 Prostate cancer 1,500 3,500 7,000 Hip fracture 1,000 2,500 6,000 Alzheimer’s 1,000 3,000 9,000

slide-45
SLIDE 45

Data from portable wearable devices

Accelerometry data: 100,000 participants Continuous ECG monitoring: 20,000 + participants

Prospective design and large size enable reasonably well- powered studies of (causal) associations between accelerometry and cardiac rhythm measures and later

  • nset disease
slide-46
SLIDE 46

Sample analyses

  • Genome-wide genotyping of all participants
  • Standard panel of assays (e.g. lipids; hormones; metabolic) on

samples from all participants

  • Exome & whole genome sequencing, proteomics, metabolomics,

infectious disease assays, stool microbiome…all underway/planned

slide-47
SLIDE 47

Multimodal imaging of 100,000 participants

>22,000 imaged so far

Prospective design and large size enable well-powered studies of (causal) associations between structure and function of organs and later onset disease…but…need scalable methods of analysing complex data to derive measures for large scale analyses

slide-48
SLIDE 48

Obtaining phenotype and outcome data from health records:

China Kadoorie Biobank experience

Zhengming CHEN

CKB Principal Investigator Professor of Epidemiology Nuffield Dept. of Population Health University of Oxford, UK (zhengming.chen@ndph.ox.ac.uk)

International Cohorts Summit, Duke University, USA, 26-27 March 2018

slide-49
SLIDE 49
  • >512K recruited from 10 localities in 2004-08
  • Participants interviewed, measured, and gave

plasma and DNA (urine) for long-term storage

  • All followed up indefinitely via electronic record

linkage to deaths and ALL hospital episodes

  • Periodic resurvey of 5% surviving participants

(for enhancements and sources of variation)

China Kadoorie Biobank (CKB)

Informed consent for linkages to health records and unspecified research use of stored samples

slide-50
SLIDE 50

CKB: Clinical stations at local assessment centre

登记,知情同意,分发现场调查袋 问卷 (包括一般调查问卷, COPD 问卷, CIDI问卷 以及完整性检查 I) 完成 · 完整性检查 II – 检查是否有遗漏项目 · 分发体检结果报告 体脂 手握力 体格检查 身高 腰臀围 血压 脉搏波速 心血管检查 采血及血检 采尿及尿检 血、尿样采集及检验 肺功能 一氧化碳浓度 生理检查 颈动脉超声 跟骨密度 心电图 其他检查

1.Registration & consent 8.Physical exam. 7.Questionnaire (with recording) 5.Sample collection

The clinic visit took 60-90 minutes, with daily statistical monitoring

Recruitment rate: 7~800 per day

2.Physical exam. 3.Physical exam. 4.Physical exam. 9.Clinic consultation

slide-51
SLIDE 51

CKB: Supported by >90 bespoke IT systems

slide-52
SLIDE 52

CKB: Fully established with 10-year follow-up

Questionnaire SES, smoking, alcohol, tea, diet, physical activity, indoor air pollution, sleep, reproductive patterns, medical history Measurements Blood pressure, height, weight, lung function, heart rate, bone density, exhaled CO, ECG, cIMT, ambient temperature, ambient air pollution, blood lipids, metabolites, proteomics, infectious markers, genetics Electronic health records >1,300 different diseases, 43K deaths, <5K lost to follow-up, ~0.9 million hospitalizations, >100 million chargeable items www.ckbiobank.org

Data is growing rapidly

◊ ◊

slide-53
SLIDE 53

Outcome Follow up in CKB

Active follow-up Disease registries National health insurance Death registries

CKB: Follow-up through record linkages

slide-54
SLIDE 54

National health insurance system in China

(supplementing death and cancer registries)

  • Introduced during 2004-6, with almost universal

coverage in CKB areas by 2010

  • Multiple disease diagnoses, with ICD-10 codes plus

disease descriptions and >2,500 procedure codes

  • Managed electronically at city level, with detailed

chargeable items for reimbursement purposes

  • Lacks certain details (e.g. cancer pathology) required

for disease sub-phenotyping

Nearly all CKB participants now linked to the health insurance databases via unique national ID number

slide-55
SLIDE 55

46063 32468 28274 19796 15539 14277 13882 11037 8035 6270 5904 5489 4632 4002 3188 3169 2748 2664 2273 1730 1316 1192 1104 1098 822 425 54

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Ischaemic Stroke (I63) Diabetes (E10-E14) Cancer (C00-C97 ) COPD (J41-J44) Fractures (S02, S12, S22, S32, S42, S52, S62, S72, S82, S92) Cataract (H25-H28) Angina (I20) Haemorrhagic Stroke (I61) MI (I21-I23) Arrhythmia (I47-I49) Chronic liver disease (K70-K77, B18-B19, I85, Z22.5) Pulmonary heart disease (I26-I27) Heart failure (I50) Anxiety disorders (F40-F48) Tuberculosis (A15-A19, B90, J65) Asthma (J45-J46) CKD (N02-N03, N07, N11,N18) Osteoporosis (M80-M81) Coronary revascularisation Rheumatoid arthritis (M05-M06, M45) Schizophrenia (F20-F29) Depression (F30-F39) Retinopathy (E10-4.3,H36.0) SAH (I60, I69.0) Parkinson disease (G20-G21) (Venomous) snake bite (T63.0, X20) Victim of earthquake (X34)

CKB: Participants with selected diseases in 10 years

(43K deaths, 0.9M hospital admissions; 2017 HI data are being processed)

Haemorrhagic stroke NAFLD

Waist circumference (cm)

slide-56
SLIDE 56

CKB: Procedures for improving disease phenotyping

Pilot study of ~1000 cases for specific disease before deciding whether to undertake systematic adjudication

slide-57
SLIDE 57

CKB: Disease standardisation and coding tool

slide-58
SLIDE 58

CKB: Verifying reported diagnosis

slide-59
SLIDE 59

CKB: Adjudicating & phenotyping major diseases

(>70K adjudicated: 30K stroke, 25K IHD, 15K cancer, >3K CKD)

slide-60
SLIDE 60

CKB: “traffic” light approach for outcome data

slide-61
SLIDE 61

Future work for disease phenotyping

  • Standardising and ICD-10 coding new events collected
  • Processing and incorporating >100M chargeable items

data to enhance disease phenotyping

  • Extending outcome adjudication to several other

diseases (e.g. heart failure, chronic liver disease)

  • Developing automated algorithm to sub-phenotype

stroke and other diseases according to clinical criteria

  • Piloting collection of discharge summary pages and

tumour tissue samples

slide-62
SLIDE 62

CKB: Open data access platform

(www.ckbiobank.org)