Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, - - PowerPoint PPT Presentation

data mining for genomic phenomic correlations
SMART_READER_LITE
LIVE PREVIEW

Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, - - PowerPoint PPT Presentation

Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, Ph.D. Associate Director & Chair, Information Sciences Rebecca Nelson, Ph.D. Lead, Data Mining Section City of Hope National Medical Center Duarte, California, USA City of


slide-1
SLIDE 1

Data Mining for Genomic- Phenomic Correlations

Joyce C. Niland, Ph.D. Associate Director & Chair, Information Sciences Rebecca Nelson, Ph.D. Lead, Data Mining Section City of Hope National Medical Center Duarte, California, USA

slide-2
SLIDE 2

City of Hope National Medical Center, Duarte, California City of Hope National Medical Center, Duarte, California

slide-3
SLIDE 3

City of Hope National Medical Center City of Hope National Medical Center

  • Founded in 1913

Founded in 1913

  • State

State-

  • of
  • f-
  • the

the-

  • art care to patients with cancer & other life

art care to patients with cancer & other life-

  • threatening diseases (e.g. diabetes)

threatening diseases (e.g. diabetes)

  • Leading edge research into the causes, prevention, and

Leading edge research into the causes, prevention, and cure of such diseases cure of such diseases

  • Promising new therapeutic agents being taken from

Promising new therapeutic agents being taken from “ “bench bench” ” to to “ “bedside bedside” ” (translational research) (translational research)

  • Over 400 ongoing clinical trials, 1/3 initiated at City of Hope

Over 400 ongoing clinical trials, 1/3 initiated at City of Hope

  • Human genome has expanded scope and objectives

Human genome has expanded scope and objectives

  • f translational research
  • f translational research
slide-4
SLIDE 4
  • Improved diagnosis of disease

Improved diagnosis of disease

  • Earlier detection of genetic risk

Earlier detection of genetic risk

  • Repairing defective genes with healthy ones

Repairing defective genes with healthy ones

  • New drugs based on information about genes

New drugs based on information about genes

slide-5
SLIDE 5

Recent Examples of Genomic-Phenomic Data Mining Supported by Biostatistics

Predictors of Genetic Susceptibility to Heart Disease

Post-Bone Marrow Transplant

Pathogenesis of Radiation-induced Breast Cancer Gene Expression in Prostate Cancer Tumors Prognostic Biomarkers for Stage I-III Renal Cell

Carcinoma

Validation of Biomarkers for Tumor Initiating Cells

in Brain Cancer

Expression of DNA Repair in Normal versus

Tumor Cell Genes

slide-6
SLIDE 6

The 3 (4?) I’s of Data Mining Process

I ntegrate Data Sources I nclude Appropriate Samples I dentify (I nfer?) Subjects Wishes

Data W arehouse Cohort Assem bly Honest Broker

slide-7
SLIDE 7

Integrating Data Systems to Support Biomedical Research

Biomedical research is an increasingly complex

collaborative undertaking

Requires integration of data, rules, processes, and

vocabularies from many different source systems

Most information systems developed independently

Operational systems, created to meet different

functional and departmental needs of an institution

slide-8
SLIDE 8

Operational Systems Versus Data Warehousing

On-Line Transaction Processing ( OLTP) :

Focuses on an organization’s day-to-day business

needs (electronic medical record, financial systems, clinical trial management systems) On-Line Analytic Processing ( OLAP) :

Retrieves, analyzes, reports, and shares data from

disparate systems, vendors & departments (DW)

8

slide-9
SLIDE 9

Data Warehousing Concept

9

Large, centralized, and longitudinal store of

data to facilitate organization-wide consolidated reporting and analysis

  • Multiple source databases
  • Central coordination and management via

metadata repository

  • Multiple target “data marts” (aggregated

datasets for efficient querying and analysis)

slide-10
SLIDE 10

Clinical Research Basic Science Eclypsis Electronic Medical Record (EMR)

Caregiver Documentation Discrete Core Data Elements from Patient Care Abstraction of Core Phenomic Data

  • n All COH Patients

Medidata Electronic Data Capture (EDC)

Management of Patients on Clinical Trials Extract, Transform & Load (ETL)

COH Data COH Data Warehouse: Warehouse: Phenomic and Genomic Information for Reporting & Data Mining

MIDAS Observational Data System

ETL

Hospital QA Reporting Hypothesis Generation & Grant Proposals Disease Cluster Reporting & Analysis Genotype – Phenotype Correlative Analyses

Data Validation & Quality Assurance (QA) M

E T A D A T A L A Y E R

City of Hope (COH) Da City of Hope (COH) Data Warehous ta Warehouse Overvi e Overview ew

Patient Care

Source Data Systems Sunquest Surgical Info System (SIS) Radiology Info System (RIS) SafeTrace Transfusion Medicine CoPath

Financials

Trendstar Financial Data E T L

OLAP “Cubes”

OLTP OLAP

10

Cancer Registry Labware Biospecimen Repository Solexa Gene Analysis Microarray High Throughput Analysis Sample Data Genomic / Proteomic Results ETL ETL ETL ETL

indicates in progress

slide-11
SLIDE 11

While protecting personally identifiable information and proprietary research data:

  • Decision support to administrators
  • Screening of patients for eligibility
  • Measurement of quality of care & outcomes
  • Query capabilities to investigators
  • Data mining to generate new hypotheses,

facilitate new discoveries

Utility of a Data Warehouse

11

slide-12
SLIDE 12

Technical & Business Metadata Directories

Data Sources

Technical name Data type & length Creation, expiration dates Source system Data ‘steward’

Mappings

Rules for merging/filtering

Validation Rules

Missing value fields Data integrity, consistency

Transformation Rules

Derivation of values Data summaries

  • Data Definition

Field names, aliases Description of data meaning

  • Data Directives

Instructions for data collection Guidelines for data coding

  • Queries

Synonyms Classification coding

  • Reports

List of reports that use term

  • Security Information

Authorization to access

*Database Administrator Perspective *Database Administrator Perspective **Database User Perspective **Database User Perspective M M E E T T A A D D A A T T A A R R E E P P O O S S I I T T O O R R Y Y

Technical Metadata:* Technical Metadata:* Business Metadata:** Business Metadata:**

slide-13
SLIDE 13

Cohort Assembly

Inclusion of subjects with appropriate

phenomic characteristics AND available tissue

> 360,000 specimens logged in CoPath

system, going back to 1955

Critical to integrate tissue sample data into the

data warehouse

Broken down into “Class of Case” to describe

type of specimen

slide-14
SLIDE 14

CoPath Specimens by Year and Type

5000 10000 15000 20000 25000 30000 1955 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Routine Surgical Other Outside Consults Hematopathology Cytology Card Class Blood

slide-15
SLIDE 15

Integrate “Best” Source of Tissue Annotation Data

Standardized formatted path reports now

available from pathologist

‘Synoptic Report’ includes:

Path T Stage Nodes Examined Nodes Positive Path N Stage Path M Stage Margins Histology Grade

slide-16
SLIDE 16

Concordance Between CoPath Synoptic Reports & Cancer Registry

0.99 0.82 0.91 0.88 0.84 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Prostate Breast Colon and Rectum Kidney Lung

slide-17
SLIDE 17

Comparing Path Data Sources: Synoptic Report vs. CNExT

Synoptic Report - Pathology

Surgery specific Only 22% of cancer cases Non-strict reporting rules No confirmation by

treating MD

CNExT – Cancer Registry

Patient specific 100% of cancer cases Strict reporting rules Confirmed by treating

MD

slide-18
SLIDE 18

Text Mining to Identify Cases

Text mining may be needed if neither

data source specific enough

  • Example: Diagnosis of Unknown

Primary Sites for Metastatic Tumors via 92-Gene RT-PCR

Needed to find tissues samples labeled

as purely metastatic, with no link to

  • riginal cancer site
slide-19
SLIDE 19

Enabling Investigators to Conduct Cohort Searches via i2b2

i2 b2 : Integrating Biology and the Bedside

Facile user interface to allow investigators to

search for cohorts on their own

Executes advanced queries against meta-

database to identify available subjects, tissue samples, or biospecimens matching query criteria

if feasible number of cases returned, then submit

IRB protocol for approval

Biostatistics Division first needs to

provide ‘Honest Broker’ service to eliminate any dissenting patients

slide-20
SLIDE 20

Definition of “Honest Broker”

Impartial party and process to determine

whether patient’s wishes would be violated by:

Analyzing their data arising from standard care,

  • r

Studying their discard tissue samples

Moving to single “General Research

Consent” for all patients going forward

However many different consents used for

various studies in the past, must be considered

slide-21
SLIDE 21

Protocols Requiring “Honest Broker” Process

Total Protocols N=3,299 Protocols with Consent for Specific Interventions N=2,314 Non-intervention Studies N= 985 Consent Required N=365 Consent Not Required N=620

No Honest Broker Honest Broker Required

slide-22
SLIDE 22

Patient Consent Status

Consent Status N Percent Consented 70,133 92.5 Dissented 5,637 7.4 Consent Withdrawn (Dissented) 25 < 1.0 Total 75,795

slide-23
SLIDE 23

Honest Broker Algorithmic Approach

Use computerized algorithms to evaluate:

Consent Type Participation Type

Any “no” response to a consent/ participation

type related to objectives of the study do not include in cohort

Note: Consents change over time

May require different algorithms depending

  • n protocol version
slide-24
SLIDE 24

Participation Type Field Examples

agree to allow the collection and storage of sam ples from tissue rem oved during surgical resection of breast tissue ( perform ed as part of m y routine clinical care) , and for the specim ens to be linked to the clinical data obtained from m y m edical records for the purposes of cancer research I agree to allow the collection of clinical data to be stored in a clinical research database. I understand that this clinical database w ill be updated w ith data obtained from m y ongoing m edical care at City of Hope I agree to have m y m edical inform ation and m y blood stored for future research I agree to have m y specim ens stored for future research purposes I agree to have m y specim ens stored for future research purposes. I agree to have m y/ m y child's tissue or specim ens stored for future research purposes. I agree to have m y/ m y child’s nam e and contact inform ation released to an agency under contract w ith COH, one of w hich is Exam ined Managem ent Services, I nc. ( EMSI ) I agree to have tissue stored for future research I agree to have tissue stored for future research. I agree to participate in the focus group I agree to provide a blood sam ple to be used as part of this study I agree to provide a urine sam ple to be used as part of this study I agree to the collection and storage of sam ples from tissue rem oved during biopsy or surgical resection of prostate cancer ( perform ed as part of m y routine clinical care) for the purposes of cancer research I agree to the collection of blood sam ples during routine clinical care that w ill be stored in blood bank for the purposes of cancer research I agree to udnergo research blood sam ple collection at the first 3 proposed tim e points as part of this study I agree to undergo research blood sam ple collection at all proposed tim e points as part of this study I do not agree to have m y child’s nam e and contact inform ation released to an agency under contract w ith COH, one of w hich is Exam ined Managem ent Services, I nc. ( EMSI ) I w ill allow additional needle biopsy specim ens to be obtained during a routine and required diagnostic procedure for the purposes of cancer research

slide-25
SLIDE 25

Participation Type Responses

218 different participation types across all

non-interventional studies

Multiple participation types within a given consent

form require complex coding algorithms:

Example of participations within 1 consent form:

I allow my data to be collected for research= ‘Yes’ I allow a routine sample to be used for research= ‘Yes’ I allow an extra sample to be collected for research= ‘No’ I allow my clinical and sample data to be linked= ‘No’

slide-26
SLIDE 26

Conclusions

Advances in personalized medicine will

require genome-phenome correlations

Data warehousing is an optimal approach Business metadata are critical for valid use

  • f data

Requires complex computational algorithms

to

Assemble appropriate cohort Mine data, particularly non-standardized, text Ensure valid linkages and data extraction Protect patient privacy wishes

slide-27
SLIDE 27