Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, - - PowerPoint PPT Presentation
Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, - - PowerPoint PPT Presentation
Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, Ph.D. Associate Director & Chair, Information Sciences Rebecca Nelson, Ph.D. Lead, Data Mining Section City of Hope National Medical Center Duarte, California, USA City of
City of Hope National Medical Center, Duarte, California City of Hope National Medical Center, Duarte, California
City of Hope National Medical Center City of Hope National Medical Center
- Founded in 1913
Founded in 1913
- State
State-
- of
- f-
- the
the-
- art care to patients with cancer & other life
art care to patients with cancer & other life-
- threatening diseases (e.g. diabetes)
threatening diseases (e.g. diabetes)
- Leading edge research into the causes, prevention, and
Leading edge research into the causes, prevention, and cure of such diseases cure of such diseases
- Promising new therapeutic agents being taken from
Promising new therapeutic agents being taken from “ “bench bench” ” to to “ “bedside bedside” ” (translational research) (translational research)
- Over 400 ongoing clinical trials, 1/3 initiated at City of Hope
Over 400 ongoing clinical trials, 1/3 initiated at City of Hope
- Human genome has expanded scope and objectives
Human genome has expanded scope and objectives
- f translational research
- f translational research
- Improved diagnosis of disease
Improved diagnosis of disease
- Earlier detection of genetic risk
Earlier detection of genetic risk
- Repairing defective genes with healthy ones
Repairing defective genes with healthy ones
- New drugs based on information about genes
New drugs based on information about genes
Recent Examples of Genomic-Phenomic Data Mining Supported by Biostatistics
Predictors of Genetic Susceptibility to Heart Disease
Post-Bone Marrow Transplant
Pathogenesis of Radiation-induced Breast Cancer Gene Expression in Prostate Cancer Tumors Prognostic Biomarkers for Stage I-III Renal Cell
Carcinoma
Validation of Biomarkers for Tumor Initiating Cells
in Brain Cancer
Expression of DNA Repair in Normal versus
Tumor Cell Genes
The 3 (4?) I’s of Data Mining Process
I ntegrate Data Sources I nclude Appropriate Samples I dentify (I nfer?) Subjects Wishes
Data W arehouse Cohort Assem bly Honest Broker
Integrating Data Systems to Support Biomedical Research
Biomedical research is an increasingly complex
collaborative undertaking
Requires integration of data, rules, processes, and
vocabularies from many different source systems
Most information systems developed independently
Operational systems, created to meet different
functional and departmental needs of an institution
Operational Systems Versus Data Warehousing
On-Line Transaction Processing ( OLTP) :
Focuses on an organization’s day-to-day business
needs (electronic medical record, financial systems, clinical trial management systems) On-Line Analytic Processing ( OLAP) :
Retrieves, analyzes, reports, and shares data from
disparate systems, vendors & departments (DW)
8
Data Warehousing Concept
9
Large, centralized, and longitudinal store of
data to facilitate organization-wide consolidated reporting and analysis
- Multiple source databases
- Central coordination and management via
metadata repository
- Multiple target “data marts” (aggregated
datasets for efficient querying and analysis)
Clinical Research Basic Science Eclypsis Electronic Medical Record (EMR)
Caregiver Documentation Discrete Core Data Elements from Patient Care Abstraction of Core Phenomic Data
- n All COH Patients
Medidata Electronic Data Capture (EDC)
Management of Patients on Clinical Trials Extract, Transform & Load (ETL)
COH Data COH Data Warehouse: Warehouse: Phenomic and Genomic Information for Reporting & Data Mining
MIDAS Observational Data System
ETL
Hospital QA Reporting Hypothesis Generation & Grant Proposals Disease Cluster Reporting & Analysis Genotype – Phenotype Correlative Analyses
Data Validation & Quality Assurance (QA) M
E T A D A T A L A Y E R
City of Hope (COH) Da City of Hope (COH) Data Warehous ta Warehouse Overvi e Overview ew
Patient Care
Source Data Systems Sunquest Surgical Info System (SIS) Radiology Info System (RIS) SafeTrace Transfusion Medicine CoPath
Financials
Trendstar Financial Data E T L
OLAP “Cubes”
OLTP OLAP
10
Cancer Registry Labware Biospecimen Repository Solexa Gene Analysis Microarray High Throughput Analysis Sample Data Genomic / Proteomic Results ETL ETL ETL ETL
indicates in progress
While protecting personally identifiable information and proprietary research data:
- Decision support to administrators
- Screening of patients for eligibility
- Measurement of quality of care & outcomes
- Query capabilities to investigators
- Data mining to generate new hypotheses,
facilitate new discoveries
Utility of a Data Warehouse
11
Technical & Business Metadata Directories
Data Sources
Technical name Data type & length Creation, expiration dates Source system Data ‘steward’
Mappings
Rules for merging/filtering
Validation Rules
Missing value fields Data integrity, consistency
Transformation Rules
Derivation of values Data summaries
- Data Definition
Field names, aliases Description of data meaning
- Data Directives
Instructions for data collection Guidelines for data coding
- Queries
Synonyms Classification coding
- Reports
List of reports that use term
- Security Information
Authorization to access
*Database Administrator Perspective *Database Administrator Perspective **Database User Perspective **Database User Perspective M M E E T T A A D D A A T T A A R R E E P P O O S S I I T T O O R R Y Y
Technical Metadata:* Technical Metadata:* Business Metadata:** Business Metadata:**
Cohort Assembly
Inclusion of subjects with appropriate
phenomic characteristics AND available tissue
> 360,000 specimens logged in CoPath
system, going back to 1955
Critical to integrate tissue sample data into the
data warehouse
Broken down into “Class of Case” to describe
type of specimen
CoPath Specimens by Year and Type
5000 10000 15000 20000 25000 30000 1955 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Routine Surgical Other Outside Consults Hematopathology Cytology Card Class Blood
Integrate “Best” Source of Tissue Annotation Data
Standardized formatted path reports now
available from pathologist
‘Synoptic Report’ includes:
Path T Stage Nodes Examined Nodes Positive Path N Stage Path M Stage Margins Histology Grade
Concordance Between CoPath Synoptic Reports & Cancer Registry
0.99 0.82 0.91 0.88 0.84 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Prostate Breast Colon and Rectum Kidney Lung
Comparing Path Data Sources: Synoptic Report vs. CNExT
Synoptic Report - Pathology
Surgery specific Only 22% of cancer cases Non-strict reporting rules No confirmation by
treating MD
CNExT – Cancer Registry
Patient specific 100% of cancer cases Strict reporting rules Confirmed by treating
MD
Text Mining to Identify Cases
Text mining may be needed if neither
data source specific enough
- Example: Diagnosis of Unknown
Primary Sites for Metastatic Tumors via 92-Gene RT-PCR
Needed to find tissues samples labeled
as purely metastatic, with no link to
- riginal cancer site
Enabling Investigators to Conduct Cohort Searches via i2b2
i2 b2 : Integrating Biology and the Bedside
Facile user interface to allow investigators to
search for cohorts on their own
Executes advanced queries against meta-
database to identify available subjects, tissue samples, or biospecimens matching query criteria
if feasible number of cases returned, then submit
IRB protocol for approval
Biostatistics Division first needs to
provide ‘Honest Broker’ service to eliminate any dissenting patients
Definition of “Honest Broker”
Impartial party and process to determine
whether patient’s wishes would be violated by:
Analyzing their data arising from standard care,
- r
Studying their discard tissue samples
Moving to single “General Research
Consent” for all patients going forward
However many different consents used for
various studies in the past, must be considered
Protocols Requiring “Honest Broker” Process
Total Protocols N=3,299 Protocols with Consent for Specific Interventions N=2,314 Non-intervention Studies N= 985 Consent Required N=365 Consent Not Required N=620
No Honest Broker Honest Broker Required
Patient Consent Status
Consent Status N Percent Consented 70,133 92.5 Dissented 5,637 7.4 Consent Withdrawn (Dissented) 25 < 1.0 Total 75,795
Honest Broker Algorithmic Approach
Use computerized algorithms to evaluate:
Consent Type Participation Type
Any “no” response to a consent/ participation
type related to objectives of the study do not include in cohort
Note: Consents change over time
May require different algorithms depending
- n protocol version
Participation Type Field Examples
agree to allow the collection and storage of sam ples from tissue rem oved during surgical resection of breast tissue ( perform ed as part of m y routine clinical care) , and for the specim ens to be linked to the clinical data obtained from m y m edical records for the purposes of cancer research I agree to allow the collection of clinical data to be stored in a clinical research database. I understand that this clinical database w ill be updated w ith data obtained from m y ongoing m edical care at City of Hope I agree to have m y m edical inform ation and m y blood stored for future research I agree to have m y specim ens stored for future research purposes I agree to have m y specim ens stored for future research purposes. I agree to have m y/ m y child's tissue or specim ens stored for future research purposes. I agree to have m y/ m y child’s nam e and contact inform ation released to an agency under contract w ith COH, one of w hich is Exam ined Managem ent Services, I nc. ( EMSI ) I agree to have tissue stored for future research I agree to have tissue stored for future research. I agree to participate in the focus group I agree to provide a blood sam ple to be used as part of this study I agree to provide a urine sam ple to be used as part of this study I agree to the collection and storage of sam ples from tissue rem oved during biopsy or surgical resection of prostate cancer ( perform ed as part of m y routine clinical care) for the purposes of cancer research I agree to the collection of blood sam ples during routine clinical care that w ill be stored in blood bank for the purposes of cancer research I agree to udnergo research blood sam ple collection at the first 3 proposed tim e points as part of this study I agree to undergo research blood sam ple collection at all proposed tim e points as part of this study I do not agree to have m y child’s nam e and contact inform ation released to an agency under contract w ith COH, one of w hich is Exam ined Managem ent Services, I nc. ( EMSI ) I w ill allow additional needle biopsy specim ens to be obtained during a routine and required diagnostic procedure for the purposes of cancer research
Participation Type Responses
218 different participation types across all
non-interventional studies
Multiple participation types within a given consent
form require complex coding algorithms:
Example of participations within 1 consent form:
I allow my data to be collected for research= ‘Yes’ I allow a routine sample to be used for research= ‘Yes’ I allow an extra sample to be collected for research= ‘No’ I allow my clinical and sample data to be linked= ‘No’
Conclusions
Advances in personalized medicine will
require genome-phenome correlations
Data warehousing is an optimal approach Business metadata are critical for valid use
- f data
Requires complex computational algorithms
to
Assemble appropriate cohort Mine data, particularly non-standardized, text Ensure valid linkages and data extraction Protect patient privacy wishes