Machine Learning Applications to Omics Data
Kelly Ruggles April 9, 2018
Machine Learning Applications to Omics Data Kelly Ruggles April 9, - - PowerPoint PPT Presentation
Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in Biomedicine Genome Long term information Proteomics storage Phosphoproteomics Transcriptome Mutation calls Retrieval of information
Kelly Ruggles April 9, 2018
Mutation calls Copy Number Gene Expression DNA methylation/Epigenetics MicroRNA RPPA Clinical Data Proteomics Phosphoproteomics
storage
storage
ChIP-Seq
recovered DNA is sequenced
proteins DNAse-Seq/FAIRE-Seq
(open chromatin = active genes) Hi-C/5C
(promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS)
level
Sample
Single Nucleotide Polymorphisms (SNPs)
progression
Copy Number Variation (CNV)
deletion of large regions of DNA
Genomic DNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing
Library Preparation
SNP T C
Gene Expression
Alternative Splicing
cancer
driver
Sample RNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing
Library Preparation
Sample Peptides Fractionation Digestion Lysis
m/z intensity
Identity Quantity
Tandem Mass Spectrometry
Discovery Proteomics:
expression (whole cell proteome)
to measure phosphorylation status Reverse Phase Protein Array:
Institute and the National Human Genome Research Institute
maps of 33 tumor types
characterized at the proteome level
National Human Genome Research Institute
functional elements in the human genome
Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.
Sequence Element Annotation
Kapranov, 2009
the regulatory regions that flank it
RNA
and list of non-TSS sequences
which predicts TSS or non-TSS for each sequence
elements of a given type you can probably train a machine learning method to recognize those elements
Libbrecht and Nobel, 2015
the regulation of gene expression
target genes makes them difficult to identify
Seq data and applied random forest model to predict enhancers
H3K4me3, H3K27ac) that were the most informative and robust across cell types
embryonic stem cells and predicted in 12 ENCODE cell types
Rajagopal 2013 Cell types
Exons Introns
predict locations and intron/exon structure of all protein-coding genes on a chromosome
Libbrecht and Nobel, 2015
start/end of gene, splice sites
Libbrecht and Nobel, 2015
(ENCODE) and want to identify patterns
modification TF binding
providing an overview of the functional activities of the genome
number of labels
labels to each segment.
genomic elements
Libbrecht and Nobel, 2015
data (ChIP-Seq, DNAse-seq, FAIRE-seq)
Networks to segment and cluster the data
features
chromatin states, etc.
Nature Methods, 2012
Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.
Expression-based input
Ruggles et al., (2017) MCP
expression matrices
distinguish between disease phenotypes and/or to identify potentially valuable disease biomarkers
positives due to chance
Alyass, 2015
efficacy and minimize risk based on genetic make-up
individual variability in drug response and toxicity.
drug metabolism differences
corresponding to a therapeutic effect
help between 1 of 25 and 1 of 4 people who take them
ethnic groups because the bias of Wester participants in clinical trials
account genetic and environmental factors that effect how a person responds to treatment
Schork, 2015
healthcare:
disease
predictive and drug response markers
based on biomarkers
Bernstam et al., 2013
in part, to our ability to generate
cheaply generate data
subject-specific care based on their disease network
molecular mechanisms based on this data is limited
Alyass, 2015
disciplines is required:
best practices
Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity
(TCGA)
tumor types
samples
Li et al., 2017
Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity
shotgun proteomics
highest segregating power
and TBC1D4) to accurately classify Large B-Cell lymphoma patients, which are usually morphologically indistinguishable
Deeb, et al. (2015) Mol. Cell. Proteomics MCP. 14, 2947–2960 Clustering of top protein candidates determined by SVM
Modified from Ritchie et al., 2015
rearrangement
methylation
modification
accessibility
expression
modification
profiling Phenome e.g. Cancer
Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity
proteome and phosphoproteome data to be more informative regarding clinical
data modalities are more proximal to the disease.
proteomics data to
cancer
Mutation Copy Number Gene Expression DNA methylation Clinical Data Proteomics Phospho- proteomics Lung squamous Ovarian Colorectal Kidney Breast Lung adeno Endometrial Glioblastoma Pancreatic
possible)
Mertins et al. Nature 2016 Zhang Nature 2014 Zhang Cell 2016
breast tumors to predict 10 year survival in breast cancer
improve model performance
transcriptomics
Ma, S et al. (2016) AMIA Summits Transl. Sci.
Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity
Zhu, 2017 Scientific Reports
Cancer Genome Consortium (ICGC)
data is aggregated
Zhu, 2017 Scientific Reports Prognostic performance each ’omics alone
Endocervical Carcinoma Head & Neck Squamous Cell Carcinoma Lower Grade Glioma Ovarian Serous Cystadenocarcinoma
(pink vs. dotted blue)
Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity
respond to a drug
mutations from 33 cancer types in the TCGA
regression classifier to learn gene or pathway signatures
pathway activation
Way et al., (2018) Cell
cancer
mutation in KRAS, NRAS, HRAS or NF1 loss of function events
treatment resistance
difficult
important for determining therapy
certain circumstances
approach is fine
hard to generalize to other contexts
the ratio of positive to negative is very small
that reflect changes in metabolism
Han, X. (2016) Lipidomics for studying metabolism
processing
leave behind”
upstream input from the environment
transcriptome or proteome
these studies yielded far fewer disease genes than expected
environmental factors
development
drug discovery and development
being causal, then the drug target/pathway is known
pharma companies to create a system for these predictions
NMR is not sensitive enough but with better technology these methods will likely increase
Who is present? What are they doing? Taxonomic Analysis Gene/Pathway Prediction Next Generation Sequencing 16S rRNA Sequencing