Machine Learning Applications to Omics Data Kelly Ruggles April 9, - - PowerPoint PPT Presentation

machine learning applications to omics data
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Applications to Omics Data Kelly Ruggles April 9, - - PowerPoint PPT Presentation

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in Biomedicine Genome Long term information Proteomics storage Phosphoproteomics Transcriptome Mutation calls Retrieval of information


slide-1
SLIDE 1

Machine Learning Applications to Omics Data

Kelly Ruggles April 9, 2018

slide-2
SLIDE 2

Diversity of Omics in Biomedicine

Mutation calls Copy Number Gene Expression DNA methylation/Epigenetics MicroRNA RPPA Clinical Data Proteomics Phosphoproteomics

  • Genome
  • Long term information

storage

  • Transcriptome
  • Retrieval of information
  • Proteome
  • Short term information

storage

  • Interactome
  • Execution
  • Metabolome, Lipidome
  • State
slide-3
SLIDE 3

Understanding Gene Regulation and Epigenetics

ChIP-Seq

  • Chromatin is immmunoprecipitated and the

recovered DNA is sequenced

  • Identifies binding sites of DNA-associated

proteins DNAse-Seq/FAIRE-Seq

  • Identifies DNaseI hypersensitive sites

(open chromatin = active genes) Hi-C/5C

  • DNA crosslinked and sequenced
  • Spatial organization of chromatin

(promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS)

  • Reads methylation status at the genome

level

slide-4
SLIDE 4

Assessing Copy Number and Mutation Status by Genome Sequencing

Sample

Single Nucleotide Polymorphisms (SNPs)

  • Single base-pair sites that vary in a population
  • Have been found to act as “drivers” of tumor

progression

Copy Number Variation (CNV)

  • Changes in the genome due to duplication or

deletion of large regions of DNA

Genomic DNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing

Library Preparation

SNP T C

slide-5
SLIDE 5

Assessing Copy Number and Mutation Status by Genome Sequencing

Gene Expression

  • Normalized expression of genes in all samples
  • Can be used for differential expression analysis

Alternative Splicing

  • Splicing of exons, creating new protein isoforms
  • Alternative splicing changes are frequently found in

cancer

  • Loss of functional domains may also be a disease

driver

Sample RNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing

Library Preparation

slide-6
SLIDE 6

Protein Identification and Quantitation by Mass Spectrometry

Sample Peptides Fractionation Digestion Lysis

m/z intensity

Identity Quantity

Tandem Mass Spectrometry

Discovery Proteomics:

  • Used to measure global protein

expression (whole cell proteome)

  • Can enrich for phosphopeptides

to measure phosphorylation status Reverse Phase Protein Array:

slide-7
SLIDE 7

Publically Available Omics Datasets

  • Collaboration between National Cancer

Institute and the National Human Genome Research Institute

  • Generated comprehensive genomic

maps of 33 tumor types

  • Subset of these tumors were

characterized at the proteome level

  • International collaboration funded by the

National Human Genome Research Institute

  • Goal is to build comprehensive parts list of

functional elements in the human genome

slide-8
SLIDE 8

ML Applications in Omics

Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.

Sequence Element Annotation

slide-9
SLIDE 9

”Learning” Transcription Start Sites (TSSs)

Kapranov, 2009

  • Knowing the exact position of a 5’ TSS of an RNA is crucial for finding

the regulatory regions that flank it

  • Traditionally, one will find where the 5’ cap structure maps onto the

RNA

  • Cap analysis of gene expression (CAGE)
  • Oligo-capping
  • Robust analysis of 5’ transcript ends (5’ RATE)
  • Complexity surrounding the TSSs
  • Non-coding RNAs function
  • Regulatory regions around the TSS
  • Effective of repetitive elements
slide-10
SLIDE 10

”Learning” Transcription Start Sites (TSSs)

  • Identify algorithm
  • Provide large collection of TSS sequences

and list of non-TSS sequences

  • Give novel sequences to the model,

which predicts TSS or non-TSS for each sequence

  • If you can compile a list of sequence

elements of a given type you can probably train a machine learning method to recognize those elements

Libbrecht and Nobel, 2015

slide-11
SLIDE 11
  • Enhancers: distal regulatory elements with roles in

the regulation of gene expression

  • Lack common sequence features and are far from

target genes makes them difficult to identify

  • Used ENCODE DNaseI hypersensitivity and ChIP-

Seq data and applied random forest model to predict enhancers

  • Identified 3 histone modifications (H3K4me1,

H3K4me3, H3K27ac) that were the most informative and robust across cell types

  • Trained on p300 ENCODE data from human

embryonic stem cells and predicted in 12 ENCODE cell types

Rajagopal 2013 Cell types

slide-12
SLIDE 12

Annotating Genomes

  • To be useful, genomes must be annotated
  • Genome annotation:
  • Identifying the location and function of protein coding genes
  • Understand cis-regulatory sequences
  • Alternative splicing
  • Identifying promoters and enhancers

Exons Introns

slide-13
SLIDE 13

Annotating Genomes

  • Can use gene-finding algorithms to

predict locations and intron/exon structure of all protein-coding genes on a chromosome

Libbrecht and Nobel, 2015

slide-14
SLIDE 14

Annotating Genomes

  • Labelled DNA sequences with

start/end of gene, splice sites

  • Model learns the properties of genes
  • DNA sequence patterns
  • Donor/acceptor splice sites
  • Length/distribution of UTRs

Libbrecht and Nobel, 2015

Supervised Approach

slide-15
SLIDE 15

Annotating Genomes

  • Collection of epigenomic data sets

(ENCODE) and want to identify patterns

  • f chromatin accessibility, histone

modification TF binding

  • We want to know what labels do best in

providing an overview of the functional activities of the genome

  • Use unlabeled data and input desired

number of labels

  • Model will partition genome and assign

labels to each segment.

  • Allows for the identification of novel

genomic elements

Libbrecht and Nobel, 2015

Unsupervised Approach

slide-16
SLIDE 16
  • Unsupervised training on 1% of the human genome using ENCODE

data (ChIP-Seq, DNAse-seq, FAIRE-seq)

  • Fixed the number of labels at 25 to keep them interpretable
  • They used a method (“Segway”) based on Dynamic Bayseian

Networks to segment and cluster the data

  • Assigned functional categories to groups of segment labels based on

features

  • Identifies protein coding genes, transcription factor binding,

chromatin states, etc.

Nature Methods, 2012

slide-17
SLIDE 17

ML Applications in Genomics and Proteomics

Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.

Expression-based input

slide-18
SLIDE 18

Ruggles et al., (2017) MCP

Modeling and ‘Omics

  • Input can also be

expression matrices

  • RNA-seq
  • DNAse-seq
  • ChIP-seq
  • Microarray
  • Proteomics etc.
  • Can be used to

distinguish between disease phenotypes and/or to identify potentially valuable disease biomarkers

slide-19
SLIDE 19

Curse of Dimensionality (‘Large p, small n’)

  • Often leads to results with poor biological interpretability
  • Reliability of models decreases with added dimension
  • Analysis of single and integrative omics data is due to high rates of false

positives due to chance

  • Requires corrections for multiple hypothesis testing or dimensionality reduction
  • Can lose key mechanistic information

Alyass, 2015

slide-20
SLIDE 20

Personalized Medicine

  • Personalized medicine: algorithm that
  • ptimizes treatment to maximize

efficacy and minimize risk based on genetic make-up

  • Patient populations show high inter-

individual variability in drug response and toxicity.

  • Gene factors account for 15-30% of

drug metabolism differences

  • Ability to identify gene biomarkers

corresponding to a therapeutic effect

slide-21
SLIDE 21

Imprecise Medicine

  • The top 10 grossing drugs in the US

help between 1 of 25 and 1 of 4 people who take them

  • Some drugs are harmful to specific

ethnic groups because the bias of Wester participants in clinical trials

  • Classical clinical trials do not take into

account genetic and environmental factors that effect how a person responds to treatment

Schork, 2015

slide-22
SLIDE 22

Personalized Medicine Continuum

  • Spans the full spectrum of

healthcare:

  • Greatest risk of developing a

disease

  • Identifying prognostic,

predictive and drug response markers

  • Developing new therapies

based on biomarkers

Bernstam et al., 2013

slide-23
SLIDE 23

Use of ‘Omics in Personalized Medicine

  • Lag in personalized medicine due,

in part, to our ability to generate

  • vs. integrate/interpret omics data
  • NGS means we can quickly and

cheaply generate data

  • ’Omics data can be translated into

subject-specific care based on their disease network

  • However, our ability to determine

molecular mechanisms based on this data is limited

Alyass, 2015

slide-24
SLIDE 24

Barriers of ‘Omics

  • To complete this complex data integration, expertise in many

disciplines is required:

  • Biological mechanisms
  • Medicine
  • Informaticians and statisticians
  • Barriers between these disciplines still exist
  • 90% of scientists are self-taught in software development and lack

best practices

  • Task automation
  • Code review
  • Version control
slide-25
SLIDE 25

Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity

slide-26
SLIDE 26
  • Used RNA-Seq data from The Cancer Genome Atlas

(TCGA)

  • 31 tumor types
  • 9,096 samples
  • 75% training, 25% testing
  • Goal: Identify a set of genes that can distinguish

tumor types

  • Identified 20 genes that could classify >90% of the

samples

  • Used a GA/KNN method
  • Genetic algorithm (GA) for gene feature selection
  • K nearest neighbors as classification tool

Li et al., 2017

slide-27
SLIDE 27

Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity

slide-28
SLIDE 28
  • Deeb et al. used global expression patterns from

shotgun proteomics

  • ~9000 tumor proteins
  • 20 Large B-Cell lymphoma patients
  • Used SVMs to extract candidate proteins with

highest segregating power

  • Identified four proteins (PALD1, MME, TNFAIP8

and TBC1D4) to accurately classify Large B-Cell lymphoma patients, which are usually morphologically indistinguishable

Deeb, et al. (2015) Mol. Cell. Proteomics MCP. 14, 2947–2960 Clustering of top protein candidates determined by SVM

slide-29
SLIDE 29

Integrating ‘Omics Data

Modified from Ritchie et al., 2015

  • SNP
  • CNV
  • Genomic

rearrangement

  • DNA

methylation

  • Histone

modification

  • Chromatin

accessibility

  • Gene

expression

  • Alt. splicing
  • Protein expression
  • Post-translational

modification

  • Metabolite

profiling Phenome e.g. Cancer

slide-30
SLIDE 30

Omics Input Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity

slide-31
SLIDE 31

Machine Learning in Proteomics

  • One would expect the predictive analysis of

proteome and phosphoproteome data to be more informative regarding clinical

  • utcomes compared to NGS data, as these

data modalities are more proximal to the disease.

  • These techniques have been applied to

proteomics data to

  • 1. Classify clinically-relevant disease subtypes in

cancer

  • 2. Define prognosis
  • 3. Identify biomarkers predicting drug sensitivity
slide-32
SLIDE 32

CPTAC Cancer Proteogenomics: Samples and Data

Mutation Copy Number Gene Expression DNA methylation Clinical Data Proteomics Phospho- proteomics Lung squamous Ovarian Colorectal Kidney Breast Lung adeno Endometrial Glioblastoma Pancreatic

  • 9 tumor types total
  • 3 already processed
  • Breast
  • Ovarian
  • Colorectal
  • ~100 Samples per tumor type
  • Matched normals (when

possible)

Mertins et al. Nature 2016 Zhang Nature 2014 Zhang Cell 2016

slide-33
SLIDE 33
  • Ma et al used proteogenomics data from 77

breast tumors to predict 10 year survival in breast cancer

  • Found that fusion of 4 data types did not

improve model performance

  • Protemics outperformed genomics and

transcriptomics

Ma, S et al. (2016) AMIA Summits Transl. Sci.

  • Proc. 2016, 52–59
slide-34
SLIDE 34

Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity

slide-35
SLIDE 35
  • Z

Zhu, 2017 Scientific Reports

  • Used RNA-Seq data from The Cancer Genome Atlas (TCGA) and International

Cancer Genome Consortium (ICGC)

  • Somatic mutations
  • DNA copy number
  • DNA methylation
  • mRNA expression
  • miRNA expression
  • RPPA
  • Goal: some prognosis-relevant signals will be found in models only when

data is aggregated

  • Multi-omic kernel ML method including all molecular markers
slide-36
SLIDE 36

Zhu, 2017 Scientific Reports Prognostic performance each ’omics alone

slide-37
SLIDE 37

Endocervical Carcinoma Head & Neck Squamous Cell Carcinoma Lower Grade Glioma Ovarian Serous Cystadenocarcinoma

  • In some cases, omics are able to complement the absence of clinical variables

(pink vs. dotted blue)

  • Clinical Factors = age, gender, tumor grade, histological subtype
slide-38
SLIDE 38

Feature Selection Model Training Predictive Model Diagnosis Prognosis Drug Response Drug Toxicity

slide-39
SLIDE 39
  • Goal: identify patients who are likely to

respond to a drug

  • Integrates RNA-Seq, copy number,

mutations from 33 cancer types in the TCGA

  • Used an elastic net penalized logistic

regression classifier to learn gene or pathway signatures

  • Applied this method to detect Ras

pathway activation

Way et al., (2018) Cell

  • Ras pathway is commonly altered in

cancer

  • Typically due to a gain of function

mutation in KRAS, NRAS, HRAS or NF1 loss of function events

  • Associated with poor survival and

treatment resistance

  • Rare/unknown mutations make this

difficult

  • Accurate detection of Ras is very

important for determining therapy

slide-40
SLIDE 40

Challenges and Limitations:

  • Complex models are difficult to interpret and should only be used in

certain circumstances

  • If goal is to list genomic elements as accurately as possible then using a ‘black box’

approach is fine

  • If goal is to understand biological mechanisms, simpler model may be a better fit
  • Models built from specific cell types or experimental conditions can be

hard to generalize to other contexts

  • Lack of training sets
  • Unbalanced positive and negative sets
  • Most genomic element classes make up only a small portion of a genome therefore

the ratio of positive to negative is very small

  • This will lead to accurate but non informative models
slide-41
SLIDE 41

Ot Other Om Omics s Data Typ ypes

slide-42
SLIDE 42

Lipidomics

  • Developed in 2003 to study metabolism of the lipidome
  • Allows us to quantify changes in individual lipid classes, subclasses

that reflect changes in metabolism

Han, X. (2016) Lipidomics for studying metabolism

  • Nat. Rev. Endocrinol. doi:10.1038/nrendo.2016.98
slide-43
SLIDE 43

Lipidomics Applications

  • Identification of novel lipid classes and molecular species
  • Development of quantitative methods for large-scale lipid analysis
  • Pathway analysis in disease
  • Biomarker identification
  • Tissue mapping of altered lipid distribution in organs
  • Development of bioinformatics approaches for high-throughput

processing

slide-44
SLIDE 44

Metabolomics

  • “Systematic study of unique chemical fingerprints that specific cellular processes

leave behind”

  • Metabolites represent both the downstream output of the genome and the

upstream input from the environment

  • Allows us to explore gene-environment interactions
  • Metabolome responds to nutrients, stress and disease long before the

transcriptome or proteome

  • Examples:
  • Metabolic intermediates
  • Hormones
  • Signaling molecules
  • Secondary metabolites
  • Antibiotics
  • Drugs
slide-45
SLIDE 45

Metabolomics and Human Health

  • GWAS studies have been extensively search for disease genes and

these studies yielded far fewer disease genes than expected

  • Metabolomics helps to understand cellular metabolism, factoring in

environmental factors

  • Studies have shown metabolites to have a central role in disease

development

  • Trimethylamine N-oxide and atherosclerosis
  • Cancer and oncometabolites
  • Amino acids and diabetes
slide-46
SLIDE 46

Metabolomics and Drug Discovery

  • Metabolomics offers are more cost-effective and productive route for

drug discovery and development

  • For example, if a metabolite or set of metabolites is identified as

being causal, then the drug target/pathway is known

  • Drugs to inhibit TMAO production for atherosclerosis
  • Metabolomics has also been used to detect early drug toxicity in trials
  • COMET (the Consortium for Metabonomic Toxicity) brought together 5

pharma companies to create a system for these predictions

  • These methods are still not widely used in pharma, likely because

NMR is not sensitive enough but with better technology these methods will likely increase

slide-47
SLIDE 47

Metagenomics

Who is present? What are they doing? Taxonomic Analysis Gene/Pathway Prediction Next Generation Sequencing 16S rRNA Sequencing

slide-48
SLIDE 48

Que Questions? ns?