Data A Analysis
Kelly R Ruggles, P , Ph.D .D. .
Assistant Professor, Department of Medicine NYU Langone Medical Center www.ruggleslab.org September 18, 2017 Methods in Quantitative Biology
Data A Analysis Kelly R Ruggles, P , Ph.D .D. . Assistant - - PowerPoint PPT Presentation
Data A Analysis Kelly R Ruggles, P , Ph.D .D. . Assistant Professor, Department of Medicine NYU Langone Medical Center www.ruggleslab.org September 18, 2017 Methods in Quantitative Biology Lets make it less vague How do we explore
Kelly R Ruggles, P , Ph.D .D. .
Assistant Professor, Department of Medicine NYU Langone Medical Center www.ruggleslab.org September 18, 2017 Methods in Quantitative Biology
Gene N Name De Description Sample 1 1 Sa Sample 2 Sample 3 3 Sample 4 4 Sample 5 5 Sample 6 6 Sample 7 7 Sa Sample 8 Sample 9 9 Sample 1 10 plectin isoform 1 NP_958782 1.10 2.61
0.20
2.77 0.86 1.41 1.19 1.10 plectin isoform 1g NP_958785 1.11 2.65
0.22
2.78 0.87 1.41 1.19 1.10 plectin isoform 1a NP_958786 1.11 2.65
0.22
2.78 0.87 1.41 1.19 1.10 plectin isoform 1c NP_000436 1.11 2.65
0.21
2.80 0.87 1.41 1.19 1.10 plectin isoform 1e NP_958781 1.12 2.65
0.22
2.79 0.87 1.41 1.20 1.09 plectin isoform 1f NP_958780 1.11 2.65
0.22
2.78 0.87 1.41 1.19 1.10 plectin isoform 1d NP_958783 1.11 2.65
0.22
2.78 0.87 1.41 1.19 1.10 plectin isoform 1b NP_958784 1.11 2.65
0.22
2.78 0.87 1.41 1.19 1.10 epiplakin NP_112598
3.91
2.21 1.92 3.20 1.05
myosin-9 NP_002464 2.04 1.59
1.03 0.11 1.25 0.42 0.12 1.15 1.96 myosin-10 isoform 3 NP_001243024 2.10 0.51
0.23 1.33 0.44
2.83 1.91 myosin-10 isoform 1 NP_001242941 2.10 0.51
0.23 1.29 0.43
2.81 1.91 myosin-11 isoform SM1A NP_002465
0.69
2.29
myosin-10 isoform 2 NP_005955 2.10 0.51
0.23 1.35 0.43
2.83 1.94 myosin-11 isoform SM2B NP_001035202
0.67
2.29
myosin-14 isoform 1 NP_001070654
0.26
3.78
1.56
myosin-14 isoform 2 NP_079005
0.27
3.80
1.58
unconventional myosin-Va isoform 1 NP_000250
0.92
0.03 0.45
1.27 1.08
unconventional myosin-Vb NP_001073936
1.87
0.46
1.25 0.27
unconventional myosin-Vc NP_061198
0.02
2.07 1.44
1.73 0.07 unconventional myosin-Ic isoform a NP_001074248 0.32
0.09 0.78
2.44
1.04
unconventional myosin-Ic isoform b NP_001074419 0.32
0.09 0.79
2.44
1.05 0.01 unconventional myosin-Id NP_056009 0.97 1.64
0.02 0.85 1.11 1.63
3.59 0.60 unconventional myosin-Ib isoform 2 NP_036355 1.53 2.93
0.56
1.26 0.14 1.18
77 H Human Breast T Tumors
Mertins P*, Mani DR*, Ruggles KV*, Gilette M* et al., Nature 534, 55-62 (2016)
Mutation Copy Number Gene Expression DNA methylation MicroRNA RPPA Clinical Data Proteomics Phosphoproteomics
Ozenberger KE, et al., Nature Genetics 45, 1113-1120 (2013)
825 H Human Breast T Tumors
Single Nucleotide Polymorphisms (SNPs)
GENOMICS PROTEOMICS
Global Protein Expression Copy number Alterations (CNA) Novel Splice Junctions Phosphoprotein Abundance Targeted Proteomics
Gene Expression
WGS, WXS RNA-Seq LC-MS/MS
Splicing of exons, creating new protein isoforms
SN SNP T C
Single base-pair sites that vary in a population
Single Nucleotide Polymorphisms (SNPs)
GENOMICS PROTEOMICS
Global Protein Expression Copy number Alterations (CNA) Novel Splice Junctions Phosphoprotein Abundance Targeted Proteomics
Gene Expression
WGS, WXS RNA-Seq LC-MS/MS
Signaling Potential protein quantitation Absolute quantitation Relative quantitation Amplifications or deletions in the genome
DNA (>1kb)
RNAs are converted into cDNA fragment library Sequence adapters (blue) are added to cDNA fragments Short sequence reads from each cDNA are obtained Reads are aligned to reference sequence and classified as exonic reads, junction reads or poly(A) end-reads Used to generate a base-resolution expression profile for each gene
Tu Tumor Sa Sample Pe Peptides Fr Fractionation Di Dige gestion Ly Lysis
m/ m/z in inten ensit ity
Id Identity Qu Quanti tity ty
Tandem M Mass Sp Spectrometry
Discovery P Proteomics: :
expression (whole cell proteome)
phosphopeptides to measure phosphorylation status Targeted P Proteomics: :
representative peptides of these proteins to measure prior to run
Clean Transform Visualize Model Communicate
Modified from R for Data Science, Wickham & Grolemund
Transform Visualize Model Communicate
Modified from R for Data Science, Wickham & Grolemund
for downstream analysis
don’t match (ex: Ensembl vs. RefSeq)
TCGA-A2-A0CM-01A-31R-A034-07 TCGA-A2-A0D0-01A-11R-A00Z-07 TCGA-A2-A0D1-01A-11R-A034-07 UBC|7316 0.052 0.360
GUCY2D|3000
3.337 C11orf95|65998 0.405 0.446 1.011 C17orf81|23587
0.273
ANKMY2|57037
TTC36|143941
AO-A12D.01TCGA C8-A131.01TCGA AO-A12B.01TCGA NP_958782 1.10 2.61
NP_958785 1.11 2.65
NP_958786 1.11 2.65
NP_000436 1.11 2.65
NP_958781 1.12 2.65
NP_958780 1.11 2.65
Clean
Visualize Model Communicate
Modified from R for Data Science, Wickham & Grolemund
the data that can be explained by experimental or technical reasons
Class related: e.g. Normal vs. disease Nyamundanda, 2017 Goh, 2017
different scales to a common scale
probability distribution of adjusted values into alignment
Raw D Data Normalized: : mean=0, , std std=1 =1
mean or median for each sample
should be similar
0.0 0.5 1.0 1.5 −10 −5 5
ratio density
proteome−raw
0.00 0.25 0.50 0.75 1.00 −10 −5 5
ratio density
phosphoproteome−raw
Bimodal Bimodal
Proteome Phosphoproteome Density (number of proteins) Log2 iTRAQ tumor / reference
tumors h have b bimodal d distribution o
both p proteins a and phosphopeptides w with l lower o
abundance
a p processing o
technical a artifact
specific t to s subtype, P , PAM50 s status o
histology
Normal: 5 : 54 ( (total 7 75) Bimodal: 2 : 26 ( (total 3 30)
Bimodal Normal
present that case is discarded
19
1. Non-informative Imputation
a non-informative distribution. 2. Low rank matrix completion
processing; a regularized SVD decomposition. R- package: ‘softImpute’. 3. Prediction based imputation
‘xgboost’. 4. Machine-learning based imputation
package: ‘missForest’.
through an iterative procedure.
Perseus.c (center) /.t (tail) Prediction based imputation
Clean Transform
Communicate
Modified from R for Data Science, Wickham & Grolemund
Single Nucleotide Polymorphisms (SNPs)
GENOMICS PROTEOMICS
Global Protein Expression Copy number Alterations (CNA) Novel Splice Junctions Phosphoprotein Abundance Targeted Proteomics
Gene Expression
WGS, WXS RNA-Seq LC-MS/MS
detectable at protein level?
phosphorylation/protein/gene expression status to predict effective drug combinations for treatment?
biomarker development?
Ruggles et al., MCP 16(6), 959-981 (2017)
Ruggles et al., MCP 16(6), 959-981 (2017)
Exons Introns
https://genome.ucsc.edu/FAQ/FAQreleases.html#release1 Human Reference Mouse Reference
Swiss-Prot UniProt RefSeq NP
TrEMBL RefSeq XP, ZP
*Automated annotation through pattern matching of protein to DNA + known proteoin coding genes
Ensembl UCSC
Ruggles & Fenyo, 2015
nonlethal mutations have arisen in all human genes
evolution and inheritance across thousands of humans
nucleotide per 1000
they are called polymorphisms
Ruggles et al., 2017
Ruggles & Fenyo, 2016 Ruggles & Fenyo, 2015
Gene Annotation SNV Peptide Reference Peptide
IGV Visualization Modeling
Ruggles & Fenyo, 2016 Ruggles & Fenyo, 2015
Gene Annotation Novel Splice Peptide
IGV Visualization Modeling
Ruggles et al., MCP 16(6), 959-981 (2017)
Ruggles et al., 2017
r2=0.4698 r2=0.2577 r2=0.3718 RNA Protein Phosphoprotein Protein Phosphoprotein
ICA1 Uncharacterized (RP11-595B24.2) POM121 NQO2 NT5DC4 PLEKHS1 DMRTB1 CRIP2 TTC38 KAT2A SLC35A5 PIH1D2 GUCD1 RNASE12 PLA2G2A CEP290 SNAPC4 SAFB IGVK1-6 Uncharacterized (RP11-293M10.1)
48.4% 22.0% 10.4% 18.7%
EDA2R GSTA4 FAM106A MYBPC3 AC105036.3 C5orf44 LRP5 SETDB1 PLA2R1 LAMC1 PDCD1LG2 AC091435.1 TMEM56 RNF138 PSMD14 UBD HSD17B14 C14orf166 COPE HSF1
54.4% 21.8% 6.5% 17.2%
both in ‘cis’ and ‘trans’ genes
correlation coefficient
Mertins et al., 2016
CNA RNA Phospho Protein
Outlier Status Kinase Outliers Black S Sheep Subtype enrichment Druggable Drivers 1. Used log2 normalized data for 668 kinases from all 77 TCGA breast samples 2. Found distribution for each phosphosite across samples 3. Flag samples with normalized phosphosite expression above 1.5 interquartile ranges (IQR) from the median. 4. Repeat for CNA, RNA and protein expression
Whi hich p h phos hosphos hosite ou
lier k r kinases a are re e enri riche hed i in t the he 4 4 re repre resented s subtyp ypes?
Mertins P*, Mani DR*, Ruggles KV*, Gilette M* et al., Nature 534, 55-62 (2016)
Niu*, Scott*, Sengupta* et al., Nature Genetics (2016)
Sequence variants and drug binding are mapped to protein structure Pairwise correlations used to determine the impact of variants on drug response Validate the impact of these variants in disease models Things that are in close proximity in protein structure
Intra-molecular Clusters Inter-molecular Clusters Mutations clustering around Drug binding pockets
Niu*, Scott*, Sengupta* et al., Nature Genetics (2016)
Whole Genome Sequencing Copy Number Variation (per 10kb) RNA-Seq (PolyA, Ribo0) Exon expression Global MS /MS(22) Phospho MS/MS
a
Mapped to genome (PGx.) Mapped to genome (PGx)
b
Chromosome 1 Basal/Luminal
LDLRAP1 ARID1A JAK1 NRAS HMGCS2
x108 0.5 1.0 1.5 2.0
10
5
10 200
10 CNV RNA-Seq Peptides Phospho
c
Peptide, Exon Expression Ratio Increased ( > 2) Decreased (< -2) Unchanged (between -2, 2) Not unique to gene
LSP1 Chromosome 11 Exons RNA Peptide SERPINB5 Chromosome 18 Chromosome 19 NCAN
d
PARP10 Chromosome 8 Chromosome 9 FBP1 Chromosome 4 INPP4B Exons RNA Peptide
Ruggles et al., MCP 16(6), 959-981 (2017)
Ruggles et al., 2017
matrices
between disease phenotypes and/or to identify potentially valuable disease biomarkers
Ruggles et al. (2017) Methods, tools and perspectives in proteogenomics. MCP.
that optimizes treatment to maximize efficacy and minimize risk based on genetic make-up
individual variability in drug response and toxicity.
drug metabolism differences
corresponding to a therapeutic effect
proteome and phosphoproteome data to be more informative regarding clinical outcomes compared to NGS data, as these data modalities are more proximal to the disease.
proteomics data to
1. Classify clinically-relevant disease subtypes in cancer 2. Define prognosis 3. Identify biomarkers predicting drug sensitivity
shotgun proteomics
segregating power
TBC1D4) to accurately classify Large B-Cell lymphoma patients, which are usually morphologically indistinguishable
Deeb, et al. (2015) Mol. Cell. Proteomics MCP. 14, 2947–2960
Ma, S et al. (2016) AMIA Summits Transl. Sci. Proc. 20 2016, 52–59
Ma, S et al. (2016) AMIA Summits Transl. Sci. Proc. 20 2016, 52–59
approaches to predict clinical phenotypes using
Protein Array (RPPA)
modalities compared to individual platform analysis
predictive than RPPA-based proteomics
Ray, B., et al. (2014) Sci. Rep. 4, 4411
breast tumors to predict 10 year survival in breast cancer
improve model performance
transcriptomics
Ma, S et al. (2016) AMIA Summits Transl. Sci.
2016, 52–59
Forest approach to identify molecular features associated with drug response of 90 drugs in 70 breast cancer cell lines.
expression, promoter methylation and protein expression
prediction but other data types improved the prediction in a subset of cases
Daemen et al. (2013) Genome Biol. 14 14, R110
Causal Di Disc scovery
P
Kinase
Protein A Protein B
P
Increase in phosphorylation Increase in expression Increase in expression
phosphoproteomic data to better understand cancer signaling, discover novel drug targets and subtype based on pathway activity.
Stained Fingers Smoking Lung Cancer Classic Causal Discovery Example:
Markov Blanket
immediate surroundings
known aberrant proteins
without the complication of distant causes and effects
A C B T E D F G
Causal Discovery and Cancer Signaling
collection?
mutations to signalling in breast cancer. Nature 534 (2016) 55-62.
Molecular Interactions Using Stochastic Modelling and Super- Resolution Microscopy, bioRxiv (2017)
Graph from Electronic Medical Records. Sci Rep. 7 (2017) 5994