Statistical modeling in molecular medicine: proteomics
Anna Gambin Institute of Informatics, University of Warsaw
Statistical modeling in molecular medicine: proteomics Anna Gambin - - PowerPoint PPT Presentation
Statistical modeling in molecular medicine: proteomics Anna Gambin Institute of Informatics, University of Warsaw outline masSpec basics modeling isotopic distribution modeling exopeptidase activity incorporating MEROPS data
Anna Gambin Institute of Informatics, University of Warsaw
Proteins
Center For Proteomics, Antwerp, belgium
Identifying proteins is complicated there are plenty of proteins in a sample proteins are frequently fragmented even a single protein has a complicated signal
Chemical compounds are made of different isotopes
ie
some isotopic variants are more probable than others P( ) =
Assume 1) variants of isotopes of atoms are independent 2) elements vary in abundances of isotopes P( ) =
How much we gain by considering
with a fixed probability ?
Elements
n
ie−1 2
e
Y
Elements
nie−1
e
≈ Clattice ⇣ Y
Elements
n
ie−1 2
e
p det ∆e ⌘ qk
χ2(k)
πk/2 Γ(k/2 + 1) ∝
To get the smallest set with probability P:
Find the most probable variant while Total Probability < P : Get layer so that p> P(v)>=qp where Trim the least probable variants from the last layer so that Total Probability >= P p = P(vmin previous layer)
Smallest set with current Total Probability
Monotonic Expansion Property:
For each v set {W: P(W)>=P(v) } is adjacent to v
multinomial distribution
queue for storing subsequent layers
complexity
We provide theoretical background and get better run times
LC-MS/MS
cancer patients and healthy donors
interpretation and retention time aligning
exoprotease activities contribute to cancer type–specific serum peptidome degradation
model estimated from LC-MS/MS data
Villanueva, J., Nazarian, A., Lawlor, K., et al. 2008. A sequence-specific exopeptidase activity test (sseat) for “functional” biomarker discovery. Mol. Cell. Proteomics 7, 509–518.
⋆
FT FTS TS FTSS TSS SS FTSST TSST SST ST FTSSTS TSSTS SSTS STS SSTSY STSY TSY SY
†
⌥ Q(x, x⇥) = a⇥i if x⇥
i = xi + 1, x⇥ i = xi for some i,
ar(i,j)xi if x⇥
j = xj + 1, x⇥ i = xi 1,
and x⇥
ij = xij for some i ⇧ j,
ai†xi if x⇥
i = xi 1, x⇥ i = xi for some i.
Q(x, x) = ai ar(i,j)xi ai†xi
create move annihilate/degrade
transition intensities for Markov process describing the flow of particles through the graph i.e. the process of peptidome degradation
Proposition 1 (Equilibrium distribution). The process .X.t// has the equilibrium (stationary) dis- tribution given by: .x/ D Y
i2V
ei xi
i
xiŠ ;
where the configuration of intensities .i/i2V is the unique solution to the following system of “balance” equations: X
k!i
kar.k;i/ C a?i D i @X
i!j
ar.i;j/ C ai 1 A for every i 2 V:
(Br)r∈R (br)r∈R ∼ Dir((Br)r∈R) (B?i)i∈Vin (b?i)i∈Vin ∼ Dir((B?i)i∈Vin) Sshape, Srate s ∼ Gamma(Sshape, Srate) i = i(s, b?, b) for i ∈ V (✏i)i∈V q i ∼ Bern(q) for i: ✏i = 1 xi ∼ Poiss(i) for i: ✏i = 1 ⌧ yi ∼ LogNormal(xi, ⌧) for i: i = 1 yi ∼ Background for i: i = 0
hierarchical Bayesian model missing readings errors Metropolis-Hastings to sample from posterior:
NON TRIVIAL TASK: filling the cleavage graph with real data
charge, retention time
FTSS
calculate mass
time (random forests) quite often: missing reads and errors !
patients and 20 healthy donors,
MSFT†LTN†K u pepsin ⇥peps
xy
thermolysin ⇥ther
vw
LTNK w MSFT v MSFT†L†TN x K y thermolysin ⇥ther
vz
chemotrypsin ⇥chem
st
LTN z MSFTL s TN t
MUCH SMALLER cleavage graphs !
25 38 16 14 7 19 3 13 1 9 15 37 28 26 34 33 39 31 22 35 17 6 12 29 8 10 2 27 23 32 11 20 18 24 21 4 5 30 36
data set no.
eupitrilysin cathepsin.B membrane.type.matrix.metallopeptidase.3 trypsin.1 cathepsin.S granzyme.B...Homo.sapiens..type. elastase.1 tripeptidyl.peptidase.I matrix.metallopeptidase.20 tryptase.alpha chymase...Homo.sapiens..type. myeloblastin cathepsin.G cathepsin.L calpain.1 membrane.type.matrix.metallopeptidase.6 ADAMTS5.peptidase chymotrypsin.C pepsin.A ADAMTS4.peptidase caspase.1 ADAM17.peptidase ADAM10.peptidase cathepsin.H membrane.type.matrix.metallopeptidase.4 cathepsin.K legumain aminopeptidase.PILS kallikrein.related.peptidase.3 matrix.metallopeptidase.3 calpain.2 neprilysin plasmin 10 30
Value
20 60 100
Color Key and Histogram
Count
identified enzymes make sense !
stochastic dynamics in time
MSFT†LTN†K u pepsin ⇥peps
xy
thermolysin ⇥ther
vw
LTNK w MSFT v MSFT†L†TN x K y thermolysin ⇥ther
vz
chemotrypsin ⇥chem
st
LTN z MSFTL s TN t
if x = x u + v + w and u = v † w ,
by ρvw the vector of all peptidase affinity coefficients for the cleavage v †w (for
estimate peptidase cutting intensities vector to perform the cleavage is proportional
from MEROPS: P(x, t) = P(X(t) = x).
⌥ ⌥tP(x, t) = ⌅
y⇥=x
(QyxP(y, t) − QxyP(x, t)) = ⌅
u=v†w
cT⇤vw [(xu + 1)P(x + u − v − w, t) − xuP(x, t)] = ⌅
u=v†w
cT⇤vw[x
uP(x, t) − xuP(x, t)],
calculated from CME
no more monomolecular system - we have reactions: A -> B and A-> B+C (endopeptidases) to be estimated:
interesting moments...
u − v − w
equation above:
Eq (t) = ⌅
x
xqP(x, t), ⌅
∂t Eq (t) = ⌦
u→q
λuq Eu (t) + λqq Eq (t) ⇥
q∈V
.
E (t) = E (0)T exp(Λt),
Row
20 40 60 20 40 60 −150 −100 −50 50 100 150
the matrix Λ = (λvw)v,w∈V for peptide VAHRFKDLGEEN.
more fragments more insight into structure more confidence in correct identification
some bonds get easily broken .. others not
ETD
understand fragmentation inside the instrument under different experimental conditions use purified chemical samples study fragmentation pathways locate fragments in data
reaction constants
using atomic compositions of the fragments we generate isotopic spectra with
we can aggregate masses to match data resolution
we take into account charges … and imprecisions in instrumental mass calibration
tolerance intervals around theoretical isotopic envelope natural data centroiding
m/z [Th]
T0 T1 T2
m/z [Th]
m/z [Th]
T0 T1 T2 T1 T0 T2 G0 G1 G2 G3 G4 G5 G6 G7 intervals may overlap using interval trees we build up the deconvolution graph
Fragment Fragment
m/z
TA0
G1 G0 G2 G3
TA1 TA2 TB0
G4 G5 G6 G7
TB1 TB2
FA FB
theory empiria
F P P P F P P P P P P F P P P P F P P P F P P P P E E E G E G E G E G E G E G E E E G E G E
a-theoretical peaks (no fragments around) fragment with no empirical support fragment with its isotopic envelope a fragment with empirical support: trivial case (no need for deconvolution) more a-theoretical peaks
connected components of the deconvolution graph provide a wealth of insight into the spectrum
two fragments with empirical support: suitable for deconvolution
to perform deconvolution we present the problem as a linear programme similar to the max flow problem
theory empiria
pA1 pA0 pA2 pB2 pB1 pB0
αB αA
xA00
xA11 xA12
xB02 xB03
xA24
xA25 xB15 xB16 xB27
+ +
+
rapid neutralization of charge
modifications
PTR ETnoD ETDn
+
+ + +
[M + nH]n+ [M1 + n1H]n1+ + [M2 + n2H]n2+
[M + nH]n+ [M + (n-1)H](n-1)+
[M + nH]n+ [M + nH](n-1)+
Ion Space Parametrization
(A, p, q)
Aminoacid Sequence Number of protons quenched by ETnoD Charge
Jump Process
intensity Draw reaction time Draw reaction type Compute reaction products Put ions into sample Distribute charges
sequence
Stochastic description of a single ion
ODE description of a big population of ions
Sequence Charge Electrons Intensity RPKPQQ 3 0.25 RPKP 1 1 0.01 PQQ 1 0.12 ... ... ... ...
Piotr Dittwald Frederik Lermyte Dirk Valkenborg M i c h a l S t a r t e k Frank Sobott Blazej Miasojedow
Mateusz Łącki Michał Ciach