[PPT] - Statistical modeling in molecular medicine: proteomics Anna Gambin PowerPoint Presentation

SLIDE 1

Statistical modeling in molecular medicine: proteomics

Anna Gambin Institute of Informatics, University of Warsaw

SLIDE 2

utline
masSpec basics
modeling isotopic distribution
modeling exopeptidase activity
incorporating MEROPS data
peptidase activity in time
modeling electron transfer dissociation
deconvolution of spectra
modeling fragmentation

SLIDE 3

Mass Spectrometry

Proteins

data source:

Center For Proteomics, Antwerp, belgium

SLIDE 4

SLIDE 5

Identifying proteins is complicated there are plenty of proteins in a sample proteins are frequently fragmented even a single protein has a complicated signal

SLIDE 6

Chemical compounds are made of different isotopes

isotopic envelope

SLIDE 7

CcHhNnOoSs

ne

ie

huge number of isotopologues

SLIDE 8

important observation

some isotopic variants are more probable than others P( ) =

SLIDE 9

Assume 1) variants of isotopes of atoms are independent 2) elements vary in abundances of isotopes P( ) =

SLIDE 10

0 + o1 + o2 = 200

SLIDE 11

How much we gain by considering

the smallest set

with a fixed probability ?

Y

Elements

n

ie−1 2

e

Y

Elements

nie−1

e

≈ Clattice ⇣ Y

Elements

n

ie−1 2

e

p det ∆e ⌘ qk

χ2(k)

πk/2 Γ(k/2 + 1) ∝

SLIDE 12

To get the smallest set with probability P:

Find the most probable variant while Total Probability < P : Get layer so that p> P(v)>=qp where Trim the least probable variants from the last layer so that Total Probability >= P p = P(vmin previous layer)

SLIDE 13

Smallest set with current Total Probability

Monotonic Expansion Property:

For each v set {W: P(W)>=P(v) } is adjacent to v

multinomial distribution

SLIDE 14

SLIDE 15

ur OPTIMAL implementation uses

queue for storing subsequent layers

a version of quick select for trimming

complexity

ther tricks
O(n) in the total number of configurations

SLIDE 16

We provide theoretical background and get better run times

SLIDE 17

LC-MS/MS

data for colorectal

cancer patients and healthy donors

ca 1000 peptides
preprocessing: spectra

interpretation and retention time aligning

proteolytic fragmentation

SLIDE 18

Exopeptidase activity

motivation: differential

exoprotease activities contribute to cancer type–specific serum peptidome degradation

our goal: first formal

model estimated from LC-MS/MS data

Villanueva, J., Nazarian, A., Lawlor, K., et al. 2008. A sequence-specific exopeptidase activity test (sseat) for “functional” biomarker discovery. Mol. Cell. Proteomics 7, 509–518.

SLIDE 19

⋆

FT FTS TS FTSS TSS SS FTSST TSST SST ST FTSSTS TSSTS SSTS STS SSTSY STSY TSY SY

†

⌥ Q(x, x⇥) =          a⇥i if x⇥

i = xi + 1, x⇥ i = xi for some i,

ar(i,j)xi if x⇥

j = xj + 1, x⇥ i = xi 1,

and x⇥

ij = xij for some i ⇧ j,

ai†xi if x⇥

i = xi 1, x⇥ i = xi for some i.

Cleavage graph

Q(x, x) =                        ai ar(i,j)xi ai†xi

create move annihilate/degrade

transition intensities for Markov process describing the flow of particles through the graph i.e. the process of peptidome degradation

SLIDE 20

Proposition 1 (Equilibrium distribution). The process .X.t// has the equilibrium (stationary) distribution given by: .x/ D Y

i2V

ei xi

i

xiŠ ;

where the configuration of intensities .i/i2V is the unique solution to the following system of “balance” equations: X

k!i

kar.k;i/ C a?i D i @X

i!j

ar.i;j/ C ai 1 A for every i 2 V:

in equilibrium

ld as the hills, but…

SLIDE 21

(Br)r∈R (br)r∈R ∼ Dir((Br)r∈R) (B?i)i∈Vin (b?i)i∈Vin ∼ Dir((B?i)i∈Vin) Sshape, Srate s ∼ Gamma(Sshape, Srate) i = i(s, b?, b) for i ∈ V (✏i)i∈V q i ∼ Bern(q) for i: ✏i = 1 xi ∼ Poiss(i) for i: ✏i = 1 ⌧ yi ∼ LogNormal(xi, ⌧) for i: i = 1 yi ∼ Background for i: i = 0

hierarchical Bayesian model missing readings errors Metropolis-Hastings to sample from posterior:

SLIDE 22

NON TRIVIAL TASK: filling the cleavage graph with real data

1000 peptides: mass,

charge, retention time

243 precursor peptides
ca. 40 000 subsequences

FTSS

from aa sequence:

calculate mass

consider all charges
predict retention

time (random forests) quite often: missing reads and errors !

SLIDE 23

Cleavage graph for real proteolytic events

20 colorectal cancer

patients and 20 healthy donors,

ca 1000 peptides,
preprocessing phase

MSFT†LTN†K u pepsin ⇥peps

xy

thermolysin ⇥ther

vw

LTNK w MSFT v MSFT†L†TN x K y thermolysin ⇥ther

vz

chemotrypsin ⇥chem

st

LTN z MSFTL s TN t

MUCH SMALLER cleavage graphs !

SLIDE 24

25 38 16 14 7 19 3 13 1 9 15 37 28 26 34 33 39 31 22 35 17 6 12 29 8 10 2 27 23 32 11 20 18 24 21 4 5 30 36

data set no.

eupitrilysin cathepsin.B membrane.type.matrix.metallopeptidase.3 trypsin.1 cathepsin.S granzyme.B...Homo.sapiens..type. elastase.1 tripeptidyl.peptidase.I matrix.metallopeptidase.20 tryptase.alpha chymase...Homo.sapiens..type. myeloblastin cathepsin.G cathepsin.L calpain.1 membrane.type.matrix.metallopeptidase.6 ADAMTS5.peptidase chymotrypsin.C pepsin.A ADAMTS4.peptidase caspase.1 ADAM17.peptidase ADAM10.peptidase cathepsin.H membrane.type.matrix.metallopeptidase.4 cathepsin.K legumain aminopeptidase.PILS kallikrein.related.peptidase.3 matrix.metallopeptidase.3 calpain.2 neprilysin plasmin 10 30

Value

20 60 100

Color Key and Histogram

Count

identified enzymes make sense !

SLIDE 25

stochastic dynamics in time

A. Gambin, B. Kluge / Modeling Proteolysis from MS data

MSFT†LTN†K u pepsin ⇥peps

xy

thermolysin ⇥ther

vw

LTNK w MSFT v MSFT†L†TN x K y thermolysin ⇥ther

vz

chemotrypsin ⇥chem

st

LTN z MSFTL s TN t

⇤
Qxx =
cT⇥vwxu

if x = x u + v + w and u = v † w ,

therwise.

by ρvw the vector of all peptidase affinity coefficients for the cleavage v †w (for

estimate peptidase cutting intensities vector to perform the cleavage is proportional

from MEROPS: P(x, t) = P(X(t) = x).

⌥ ⌥tP(x, t) = ⌅

y⇥=x

(QyxP(y, t) − QxyP(x, t)) = ⌅

u=v†w

cT⇤vw [(xu + 1)P(x + u − v − w, t) − xuP(x, t)] = ⌅

u=v†w

cT⇤vw[x

uP(x, t) − xuP(x, t)],

calculated from CME

no more monomolecular system - we have reactions: A -> B and A-> B+C (endopeptidases) to be estimated:

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

SLIDE 30

interesting moments...

u − v − w

by Eq (t) the expected number of instances of peptide q at time t.

equation above:

Eq (t) = ⌅

x

xqP(x, t), ⌅

∂

∂t Eq (t) = ⌦

u→q

λuq Eu (t) + λqq Eq (t) ⇥

q∈V

.

E (t) = E (0)T exp(Λt),

Row

20 40 60 20 40 60 −150 −100 −50 50 100 150

the matrix Λ = (λvw)v,w∈V for peptide VAHRFKDLGEEN.

SLIDE 31

ETD fragmentation

more fragments more insight into structure more confidence in correct identification

SLIDE 32

some bonds get easily broken .. others not

ETD

SLIDE 33

understand fragmentation inside the instrument under different experimental conditions use purified chemical samples study fragmentation pathways locate fragments in data

1. deconvolute signals and
2. infer fragmentation

reaction constants

the goal of masstodon solution:

SLIDE 34 0.00 0.01 0.02 0.03 410 415 420 425 430 Mass [Da] Probability

using atomic compositions of the fragments we generate isotopic spectra with

we can aggregate masses to match data resolution

SLIDE 35

complications

we take into account charges … and imprecisions in instrumental mass calibration

SLIDE 36

mass imprecisions

tolerance intervals around theoretical isotopic envelope natural data centroiding

m/z [Th]

T0 T1 T2

m/z [Th]

SLIDE 37

m/z [Th]

T0 T1 T2 T1 T0 T2 G0 G1 G2 G3 G4 G5 G6 G7 intervals may overlap using interval trees we build up the deconvolution graph

Fragment Fragment

SLIDE 38

m/z

TA0

G1 G0 G2 G3

TA1 TA2 TB0

G4 G5 G6 G7

TB1 TB2

FA FB

theory empiria

SLIDE 39

F P P P F P P P P P P F P P P P F P P P F P P P P E E E G E G E G E G E G E G E E E G E G E

a-theoretical peaks (no fragments around) fragment with no empirical support fragment with its isotopic envelope a fragment with empirical support: trivial case (no need for deconvolution) more a-theoretical peaks

connected components of the deconvolution graph provide a wealth of insight into the spectrum

two fragments with empirical support: suitable for deconvolution

SLIDE 40

to perform deconvolution we present the problem as a linear programme similar to the max flow problem

theory empiria

pA1 pA0 pA2 pB2 pB1 pB0

αB αA

xA00

xA11 xA12

xB02 xB03

xA24

xA25 xB15 xB16 xB27

SLIDE 41

Electron Transfer Dissociation

+ +

+

Cleavage of protein backbone by a

rapid neutralization of charge

To identify proteins
To sequence proteins de novo
To identify post-translational

modifications

SLIDE 42

Petri Net model

PTR ETnoD ETDn

+

+ + +

Electron Transfer Dissociation (ETD):

[M + nH]n+ [M1 + n1H]n1+ + [M2 + n2H]n2+

Proton Transfer Reaction (PTR):

[M + nH]n+ [M + (n-1)H](n-1)+

Electron Transfer Without Dissociation (ETnoD):

[M + nH]n+ [M + nH](n-1)+

SLIDE 43

Ion Space Parametrization

(A, p, q)

Aminoacid Sequence Number of protons quenched by ETnoD Charge

Evolution of an ion = Markov

Jump Process

Jump = transition between states
Jump intensity = reaction

intensity Draw reaction time Draw reaction type Compute reaction products Put ions into sample Distribute charges

n precursor

sequence

SLIDE 44

Population approach

Stochastic description of a single ion

ODE description of a big population of ions

SLIDE 45

Tree-like structure

SLIDE 46

Sequence Charge Electrons Intensity RPKPQQ 3 0.25 RPKP 1 1 0.01 PQQ 1 0.12 ... ... ... ...

SLIDE 47

Intensity estimation

SLIDE 48

Piotr Dittwald Frederik Lermyte Dirk Valkenborg M i c h a l S t a r t e k Frank Sobott Blazej Miasojedow

Many thanks to collaborators

Mateusz Łącki Michał Ciach