Machine Learning Metabolic Pathway descriptions using a - - PowerPoint PPT Presentation

machine learning metabolic pathway descriptions using a
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Metabolic Pathway descriptions using a - - PowerPoint PPT Presentation

Machine Learning Metabolic Pathway descriptions using a Probabilistic Relational Representation Nicos Angelopoulos and Stephen Muggleton { nicos,shm } @doc.ic.ac.uk. Imperial College, London. Wye p.1 structure Structure of the talk


slide-1
SLIDE 1

Machine Learning Metabolic Pathway descriptions using a Probabilistic Relational Representation

Nicos Angelopoulos and Stephen Muggleton

{nicos,shm}@doc.ic.ac.uk.

Imperial College, London.

Wye – p.1

slide-2
SLIDE 2

structure

Structure of the talk motivation (pathways, learning, relational, probabilistic) Stochastic Logic Programs parameter estimation with FAM experiments with chain probabilistic pathway experiments with branching probabilistic pathway

Wye – p.2

slide-3
SLIDE 3

pathways

Metabolic pathways represent biochemical reactions in the cell of

  • rganisms

are publicly available in databases such as KEGG are cross-referenced with other data, such as gene sequences there are relationships across species due to evolution

Wye – p.3

slide-4
SLIDE 4

aromatic amino acid

C00631 C00279 C03356 YDR127W YGL148W YPR060C YNL316C YBR166C YGL026C YDR127W YDR035W YDR127W C00002 YDR127W YDR127W YDR127W YGL026C C00014 YER090W C00022 C00025 YER090W YKL211C C00064 YGL026C YKL211C YDR007W YDR354W C00022 C00025 YBR249C C00166 C00108 C01302 C03506 C00078 C04302 C01179 C00074 C04691 C00944 C02652 C02637 C03175 C01269 C00463 C00025 C00006 C00005 C00009 C00009 C00661 C00006 C00493 C00008 C00074 C00254 C00251 C00661 YHR137W YGL202W YHR137W YGL202W C00082 C00079 C00026 C00065 C00013 C00065 C00009 C00009 C00119 YMR323W YDR254W C00005 C00026 YHR174W C00025 YHR174W YMR323W YDR254W

Wye – p.4

slide-5
SLIDE 5

machine learning

Public databases are, almost by definition, incomplete and containing incorrect information. Amongst other reasons incompleteness is due to: unknown enzymes lack of interest/resources for documenting secondary pathways Machine learning can use observational data to revise augment verify metabolic pathway descriptions. Of particular interest is the use of cross-species information.

Wye – p.5

slide-6
SLIDE 6

relational

Relational representations can express background knowledge at various levels of biological detail. The ability to incorporate existing knowledge enhances ability to learn. For instance in metabolic pathways, additional knowledge might include physical properties of substrates and products for individual reactions the existence of required co-factors and absence of blocking inhibitors the availability of similar pathway in other cells

Wye – p.6

slide-7
SLIDE 7

probabilistic

Various forms of uncertainty arise when modelling biological systems. Two main sources are: competing biological processes lack of detail in the model We consider two scenarios of extending metabolic pathways in these directions.

Wye – p.7

slide-8
SLIDE 8

rates as probabilities

A) 1.1.1.25 B) 2.7.1.71 p p

A B

C00493

Pathways do not take into account the rates with which enzymes consume their substrates to produce

  • metabolites. In the case of alternative production paths for

a single metabolite it is impossible to distinguish the contribution of each path. One way to model the difference in rates is by way of probabilities which captures the rates as proportions.

Wye – p.8

slide-9
SLIDE 9

rates for probabilistic ML

A) 1.1.1.25 B) 2.7.1.71 p p

A B

C00493

Rate constants can be used in conjuction with Michaelis-Menten equation to derive this probabilities. However databases such as Brenda record very few rate constants. ML techniques can be used to extrapolate these from experimental data.

Wye – p.9

slide-10
SLIDE 10

lack of detail as probabilities

: p 1.1.1.25 C00005 C02652 C00006 C00493

Due to a number of factors, such as physical chemistry, temperature, intracellular distance etc., reactions may not happen even if substrates are present. Lack of detail in the model can then be modelled as probability on the event of the reaction happening.

Wye – p.10

slide-11
SLIDE 11

rest of talk

modelling lack-of-detail in SLPs parameter estimation with FAM experiments with chain probabilistic pathway experiments with branching probabilistic pathway conclusions

Wye – p.11

slide-12
SLIDE 12

SLPs

A stochastic logic program, is a parameterised logic

  • program. Each clause, of a probabilistic predicate, has

attached to it a parameter (or label). Example program 1/2 :: nat( 0 ). 1/2 :: nat( s(X) ) :- nat( X ). It is normalised if the sum of the parameters for the clauses of each probabilistic predicate is equal to 1. An SLP is pure if all its predicates are parametrised.

Wye – p.12

slide-13
SLIDE 13

FAM, Cussens (2001)

Parameter estimation: estimate tuple of parameters λ = (λ1, λ2, . . . , λm) when given frequency of n

  • bservations y = (y1 − N1, y2 − N2, . . . , yn − Nn) which are

assumed to have been generated from S according to unknown distribution p(λ,S,G). Failure Adjusted Maximisation is an EM algorithm, where adjustment is expressed in terms of failure observations. The expected frequency of a clause is: ψλ[νi | y] =

T

  • k=1

Nkψλ[νi | yk] + N(Z−1

λ

− 1)ψλ[νi | fail]

(1)

Wye – p.13

slide-14
SLIDE 14

fam algorithm

  • 1. Set h = 0 and λ(0) to some estimates such that

Zλ(0) > 0

  • 2. For parameterised clause Ci compute ψλ(h)[νi | y]

using (Eq. 1).

  • 3. Let S(h)

i

be the sum of ψλ(h)[νi′ | y] for all Ci′ of the same predicate as Ci.

  • 4. If S(h)

i

= 0 then l(h+1)

i

= l(h)

i

  • therwise

l(h1)

i

= ψλ(h)[νi | y] Si(h)

  • 5. Increment h and go to 2 unless λ(h+1) has converged.

Wye – p.14

slide-15
SLIDE 15

implementation

SLP clauses are transformed so that, identification is added to each clause probability of a derivation is returned the path of a derivation as a list of ids, is returned would-be failures simply set a flag and succeed curtail infinite or very long computations, by approximating their probability to zero

Wye – p.15

slide-16
SLIDE 16

FAM on singular SLPs

Although FAM has been introduced for pure SLPs we applied it to a slightly more general class. Singular SLPs allow for impure/mixed SLPs in as far as that all derivations of a specific goal map to distinct stochastic path. A stochastic path is a sequence of the used probabilistic clauses.

Wye – p.16

slide-17
SLIDE 17

probabilistic pathways

1.1.1.25 C00005 C02652 C00006 C00493 (a) : p 1.1.1.25 C00005 C02652 C00006 C00493 (b) enzyme( ’1.1.1.25’, rea_1_1_1_25, [c00005,c02652], [c00006,c00493] ). 0.80 :: rea_1_1_1_25( yes, yes, yes, yes ). 0.20 :: rea_1_1_1_25( yes, yes, no, no ). (c)

Semantics of the attached probability are : “Given the inputs are present, the reaction will happen with probability p.” Probability is attached to the reaction not to the enzyme.

Wye – p.17

slide-18
SLIDE 18

assumptions

We have made two major simplifying assumptions reactions deplete their inputs each reaction is only considered, at most,

  • nce

Wye – p.18

slide-19
SLIDE 19

simulation

We run simulated experiments in order to

  • btain estimates on required learning

data-size

  • bserve behaviour of FAM

Wye – p.19

slide-20
SLIDE 20

PE scenario

Our experiments observe the following pattern : an SLP with n true parameters λ = λ1, λ2, . . . , λn is used to sample T samples sampling goal is can_produce(+Substrates, −Metabolites) parameters replaced by uniformly distributed ones use FAM to obtain ¯ λ = ¯ λ1, ¯ λ2, . . . , ¯ λn

Wye – p.20

slide-21
SLIDE 21

chain pathway

We have added direction and mock probabilities to the aromatic amino acid pathway and run the following two sets of experiments. x t It

x

Slt

x

Sut

x

Sit

x

a 1 10 100 1000 100 2 20 110 1010 100 b 1 5 100 3300 400 2 10 110 3310 400

Wye – p.21

slide-22
SLIDE 22

measures

FAM to observe two values accuracy, root mean square for parameters Rti

x =

  • N

j=1(pj − ¯

p(x,t,i,j))2 n

  • and taking mean and sdv over t

raw execution times for runtimes

Wye – p.22

slide-23
SLIDE 23

chain plots

0.16 0.18 0.2 0.22 0.24 0.26 200 400 600 800 1000 1200 rms(p) learning data size chain_a_10 chain_c_10 1500 2000 2500 3000 3500 4000 200 400 600 800 1000 1200 runtimes in milliseconds learning data size chain_a_10 chain_c_10

Wye – p.23

slide-24
SLIDE 24

branching

To compare the effect of secondary paths we, artificially, extended the pathway with an alternative path of length five, near the top of the graph. The secondary path only fires when there is a failure in the primary path.

Wye – p.24

slide-25
SLIDE 25

artificial pathway

X 4.2.1.10 2.7.1.71 2.5.1.19 4.6.1.4 4.2.1.51 2.6.1.7 2.6.1.7 5.3.1.24 4.1.1.48 4.1.3.27 4.2.1.20 C00009 C00002 2.4.2.18 5.4.99.5 C00065 4.2.1.20 C00065 4.2.1.20 C00119 C00009 C00005 C00025 C00026 C00025 C00108 C00014 C00009 C00009 C00074 C00008 C00002 C00064 C00631 4.2.1.11 4.2.1.11 C00279 C03356 C00074 4.1.2.15 C04691 4.6.1.3 C00944 C02652 C02637 C03175 C01179 C00082 C00166 C00108 C01302 C03506 C00078 C04302 C01269 C00006 C00008 C00074 C00025 C00022 C00254 C00251 C00014 or C00064 C00661 C00661 C00463 C00079 C00026 C00025 C00006 C00005 C00009 C00009 1.1.1.25 C00013 1.3.1.13 2.7.1.71 2.5.1.19 4.6.1.4 4.1.3.27 2.7.1.40 C00008 C00022 C00022 C00251 C01269 C03175 C00493

Wye – p.25

slide-26
SLIDE 26

comparative plots

0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 200 400 600 800 1000 rms(p) learning data size chain_a_10_rms branch_a_10_rms 50000 100000 150000 200000 200 400 600 800 1000 runtimes in milliseconds learning data size chain_a_10_rtm branch_a_10_rtm

Wye – p.26

slide-27
SLIDE 27

FAM future work

Improve efficiency by : storing expressions (

r ψλ(r)νi(r)) rather

than (re-)doing the proofs at each iteration. simplification of such expressions (and their equivalence to graph reduction). Extend algorithm to cover impure SLPs.

Wye – p.27

slide-28
SLIDE 28

bottom line

Currently we have run FAM to get initial estimates on the data size required for learning actual parameters. Machine learning tasks on probabilistic pathways : pathway completion pathway verification reaction rate estimation (different representation)

Wye – p.28