Parameters estimate in metabolic networks reconstruction G. Aprea 1 - - PDF document

parameters estimate in metabolic networks reconstruction
SMART_READER_LITE
LIVE PREVIEW

Parameters estimate in metabolic networks reconstruction G. Aprea 1 - - PDF document

FINAL WORKSHOP OF GRID PROJECTS, PON RICERCA 2000-2006, AVVISO 1575 1 Parameters estimate in metabolic networks reconstruction G. Aprea 1 , G. Licciardello 2 , and V. Rosato 3 1 ENEA, Via del Vecchio Macello, 80055 Portici (Naples), Italy,


slide-1
SLIDE 1

FINAL WORKSHOP OF GRID PROJECTS, “PON RICERCA 2000-2006, AVVISO 1575” 1

Parameters estimate in metabolic networks reconstruction

  • G. Aprea1, G. Licciardello2, and V. Rosato3

1ENEA, Via del Vecchio Macello, 80055 Portici (Naples), Italy,

giuseppe.aprea@gmail.com

2Science and Technology Park of Sicily, stradale V. Lancia 57, z.i. Blocco Palma I,

95121 Catania, Italy, gralicci@unict.it

3ENEA, Computinig and Modelling Unit, Via Anguillarese 301, 00123 S.Maria di

Galeria (Rome), Italy, rosato@casaccia.enea.it

Abstract—In this paper we describe the im- plementation of two different parallel codes for parameters estimate in metabolic networks based

  • n the genetic algorithm to fully take advantage of

modern computational facilities such as the Enea GRID. Index Terms—COMETA, journal, L

A

T EX, paper, template.

  • I. INTRODUCTION

Advances in omics sciences produce large amounts of data that need analysis and interpre-

  • tation. Reliable explanations of how processes

are regulated require an accurate modeling ap- proach at the systems level. In this paper we focus on metabolic networks models which re- produce the time evolution of all the metabo-

  • lites. Quite often these models rely on several

unknown parameters which have to be estimated from experimental data. This task consists in the solution of an inverse problem which requires the use of an efficient optimization algorithm. GA [1] is a widely known optimum search method which yields reliable values for model parameters with a large computational demand. Our aim is to develop a parallel implementation for parameter estimate based on GA to fully take advantage of the modern different compu- tational facilities. Our implementation relies on:

  • Ecell

software from Keio Univer- sity(Japan) [2], [4] for simulations of biochemical networks;

  • LSF - load sharing facility - a job scheduler

for (multi)cluster [3].

  • II. THE GENETIC ALGORITHM: BASICS

According to GA analogy, in a biochemical network, a set K = ki, i = 1, m of unknown parameters is defined genome. Each of these missing array of constants gives rise to a dif- ferent behavior of the network, that is different time evolutions for the metabolites’ concentra-

  • tions. The network with the genome end its

behavior together constitute an individual and a group of individuals is a population. As in the case of populations of organisms in nature, GA populations undergo a selection where good experimental data fitting represent the selection

  • criterion. After every selection stage, a new

population is created; the current generation is

  • ver and a new one is ready. After a large

number of generations, GA is expected to yield the individuals which best fit experimental data.

  • III. THE GENETIC ALGORITHM:

COMPUTATIONAL SCHEME GA is implemented following this steps: 1) a random number I of genomes is chosen, each of them differing from the others by the value of the unknown m constants. This is the starting population. 2) for each individual, the time evolution of the metabolites’ concentration is calcu- lated with the E-Cell tool.

slide-2
SLIDE 2

BARBERA et al.: COMETA DEMO PAPER

3) metabolites’ concentration allows to eval- uate, for each individual, the correspond- ing value of the cost function: Fk = [xi(tj+1) − xi(tj)]2, (k = 1, I); a good fit to data implies a small Fk and viceversa. 4) Mating is performed selecting the individ- uals of the current generation according to the cost function Fk, the smaller the more likely to be selected. Mutation is also added to yield the final offspring. After mating and mutation, a new gen- eration is ready and and the cycle restarts from point (2) unless we meet one of the following conditions (in these cases, the procedure ends):

  • the best value for the cost function

becomes smaller than a fixed thresh-

  • ld;
  • a pre-defined limit number of gener-

ations has been reached.

  • IV. GA AND SPECIFIC COMPUTATIONAL

ARCHITECTURES

The efficiency of the GA implementation stems from the choice of a number of param- eters: the number of individuals in a popula- tion, the number of generations and the spe- cific values of the parameters used for selec- tion, mating and mutation processes. Another important element to save overall time exe- cution is parallelizing the code; for example the computation of the fitness of individuals belonging to the same generation. In this work we suggest efficient implementations of GA, in relation to specific computing architectures: a single cluster architecture (SCA), characterized by many computational nodes tightly intercon- nected through a low-latency, high-bandwidth network and a GRID (or multicluster archi- tecture), characterized by a number of SCAs linked through a Wide Area Network. The com- putational problem has thus been mapped on a specific computing architecture by assuming different implementations of the method.

  • A. SCA implementation

In this case we propose a single-stage par- allelism approach. A master node generates an instance of a population of N individuals and then, if γ is the number of computing nodes, allots to each of them the same number N/γ of fitness evaluations. When the fitness of all the genomes has been evaluated, the master node gathers all data from the other computing nodes and performs selection, mating and mutation to provide new offspring which is subsequently re- allotted, as before, to the computing nodes. The procedure is repeated for nG generations.

  • B. GRID implementation

In the GRID case, the specific computing architecture suggests a different implementation

  • f the method. In this case, in fact, the single

larger generation of SCA is replaced by a num- ber γ of smaller “secondary generations” (SG), each composed of N/β genomes. The master node allots the SGs to other different secondary master nodes which accomplish the procedure described above for SCA. This procedure can be seen as the definition of many “islands” where subsets of a generation are allowed to evolve. Each “island” evolves new generations, inde- pendently, on the basis of their initial genomes. After a number of generations nG , the is- lands are allowed to exchange their resulting

  • genomes. The master node gathers all data from

the SGs, ranks their best genomes and send back those with the lowest cost function. Then each SG generates a new seed-population by using half of the best-fitting genomes received from the master and half by randomly generating new

  • individuals. This procedure is iterated nS times,

so that the total number of SG generations is NT = nGnS. In this way two main results are achieved: the first is to allow cooperation among SGs, which can exchange best individuals, and the second is keeping diversity to avoid being trapped by local minima effects.

  • V. DETAILS OF THE MODELS USED FOR

COMPUTATIONS

We have considered different networks whose topology and kinetic constants are known. We

2

slide-3
SLIDE 3

FINAL WORKSHOP OF GRID PROJECTS, “PON RICERCA 2000-2006, AVVISO 1575”

have generated the time-course of a number

  • f reaction products which are then used as

they were experimental results. We have thus “hidden” some of the kinetic constants and asked the method to recover their values. As for the networks chosen, the first describes the cyrcadian rhythm in Drosophila [5] while the second is related to the PHA (polyhydrox- yalkanoate) production in Pseudomonas. The two cases have been chosen as paradigmatic of different behaviors: in the first case, the method should be able to recover the frequency and amplitudes of experimental observations; in the second, the large initial production rate and the subsequent time-scale for reaching the asymp- totic flat behavior. We briefly report the main features of the biochemical models used for testing the method in Table I, together with the choices made in the different implementations. Before hiding the kinetic parameters, we used their “true” values to obtain the time course of some metabolites: three different metabolites in the case PHA metabolism model and two ones in the case of cyrcadian rhythm. In Tables II and III, we resume the values of the parameters which define the method’s implementations on the different computing architectures for the cyrcadian rhythm and the PHA metabolism.

  • VI. RESULTS AND DISCUSSION

In this section we report the results obtained with the two model’s implementations, for the two metabolic networks. In Table [?] we report the numerical results of our GA implementa-

  • tions. Results allow to highlight the following

points:

  • Both the SCA and the GRID implemen-

tations allow to estimate numerical val- ues for the kinetic constants very close to the “true” ones which can reproduce the “experimental” data with a very good

  • approximation. The kpha

3

parameter is the

  • nly one for which both approaches pro-

duce different results and different from the “true” one. A further investigation has shown the existence of a (almost) flat op- timum for this parameter i.e. there is a substantial independence of the resulting solutions from the absolute value of this parameter.

  • if compared at equal number of genomes

processed, the SCA implementation showed a better performance than the GRID one, either in minimizing the cost function and in the number of genomes needed to reach the minimum. This is not much surprising because it is known that GA works better with larger populations;

  • the amplitude of maximum fluctuation of

the cost function away from the minimum

  • f the best genome of each generation in

the SCA approach is about 1.5 − 2 times larger than that of the best genome from the set of parallel generations of the GRID

  • implementation. This also follows from the

ability of a larger population to efficiently span the parameters space.

  • VII. CONCLUSION

In this work, we have attempted to show that GA is a powerful tool providing good guesses

  • f unknown kinetic parameters in metabolic
  • networks. It has been used in combination with

the E-Cell tool which yields the evaluation

  • f the cost function. Although it is compu-

tationally expensive, its workflow is prone to be parallelized on multi-processor computing

  • architectures. We have
  • provided evidences on the effective abil-

ity to retrieve correct guesses when used to discover unknown kinetic constants in metabolic networks; evidences have been provided for different cases, where the time behavior of the products in the networks varies from exponential decay, as in the case of PHA metabolism in Pseudomonas to an oscillating behavior, as in the case of the cyrcadian rhythm in Drosophila;

  • designed two different parallel implemen-

tations of the method, to be used on dif- ferent classes of computing architectures: large homogeneous clusters, with fast in- terconnection (SCA), and GRID architec-

3

slide-4
SLIDE 4

BARBERA et al.: COMETA DEMO PAPER

tures, characterized by loosely coupled computing elements. The SCA implementation is based on the fast internode’s communication which allows a rapid and efficient flux of information between a master node and the computation nodes. The GRID implementation, in turn, introduces a double level of parallelism: each computational node provides the evolution of a subset of the whole population as the loose interconnection does not allow to make every node communi- cate with a single master at the end of each

  • generation. Only at given frequency, the dif-

ferent population subsets exchange data. We have tested both SCA and GRID implementa- tions to deal with two different cases: in the first, we wanted to reproduce a periodic time behavior of metabolites concentration while in the second we faced a typical slowly-decaying behavior, i.e. an initial increase followed by a slow saturation. In both cases the parallel procedures were able to provide fully satisfac- tory results both from a qualitative and, more importantly, from a quantitative points of view. The SCA implementation, using a single, large population, is faster than the GRID one in reaching its optimum, even in the case where we compare the performance of the latter with a SCA implementation with a smaller population. We are working on a further implementation

  • f the method which mixes up features of

both the SCA and the GRID approach. In this approach, we will force successive generations to reaching higher population densities in the parameter space, by allowing them to explore just a subset of the whole parameter space. At programmable intervals of generations, a number of subsets with the worst results is dis- carded and further generations are constrained to make a fine-grain exploration in the region(s)

  • f the parameters space where better solutions

have been obtained. We expect this approach to be better in approaching the correct solution and to overcome problems of instability due to the possible presence of more than one sta- ble minimum which plague the SCA approach. Moreover, we will apply this method on fur- ther cases of some technological relevance. The proposed method can be run in conjunction with the methods allowing the network to be designed, in order to ultimately provide reliable computational models of metabolic pathways to be used for metabolomic studies and to provide in-silico derived insights for bacterial metabolic engineering [6]. ACKNOWLEDGMENT This work has been performed in the frame of the project CRESCO (Computational Research Center for Complex Systems) co-founded by ENEA and the Italian Ministry of University and Research in the frame of “Programma Oper- ativo Nazionale 2000-2006 Ricerca Scientifica, Sviluppo Tecnologico, Alta Formazione, Misura II.2 : Societ della Informazione per il Sistema Scientifico Meridionale, Azione a : Sistemi di calcolo e simulazione ad alte prestazioni”. The authors wish to acknowledge discussions and useful suggestions by I. Arisi (EBRI, Rome) and V.Catara (University of Catania). They would also thank the ENEA colleagues of the ENEA Research Center in Portici for supporting the technical implementation of the work on the high-performance computing architectures

  • f the CRESCO Project.

REFERENCES

[1] D.E. Goldberg, Genetic Alghorithms in Search, Op- timization and Machine Learning, Boston (MA) USA: Addison-Wesley Longman Publishing Com- pany, 1989. [2] E-cell [http://www.e-cell.org/ecell]. [3] Platform Computing, [http://www.platform. com]. [4] M. Tomita, K. Hashimoto, K. Takahashi, T. Shimizu,

  • Y. Matsuzaki, F. Miyoshi, K. Saito, S. Tanida, K. Yugi,
  • J. Venter and C. Hutchison, Bioinformatics 15, 7284,

1999. [5] J. Tyson,C. Hong,C. Thron,B. Novak, Biophys J., , 77(5), 2411, 1999. [6] DREAM project [http://wiki.c2b2.columbia. edu/dream/index.php/The DREAM Project].

  • Dr. Giuseppe Aprea Graduated in Physics in 2002, finished

Ph.D. in Physics in 2006. In 2007 he won a research grant to work at ENEA (Italian National Agency for New Technologies, Energy and Environment) in System Biology project.

4

slide-5
SLIDE 5

FINAL WORKSHOP OF GRID PROJECTS, “PON RICERCA 2000-2006, AVVISO 1575”

  • Dr. Grazia Licciardello Graduated in Biology in 2002,

finished Ph.D. in Technologies for Plants Health in 2008. Since 2003 she has collaborated with the Science and Technologies for Plants Health Department at the University

  • f Catania (DISTEF) in the field of Biotechnologies for

Plants Health.

  • Dr. Vittorio Rosato Graduated in Physics in 1979, finished

Ph.D. in Physics in 1986. Since 1987 he has collaborated with ENEA (Italian National Agency for New Technologies, Energy and Environment) becoming confirmed researcher in

  • 1990. He has developed expertise in Statistical Mechanics,

Condensed Matter Physics, High Performance computing, Optimization Techniques, Complex Systems Modeling.

Model Metabolites Enzymes Processes M − m m

  • N. of “experimental” points

[Metabolite list] Cyrcadian rhythm 2 5 9 2 103[M,P] PHA metabolism 8 5 5 21 3 9940 [Fatty acid, 2-trans-enoyl-CoA,PHA]

TABLE I MAIN DATA OF THE TWO MODELS USED FOR COMPUTATIONS. LAST COLUMN REPORTS THE NUMBER OF “EXPERIMENTAL” POINTS USED TO EVALUATE THE COST

FUNCTION (THE VALUE np IN EQ.2).

5

slide-6
SLIDE 6

BARBERA et al.: COMETA DEMO PAPER

Model nG N γ cyrcadian rhythm 1000 100 ∼100 PHA metabolism 1000 100 ∼100

TABLE II IMPLEMENTATION PARAMETERS FOR THE SCA ARCHITECTURES. nG THE TOTAL NUMBER OF GENERATIONS THE MODEL HAS BEEN ALLOWED TO GENERATE, N THE

NUMBER OF genomeS IN EACH GENERATION, γ THE NUMBER OF COMPUTING PROCESSORS USED FOR COMPUTATIONS.

Model N/β β nG nS γ Cyrcadian rhythm 20 5 200 4 ∼10 40 5 200 4 ∼10 60 5 200 4 ∼10 PHA metabolism 20 5 200 4 ∼10

TABLE III IMPLEMENTATION PARAMETERS FOR THE GRID ARCHITECTURES. N/β THE NUMBER OF INDIVIDUALS IN EACH GROUP; β IS THE NUMBER OF GROUPS OF INDIVIDUALS

WHICH EVOLVE INDEPENDENTLY; nG THE NUMBER OF GENERATIONS CREATED IN EACH GROUP BEFORE THE DATA EXCHANGING BETWEEN GROUPS; nS THE NUMBER OF TIMES THE DIFFERENT GROUPS EXCHANGE DATA (SUCH AS THE TOTAL NUMBER OF GENERATIONS NT = nG · [nS + 1]); γ THE NUMBER OF COMPUTING PROCESSORS USED FOR COMPUTATIONS.

6

slide-7
SLIDE 7

FINAL WORKSHOP OF GRID PROJECTS, “PON RICERCA 2000-2006, AVVISO 1575”

Solutions from the different approaches Model Parameter [Solution] SCA GRID (N/β = 20) GRID (N/β = 40) GRID (N/β = 60) Pha metabolism kpha

1

[10.0sec−1] 9.9998 9.6165 10.0118 10.0000 kpha

2

[1.21e − 05M] 1.1986e − 05 1.2494e − 05 1.2100e − 05 1.2099e − 05 kpha

3

[7.2e − 04M] 27.1001e − 04 16.1585e − 04 7.1986e − 04 6.4814e − 04 Cyrcadian rhythm kperiod

1

[200M −1] 199.9998 199.9978 199.9888 200.0003 kperiod

2

[10sec−1] 10.0000 10.0000 10.0001 9.99999

TABLE IV NUMERICAL RESULTS FOR THE HIDDEN PARAMETERS. IN THE SECOND COLUMN, WITHIN BRACKETS, WE REPORT THE “TRUE” VALUE OF THE PARAMETER.

7