Algorithmica and molecular biology The Pisan experience Fabrizio - PowerPoint PPT Presentation

Algorithmica and molecular biology The Pisan experience Fabrizio Luccio Glimpses into the world born from the encounter between the machines for sequencing DNA fragments, and computers that assembly those fragments.

The Department group Maria Federico (federico@cli.di.unipi.it) Claudio Felicioli (pangon@gmail.com) * Paolo Ferragina * (ferragin@di.unipi.it) Roberto Grossi (grossi@di.unipi.it) Fabrizio Luccio * (luccio@di.unipi.it) Roberto Marangoni * (marangon@di.unipi.it) Nadia Pisanti * (pisanti@di.unipi.it) * The boss (at least, the who knows everything) * Reference person (she made most of the work) * Trying to escape * gmail: why? (probably paid by Google) * ME! (parasite, but early group initiator)

Glimpses into the world etc …. Algorithms are the winning tool. Sorry…. good algorithms are the winning tool, especially when dealing with very large data.

Inefficient algorithms…. …. have the unpleasant property of resisting to hardware improvement: A polynomial-time algorithm solves a problem on n data in time t 1 = c n s An exponential-time algorithm solves a problem on n data in time t 2 = c s n with c, s constants With a computer k times faster, and same running time, we process N > n data, according to the laws: t 1 = c n s , k t 1 = c N s N = k 1/s n t 2 = c s n , k t 2 = c s N k s n = s N N = n + log s k

Publications on sequence algorithms Mercatanti A., Rainaldi G., Mariani L., Marangoni R., Citti L. A method for prediction of accessible sites on an mRNA sequence for target selection of hammehead ribozymes . J. Computational Biology, 4(9) 641-653, 2002 Menconi G., Marangoni R. A compression-based approach for coding sequences identification in prokaryotic genomes , J. Computational Biology (to appear) Corsi C., Ferragina P., Marangoni R. The bioPrompt-box: an ontology-based Corsi C., Ferragina P., Marangoni R. clustering tool for searching in biological databases . BMC bioinformatics (to . BMC bioinformatics (to appear) appear) Cozza A., Morandin F., Galfrè S.G., Mariotti V., Marangoni R., Pellegrini S. Cozza A., Morandin F., Galfrè S.G., Mariotti V., Marangoni R., Pellegrini S. TAMGeS: a Three-Array Method for Genotyping of SNPs by a dual-color approach. BMC genomics (to appear) BMC genomics (to appear) Felicioli C., Marangoni R. BpMatch: an efficient algorithm for segmenting Felicioli C., Marangoni R. sequences, calculating genomic distance and counting repeats , (submitted) , (submitted) Ferragina P. String search in external memory: algorithms and data structures . Handbook of Computational Molecular Biology, CRC Press, 2005

Publications on motifs N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. A Comparative Study of Bases for Motif Inference . NATO Series on String Algorithmics , 2004. N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. Bases of Motifs for Generating Repeated Patterns with Wild Cards . IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1) 40-50, 2005. C.S.Iliopoulos, J.McHugh, P.Peterlongo, N.Pisanti, W.Rytter, M.Sagot. A first approach to finding common motifs with gaps , International Journal of Foundations of Computer Science 16(6) 1145--1155, 2005. N. Pisanti, H. Soldano, M. Carpentier, J. Pothier. Implicit and Explicit Representation of Approximated Motifs . In: Algorithms for Bioinformatics , C. Iliopoulos et al, editors, King's College London Press, 2006. P.Peterlongo, N.Pisanti, F.Boyer, A.Pereira do Lago, M.-F.Sagot. Lossless filter for multiple repetitions with Hamming Distance . Journal of Discrete Algorithms 2007 (to appear).

Major collaborations on motifs Lyon (group of Marie-France Sagot) (group of Marie-France Sagot) Lyon Grenoble (group of Alan Vieri) (group of Alan Vieri) Grenoble Paris (group of Henri Soldano) (group of Henri Soldano) Paris Marne-la-Valle (group of Maxime Crochemore) (group of Maxime Crochemore) Marne-la-Valle

Paralogy tree construction ……. via transformation distance Pisanti N., Marangoni R., Ferragina P., Frangioni A., Savona A., Pisanelli C., Luccio F. PaTre: A Method for Paralogy Trees Construction . J. Computational Biology, 5 (10) 791-802, 2003 How does the genomic information increase? external imports - Transfections - Horizontal transfer Endogenous mechanisms (genic or genomic) duplications: Large scale Tandem Dispersed Single gene

The fate of the copy Non-functional: pseudogene Functional: paralog genome as a set of families of paralogs PARALOGY TREE  How does the genome choose the paralog to duplicate within a family?  Is the duplication rate constant among the various families?  Are sparse duplications correlated to sparse deletions?

Couple-comparison method Couple-comparison method Transformation Distance (TD) Transformation Distance (TD) Often newest genes are the shortest ones Often newest genes are the shortest ones To insert sequences imply paying metabolic costs. To delete To insert sequences imply paying metabolic costs. To delete sequences has no metabolic cost sequences has no metabolic cost We need an asymmetric distance: We need an asymmetric distance: TD(S,T) = the cost of the minimum-length script able to TD(S,T) = the cost of the minimum-length script able to transform S into T transform S into T Elementary operations : Insertion, Copy, Inverted copy Elementary operations : Insertion, Copy, Inverted copy

TD: an example TD: an example f g h S=ATCGATCAGCTGCCCAATGAATCAGATAAAGTTTC 1ÉÉÉÉÉ.ÉÉ11ÉÉ.....16ÉÉÉÉÉÉ..25ÉÉÉÉÉÉÉ35 f g h T=ATCGATCAGCTTTCACTACGAATGAATCAGATTGGTAGCTTTGAAATAG 1ÉÉÉÉÉÉÉ..11ÉÉÉÉÉÉ...21ÉÉÉÉÉÉÉÉÉÉÉ.ÉÉÉ38ÉÉÉÉÉÉÉ48 Script transforming S into T Description 1) copy f copy (1, 1, 11) 2) insertion of TTCACTACG insert (TTCACTACG) 3) copy g copy (16,21,12) 4) insertion of TGGTAGC insert (TGGTAGC) 5) inverted copy of h cop y (25,38,11,1)

PaTre PaTre Input: TD values for each possible couple made by the Input: TD values for each possible couple made by the genes of the family genes of the family Building-up of the directed graph of distances Building-up of the directed graph of distances Edmonds’ algorithm: extraction of the LSA (Lightest Edmonds’ algorithm: extraction of the LSA (Lightest Spanning Arborescence)  optimal paralogy tree  optimal paralogy tree Spanning Arborescence) Generation of optimal and sub-optimals (space of Generation of optimal and sub-optimals (space of quasi-optimal solutions) quasi-optimal solutions)

PaTre has been tested by simulation PaTre has been tested by simulation …because there are no experimental data because there are no experimental data … on the history of families of genes on the history of families of genes

Cost: 7840 - Distance: 0% output from PaTre for the MFINFRP simulated Ribosomal Protein of 1044 1187 1035 M. pneumoniae str01 str02 str03 757 955 0 str04 str05 704 1 2 3 str06 505 526 4 5 str07 str09 394 428 6 str08 str11 7 9 305 str10 8 11 10 The simulated paralogy tree for the Ribosomal Proteins family of M. pneumoniae

str11 MFINFRP Cost: 7840 - Distance: 0% MFINFRP 1044 str05 1187 1035 str01 str02 str03 757 str06 955 str04 str05 704 str08 str06 505 526 str07 str07 str09 394 428 str02 str08 str11 305 str10 str10 str09 The paralogy tree reconstructed str03 by ClustalW for the Ribosomal proteins genic family of str01 M. pneumoniae str04

After having tested PaTre on many examples, we could conclude that PaTre is able to PaTre is able to correctly reconstruct the simulated history of correctly reconstruct the simulated history of genetic families, while ClustalW and other , while ClustalW and other genetic families similarity based methods fail. similarity based methods fail.

Algorithmica and molecular biology The Pisan experience Fabrizio - PowerPoint PPT Presentation

Algorithmica and molecular biology The Pisan experience Fabrizio Luccio Glimpses into the world born from the encounter between the machines for sequencing DNA fragments, and computers that assembly those fragments. The Department group

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

4. Molecular dynamics Understanding Molecular Simulation Molecular Simulations Molecular

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

ASEANs Approach on Codex Pisan Pongsapitch National Bureau of Agricultural Commodity and

Implementation of Codex Standards at National Level Pisan Pongsapitch Director, Office of

Information problems in Information problems in molecular biology and molecular biology and

Probing dimerization and structural flexibility Probing dimerization and structural flexibility

Protein unfolding and flexible systems Protein unfolding and flexible systems Protein unfolding

Formal Molecular Biology According to V. Danos & C. Laneve J er ome Caffaro

Small Angle Scattering (SAXS/SANS) Small Angle Scattering (SAXS/SANS) Small Angle Scattering

1. Introduction to Molecular & Systems Biology EECS 600: Systems Biology &

Introduction to Fetal Medicine: Genetics and Embryology Question: What do cancer biology,

Molecular vibrations Ask Hjorth Larsen Center for Atomic-scale Materials Design 2008 Molecular

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Molecular Simulation Introduction Understanding Molecular Simulation Introduction Why to use

Systems Biology Structural Biology Dr. Shaila C. Rssle 1 Our life is maintained by

Selection Dynamics in Transient Compartmentalization A. Blokhuis 1 , D. Lacoste 1 , P. Nghe 1 and

CSE 527 Autumn 2006 Lectures 15-16 RNA Secondary Structure Prediction RNA Secondary Structure:

preservation / curation Logo Research cannot flourish if data are not preserved and made

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Multifamily: Real Estates New Darling Wendy Drucker Managing Director Drucker & Falk,

Rethinking New Yorks Direct Pay Market Jeff Alter CEO, UnitedHealthcare, Northeast Region New

SARA: a tool for RNA structural alignment Emidio Capriotti Marc A. Marti-Renom

1 1

Algorithmica and molecular biology The Pisan experience Fabrizio - PowerPoint PPT Presentation

Algorithmica and molecular biology The Pisan experience Fabrizio Luccio Glimpses into the world born from the encounter between the machines for sequencing DNA fragments, and computers that assembly those fragments. The Department group

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

4. Molecular dynamics Understanding Molecular Simulation Molecular Simulations Molecular

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

ASEANs Approach on Codex Pisan Pongsapitch National Bureau of Agricultural Commodity and

Implementation of Codex Standards at National Level Pisan Pongsapitch Director, Office of

Information problems in Information problems in molecular biology and molecular biology and

Probing dimerization and structural flexibility Probing dimerization and structural flexibility

Protein unfolding and flexible systems Protein unfolding and flexible systems Protein unfolding

Formal Molecular Biology According to V. Danos &amp; C. Laneve J er ome Caffaro

Small Angle Scattering (SAXS/SANS) Small Angle Scattering (SAXS/SANS) Small Angle Scattering

1. Introduction to Molecular &amp; Systems Biology EECS 600: Systems Biology &amp;

Introduction to Fetal Medicine: Genetics and Embryology Question: What do cancer biology,

Molecular vibrations Ask Hjorth Larsen Center for Atomic-scale Materials Design 2008 Molecular

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Molecular Simulation Introduction Understanding Molecular Simulation Introduction Why to use

Systems Biology Structural Biology Dr. Shaila C. Rssle 1 Our life is maintained by

Selection Dynamics in Transient Compartmentalization A. Blokhuis 1 , D. Lacoste 1 , P. Nghe 1 and

CSE 527 Autumn 2006 Lectures 15-16 RNA Secondary Structure Prediction RNA Secondary Structure:

preservation / curation Logo Research cannot flourish if data are not preserved and made

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Multifamily: Real Estates New Darling Wendy Drucker Managing Director Drucker &amp; Falk,

Rethinking New Yorks Direct Pay Market Jeff Alter CEO, UnitedHealthcare, Northeast Region New

SARA: a tool for RNA structural alignment Emidio Capriotti Marc A. Marti-Renom

1 1

Formal Molecular Biology According to V. Danos & C. Laneve J er ome Caffaro

1. Introduction to Molecular & Systems Biology EECS 600: Systems Biology &

Multifamily: Real Estates New Darling Wendy Drucker Managing Director Drucker & Falk,