Diversity in vivo, Multicore in silico : How to link metagenomics and community ecology
Alain Franc & al. INRA and Université de Bordeaux, Bordeaux, France
Agreenium, Toulouse October 2014
Diversity in vivo, Multicore in silico : How to link metagenomics - - PowerPoint PPT Presentation
Diversity in vivo, Multicore in silico : How to link metagenomics and community ecology Alain Franc & al. Agreenium, INRA and Universit de Bordeaux, Toulouse Bordeaux, France October 2014 Plan Diversity and ecology
Alain Franc & al. INRA and Université de Bordeaux, Bordeaux, France
Agreenium, Toulouse October 2014
Molecular inprint of evolution : discrete mathematics and statistical modelling
Ecological modelling : dynamical systems
A challenge : How to link and assemble those two modelling domains? Donoghue & al., 2009
Evolution
Few individuals Many traits : genome wide cover Many individuals Few DNA regions of interest
Definition: The edit distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. kitten → sitten (substitution of 'k' with 's') sitten → sittin (substitution of 'e' with 'i') sittin → sitting (insert 'g' at the end).
What we have: genetic distances between sequences What we want: an evolutionary distance as branch length
(time?) Tool for linking both: a graph for visualizing pariwise distance network according to athreshold
A taxon is … a disc … a clique a clade
From Wikipedia, graphs http://fr.wikipedia.org/wiki/Th%C3%A9orie_des_graphes
Clique Basis for Phylogenetics Ultrametrics Finding: hard (NP complete) Connected component Basis for BLAST Finding: easy
HPC and turbulences Astrophysics, Climate change, Earth Sciences, Bioinformatics, etc … From HPC to distributed computing Computational science Tighter links with services Science as service for communities
Discrete mathematics on words and strings: from sequences to proteins Genetics and random processes population genetics, inferring phylogenies, … Ecology, system biology, and ([very] large) dynamical systems Bioinformatics Integrative biology Medecine Agriculture Environment Response to climate change
Scaling Millions of reads exact calculation no heuristics (local alignment) Flows of several 10 ×To / job Towards diagnotics for community ecology
Distinguish, mobilize, and unite three types of knowledge and skills
Modularity and networking in workflows Galaxy servers enables to implement this As soon as a command line launches a module
One module A network of modules
Workflows
Avakas 1000 cores
France-Grille
Where is it possible to compute? Where from? From any computer connected to internet Currenty available from French Guiana (IP Cayenne works with it) From a unique portal the Galaxy instance
Pairwise distances From local alignment Distance matrix Selection of a barcoding gap Building a graph Computing connex components and cliques Visualisation Statistics on taxa and characters
An example for taxonomic Annotation from NGS
Cross validation
False-positives False-negatives Abundances
Quality Indices
Microscopy Metabarcoding rbcL / PGM / RSYST DB 99% homology 54 000 reads
mock community
Metabarcoding rbcL / 454 / RSYST DB 100% homology 40 000 reads
Kermarrec et al 2013
false-negatives: 3 sp under 1% 1 sp 1.9%
false-negatives = 2 sp under 0.6% 2 false-positives = taxonomy pb (Gomphonema sp complex) 3 false-positives = 1 to 5 reads
Lake Geneva Seasonal dynamics of benthic diatoms Monthly samplings during 1 year 10 environmental samples
(April 2012 to March 2013)
Diatoms: scraped from 5 stones, 50 cm depth Microscopy Metabarcoding rbcL / PGM / RSYST DB 250 000 reads
Sequences known by pairwise distances Distance geometry pattern recognition machine learning Clustering Multidimensional Scaling linear and nonlinear (e.g. Sammon, 1969) Manifold learning IsoMap, EigenMap, etc … Graph based methods spectral clustering
46
Complete independence Modest connectivity Substantial connectivity Panmixia (subpopulations are completely congruent) After Waples and Gaggiotti, 2006, Molecular Ecology
Item Number Atoms 92 Molecules 106 ? Cell types 3 × 102 Organisms 107 Communiti es
One modelling goal: howto visualize /simulate large associations
with modular structures
http://www.fractalforums.com/images-showcase-%28rate-my-fractal%29/the-lego-molecule/?PHPSESSID=00a24d7f4234586a8e5ba4dd9c82541b
Living systems: Diversity …. Assembly of heterogeneous parts Distributed systems Distributed computing For Distributed systems?
Team Yec’han Laizet Jean-Marc Frigerion Philippe Chaumeil HPC Pierre Gay MCIA Bordeaux Sylvie Thérond IDRIS Michel Daydé e-Biothon Vincent Breton idGC, GIS FG (Meta)barcoding LMGE, Clermont Didier Debroas Gisèle Bronner Carrtel, Thonon Agès Bouchez Frédéric Rimet Isabelle Domaizon AMAP Jean-François Molino Daniel Sabatier IP Cayenne Benoit de Thoisy Anne Lavergne Sourakhata Tirera
Thanks to