Diversity in vivo, Multicore in silico : How to link metagenomics - - PowerPoint PPT Presentation

diversity in vivo multicore in silico
SMART_READER_LITE
LIVE PREVIEW

Diversity in vivo, Multicore in silico : How to link metagenomics - - PowerPoint PPT Presentation

Diversity in vivo, Multicore in silico : How to link metagenomics and community ecology Alain Franc & al. Agreenium, INRA and Universit de Bordeaux, Toulouse Bordeaux, France October 2014 Plan Diversity and ecology


slide-1
SLIDE 1

Diversity in vivo, Multicore in silico : How to link metagenomics and community ecology

Alain Franc & al. INRA and Université de Bordeaux, Bordeaux, France

Agreenium, Toulouse October 2014

slide-2
SLIDE 2
slide-3
SLIDE 3
  • Diversity and ecology
  • Molecular systematics
  • Discrete mathematics for molecular systematics
  • Tools for discrete mathematics for molecular systematics
  • Case study: amazonian trees and dimensionality reduction
  • Case study: diatoms and inventories through NGS
  • Next future

Plan

slide-4
SLIDE 4

DIVERSITY AND ECOLOGY

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Some examples Biodiversity and Applied Mathematics

Molecular inprint of evolution : discrete mathematics and statistical modelling

  • Global alignment (very hard problem)
  • Inferring large phylogenies (very hard problem)
  • Coalescence models (technical, rich domain)
  • Genetic distances and evolutionary distances

Ecological modelling : dynamical systems

  • Community assembly
  • Diffuse coevolution (geographical mosaic …)

A challenge : How to link and assemble those two modelling domains? Donoghue & al., 2009

slide-13
SLIDE 13

MOLECULAR SYSTEMATICS

slide-14
SLIDE 14
slide-15
SLIDE 15

Evolution

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

Few individuals Many traits : genome wide cover Many individuals Few DNA regions of interest

slide-20
SLIDE 20
slide-21
SLIDE 21

DISCRETE MATHEMATICS FOR MOLECULAR SYSTEMATICS

slide-22
SLIDE 22

Taxonomy on Edit distance

Definition: The edit distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. kitten → sitten (substitution of 'k' with 's') sitten → sittin (substitution of 'e' with 'i') sittin → sitting (insert 'g' at the end).

slide-23
SLIDE 23

A pint

  • f methods …

What we have: genetic distances between sequences What we want: an evolutionary distance as branch length

  • n a phylogenetic tree

(time?) Tool for linking both: a graph for visualizing pariwise distance network according to athreshold

slide-24
SLIDE 24

Ultrametric distances

A taxon is … a disc … a clique a clade

slide-25
SLIDE 25

Graphs …

From Wikipedia, graphs http://fr.wikipedia.org/wiki/Th%C3%A9orie_des_graphes

slide-26
SLIDE 26

Two useful notions on graphs

Clique Basis for Phylogenetics Ultrametrics Finding: hard (NP complete) Connected component Basis for BLAST Finding: easy

slide-27
SLIDE 27

TOOLS FOR DISCRETE MATHEMATICS FOR MOLECULAR SYSTEMATICS

slide-28
SLIDE 28

A couple of success stories …

HPC and turbulences Astrophysics, Climate change, Earth Sciences, Bioinformatics, etc … From HPC to distributed computing Computational science Tighter links with services Science as service for communities

slide-29
SLIDE 29

What about Biology?

Discrete mathematics on words and strings: from sequences to proteins Genetics and random processes population genetics, inferring phylogenies, … Ecology, system biology, and ([very] large) dynamical systems Bioinformatics Integrative biology Medecine Agriculture Environment Response to climate change

slide-30
SLIDE 30

An experiment with Turing (IDRIS) ?

Scaling Millions of reads exact calculation no heuristics (local alignment) Flows of several 10 ×To / job Towards diagnotics for community ecology

slide-31
SLIDE 31

towards a Shared, virtual Biodiversity Lab

Distinguish, mobilize, and unite three types of knowledge and skills

  • Evolutionary biology and ecology
  • Applied mathematics and statistical modelling
  • Computer Sciences and High Performance Computing
slide-32
SLIDE 32

How does it work?

Modularity and networking in workflows Galaxy servers enables to implement this As soon as a command line launches a module

One module A network of modules

slide-33
SLIDE 33

Galaxy

Workflows

slide-34
SLIDE 34
  • Local Galaxy server
  • Mesocentre (Tier 2)

Avakas 1000 cores

  • Tier 1 (IDRIS, one pipeline, not via Galaxy)
  • EGI GRID

France-Grille

  • Cloud (on going, with UPV Valencia)

Where is it possible to compute? Where from? From any computer connected to internet Currenty available from French Guiana (IP Cayenne works with it) From a unique portal the Galaxy instance

slide-35
SLIDE 35

CASE STUDY: AMAZONIAN TREES AND DIMENSIONALITY REDUCTION

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

CASE STUDY: DIATOMS AND INVENTORIES THROUGH NGS

slide-39
SLIDE 39

Pairwise distances From local alignment Distance matrix Selection of a barcoding gap Building a graph Computing connex components and cliques Visualisation Statistics on taxa and characters

An example for taxonomic Annotation from NGS

slide-40
SLIDE 40
slide-41
SLIDE 41

metabarcoding + NGS

  • n diatoms communities

Taxonomic inventory

Cross validation

 False-positives  False-negatives  Abundances

Quality Indices

slide-42
SLIDE 42

Microscopy Metabarcoding rbcL / PGM / RSYST DB 99% homology 54 000 reads

mock community

Metabarcoding rbcL / 454 / RSYST DB 100% homology 40 000 reads

Kermarrec et al 2013

Taxonomic inventories 17/21

false-negatives: 3 sp under 1% 1 sp 1.9%

19/21

false-negatives = 2 sp under 0.6% 2 false-positives = taxonomy pb (Gomphonema sp complex) 3 false-positives = 1 to 5 reads

slide-43
SLIDE 43

 Lake Geneva  Seasonal dynamics of benthic diatoms  Monthly samplings during 1 year  10 environmental samples

(April 2012 to March 2013)

 Diatoms: scraped from 5 stones, 50 cm depth Microscopy Metabarcoding rbcL / PGM / RSYST DB 250 000 reads

Taxonomic inventories

slide-44
SLIDE 44

NEXT FUTURE …

slide-45
SLIDE 45

Molecular based taxonomy and systematics: An open route for (new) methods

Sequences known by pairwise distances Distance geometry pattern recognition machine learning Clustering Multidimensional Scaling linear and nonlinear (e.g. Sammon, 1969) Manifold learning IsoMap, EigenMap, etc … Graph based methods spectral clustering

slide-46
SLIDE 46

Continuum of population differentiation

46

Complete independence Modest connectivity Substantial connectivity Panmixia (subpopulations are completely congruent) After Waples and Gaggiotti, 2006, Molecular Ecology

Pattern recognition …

slide-47
SLIDE 47

Pattern and functions from populations to biomes

+

Biodiversity

slide-48
SLIDE 48

Speculation: Assemblage and Scaling

Item Number Atoms 92 Molecules 106 ? Cell types 3 × 102 Organisms 107 Communiti es 

One modelling goal: howto visualize /simulate large associations

  • f small/large numbers of types

with modular structures

http://www.fractalforums.com/images-showcase-%28rate-my-fractal%29/the-lego-molecule/?PHPSESSID=00a24d7f4234586a8e5ba4dd9c82541b

Living systems: Diversity …. Assembly of heterogeneous parts Distributed systems Distributed computing For Distributed systems?

slide-49
SLIDE 49

Team Yec’han Laizet Jean-Marc Frigerion Philippe Chaumeil HPC Pierre Gay MCIA Bordeaux Sylvie Thérond IDRIS Michel Daydé e-Biothon Vincent Breton idGC, GIS FG (Meta)barcoding LMGE, Clermont Didier Debroas Gisèle Bronner Carrtel, Thonon Agès Bouchez Frédéric Rimet Isabelle Domaizon AMAP Jean-François Molino Daniel Sabatier IP Cayenne Benoit de Thoisy Anne Lavergne Sourakhata Tirera

Thanks to