Challenges in Organizing a Metagenomics Benchmarking Challenge - - PowerPoint PPT Presentation

challenges in organizing a metagenomics benchmarking
SMART_READER_LITE
LIVE PREVIEW

Challenges in Organizing a Metagenomics Benchmarking Challenge - - PowerPoint PPT Presentation

Challenges in Organizing a Metagenomics Benchmarking Challenge Alice C. McHardy Department for Computational Biology of Infection Research Helmholtz Centre for Infection Research and the CAMI Initiative Critical Assessment of Metagenome


slide-1
SLIDE 1

Challenges in Organizing a Metagenomics Benchmarking Challenge

Alice C. McHardy Department for Computational Biology of Infection Research Helmholtz Centre for Infection Research and the CAMI Initiative

slide-2
SLIDE 2

Critical Assessment of Metagenome Interpretation

Tool development for shotgun metagenome data sets is a very active area: Assembly, (tax.) binning, taxonomic profiling

 Method papers present evaluations using many different

metrics, simulated data sets (snapshots) and are difficult to compare

 It is unclear to everyone which tools are most suitable for a

particular task and for particular data sets

 Comparative benchmarking requires extensive resources and

there are pitfalls

Towards a comprehensive, independent and unbiased evaluation of computational metagenome analysis methods

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

First CAMI challenge

 Benchmark assembly, (tax.) binning and taxonomic profiling software  Extensive, high-quality benchmark data sets from unpublished data  Publication with participants and data contributors

Aims

 Standards  Overview of tools and use cases  Indicate promising directions for tool development  Suggestions for experimental design  Facilitate future benchmarking

Contest opened in 2015

slide-6
SLIDE 6

The most important challenge: Getting developers and the broader community involved

  • Spreading the news
  • Data sets must be „exciting“, reflect what people would like to see
  • Should facilitate benchmarking to developers
  • Tool results must be reproducible

Page 10 |

slide-7
SLIDE 7

IKey principles

 All design decisions (data sets, evaluation measures and principles)

should involve the community

 Data sets should be as realistic as possible  Evaluation measures should be informative to developers and

understandable also by applied community

 Reproducibility (data generation, tools, evaluation)  Participants should not see any of the data before  Evaluation using anonymized tool names as long as possible

slide-8
SLIDE 8

 Google+ community

„Spreading the word“ & Community Involvement

 ISME Roundtable, Hackathons &

workshops

 Announcements in blogs & tweets  www.cami-challenge.org with

newsletter

Page 12 |

slide-9
SLIDE 9

Challenge Data sets – Design principles

 Challenging  Common experimental setups and community types  Unpublished data  Strain-level variation  Different taxonomic distances to sequenced genomes (deep

branchers included)

 State-of-the-art sequencing technologies  Non-bacterial sequences  Fully reproducible with CAMI benchmark data set generation pipeline  Provide freely accessible toy data sets upfront

Seite 13 |

slide-10
SLIDE 10

CAMI Datasets

CAMI_low 1 sample 15 GBp 2 x150 bp Insert size: 270 bp CAMI_medium (differential abundance) 2 samples 40 GBp 2 x150 bp Insert sizes: 270 bp & 5kbp CAMI_high (time series) 5 samples 75 GBp 2 x150 bp Insert size: 270 bp Datasets simulated from ~700 unpublished microbial genomes and additional sequence material

slide-11
SLIDE 11

Timeline first CAMI Challenge

slide-12
SLIDE 12

CAMI participants

https://data.cami-challenge.org Early 2015: Already >40 registered participants (currently 128)

slide-13
SLIDE 13

Reproducibility and Standardization

Barton et al., Gigascience 2015

 Standard formats for

binning and profiling

 Standard interfaces for

tool execution

 Bioboxes (docker

containers) for tools and metrics

 Currently 25 tools in

bioboxes – semi- automatic benchmarking in future challenges

 Challenge organization

and benchmarking framework

slide-14
SLIDE 14

Further Challenges

  • Challenge Design
  • Repeat with current setup, or new challenge setup (eg using real

samples)?

  • Define challenge questions & performance metrics very specifically

from the start or leave flexibility?

  • Include new tool categories (predict pathogens, antibiotic

resistance genes)?

  • Measure runtimes
  • Contest implementation (communication is an issue if geograpically

distributed team)

  • Challenge data set generation – get data generators onboard, not

many people routinely isolate hundreds of microbes for sequencing

Seite 18 |

slide-15
SLIDE 15

Alexander Sczyrba Peter Belmann Liren Huang Alice McHardy Johannes Dröge, Ivan Gregor Peter Hofmann, Jessika Fiedler Stephan Majda, Eik Dahms, Stefan Janssen, Adrian Fritz, Andreas Bremges, Ruben Garrido- Oter Tanja Woyke Nikos Kyrpides Eddy Rubin Paul Schulze-Lefert, Yang Bai Hans-Peter Klenk Phil Blood Mihai Pop, Chris Hill Aaron Darling Matthew DeMaere Thomas Rattei Dmitri Turaev Julia Vorholt Michael Barton Tue Sparholt Jørgensen, Lars Hestbjerg Hansen Søren J Sørensen David Koslicki Isaak Newton Institute for Mathematical Sciences