outline
play

Outline Introduction Brief history of Bioinformatics - PowerPoint PPT Presentation

Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org


  1. Why Aren't We Benchmarking Bioinformatics? � Joe Parker � Early Career Research Fellow (Phylogenomics) � Department of Biodiversity Informatics and Spatial Analysis � Royal Botanic Gardens, Kew � Richmond TW9 3AB � joe.parker@kew.org �

  2. Outline � • Introduction � • Brief history of Bioinformatics � • Benchmarking in Bioinformatics � • Case study 1: Typical benchmarking across environments � • Case study 2: Mean-variance relationship for repeated measures � • Conclusions: implication for statistical genomics �

  3. A (very) brief history of bioinformatics � Kluge & Farris (1969) Syst. Zool. 18: 1-32

  4. A (very) brief history of bioinformatics � Stewart et al. (1987) Nature 330:40 1-404

  5. A (very) brief history of bioinformatics � ENCODE Consortium (2012) Nature 489: 57-74

  6. A (very) brief history of bioinformatics � Kluge & Farris (1969) Syst. Zool. 18: 1-32

  7. Benchmarking to biologists � • Benchmarking as a comparative process � • i.e. ‘which software’s best?’ / ‘which platform’ � • Benchmarking application logic / profiling unknown � • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’ �

  8. Case Study 1 � aka � ‘Which program’s the best?’ �

  9. Bioinformatics environments are very heterogeneous � • Laptop: � – Portable � – Very costly form-factor � – Maté? Beer? � • Raspi: � – Low: cost, energy (& power) � – Highly portable � – Hackable form-factor � • Clusters: � • Not portable, setup costs • The cloud: � – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup

  10. Benchmarking to biologists �

  11. Comparison � RAM Gb / CPU type, System Arch cores MHz / HDD Gb clock GHz type Xeon E5620 1000 Haemodorum i686 8 33 @ 2.4 @ SATA Raspberry Pi ARMv7 8 ARM 1 1 2 B+ @ 1.0 @ flash card Macbook Pro Core i7 250 x64 4 8 (2011) @ 2.2 @ SSD EC2 Xeon E5 320 x64 40 160 m4.10xlarge @ 2.4 @ SSD

  12. Reviewing / comparing new methods � • Biological problems often scale horribly unpredictably � • Algorithm analyses � • So empirical measurements on different problem sets to predict how problems will scale… �

  13. Workflow � Set up workflow, binaries, and reference / Setup alignment data. Deploy to machines. Protein-protein blast reads (from MG- RAST repository, Bass Strait oil field) CEGMA against 458 core eukaryote genes from Short reads BLAST genes CEGMA. Keep only top hits. Use max. 2.2.30 num_threads available. Append top hit sequences to CEGMA Concatenate hits to alignments. CEGMA alignments For each: Align in MUSCLE using default Muscle parameters 3.8.31 Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. RAxML 7.2.8+ Output and parse times.

  14. Results - BLASTP �

  15. Results - RAxML �

  16. Case Study 2 � aka � ‘What the hell’s a random seed ?’ �

  17. Mean-variance plot for sitewise lnL estimates in PAML n=10 0.014 + + + + 0.012 + + + + + 0.010 + + + + + + + + + 0.008 + + + + + + + + + variance + + + + + + + + + + + + + + + + + + + + + + + + + 0.006 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.004 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.002 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.000 + + + + + + + + + + + o o o o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + o o + o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + -50 -40 -30 -20 -10 Mean log-likelihood

  18. Properly benchmarking workflows � • Ignoring limiting-steps analyses � – in many workflows might actually be data cleaning / parsing / transformation � • Or (most common error) inefficiently iterating � • Or even disk I/O! �

  19. Workflow benchmarking very rare � • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc � • http://beast.bio.ed.ac.uk/benchmarks � • Many e.g. bioinformatics papers � • More harm than good? �

  20. Conclusion � • Biologists and error � • Current practice � • Help! � • Interesting challenges too… � Thanks: � RBG Kew, BI&SA, Mike Chester �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend