ngs data analysis
play

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last


  1. NGS Data Analysis M E T H O D S A N D P R O T O C O L S

  2. Shifting Paradigms 2  Thousand years ago: science was em pirical describing natural phenom ena  Last few hundred years: theoretical branch using m odels, generalizations  Last few decades: a com putational branch sim ulating com plex phenom ena  Today: data exploration (eScience) unify theory, experim ent, and sim ulation  Data captured by instruments or generated by simulator  Processed by software  Information/ knowledge stored in computer  Scientist analyzes database/ files using data management and statistics Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  3. Big Data Biology 3  The term “Big Data” is not only for size:  Speed  Volume  Computational and analytical capacity to manage data and derive insight  The “ Forth Paradigm ” is at hand in Life Sciences  the analysis of massive data sets NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  4. “It’s the data, stupid” 4  It’s a new scientific methodology based on the power of data-intensive science  Capturing  Curation, and  Analysis of large data  The goal, Dr. Gray insisted, was not to have the biggest, fastest single computer, but rather “ to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each other .”  At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  5. Big Data Biology 5  Moving from traditional small-scale, focused experiments to more hypothesis-neutral studies  Small biology labs can become  Big data generators  Big data users NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  6. The story so far… 6 “ We can know m ore than w e can tell ” Michael Polanyi (1891-1976) 5000 100 80 4000 60 3000 40 20 2000 0 1000 0 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 20 0 7-20 0 8 : Grid Computing Cloud Computing sequencers begin giving flurries of 500 data 400 300 200 100 0 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences) Grid Computing Cloud Computing NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  7. Words of the story... 7  391 abstracts from PubMed Common terms  4,770 unique terms • comput • data • system • provid • technolog Word Cloud for • applic “ Grid ” abstracts • resour • analysi Grid terms Cloud terms • grid • cloud • model • servic Word Cloud for • distribut • sequenc “ Cloud ” • bioinformat • health • molecular • genom abstracts NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  8. Any field in particular? 8  Research areas from SCOPUS Biochemistry, Genetics and Medicine ( 20 3 ) Molecular Biology ( 1,0 29 ) Biochemistry, Genetics and Medicine ( 20 1 ) Molecular Biology ( 116 ) Health Professions ( 10 9 ) Health Professions ( 8 5 ) Multidisciplinary ( 65 ) Multidisciplinary ( 69 ) Agricultural and Biological Agricultural and Biological Sciences ( 4 4 ) Sciences ( 4 2 ) Pharmacology, Toxicology Pharmacology, Toxicology and Pharmaceutics ( 23 ) and Pharmaceutics ( 21 ) Nursing ( 22 ) Environmental Science ( 13 ) Immunology and Environmental Science ( 10 ) Microbiology ( 12 ) Neuroscience ( 9 ) Nursing ( 11 ) Immunology and Neuroscience ( 9 ) Microbiology ( 1 ) Grid Cloud NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  9. Making the bridge… 9 “ Ba: Know ledge creation requires a tim e and place in w hich people share know ledge and w ork together as a com m unity.” Kitaro Nishida x “Grid computing” in 2004:  “Cloud computing” in 2014: NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  10. 10 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  11. NGS pushes bioinformatics needs up 11  Need for large amount of CPU power  Informatics groups must manage compute clusters  Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment  Another level of software complexity and challenges to interoperability  VERY large text files (~10 million lines long)  Can’t do “business as usual” with familiar tools such as Perl/ Python  Impossible memory usage and execution time  Impossible to browse for problems  Need sequence Quality filtering NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  12. Data Management Issues 12  Raw data are large. How long should be kept?  Processed data are manageable for most people  20 million reads (50bp) ~ 1 Gbyte  More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM  Certain studies much more data intensive than others  Whole genome sequencing  A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte  50 genome pairs ~ 25 TB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  13. So what? 13  In NGS we have to process really big amounts of data, which is not trivial in computing terms.  Big NGS projects require supercomputing infrastructure  Or put another way: it’s not the case that anyone can study everything.  small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  14. Intermediate Solution #1: Cloud Computing 14  Pros:  Flexibility  You pay what you use  Don’t need to maintain a data center  Cons:  Transfer big datasets over internet is slow  You pay for consumed bandwidth. That is a problem with big datasets  Lower performance, specially in disk read/ write  Privacy/ security concerns  More expensive or big and long term projects NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  15. Intermediate Solution #2: Grid Computing 15  Pros  Cheaper  More resources available  Cons  Heterogeneous environment  Slow connectivity  Much time required to find good resources in the grid NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  16. AppDB: Ready-to-use Apps in EGI 16  The EGI Applications Database (AppDB) is a central service that stores and provides to the public, information about:  software solutions for scientists and developers to use,  the programmers and the scientists who developed them, and  the publications derived from the registered solutions NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  17. What about the data? 17  There is a VT on this! Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  18. Crossbow 18  Identifies SNPs from high-coverage, short- read resequencing data  Combines the Aligner Bowtie and the SNP caller SOAPsnp  Hadoop MapReduce approach  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  19. Rainbow 19  Large scale Whole Genome Sequencing (WGS) analysis  Supports FASTQ and BAM input  Load balancing  Active workflow monitoring  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  20. CloudMap 20  Greatly simplifies the analysis of mutant whole genome sequences  Offers predefined workflows to pinpoint variations in animal genomes  Available on the Galaxy web platform  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  21. CloudBurst 21  Parallel read-mapping algorithm optimized for mapping NGS data to the human and other reference genomes  Modeled after the short read-mapping RMAP program  Parallelization overcomes computational barriers and allows deeper analysis  Hadoop MapReduce approach  Almost linear increase in performance to the number of CPU cores available NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  22. RSD-Cloud 22  Large comparative genomics analysis tool  Redesigned the reciprocal smallest distance algorithm (RSD) to run on a cloud computing environment  Fast and cost efficient solution  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  23. Cloud BioLinux 23  Publicly accessible VM  Platform for developing bioinformatics infrastructures on the cloud  Quick provision of on-demand infrastructures for HPC in bioinformatics  Pre-configured tools and GUI  Tested on Amazon EC2, Eucalyptus, Okeanos and Virtual box NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  24. CloVR 24  Portable VM  Several automated analysis pipelines for microbial genomics provided, including 16S, whole genome and metagenome sequence analysis  Run on a local PC but also supports use of remote cloud computing resources on multiple cloud computing platforms. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  25. Mercury 25  Integration of multiple sequence analysis tool in a single DNAnexus based platform  Simplified workflow construction GUI  Applet based workflows  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend