NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S

Shifting Paradigms 2  Thousand years ago: science was em pirical describing natural phenom ena  Last few hundred years: theoretical branch using m odels, generalizations  Last few decades: a com putational branch sim ulating com plex phenom ena  Today: data exploration (eScience) unify theory, experim ent, and sim ulation  Data captured by instruments or generated by simulator  Processed by software  Information/ knowledge stored in computer  Scientist analyzes database/ files using data management and statistics Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Big Data Biology 3  The term “Big Data” is not only for size:  Speed  Volume  Computational and analytical capacity to manage data and derive insight  The “ Forth Paradigm ” is at hand in Life Sciences  the analysis of massive data sets NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

“It’s the data, stupid” 4  It’s a new scientific methodology based on the power of data-intensive science  Capturing  Curation, and  Analysis of large data  The goal, Dr. Gray insisted, was not to have the biggest, fastest single computer, but rather “ to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each other .”  At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Big Data Biology 5  Moving from traditional small-scale, focused experiments to more hypothesis-neutral studies  Small biology labs can become  Big data generators  Big data users NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

The story so far… 6 “ We can know m ore than w e can tell ” Michael Polanyi (1891-1976) 5000 100 80 4000 60 3000 40 20 2000 0 1000 0 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 20 0 7-20 0 8 : Grid Computing Cloud Computing sequencers begin giving flurries of 500 data 400 300 200 100 0 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences) Grid Computing Cloud Computing NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Words of the story... 7  391 abstracts from PubMed Common terms  4,770 unique terms • comput • data • system • provid • technolog Word Cloud for • applic “ Grid ” abstracts • resour • analysi Grid terms Cloud terms • grid • cloud • model • servic Word Cloud for • distribut • sequenc “ Cloud ” • bioinformat • health • molecular • genom abstracts NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Any field in particular? 8  Research areas from SCOPUS Biochemistry, Genetics and Medicine ( 20 3 ) Molecular Biology ( 1,0 29 ) Biochemistry, Genetics and Medicine ( 20 1 ) Molecular Biology ( 116 ) Health Professions ( 10 9 ) Health Professions ( 8 5 ) Multidisciplinary ( 65 ) Multidisciplinary ( 69 ) Agricultural and Biological Agricultural and Biological Sciences ( 4 4 ) Sciences ( 4 2 ) Pharmacology, Toxicology Pharmacology, Toxicology and Pharmaceutics ( 23 ) and Pharmaceutics ( 21 ) Nursing ( 22 ) Environmental Science ( 13 ) Immunology and Environmental Science ( 10 ) Microbiology ( 12 ) Neuroscience ( 9 ) Nursing ( 11 ) Immunology and Neuroscience ( 9 ) Microbiology ( 1 ) Grid Cloud NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Making the bridge… 9 “ Ba: Know ledge creation requires a tim e and place in w hich people share know ledge and w ork together as a com m unity.” Kitaro Nishida x “Grid computing” in 2004:  “Cloud computing” in 2014: NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

10 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

NGS pushes bioinformatics needs up 11  Need for large amount of CPU power  Informatics groups must manage compute clusters  Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment  Another level of software complexity and challenges to interoperability  VERY large text files (~10 million lines long)  Can’t do “business as usual” with familiar tools such as Perl/ Python  Impossible memory usage and execution time  Impossible to browse for problems  Need sequence Quality filtering NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Data Management Issues 12  Raw data are large. How long should be kept?  Processed data are manageable for most people  20 million reads (50bp) ~ 1 Gbyte  More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM  Certain studies much more data intensive than others  Whole genome sequencing  A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte  50 genome pairs ~ 25 TB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

So what? 13  In NGS we have to process really big amounts of data, which is not trivial in computing terms.  Big NGS projects require supercomputing infrastructure  Or put another way: it’s not the case that anyone can study everything.  small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Intermediate Solution #1: Cloud Computing 14  Pros:  Flexibility  You pay what you use  Don’t need to maintain a data center  Cons:  Transfer big datasets over internet is slow  You pay for consumed bandwidth. That is a problem with big datasets  Lower performance, specially in disk read/ write  Privacy/ security concerns  More expensive or big and long term projects NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Intermediate Solution #2: Grid Computing 15  Pros  Cheaper  More resources available  Cons  Heterogeneous environment  Slow connectivity  Much time required to find good resources in the grid NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

AppDB: Ready-to-use Apps in EGI 16  The EGI Applications Database (AppDB) is a central service that stores and provides to the public, information about:  software solutions for scientists and developers to use,  the programmers and the scientists who developed them, and  the publications derived from the registered solutions NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

What about the data? 17  There is a VT on this! Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Crossbow 18  Identifies SNPs from high-coverage, short- read resequencing data  Combines the Aligner Bowtie and the SNP caller SOAPsnp  Hadoop MapReduce approach  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Rainbow 19  Large scale Whole Genome Sequencing (WGS) analysis  Supports FASTQ and BAM input  Load balancing  Active workflow monitoring  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

CloudMap 20  Greatly simplifies the analysis of mutant whole genome sequences  Offers predefined workflows to pinpoint variations in animal genomes  Available on the Galaxy web platform  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

CloudBurst 21  Parallel read-mapping algorithm optimized for mapping NGS data to the human and other reference genomes  Modeled after the short read-mapping RMAP program  Parallelization overcomes computational barriers and allows deeper analysis  Hadoop MapReduce approach  Almost linear increase in performance to the number of CPU cores available NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

RSD-Cloud 22  Large comparative genomics analysis tool  Redesigned the reciprocal smallest distance algorithm (RSD) to run on a cloud computing environment  Fast and cost efficient solution  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Cloud BioLinux 23  Publicly accessible VM  Platform for developing bioinformatics infrastructures on the cloud  Quick provision of on-demand infrastructures for HPC in bioinformatics  Pre-configured tools and GUI  Tested on Amazon EC2, Eucalyptus, Okeanos and Virtual box NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

CloVR 24  Portable VM  Several automated analysis pipelines for microbial genomics provided, including 16S, whole genome and metagenome sequence analysis  Run on a local PC but also supports use of remote cloud computing resources on multiple cloud computing platforms. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Mercury 25  Integration of multiple sequence analysis tool in a single DNAnexus based platform  Simplified workflow construction GUI  Applet based workflows  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

Genomics infrastr Genomics infrastruc ucture f ure for NGS r NGS 2013 Winter School

The NGS WFS of MAORY Presented by Marco Bonaglia Adoni workshop Padova, 10th-12th April 2017

Nov Novel Appr Approaches oaches to to ID ID Te Testing Usi Using NGS NGS Based Based

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON,

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Admixture model

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

SFS inference from NGS data to detect recent adaptive selection Anders Albrechtsen The

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva

NGS in clinical Italian practice: impact of minor quasispecies on antiretroviral drug resistance

NGS Implementation in a Clinical Laboratory Tabetha Sundin, PhD, HCLD, MB (ASCP) CM Molecular

NGS I - History and Technologies Robert Kraaij Department of Internal Medicine

5/10/2012 Describe non-growing season land application Define HLR ngs and parameters

Automation of the Precision ID NGS System for routine use Collaboration and Aim Collaboration

Pa#ent Privacy and Research on Genomes March 16, 2015

drawing data one genome, four samples SESSION 2 MARTIN KRZYWINSKI Genome Sciences Center BC

Using the Network Structure of Annota5on Data to Gain

Approaches to Repeat Finding Beth Skwarecki Cornell Genomics Forum, 2005-03-18 Why is repeat

Data Driven Innovation Interoperability Tech Track (#agridata) 18 & 19 March 2015, Wageningen

1 Global Entrez Search Results Global NCBI (Entrez) Search NCBI FieldGuide NCBI FieldGuide Human

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Bio4j: bigger, faster, leaner Pablo Pareja-Tobes, Alexey Alekhin, Evdokim Kovach, Marina

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

Genomics infrastr Genomics infrastruc ucture f ure for NGS r NGS 2013 Winter School

The NGS WFS of MAORY Presented by Marco Bonaglia Adoni workshop Padova, 10th-12th April 2017

Nov Novel Appr Approaches oaches to to ID ID Te Testing Usi Using NGS NGS Based Based

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON,

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Admixture model

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

SFS inference from NGS data to detect recent adaptive selection Anders Albrechtsen The

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva

NGS in clinical Italian practice: impact of minor quasispecies on antiretroviral drug resistance

NGS Implementation in a Clinical Laboratory Tabetha Sundin, PhD, HCLD, MB (ASCP) CM Molecular

NGS I - History and Technologies Robert Kraaij Department of Internal Medicine

5/10/2012 Describe non-growing season land application Define HLR ngs and parameters

Automation of the Precision ID NGS System for routine use Collaboration and Aim Collaboration

Pa#ent Privacy and Research on Genomes March 16, 2015

drawing data one genome, four samples SESSION 2 MARTIN KRZYWINSKI Genome Sciences Center BC

Using the Network Structure of Annota5on Data to Gain

Approaches to Repeat Finding Beth Skwarecki Cornell Genomics Forum, 2005-03-18 Why is repeat

Data Driven Innovation Interoperability Tech Track (#agridata) 18 &amp; 19 March 2015, Wageningen

1 Global Entrez Search Results Global NCBI (Entrez) Search NCBI FieldGuide NCBI FieldGuide Human

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Bio4j: bigger, faster, leaner Pablo Pareja-Tobes, Alexey Alekhin, Evdokim Kovach, Marina

Data Driven Innovation Interoperability Tech Track (#agridata) 18 & 19 March 2015, Wageningen