M E T H O D S A N D P R O T O C O L S
NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - - PowerPoint PPT Presentation
NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - - PowerPoint PPT Presentation
NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last
Shifting Paradigms
Thousand years ago:
science was em pirical describing natural phenom ena
Last few hundred years:
theoretical branch using m odels, generalizations
Last few decades:
a com putational branch sim ulating com plex phenom ena
Today: data exploration (eScience)
unify theory, experim ent, and sim ulation
Data captured by instruments or generated by simulator Processed by software Information/ knowledge stored in computer Scientist analyzes database/ files using data management and statistics
Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009
11/ 11/ 2015
2
NGS Data Analysis Training Workshop - Part I
Big Data Biology
The term “Big Data” is not only for size:
- Speed
- Volume
- Computational and analytical capacity to manage data and
derive insight
The “Forth Paradigm ” is at hand in Life Sciences
the analysis of massive data sets
11/ 11/ 2015
3
NGS Data Analysis Training Workshop - Part I
“It’s the data, stupid”
It’s a new scientific methodology based on the power of
data-intensive science
Capturing Curation, and Analysis of large data
The goal, Dr. Gray insisted, was not to have the biggest,
fastest single computer, but rather “to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each
- ther.”
At the petabyte scale, information is not a matter of
simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics.
11/ 11/ 2015
4
NGS Data Analysis Training Workshop - Part I
Big Data Biology
Moving from traditional small-scale, focused
experiments to more hypothesis-neutral studies
Small biology labs can become
Big data generators Big data users 11/ 11/ 2015
5
NGS Data Analysis Training Workshop - Part I
The story so far…
1000 2000 3000 4000 5000 Grid Computing Cloud Computing 20 40 60 80 100 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 100 200 300 400 500 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences)
“We can know m ore than w e can tell”
Michael Polanyi (1891-1976)
Grid Computing Cloud Computing
20 0 7-20 0 8 : sequencers begin giving flurries of data
11/ 11/ 2015
6
NGS Data Analysis Training Workshop - Part I
Words of the story...
391 abstracts from PubMed 4,770 unique terms
Grid terms
- grid
- model
- distribut
- bioinformat
- molecular
Cloud terms
- cloud
- servic
- sequenc
- health
- genom
Common terms
- comput
- data
- system
- provid
- technolog
- applic
- resour
- analysi
Word Cloud for “Grid” abstracts Word Cloud for “Cloud” abstracts
11/ 11/ 2015
7
NGS Data Analysis Training Workshop - Part I
Any field in particular?
Research areas from SCOPUS
Biochemistry, Genetics and Molecular Biology (1,0 29) Medicine (20 1) Health Professions (10 9) Multidisciplinary (65) Agricultural and Biological Sciences (4 4 ) Pharmacology, Toxicology and Pharmaceutics (23) Nursing (22) Environmental Science (10 ) Neuroscience (9) Immunology and Microbiology (1) Medicine (20 3) Biochemistry, Genetics and Molecular Biology (116) Health Professions (8 5) Multidisciplinary (69) Agricultural and Biological Sciences (4 2) Pharmacology, Toxicology and Pharmaceutics (21) Environmental Science (13) Immunology and Microbiology (12) Nursing (11) Neuroscience (9)
Grid Cloud
11/ 11/ 2015
8
NGS Data Analysis Training Workshop - Part I
Making the bridge…
x “Grid computing” in 2004: “Cloud computing” in 2014:
“Ba: Know ledge creation requires a tim e and place in w hich people share
know ledge and w ork together as a com m unity.”
Kitaro Nishida
11/ 11/ 2015
9
NGS Data Analysis Training Workshop - Part I
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
10
NGS pushes bioinformatics needs up
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
11
Need for large amount of CPU power
Informatics groups must manage compute clusters Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment
Another level of software complexity and challenges to
interoperability
VERY large text files (~10 million lines long)
Can’t do “business as usual” with familiar tools such as
Perl/ Python
Impossible memory usage and execution time Impossible to browse for problems
Need sequence Quality filtering
Data Management Issues
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
12
Raw data are large. How long should be kept? Processed data are manageable for most people
20 million reads (50bp) ~ 1 Gbyte
More of an issue for a facility: HiSeq recommends 32
CPU cores, each with 4GB RAM
Certain studies much more data intensive than
- thers
Whole genome sequencing A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte 50 genome pairs ~ 25 TB
So what?
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
13
In NGS we have to process really big amounts of
data, which is not trivial in computing terms.
Big NGS projects require supercomputing
infrastructure
Or put another way: it’s not the case that anyone can
study everything.
small facilities must carefully choose their projects to be scaled
with their computing capabilities.
Intermediate Solution #1: Cloud Computing
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
14
Pros:
Flexibility You pay what you use Don’t need to maintain a data center
Cons:
Transfer big datasets over internet is slow You pay for consumed bandwidth. That is a problem with big
datasets
Lower performance, specially in disk read/ write Privacy/ security concerns More expensive or big and long term projects
Intermediate Solution #2: Grid Computing
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
15
Pros
Cheaper More resources available
Cons
Heterogeneous environment Slow connectivity Much time required to find good resources in the grid
AppDB: Ready-to-use Apps in EGI
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
16
The EGI
Applications Database (AppDB) is a central service that stores and provides to the public, information about:
software solutions for
scientists and developers to use,
the programmers and
the scientists who developed them, and
the publications
derived from the registered solutions
What about the data?
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
17
There is a VT on this!
Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB
Crossbow
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
18
Identifies SNPs from high-coverage, short-
read resequencing data
Combines the Aligner Bowtie and the SNP
caller SOAPsnp
Hadoop MapReduce approach Amazon EC2 / Local Cluster
Rainbow
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
19
Large scale Whole Genome Sequencing (WGS) analysis Supports FASTQ and BAM input Load balancing Active workflow monitoring Amazon EC2
CloudMap
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
20
Greatly simplifies the analysis of mutant whole genome
sequences
Offers predefined workflows to pinpoint variations in
animal genomes
Available on the Galaxy web platform Amazon EC2 / Local Cluster
CloudBurst
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
21
Parallel read-mapping algorithm optimized for mapping
NGS data to the human and other reference genomes
Modeled after the short read-mapping RMAP program Parallelization overcomes computational barriers and
allows deeper analysis
Hadoop MapReduce approach Almost linear increase in performance to the number of
CPU cores available
RSD-Cloud
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
22
Large comparative genomics analysis tool Redesigned the reciprocal smallest distance algorithm
(RSD) to run on a cloud computing environment
Fast and cost efficient solution Amazon EC2
Cloud BioLinux
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
23
Publicly accessible VM Platform for developing
bioinformatics infrastructures
- n the cloud
Quick provision of on-demand
infrastructures for HPC in bioinformatics
Pre-configured tools and GUI Tested on Amazon EC2,
Eucalyptus, Okeanos and Virtual box
CloVR
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
24
Portable VM Several automated analysis pipelines for microbial
genomics provided, including 16S, whole genome and metagenome sequence analysis
Run on a local PC but also supports use of remote cloud
computing resources on multiple cloud computing platforms.
Mercury
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
25
Integration of multiple sequence analysis tool in a single
DNAnexus based platform
Simplified workflow construction GUI Applet based workflows Amazon EC2 / Local Cluster
Galaxy
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
26
Offers genome analysis resources for cloud
computing platforms
Amazon EC2 Virtual Box Eucalyptus Okeanos
Freely available and community maintained
software images and data repositories
Widely adopted in the bioinformatics community
The take-home points…
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
27
Life Sciences and Big Data are irrevocably linked A lot of Life Sciences infrastructure projects (ELIXIR,
LifeWatch etc) are already looking towards Grid/ Cloud solutions
Although techniques are here to stay, there is a narrow
window of opportunity for researchers to stay ahead of the curve
If interested, do ask for more…
Sequencing Technology
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
28
Changes and Timing past decade
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
29
Overview of costs (past, present and near future)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
30
Steps in sequencing experiments
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
31
NGS analysis workflow
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
32
The three stages of NGS data analysis
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
33
We will focus mostly on the first two…
NGS Applications are sequencing applications
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
34
Whole Genome Sequencing Gene Regulation Epigenetic Changes Metagenomics Paleogenomics Transcriptome Analysis Resequencing …
.
Why QC and preprocessing
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
35
Sequencer output
Reads + quality
Natural questions
Is the quality of my sequenced data ok? If something is wrong, can I fix it?
Problem: HUGE files
Sequencing Data Formats
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
36
Quality before content
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
37
What is quality?
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
38
Trace File (high quality)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
39
Trace File (Medium Quality)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
40
Trace File (Low Quality)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
41
Phred Quality Scores
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
42
Phred is a program that assigns a quality score to each base in
a sequence. These scores can then be used to trim bad data from the reads, and to determine how good an overlap actually is
Phred scores are logarithmically related to the probability of
an error:
a score of 10 means 10% error probability, 20 means a 1% chance, 30 means a 0.1 chance, etc
A score of 30 is usually considered the minimum acceptable
score.
FASTQ File Format
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
43
Each read is represented by four lines: 1.
@ followed by read ID
2.
Sequence
3.
+ optionally followed by repeated read ID
4.
Quality line
Same length as sequence Each character encodes the
quality of the respective base
FASTQC
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
44
As the name implies, FastQC is way to quickly see
some summary statistics to check the quality of your NGS run.
It runs both as a GUI (requires Java) and as a command line
program.
Provides several statistics: Per Sequence Quality Per sequence quality scores Per base sequence and GC
content
Per Sequence GC Content etc..
Trimming
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
45
Knowing quality → Act accordingly Adapter trimming
May increase mapping rates Absolutely essential for small RNA
Probably Improves de novo assemblies
Quality trimming
May increase mapping rates May also lead to loss of information
Lots of software:
Cutadapt, Trim Galore!, PRINSEQ, etc.
Mapped Reads
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
46
Mapping: “align” these raw reads to a reference
genome
Single-end or paired-end data? How would you align a short read to the reference?
Old-school: Smith-Waterman, BLAST, BLAT,… Now: mapping tools for short reads that use
intelligent indexing and allow mismatches
Short read applications
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
47
Genotyping RNA-Seq, ChIP-Seq, Methyl-Seq,…
… always a problem
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
48
Finding the alignment is a computational bottleneck
Defining the question
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
49
Given a reference and a set of reads, report at least
- ne “good” local alignment for each read, if one
exists
Approximate answer to question: where in genome did read
- riginate
What is “good”? For now we concentrate on: Fewer mismatches = better Failing to align a low-quality
base is better than failing to align a high-quality base
A few technical aspects (geeky stuff)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
50
Genomes and reads are too large for direct
approaches like dynamic programming
Indexing/ Hashing is required Choice of index is key to performance
A few technical aspects (geeky stuff)
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
51
Genome indices can be big. For human: Large indices necessitate painful compromises 1.
Require big-memory machines
2.
Use secondary storage
3.
Build new index each run
4.
Subindex and do multiple passses
>35GB >12GB >12GB
Interlude
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
52
(not only) NGS File Form ats
The Sequence Alignment/ Map Format
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
53
Generic alignment format Supports short and long reads Supports different sequencing platforms Flexible in style, compact in size, computationally
efficient to access
SAM File Format
BAM is the binary version of the SAM file; not human readable
but indexed for fast access for other tools / visualization / …
SAM Fields
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
54
Other useful formats in NGS
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
55
Browser Extensible Data (location / annotation /
scores).
used for mapping / annotation / peak locations extension: bigBED (binary)
BEDGraph files (location, combined with score)
used to represent peak scores
Other useful formats in NGS
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
56
WIG files (location / annotation / scores): wiggle
used for visualization or to summarize data, in most cases
count data or normalized count data (RPKM)
extension: BigWig – binary versions, often used in GEO for
ChIP-seq peaks
Other useful formats in NGS
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
57
General Feature Format
used for annotation of genetic / genomic features, such as all
coding genes in Ensembl
often used in downstream analysis to assign annotation to
regions/ peaks/ … .
Other useful formats in NGS
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
58
Variant Call Format
used for SNP representation
aaaand back to the story
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
59
Assembly simplified
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
60
Assembly simplified
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
61
Impossible to assemble manually
Metagenomics
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
62
(and other community based “omics”)
De-novo sequencing
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
63
BowTie
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
64
BowTie is the most commonly used aligner Employs an indexing algorithm that can trade flexibility
between memory usage and running time
For Human data (NCBI 36.3) on an 2.4 GHz AMD Opteron:
TopHat
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
65
TopHat is one of many applications for aligning short
sequence reads to a reference genome.
It uses the BOWTIE aligner internally. Genome alignments from TopHat were saved as BAM files,
the binary version of SAM (samtools.sourceforge.net/).
Other alternatives are BWA, MAQ, OLego, Stampy,
Novoalign, etc
We’ve aligned the data. Then what?
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
66
Depending on the target study.
1 14 18 10 47 13 24 2 10 3 15 1 11 5 3 1 0 10 80 21 34 4 0 0 0 0 2 0 5 4 3 3 5 33 29 . . . . . . . . . . . . . . . . . . . . . 53256 47 29 11 71 278 339 Total 22910173 30701031 18897029 20546299 28491272 27082148
Treatment 1 Treatment 2 Gene
Differential Expression
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
67
To determine if gene 1 is DE, we would like to know
whether the proportion of reads aligning to gene 1 tends to be different for experimental units that received treatment 1 than for experimental units that received treatment 2
14 out of 22910173 47 out of 20546299 18 out of 30701031 vs. 13 out of 28491272 10 out of 18897029 24 out of 27082148
Cufflinks
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
68
CuffLinks is a program that assembles aligned RNA-
Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.
CuffDiff is a program within CuffLinks that
compares transcript abundance between samples
Putting it all together
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
69
There are several protocols, tools and (now)
platforms that are specific to NGS data analysis:
Tuxedo protocol Galaxy platform Chipster platform
Tuxedo
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
70
In closing: Visualization is key
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
71
The Integrative Genomics Viewer (IGV) is a high-performance
visualization tool for interactive exploration of large, integrated genomic datasets.
It supports a wide variety of data types, including array-based and next-
generation sequence data, and genomic annotations.
CummeRBund
11/ 11/ 2015
NGS Data Analysis Training Workshop - Part I
72