NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - - PowerPoint PPT Presentation

ngs data analysis
SMART_READER_LITE
LIVE PREVIEW

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last


slide-1
SLIDE 1

M E T H O D S A N D P R O T O C O L S

NGS Data Analysis

slide-2
SLIDE 2

Shifting Paradigms

 Thousand years ago:

science was em pirical describing natural phenom ena

 Last few hundred years:

theoretical branch using m odels, generalizations

 Last few decades:

a com putational branch sim ulating com plex phenom ena

 Today: data exploration (eScience)

unify theory, experim ent, and sim ulation

 Data captured by instruments or generated by simulator  Processed by software  Information/ knowledge stored in computer  Scientist analyzes database/ files using data management and statistics

Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009

11/ 11/ 2015

2

NGS Data Analysis Training Workshop - Part I

slide-3
SLIDE 3

Big Data Biology

 The term “Big Data” is not only for size:

  • Speed
  • Volume
  • Computational and analytical capacity to manage data and

derive insight

 The “Forth Paradigm ” is at hand in Life Sciences

 the analysis of massive data sets

11/ 11/ 2015

3

NGS Data Analysis Training Workshop - Part I

slide-4
SLIDE 4

“It’s the data, stupid”

 It’s a new scientific methodology based on the power of

data-intensive science

 Capturing  Curation, and  Analysis of large data

 The goal, Dr. Gray insisted, was not to have the biggest,

fastest single computer, but rather “to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each

  • ther.”

 At the petabyte scale, information is not a matter of

simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics.

11/ 11/ 2015

4

NGS Data Analysis Training Workshop - Part I

slide-5
SLIDE 5

Big Data Biology

 Moving from traditional small-scale, focused

experiments to more hypothesis-neutral studies

 Small biology labs can become

 Big data generators  Big data users 11/ 11/ 2015

5

NGS Data Analysis Training Workshop - Part I

slide-6
SLIDE 6

The story so far…

1000 2000 3000 4000 5000 Grid Computing Cloud Computing 20 40 60 80 100 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 100 200 300 400 500 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences)

“We can know m ore than w e can tell”

Michael Polanyi (1891-1976)

Grid Computing Cloud Computing

20 0 7-20 0 8 : sequencers begin giving flurries of data

11/ 11/ 2015

6

NGS Data Analysis Training Workshop - Part I

slide-7
SLIDE 7

Words of the story...

 391 abstracts from PubMed  4,770 unique terms

Grid terms

  • grid
  • model
  • distribut
  • bioinformat
  • molecular

Cloud terms

  • cloud
  • servic
  • sequenc
  • health
  • genom

Common terms

  • comput
  • data
  • system
  • provid
  • technolog
  • applic
  • resour
  • analysi

Word Cloud for “Grid” abstracts Word Cloud for “Cloud” abstracts

11/ 11/ 2015

7

NGS Data Analysis Training Workshop - Part I

slide-8
SLIDE 8

Any field in particular?

 Research areas from SCOPUS

Biochemistry, Genetics and Molecular Biology (1,0 29) Medicine (20 1) Health Professions (10 9) Multidisciplinary (65) Agricultural and Biological Sciences (4 4 ) Pharmacology, Toxicology and Pharmaceutics (23) Nursing (22) Environmental Science (10 ) Neuroscience (9) Immunology and Microbiology (1) Medicine (20 3) Biochemistry, Genetics and Molecular Biology (116) Health Professions (8 5) Multidisciplinary (69) Agricultural and Biological Sciences (4 2) Pharmacology, Toxicology and Pharmaceutics (21) Environmental Science (13) Immunology and Microbiology (12) Nursing (11) Neuroscience (9)

Grid Cloud

11/ 11/ 2015

8

NGS Data Analysis Training Workshop - Part I

slide-9
SLIDE 9

Making the bridge…

x “Grid computing” in 2004: “Cloud computing” in 2014:

“Ba: Know ledge creation requires a tim e and place in w hich people share

know ledge and w ork together as a com m unity.”

Kitaro Nishida

11/ 11/ 2015

9

NGS Data Analysis Training Workshop - Part I

slide-10
SLIDE 10

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

10

slide-11
SLIDE 11

NGS pushes bioinformatics needs up

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

11

 Need for large amount of CPU power

 Informatics groups must manage compute clusters  Challenges in parallelizing existing software or redesign of

algorithms to work in a parallel environment

 Another level of software complexity and challenges to

interoperability

 VERY large text files (~10 million lines long)

 Can’t do “business as usual” with familiar tools such as

Perl/ Python

 Impossible memory usage and execution time  Impossible to browse for problems

 Need sequence Quality filtering

slide-12
SLIDE 12

Data Management Issues

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

12

 Raw data are large. How long should be kept?  Processed data are manageable for most people

 20 million reads (50bp) ~ 1 Gbyte

 More of an issue for a facility: HiSeq recommends 32

CPU cores, each with 4GB RAM

 Certain studies much more data intensive than

  • thers

 Whole genome sequencing  A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte  50 genome pairs ~ 25 TB

slide-13
SLIDE 13

So what?

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

13

 In NGS we have to process really big amounts of

data, which is not trivial in computing terms.

 Big NGS projects require supercomputing

infrastructure

 Or put another way: it’s not the case that anyone can

study everything.

 small facilities must carefully choose their projects to be scaled

with their computing capabilities.

slide-14
SLIDE 14

Intermediate Solution #1: Cloud Computing

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

14

 Pros:

 Flexibility  You pay what you use  Don’t need to maintain a data center

 Cons:

 Transfer big datasets over internet is slow  You pay for consumed bandwidth. That is a problem with big

datasets

 Lower performance, specially in disk read/ write  Privacy/ security concerns  More expensive or big and long term projects

slide-15
SLIDE 15

Intermediate Solution #2: Grid Computing

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

15

 Pros

 Cheaper  More resources available

 Cons

 Heterogeneous environment  Slow connectivity  Much time required to find good resources in the grid

slide-16
SLIDE 16

AppDB: Ready-to-use Apps in EGI

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

16

 The EGI

Applications Database (AppDB) is a central service that stores and provides to the public, information about:

 software solutions for

scientists and developers to use,

 the programmers and

the scientists who developed them, and

 the publications

derived from the registered solutions

slide-17
SLIDE 17

What about the data?

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

17

 There is a VT on this!

Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB

slide-18
SLIDE 18

Crossbow

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

18

 Identifies SNPs from high-coverage, short-

read resequencing data

 Combines the Aligner Bowtie and the SNP

caller SOAPsnp

 Hadoop MapReduce approach  Amazon EC2 / Local Cluster

slide-19
SLIDE 19

Rainbow

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

19

 Large scale Whole Genome Sequencing (WGS) analysis  Supports FASTQ and BAM input  Load balancing  Active workflow monitoring  Amazon EC2

slide-20
SLIDE 20

CloudMap

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

20

 Greatly simplifies the analysis of mutant whole genome

sequences

 Offers predefined workflows to pinpoint variations in

animal genomes

 Available on the Galaxy web platform  Amazon EC2 / Local Cluster

slide-21
SLIDE 21

CloudBurst

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

21

 Parallel read-mapping algorithm optimized for mapping

NGS data to the human and other reference genomes

 Modeled after the short read-mapping RMAP program  Parallelization overcomes computational barriers and

allows deeper analysis

 Hadoop MapReduce approach  Almost linear increase in performance to the number of

CPU cores available

slide-22
SLIDE 22

RSD-Cloud

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

22

 Large comparative genomics analysis tool  Redesigned the reciprocal smallest distance algorithm

(RSD) to run on a cloud computing environment

 Fast and cost efficient solution  Amazon EC2

slide-23
SLIDE 23

Cloud BioLinux

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

23

 Publicly accessible VM  Platform for developing

bioinformatics infrastructures

  • n the cloud

 Quick provision of on-demand

infrastructures for HPC in bioinformatics

 Pre-configured tools and GUI  Tested on Amazon EC2,

Eucalyptus, Okeanos and Virtual box

slide-24
SLIDE 24

CloVR

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

24

 Portable VM  Several automated analysis pipelines for microbial

genomics provided, including 16S, whole genome and metagenome sequence analysis

 Run on a local PC but also supports use of remote cloud

computing resources on multiple cloud computing platforms.

slide-25
SLIDE 25

Mercury

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

25

 Integration of multiple sequence analysis tool in a single

DNAnexus based platform

 Simplified workflow construction GUI  Applet based workflows  Amazon EC2 / Local Cluster

slide-26
SLIDE 26

Galaxy

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

26

 Offers genome analysis resources for cloud

computing platforms

 Amazon EC2  Virtual Box  Eucalyptus  Okeanos

 Freely available and community maintained

 software images and  data repositories

 Widely adopted in the bioinformatics community

slide-27
SLIDE 27

The take-home points…

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

27

 Life Sciences and Big Data are irrevocably linked  A lot of Life Sciences infrastructure projects (ELIXIR,

LifeWatch etc) are already looking towards Grid/ Cloud solutions

 Although techniques are here to stay, there is a narrow

window of opportunity for researchers to stay ahead of the curve

 If interested, do ask for more…

slide-28
SLIDE 28

Sequencing Technology

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

28

slide-29
SLIDE 29

Changes and Timing past decade

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

29

slide-30
SLIDE 30

Overview of costs (past, present and near future)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

30

slide-31
SLIDE 31

Steps in sequencing experiments

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

31

slide-32
SLIDE 32

NGS analysis workflow

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

32

slide-33
SLIDE 33

The three stages of NGS data analysis

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

33

 We will focus mostly on the first two…

slide-34
SLIDE 34

NGS Applications are sequencing applications

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

34

 Whole Genome Sequencing  Gene Regulation  Epigenetic Changes  Metagenomics  Paleogenomics  Transcriptome Analysis  Resequencing  …

.

slide-35
SLIDE 35

Why QC and preprocessing

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

35

 Sequencer output

 Reads + quality

 Natural questions

 Is the quality of my sequenced data ok?  If something is wrong, can I fix it?

 Problem: HUGE files

slide-36
SLIDE 36

Sequencing Data Formats

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

36

slide-37
SLIDE 37

Quality before content

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

37

slide-38
SLIDE 38

What is quality?

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

38

slide-39
SLIDE 39

Trace File (high quality)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

39

slide-40
SLIDE 40

Trace File (Medium Quality)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

40

slide-41
SLIDE 41

Trace File (Low Quality)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

41

slide-42
SLIDE 42

Phred Quality Scores

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

42

 Phred is a program that assigns a quality score to each base in

a sequence. These scores can then be used to trim bad data from the reads, and to determine how good an overlap actually is

 Phred scores are logarithmically related to the probability of

an error:

 a score of 10 means 10% error probability,  20 means a 1% chance,  30 means a 0.1 chance, etc

 A score of 30 is usually considered the minimum acceptable

score.

slide-43
SLIDE 43

FASTQ File Format

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

43

 Each read is represented by four lines: 1.

@ followed by read ID

2.

Sequence

3.

+ optionally followed by repeated read ID

4.

Quality line

 Same length as sequence  Each character encodes the

quality of the respective base

slide-44
SLIDE 44

FASTQC

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

44

 As the name implies, FastQC is way to quickly see

some summary statistics to check the quality of your NGS run.

 It runs both as a GUI (requires Java) and as a command line

program.

 Provides several statistics:  Per Sequence Quality  Per sequence quality scores  Per base sequence and GC

content

 Per Sequence GC Content  etc..

slide-45
SLIDE 45

Trimming

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

45

 Knowing quality → Act accordingly  Adapter trimming

 May increase mapping rates  Absolutely essential for small RNA

Probably Improves de novo assemblies

 Quality trimming

 May increase mapping rates  May also lead to loss of information

 Lots of software:

 Cutadapt, Trim Galore!, PRINSEQ, etc.

slide-46
SLIDE 46

Mapped Reads

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

46

 Mapping: “align” these raw reads to a reference

genome

 Single-end or paired-end data?  How would you align a short read to the reference?

 Old-school: Smith-Waterman, BLAST, BLAT,…  Now: mapping tools for short reads that use

intelligent indexing and allow mismatches

slide-47
SLIDE 47

Short read applications

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

47

 Genotyping  RNA-Seq, ChIP-Seq, Methyl-Seq,…

slide-48
SLIDE 48

… always a problem

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

48

 Finding the alignment is a computational bottleneck

slide-49
SLIDE 49

Defining the question

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

49

 Given a reference and a set of reads, report at least

  • ne “good” local alignment for each read, if one

exists

 Approximate answer to question: where in genome did read

  • riginate

 What is “good”? For now we concentrate on:  Fewer mismatches = better  Failing to align a low-quality

base is better than failing to align a high-quality base

slide-50
SLIDE 50

A few technical aspects (geeky stuff)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

50

 Genomes and reads are too large for direct

approaches like dynamic programming

 Indexing/ Hashing is required  Choice of index is key to performance

slide-51
SLIDE 51

A few technical aspects (geeky stuff)

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

51

 Genome indices can be big. For human:  Large indices necessitate painful compromises 1.

Require big-memory machines

2.

Use secondary storage

3.

Build new index each run

4.

Subindex and do multiple passses

>35GB >12GB >12GB

slide-52
SLIDE 52

Interlude

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

52

(not only) NGS File Form ats

slide-53
SLIDE 53

The Sequence Alignment/ Map Format

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

53

 Generic alignment format  Supports short and long reads  Supports different sequencing platforms  Flexible in style, compact in size, computationally

efficient to access

 SAM File Format

 BAM is the binary version of the SAM file; not human readable

but indexed for fast access for other tools / visualization / …

slide-54
SLIDE 54

SAM Fields

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

54

slide-55
SLIDE 55

Other useful formats in NGS

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

55

 Browser Extensible Data (location / annotation /

scores).

 used for mapping / annotation / peak locations  extension: bigBED (binary)

 BEDGraph files (location, combined with score)

 used to represent peak scores 

slide-56
SLIDE 56

Other useful formats in NGS

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

56

 WIG files (location / annotation / scores): wiggle

 used for visualization or to summarize data, in most cases

count data or normalized count data (RPKM)

 extension: BigWig – binary versions, often used in GEO for

ChIP-seq peaks

slide-57
SLIDE 57

Other useful formats in NGS

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

57

 General Feature Format

 used for annotation of genetic / genomic features, such as all

coding genes in Ensembl

 often used in downstream analysis to assign annotation to

regions/ peaks/ … .

slide-58
SLIDE 58

Other useful formats in NGS

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

58

 Variant Call Format

 used for SNP representation

slide-59
SLIDE 59

aaaand back to the story

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

59

slide-60
SLIDE 60

Assembly simplified

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

60

slide-61
SLIDE 61

Assembly simplified

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

61

 Impossible to assemble manually

slide-62
SLIDE 62

Metagenomics

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

62

 (and other community based “omics”)

slide-63
SLIDE 63

De-novo sequencing

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

63

slide-64
SLIDE 64

BowTie

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

64

 BowTie is the most commonly used aligner  Employs an indexing algorithm that can trade flexibility

between memory usage and running time

 For Human data (NCBI 36.3) on an 2.4 GHz AMD Opteron:

slide-65
SLIDE 65

TopHat

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

65

 TopHat is one of many applications for aligning short

sequence reads to a reference genome.

 It uses the BOWTIE aligner internally.  Genome alignments from TopHat were saved as BAM files,

the binary version of SAM (samtools.sourceforge.net/).

 Other alternatives are BWA, MAQ, OLego, Stampy,

Novoalign, etc

slide-66
SLIDE 66

We’ve aligned the data. Then what?

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

66

 Depending on the target study.

1 14 18 10 47 13 24 2 10 3 15 1 11 5 3 1 0 10 80 21 34 4 0 0 0 0 2 0 5 4 3 3 5 33 29 . . . . . . . . . . . . . . . . . . . . . 53256 47 29 11 71 278 339 Total 22910173 30701031 18897029 20546299 28491272 27082148

Treatment 1 Treatment 2 Gene

slide-67
SLIDE 67

Differential Expression

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

67

 To determine if gene 1 is DE, we would like to know

whether the proportion of reads aligning to gene 1 tends to be different for experimental units that received treatment 1 than for experimental units that received treatment 2

14 out of 22910173 47 out of 20546299 18 out of 30701031 vs. 13 out of 28491272 10 out of 18897029 24 out of 27082148

slide-68
SLIDE 68

Cufflinks

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

68

 CuffLinks is a program that assembles aligned RNA-

Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

 CuffDiff is a program within CuffLinks that

compares transcript abundance between samples

slide-69
SLIDE 69

Putting it all together

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

69

 There are several protocols, tools and (now)

platforms that are specific to NGS data analysis:

 Tuxedo protocol  Galaxy platform  Chipster platform

slide-70
SLIDE 70

Tuxedo

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

70

slide-71
SLIDE 71

In closing: Visualization is key

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

71

 The Integrative Genomics Viewer (IGV) is a high-performance

visualization tool for interactive exploration of large, integrated genomic datasets.

 It supports a wide variety of data types, including array-based and next-

generation sequence data, and genomic annotations.

slide-72
SLIDE 72

CummeRBund

11/ 11/ 2015

NGS Data Analysis Training Workshop - Part I

72

 Downstream Analysis