Transcriptome analysis Stefan Seemann seemann@rth.dk University of - - PowerPoint PPT Presentation

transcriptome analysis
SMART_READER_LITE
LIVE PREVIEW

Transcriptome analysis Stefan Seemann seemann@rth.dk University of - - PowerPoint PPT Presentation

Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize your expression data? c) How to test


slide-1
SLIDE 1

Statistical Bioinformatics:

Transcriptome analysis

Stefan Seemann

seemann@rth.dk

University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize your expression data? c) How to test for differential gene expression?

slide-2
SLIDE 2

The transcriptome

  • the set of all

RNA molecules in the cell (mRNA, rRNA, sncRNA, tRNA, lncRNA)

  • many RNAs are

byproducts of interesting cellular function (enhancer RNAs)

  • the amount of

RNA as proxy how much a gene is expressed

slide-3
SLIDE 3

The transcriptome

slide-4
SLIDE 4

A typical RNA-seq experiment

slide-5
SLIDE 5

Experiment design – different sequencing techniques

✞ ✝ ☎ ✆

Transcript types

  • polyadenylated – polyA+
  • non-polyadenylated – polyA-
  • total RNA

✞ ✝ ☎ ✆

Read properties

  • sequencing depth
  • read length
  • strand specific or unspecific
  • single-end
  • paired-end / mate-pairs

→ Library preparation decides what the data can be used for!!!

slide-6
SLIDE 6

RNA-seq bioinformatics analyses

Reference-based Transcriptome assembly Gene expression Normalization Differential expression analysis De novo Read mapping Quality control

slide-7
SLIDE 7

Read quality control

Essential data pre-processing step for downstream analysis! Decide sensibly on which data can be filtered out. Reads should be

  • discarded if low quality score (provided by sequencer)
  • discarded if low complexity
  • discarded if near-identical (PCR amplification?)
  • trimmed (removing adapters and low quality ends)
slide-8
SLIDE 8

Read quality control

Assessment is mostly done with help of FastQC:

  • quality scores
  • per base sequence quality
  • per tile sequence quality (specific for Illumina)
  • per sequence quality scores
  • nucleotide bias
  • per base sequence content
  • per sequence GC content
  • enrichment bias (e.g. PCR over amplification)
  • sequence duplication levels
  • overrepresented sequences
  • kmer content
  • adapter content

Help for interpreting FastQC output:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Moduls/

(Semi-)automated quality control by tools such as Trimmomatic and Trim Galore! (adaptive quality and adapter trimmer).

slide-9
SLIDE 9

Quality scores

trim galore User Guide v0.3.7

slide-10
SLIDE 10

Nucleotide bias

trim galore User Guide v0.3.7

slide-11
SLIDE 11

RNA-seq bioinformatics analyses

Reference-based Transcriptome assembly Gene expression Normalization Differential expression analysis De novo Read mapping Quality control

slide-12
SLIDE 12

Transcriptome assembly

Assemble your transcriptome:

1 Find all transcripts 2 Find all transcript variants (isoforms) 3 Quantify your transcripts (proportional to expression level)

isoforms / transcripts gene genome

slide-13
SLIDE 13

Mapping versus Assembly

There are many mappers, but they must be able to detect splice junctions to be used in transcriptome assembly. Splice Junction mappers can be devided in

1 Exon-first methods

→ TopHat (based on Bowtie2)

2 Seed-extend methods

→ Blat, STAR-aligner, Segemehl

slide-14
SLIDE 14

Strategies of gapped alignments

Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

slide-15
SLIDE 15

Transcriptome assembly methods

  • Reference based
  • De novo
  • Combined methods
slide-16
SLIDE 16

Reference based – Overlap graph

  • each read alignment is a node
  • short overlaps (>5bp) are indicated by directed edges
  • transitive overlaps (longer overlaps) are shown as dotted edges
slide-17
SLIDE 17

Reference based – Overlap graph

slide-18
SLIDE 18

Reference based – Overlap graph

slide-19
SLIDE 19

Reference based methods

✞ ✝ ☎ ✆

Applications

Organisms with good reference genomes, except perhaps polyploid

  • rganisms

✞ ✝ ☎ ✆

Pros

Can be run in parallel; Contamination not a major issue; less coverage needed

✞ ✝ ☎ ✆

Cons

Reference dependant; known splice site dependant; long introns may be predicted

✞ ✝ ☎ ✆

Common software

Cufflinks

slide-20
SLIDE 20

De novo based – De Bruijn graph

slide-21
SLIDE 21

De novo based – De Bruijn graph

slide-22
SLIDE 22

De novo based methods

✞ ✝ ☎ ✆

Applications

Organisms with poor reference genomes or no reference genome; independent of correct splice sites or intron length

✞ ✝ ☎ ✆

Pros

Identification of novel transcripts; no reference needed

✞ ✝ ☎ ✆

Cons

Computationally intensive; requires high read depth; contamination a major issue

✞ ✝ ☎ ✆

Common software

Trinity

slide-23
SLIDE 23

RNA-seq bioinformatics analyses

Reference-based Transcriptome assembly Gene expression Normalization Differential expression analysis De novo Read mapping Quality control

slide-24
SLIDE 24

From read counts to gene expression

You are analyzing 2 genes (gene A and B) in two conditions (condition 1 and 2) on the bases of an RNA-seq experiment that resulted in the following number of reads: Condition 1 Condition 2 Gene A 1000 3000 Gene B 2000 4000 Discuss with you neigbor if the following statement is correct or not and why (2 min): Both gene A and B are more expressed in condition 2.

slide-25
SLIDE 25

From read counts to gene expression

We cannot state any such thing since we do not know

  • the sequencing depth (library size)
  • gene length

→ Tags per million (TPM), Reads/Fragments Per Kilobase transcript per Million mapped reads (RPKM/FPKM)

  • batch effects (methods are adapted from Microarray analyses)
slide-26
SLIDE 26

Length normalization

✬ ✫ ✩ ✪

Current RNA-seq protocols use an mRNA fragmentation approach prior to sequencing to gain sequence coverage of the whole tran- script. Thus, the total number of reads for a given transcript is proportional to the expression level M of the transcript multiplied by the length L of the transcript: E ∝ M · L RPKM: Reads per kilobase transcript per million reads RPKM(X) = 109 · C N · L = Reads per transcript million reads · transcript length(kb)

  • C . . . number of mappable reads that fell onto the genes exons
  • N . . . total number of mappable reads in the experiment
  • L . . . sum of the exons in base pairs
slide-27
SLIDE 27

Length normalization

Question: What are the RPKM-corrected expression values and why?

Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

slide-28
SLIDE 28

Length normalization

Question: What are the RPKM-corrected expression values and why? Note normalization for fragment length (transcripts 3 and 4):

Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

slide-29
SLIDE 29

Differential expression (DE)

★ ✧ ✥ ✦

A gene is declared differentially expressed if an observed difference

  • r change in read counts between two experimental conditions is

statistically significant, i.e. if the difference is greater than what would be expected just due to random variation. Principles are the same as for all other significance tests:

1 Use the independent replicates (samples of the same

conditions) to estimate the variance of the expression

2 Use the expression and variance to test whether the difference

is random or not The statistical power of the test increases with more replicates!

slide-30
SLIDE 30

Technical versus biological replicates

  • What is the difference between technical and biological

replicates?

  • Which different questions can I answer with the different

types of replicates?

slide-31
SLIDE 31

Parametric vs. non-parametric methods

It would be nice to not have to assume anything about the expression value distributions but only use rank-order statistics. However, it is (typically) harder to show statistical significance with non-parametric methods with few replicates (Less than 10?).

slide-32
SLIDE 32

Issues of DE testing for RNA-seq

Couldn’t we just use a Student’s t test for each gene?

1 Distribution is not normal. Which parametric distribution

should I use?

2 If we test each gene for DE we have to account for multiple

testing!

3 The number of replicates is often to small to estimate the

variance.

slide-33
SLIDE 33

Binomial distribution

Models for read counts originated from the idea that each read is sampled independently from a pool of reads and hence the number

  • f reads for a given gene follows a binomial distribution.

✓ ✒ ✏ ✑

The binomial distribution works when we have a fixed number of trials n, each with a constant probability of success p. The random variable X is the number of k successes: p(X = k) = n k

  • pk(1 − p)n−k

Event: An RNA-seq read ”lands“ in a given gene (success) or not (failure)

10 20 30 40 0.00 0.05 0.10 0.15 0.20 0.25 p=0.5 and n=20 p=0.7 and n=20 p=0.5 and n=40

slide-34
SLIDE 34

Poisson distribution

✛ ✚ ✘ ✙

RNA-seq experiments produce large number of reads (n is large) and probabilities of success are small (p is small) which can be modelled by the poisson distribution which is an approximation of the binomial. Instead we know the average number of successes per intervall: λ = np For X ∼ Poisson(λ), both the mean and the variance are equal to λ.

slide-35
SLIDE 35

Poisson versus negative binomial distribution

✎ ✍ ☞ ✌

Many studies have shown that the variance grows faster than the mean in RNAseq data. This is known as overdispersion.

slide-36
SLIDE 36

Negative binomial distribution

✗ ✖ ✔ ✕

The negative binomial distribution works for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. The random variable X is the number of trials needed to make n successes (read counts) if the probability of a single success is p: NB(X = x) = x − 1 n − 1

  • pn(1 − p)x−n
slide-37
SLIDE 37

Issues of DE testing for RNA-seq

Distribution: edgeR and DESeq model the count data using a negative binomial distribution and use their own modified statistical tests based on that. Multiple testing issue: All current packages report false discovery rate (most often Benjamini-Hochberg corrected p values). Variance estimation issue: edgeE, DESeq2, limma (in slightly different ways) ”borrow“ information across genes to get a better variance estimate.

slide-38
SLIDE 38

RNA-seq bioinformatics analyses

Find significant changes in expression levels between two samples:

  • map reads for each

sample

  • do transcriptome

assembly for each sample

  • merge transcriptomes

across all samples

  • quantify the expression
  • f transcripts
  • do differential

expression analysis and get significant genes / transcripts

Reads TopHat Cufflinks Cuffmerge Reads Mapped reads Mapped reads Assembled transcripts Assembled transcripts Final transcriptome assembly Cuffdiff Mapped reads Mapped reads Differential expression results Post Analyses Assembly DE analysis Mapping Condition A Condition B

slide-39
SLIDE 39

Summary

  • The transcriptome is a subset of the expressed transcripts

from the genome and is highly variable.

  • Transcriptome assembly uses ”short“ reads from RNA-seq

data to find expressed transcripts

  • The major assembly methologies are de novo and

reference-based

  • Sequencing of full length RNA molecules by new technologies

(PacBio, Nanopore) eliminates the need to fragment the RNA during library preparation → no more need for assembler tools (however new tools are needed)

  • Normalization of expression levels is essential to make

conclusions

slide-40
SLIDE 40

Galaxy

  • software system that provides informatics support to perform

computational analyses

  • gives experimentalists simple interface
  • automatically manages the computational details
  • integration of biological databases from

UCSC genome browser, BioMart, . . .

  • transparency – every step of an analysis is preserved in the

”history“ which can be shared or published

  • reproducibility – construction of reusable multi-step analysis

”workflows“

  • learn functionality using screencast page at

http://galaxycast.org

  • more software tools are added continuously
slide-41
SLIDE 41

Galaxy

Galaxy’s Analyze Data interface – http://galaxyproject.org/:

Tools Workbench History Job done Job pending Job running