 
              Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize your expression data? c) How to test for differential gene expression?
The transcriptome • the set of all RNA molecules in the cell (mRNA, rRNA, sncRNA, tRNA, lncRNA) • many RNAs are byproducts of interesting cellular function (enhancer RNAs) • the amount of RNA as proxy how much a gene is expressed
The transcriptome
A typical RNA-seq experiment
Experiment design – different sequencing techniques ✞ ☎ ✝ ✆ Transcript types • polyadenylated – polyA+ • non-polyadenylated – polyA- • total RNA ✞ ☎ ✝ ✆ Read properties • sequencing depth • read length • strand specific or unspecific • single-end • paired-end / mate-pairs → Library preparation decides what the data can be used for!!!
RNA-seq bioinformatics analyses Quality control Read mapping Transcriptome assembly Reference-based De novo Gene expression Normalization Di ff erential expression analysis
Read quality control Essential data pre-processing step for downstream analysis! Decide sensibly on which data can be filtered out. Reads should be • discarded if low quality score (provided by sequencer) • discarded if low complexity • discarded if near-identical (PCR amplification?) • trimmed (removing adapters and low quality ends)
Read quality control Assessment is mostly done with help of FastQC: • quality scores • per base sequence quality • per tile sequence quality (specific for Illumina) • per sequence quality scores • nucleotide bias • per base sequence content • per sequence GC content • enrichment bias (e.g. PCR over amplification) • sequence duplication levels • overrepresented sequences • kmer content • adapter content Help for interpreting FastQC output: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Moduls/ (Semi-)automated quality control by tools such as Trimmomatic and Trim Galore! (adaptive quality and adapter trimmer).
Quality scores trim galore User Guide v0.3.7
Nucleotide bias trim galore User Guide v0.3.7
RNA-seq bioinformatics analyses Quality control Read mapping Transcriptome assembly Reference-based De novo Gene expression Normalization Di ff erential expression analysis
Transcriptome assembly Assemble your transcriptome: 1 Find all transcripts 2 Find all transcript variants (isoforms) 3 Quantify your transcripts (proportional to expression level) isoforms / gene transcripts genome
Mapping versus Assembly There are many mappers, but they must be able to detect splice junctions to be used in transcriptome assembly. Splice Junction mappers can be devided in 1 Exon-first methods → TopHat (based on Bowtie2) 2 Seed-extend methods → Blat, STAR-aligner, Segemehl
Strategies of gapped alignments Graphic credit: Garber et al. (2011) Nature Methods 8:469477.
Transcriptome assembly methods • Reference based • De novo • Combined methods
Reference based – Overlap graph • each read alignment is a node • short overlaps ( > 5bp) are indicated by directed edges • transitive overlaps (longer overlaps) are shown as dotted edges
Reference based – Overlap graph
Reference based – Overlap graph
Reference based methods ✞ ☎ ✝ ✆ Applications Organisms with good reference genomes, except perhaps polyploid organisms ✞ ☎ ✝ ✆ Pros Can be run in parallel; Contamination not a major issue; less coverage needed ✞ ☎ ✝ ✆ Cons Reference dependant; known splice site dependant; long introns may be predicted ✞ ☎ ✝ ✆ Common software Cufflinks
De novo based – De Bruijn graph
De novo based – De Bruijn graph
De novo based methods ✞ ☎ ✝ ✆ Applications Organisms with poor reference genomes or no reference genome; independent of correct splice sites or intron length ✞ ☎ ✝ ✆ Pros Identification of novel transcripts; no reference needed ✞ ☎ ✝ ✆ Cons Computationally intensive; requires high read depth; contamination a major issue ✞ ☎ ✝ ✆ Common software Trinity
RNA-seq bioinformatics analyses Quality control Read mapping Transcriptome assembly Reference-based De novo Gene expression Normalization Di ff erential expression analysis
From read counts to gene expression You are analyzing 2 genes (gene A and B) in two conditions (condition 1 and 2) on the bases of an RNA-seq experiment that resulted in the following number of reads: Condition 1 Condition 2 Gene A 1000 3000 Gene B 2000 4000 Discuss with you neigbor if the following statement is correct or not and why (2 min): Both gene A and B are more expressed in condition 2.
From read counts to gene expression We cannot state any such thing since we do not know • the sequencing depth (library size) • gene length → Tags per million (TPM), Reads/Fragments Per Kilobase transcript per Million mapped reads (RPKM/FPKM) • batch effects (methods are adapted from Microarray analyses)
✬ ✩ Length normalization Current RNA-seq protocols use an mRNA fragmentation approach prior to sequencing to gain sequence coverage of the whole tran- script. Thus, the total number of reads for a given transcript is proportional to the expression level M of the transcript multiplied by the length L of the transcript: ✫ ✪ E ∝ M · L RPKM : Reads per kilobase transcript per million reads RPKM ( X ) = 10 9 · C Reads per transcript = million reads · transcript length(kb) N · L • C . . . number of mappable reads that fell onto the genes exons • N . . . total number of mappable reads in the experiment • L . . . sum of the exons in base pairs
Length normalization Question: What are the RPKM-corrected expression values and why? Graphic credit: Garber et al. (2011) Nature Methods 8:469477.
Length normalization Question: What are the RPKM-corrected expression values and why? Note normalization for fragment length (transcripts 3 and 4): Graphic credit: Garber et al. (2011) Nature Methods 8:469477.
Differential expression (DE) ★ ✥ A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant, i.e. if the difference is greater than what ✧ would be expected just due to random variation. ✦ Principles are the same as for all other significance tests: 1 Use the independent replicates (samples of the same conditions) to estimate the variance of the expression 2 Use the expression and variance to test whether the difference is random or not The statistical power of the test increases with more replicates!
Technical versus biological replicates • What is the difference between technical and biological replicates? • Which different questions can I answer with the different types of replicates?
Parametric vs. non-parametric methods It would be nice to not have to assume anything about the expression value distributions but only use rank-order statistics. However, it is (typically) harder to show statistical significance with non-parametric methods with few replicates (Less than 10?).
Issues of DE testing for RNA-seq Couldn’t we just use a Student’s t test for each gene? 1 Distribution is not normal. Which parametric distribution should I use? 2 If we test each gene for DE we have to account for multiple testing! 3 The number of replicates is often to small to estimate the variance.
Binomial distribution Models for read counts originated from the idea that each read is sampled independently from a pool of reads and hence the number of reads for a given gene follows a binomial distribution. ✓ ✏ The binomial distribution works when we have a fixed number of ✒ trials n , each with a constant probability of success p . ✑ 0.25 The random variable X is the p=0.5 and n=20 p=0.7 and n=20 number of k successes: 0.20 p=0.5 and n=40 0.15 � n � p k (1 − p ) n − k p ( X = k ) = k 0.10 0.05 Event: An RNA-seq read ”lands“ in a given gene 0.00 0 10 20 30 40 (success) or not (failure)
Poisson distribution ✛ ✘ RNA-seq experiments produce large number of reads ( n is large) and probabilities of success are small ( p is small) which can be modelled by the poisson distribution which is an approximation of ✚ ✙ the binomial. Instead we know the average number of successes per intervall: λ = np For X ∼ Poisson( λ ), both the mean and the variance are equal to λ .
✎ Poisson versus negative binomial distribution ☞ Many studies have shown that the variance grows faster than the ✍ ✌ mean in RNAseq data. This is known as overdispersion .
✗ ✔ Negative binomial distribution The negative binomial distribution works for discrete data over an unbounded positive range whose sample variance exceeds the sample ✖ ✕ mean. The random variable X is the number of trials needed to make n successes (read counts) if the probability of a single success is p : � x − 1 � p n (1 − p ) x − n NB ( X = x ) = n − 1
Recommend
More recommend