Transcriptome analysis Stefan Seemann seemann@rth.dk University of - PowerPoint PPT Presentation

Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize your expression data? c) How to test for differential gene expression?

The transcriptome • the set of all RNA molecules in the cell (mRNA, rRNA, sncRNA, tRNA, lncRNA) • many RNAs are byproducts of interesting cellular function (enhancer RNAs) • the amount of RNA as proxy how much a gene is expressed

The transcriptome

A typical RNA-seq experiment

Experiment design – different sequencing techniques ✞ ☎ ✝ ✆ Transcript types • polyadenylated – polyA+ • non-polyadenylated – polyA- • total RNA ✞ ☎ ✝ ✆ Read properties • sequencing depth • read length • strand specific or unspecific • single-end • paired-end / mate-pairs → Library preparation decides what the data can be used for!!!

RNA-seq bioinformatics analyses Quality control Read mapping Transcriptome assembly Reference-based De novo Gene expression Normalization Di ff erential expression analysis

Read quality control Essential data pre-processing step for downstream analysis! Decide sensibly on which data can be filtered out. Reads should be • discarded if low quality score (provided by sequencer) • discarded if low complexity • discarded if near-identical (PCR amplification?) • trimmed (removing adapters and low quality ends)

Read quality control Assessment is mostly done with help of FastQC: • quality scores • per base sequence quality • per tile sequence quality (specific for Illumina) • per sequence quality scores • nucleotide bias • per base sequence content • per sequence GC content • enrichment bias (e.g. PCR over amplification) • sequence duplication levels • overrepresented sequences • kmer content • adapter content Help for interpreting FastQC output: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Moduls/ (Semi-)automated quality control by tools such as Trimmomatic and Trim Galore! (adaptive quality and adapter trimmer).

Quality scores trim galore User Guide v0.3.7

Nucleotide bias trim galore User Guide v0.3.7

Transcriptome assembly Assemble your transcriptome: 1 Find all transcripts 2 Find all transcript variants (isoforms) 3 Quantify your transcripts (proportional to expression level) isoforms / gene transcripts genome

Mapping versus Assembly There are many mappers, but they must be able to detect splice junctions to be used in transcriptome assembly. Splice Junction mappers can be devided in 1 Exon-first methods → TopHat (based on Bowtie2) 2 Seed-extend methods → Blat, STAR-aligner, Segemehl

Strategies of gapped alignments Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

Transcriptome assembly methods • Reference based • De novo • Combined methods

Reference based – Overlap graph • each read alignment is a node • short overlaps ( > 5bp) are indicated by directed edges • transitive overlaps (longer overlaps) are shown as dotted edges

Reference based – Overlap graph

Reference based methods ✞ ☎ ✝ ✆ Applications Organisms with good reference genomes, except perhaps polyploid organisms ✞ ☎ ✝ ✆ Pros Can be run in parallel; Contamination not a major issue; less coverage needed ✞ ☎ ✝ ✆ Cons Reference dependant; known splice site dependant; long introns may be predicted ✞ ☎ ✝ ✆ Common software Cufflinks

De novo based – De Bruijn graph

De novo based methods ✞ ☎ ✝ ✆ Applications Organisms with poor reference genomes or no reference genome; independent of correct splice sites or intron length ✞ ☎ ✝ ✆ Pros Identification of novel transcripts; no reference needed ✞ ☎ ✝ ✆ Cons Computationally intensive; requires high read depth; contamination a major issue ✞ ☎ ✝ ✆ Common software Trinity

From read counts to gene expression You are analyzing 2 genes (gene A and B) in two conditions (condition 1 and 2) on the bases of an RNA-seq experiment that resulted in the following number of reads: Condition 1 Condition 2 Gene A 1000 3000 Gene B 2000 4000 Discuss with you neigbor if the following statement is correct or not and why (2 min): Both gene A and B are more expressed in condition 2.

From read counts to gene expression We cannot state any such thing since we do not know • the sequencing depth (library size) • gene length → Tags per million (TPM), Reads/Fragments Per Kilobase transcript per Million mapped reads (RPKM/FPKM) • batch effects (methods are adapted from Microarray analyses)

✬ ✩ Length normalization Current RNA-seq protocols use an mRNA fragmentation approach prior to sequencing to gain sequence coverage of the whole transcript. Thus, the total number of reads for a given transcript is proportional to the expression level M of the transcript multiplied by the length L of the transcript: ✫ ✪ E ∝ M · L RPKM : Reads per kilobase transcript per million reads RPKM ( X ) = 10 9 · C Reads per transcript = million reads · transcript length(kb) N · L • C . . . number of mappable reads that fell onto the genes exons • N . . . total number of mappable reads in the experiment • L . . . sum of the exons in base pairs

Length normalization Question: What are the RPKM-corrected expression values and why? Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

Length normalization Question: What are the RPKM-corrected expression values and why? Note normalization for fragment length (transcripts 3 and 4): Graphic credit: Garber et al. (2011) Nature Methods 8:469477.

Differential expression (DE) ★ ✥ A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant, i.e. if the difference is greater than what ✧ would be expected just due to random variation. ✦ Principles are the same as for all other significance tests: 1 Use the independent replicates (samples of the same conditions) to estimate the variance of the expression 2 Use the expression and variance to test whether the difference is random or not The statistical power of the test increases with more replicates!

Technical versus biological replicates • What is the difference between technical and biological replicates? • Which different questions can I answer with the different types of replicates?

Parametric vs. non-parametric methods It would be nice to not have to assume anything about the expression value distributions but only use rank-order statistics. However, it is (typically) harder to show statistical significance with non-parametric methods with few replicates (Less than 10?).

Issues of DE testing for RNA-seq Couldn’t we just use a Student’s t test for each gene? 1 Distribution is not normal. Which parametric distribution should I use? 2 If we test each gene for DE we have to account for multiple testing! 3 The number of replicates is often to small to estimate the variance.

Binomial distribution Models for read counts originated from the idea that each read is sampled independently from a pool of reads and hence the number of reads for a given gene follows a binomial distribution. ✓ ✏ The binomial distribution works when we have a fixed number of ✒ trials n , each with a constant probability of success p . ✑ 0.25 The random variable X is the p=0.5 and n=20 p=0.7 and n=20 number of k successes: 0.20 p=0.5 and n=40 0.15 � n � p k (1 − p ) n − k p ( X = k ) = k 0.10 0.05 Event: An RNA-seq read ”lands“ in a given gene 0.00 0 10 20 30 40 (success) or not (failure)

Poisson distribution ✛ ✘ RNA-seq experiments produce large number of reads ( n is large) and probabilities of success are small ( p is small) which can be modelled by the poisson distribution which is an approximation of ✚ ✙ the binomial. Instead we know the average number of successes per intervall: λ = np For X ∼ Poisson( λ ), both the mean and the variance are equal to λ .

✎ Poisson versus negative binomial distribution ☞ Many studies have shown that the variance grows faster than the ✍ ✌ mean in RNAseq data. This is known as overdispersion .

✗ ✔ Negative binomial distribution The negative binomial distribution works for discrete data over an unbounded positive range whose sample variance exceeds the sample ✖ ✕ mean. The random variable X is the number of trials needed to make n successes (read counts) if the probability of a single success is p : � x − 1 � p n (1 − p ) x − n NB ( X = x ) = n − 1

Transcriptome analysis Stefan Seemann seemann@rth.dk University of - PowerPoint PPT Presentation

Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize your expression data? c) How to test

Global Transcriptome Transcriptome Analysis of Analysis of Global Pseudomonas aeruginosa

Introduction to transcriptome analysis using high- throughput sequencing technologies D.

RNA-seq: Analysis options Genome? Biological samples/Library preparation Transcriptome

Applications and issues of large scale transcriptome profiling experiments Outline

DIVING INTO ALZHEIMERS DISEASE with transcriptome analysis of a zebrafish model Nhi Hin

Gl Global transcriptome analysis of non-se seed vasc scular plant Se Selaginella moe

Transcriptome Analysis Reveals Dynamic Changes in Coxsackievirus A16 Infected HEK 293T Cells

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

CAMDA 2004 Dissecting the Malaria Transcriptome Malaria - the disease More than 400 million

Insights from the first RT-qPCR based human transcriptome profiling based on wet lab validated

Annotations des gnomes Analyse du transcriptome Analyse des voies mtaboliques Jacques van

Thyroid hormone transcriptome and cerebellum development . 1) Developmental biology: General

PNAS Science Menopon gallinae Ectopsocus briggsi Transcriptome sequences of three lice

Phylogenies Derived from Matched Transcriptome in Breast Cancer Brain Metastases Yifeng Tao 1,2 ,

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy

GPU accelerated partial order multiple sequence alignment for long reads self-correction

Sequence Analysis with TraMineR Gilbert Ritschard Institute for Demographic and Life Course

Outline What is EMBOSS? Major programs Running EMBOSS Programs from the Unix

Sequence Alignment: Linear Space Q. Can we avoid using quadratic space? Easy. Optimal value in

CSC263 Week 7 Thursday http://goo.gl/forms/S9yie3597B Announcement Pre-test office hour today

RNA Structure and RNA Structure Prediction Purines pentose Base glycosidic bond Adenine

Lecture 7: RNA folding Chapter 6 Problem 6.51 in