Differential Expression Analysis using limma COMBINE RNA-seq - - PowerPoint PPT Presentation

▶

Aug 30, 2023 441 likes •620 views

Differential Expression Analysis using limma COMBINE RNA-seq Workshop limma package: Linear Models for Microarrays & RNA-seq Data Import Linear Preprocessing Modeling & & Quality Di fg erential Assessment Expression Professor

SLIDE 1

Differential Expression Analysis using limma

COMBINE RNA-seq Workshop

SLIDE 2

limma package: Linear Models for Microarrays & RNA-seq

Data Import

Preprocessing & Quality Assessment

Linear Modeling & Difgerential Expression Gene Set Testing

Professor Gordon Smyth limma is celebrating its 15th birthday this year!

SLIDE 3

Many plotting options available…

6 8 10 12 14 16 0.0 0.1 0.2 0.3 0.4

10 15 0.5 1.0 1.5

8 10 12 14 −4 −2 2 4 6

●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Gene

DR NC D03 D10 U03 U10

−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5

HC15 HC30 HC45 HC60 HC75 HE15 HE31 HE45HE60 HE75 HR15 HR30 HR45 HR60 HR75 HT15 HT30 HT45 HT65 HT75 HX15 HX30 HX45 HX60 HX75 HY15 HY30HY45 HY60 HY69 Jan Feb Mar Jun Jul Dec

SLIDE 4

Linear models for differential expression

E(yg) = X

g Matrix of expression values (from RNA-seq / microarray)

Gene-wise linear models

Estimated gene-specific parameters used for gene prioritization and gene set testing

Advanced statistical algorithms in limma that allow...

var(ygj) =

g 2 wgj

limma delivers powerful inference for differential expression analysis

Information Borrowing Variance Modelling Quantitative Weighting

^ g,sg 2 *

}

Gene ID LSK_1 LSK_2 CMP_1 CMP_2

11303 478 619 4830 7165 11305 27 20 48 55 11306 132 200 560 408 11307 42 60 131 99 …

… tens of thousands more …

Data Pre-processing

SLIDE 5

limma package: Linear Models for Microarrays & RNA-seq

arbitrarily complex experiments: linear models,

contrasts

empirical Bayes methods for differential expression:

t-tests, F-tests, posterior odds

analyse log-ratios, log-intensities, log-CPM values
accommodate quality weights in analysis
control of FDR across genes and contrasts
many plotting functions to help visualize raw data and

final results from statistical analysis

gene set testing at various levels
fast, numerically efficient methods

Analysis of differential expression studies

SLIDE 6

RNA-seq of Mouse mammary gland

Basal cells Luminal cells Virgin Pregnant Lactating Virgin Pregnant Lactating n=2 n=2 n=2 n=2 n=2 n=2

Fu et al. (2015) ‘EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival’ Nat Cell Biol

SLIDE 7

(some) questions we can ask

Which genes are differentially expressed

between basal and luminal cells?

… between basal and luminal in virgin

mice?

… between pregnant and lactating mice?
… between pregnant and lactating mice in

basal cells?

SLIDE 8

What do we need to perform a statistical test?

Measure of average expression
Measure of variability

Measure of expression

Log2 fold-change: difference between the two means

SLIDE 9

One of the most useful statistics: t-test

We want to test the null hypothesis:

H0: mean(GroupA) = mean(GroupB)

against the alternative hypothesis:

H1: mean(GroupA) ≠ mean(GroupB)

An important assumption of the t-test is that the

data is roughly normally distributed

A statistician’s best trick is to transform data that

isn’t normally distributed into something that looks more normally distributed

SLIDE 10

Log-counts vs counts for one gene

Log-counts Count data is right-skewed *A quick check to see how normal your data is: compare the mean and the median mean

SLIDE 11

We can perform t-tests on log-counts

Take into account different sequencing

depths

Take into account normalisation factors
Take into account we can’t log a zero
The cpm(y, log=TRUE) function

does this for you

SLIDE 12

Now we have log-counts

Log-counts

SLIDE 13

RNA-seq data is more complicated

Mean-variance relationship. Use voom

lowess fit

mean (log2 cpm)

sqrt residual std dev (log2 cpm)

SLIDE 14

Although we test one gene at a time, we can share information about all the genes to help with testing

Before sharing After sharing

SLIDE 15

Multiple testing burden

Problem: We are performing tens of

thousands of tests, which increases our chances of getting false discoveries

Solution: Calculate false discovery rates

(“adjusted p-values” in limma)

Interpretation: If there are 100 genes

significant at FDR< 5%, we are willing to accept that 5 will be false discoveries

SLIDE 16

Linear modelling analysis pipeline for RNA-seq data

model.matrix / makeContrasts
voom
lmFit
contrasts.fit
treat
eBayes
topTable / topTreat
decideTests