Winter School, 2 July 2012 Why do RNA-seq? Differential expression - - PDF document

winter school 2 july 2012
SMART_READER_LITE
LIVE PREVIEW

Winter School, 2 July 2012 Why do RNA-seq? Differential expression - - PDF document

Differential Expression Analysis of Gordon Smyth RNA-Seq Read Counts Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new exons, transcripts etc RNA-seq read counts Quantify the expression levels of


slide-1
SLIDE 1

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 1

Differential expression analysis of RNA-seq read counts

Gordon Smyth

1

Winter School 2 July 2012

Why do RNA-seq?

Discover new exons, transcripts etc Quantify the expression levels of individual transcripts for the same gene An improved expression platform for pre- defined genomic features

2

Why do RNA-seq?

Discover new exons, transcripts etc Quantify the expression levels of individual transcripts for the same gene An improved expression platform for pre- defined genomic features

3

Why do RNA-seq?

Discover new exons, transcripts etc Quantify the expression levels of individual transcripts for the same gene An improved expression platform for pre- defined genomic features This talk is about gene-level differential expression analysis using existing gene models

4

Table of Read Counts

RNA sample

5 B J Haas & M C Zody Advancing RNA-Seq analysis Nature Biotechnology 28, 421–423 (2010)

Count data

Sample 1 Sample 2 Sample 3 Gene 1 2 1 Gene 2 6 4 5 Gene 3 1

Typical two group RNA-Seq experiment with gene knock-out mice

Wild-type mouse Mutant mouse Wild-type mouse Mutant mouse

RNA

6

samples Aim:

  • find genes DE in mutant mice
  • so as to learn which other genes are regulated

by the knocked-out gene

n=2 in each group

slide-2
SLIDE 2

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 2

Count table summarizes the results of the experiment

Wild-type Mutant Mouse1 Mouse2 Mouse1 Mouse2 Gene 1 45 60 30 39

7

Gene 2 4 3 7 Gene 3 1010 800 3099 3450 ... ... ... ... ... Total 20m 16m 15m 17m

I use all reads mapping to exons for each gene

Genomic DE analysis is crazy! from a classical statistical point of view

Estimate parameters and test for DE for 30,000 or more genes … from 4 samples Ultra high-dimensional data Ultra high dimensional data Tiny sample sizes Need innovative statistical strategies that take advantage of the data structure

8

Count variation arises from both biological and technical sources g

9

Read counts estimate RNA concentration

RNA sample

M t t l b f d (lib i )

10

M = total number of reads (library size) ≈ 20 million λg = proportion of RNA in sample from gene g yg = number of reads for gene g

E(yg) = µg = M λg

Genes g = 1, …, ≈ 30k

Back to the two-group RNA-seq experiment

Wild type mice Mutant mice λ λ λ λ

11

λg1 λg2 λg3 λg4 yg1 yg2 yg3 yg4 M1 M2 M3 M4

E(ygi) = µgi = Mi λgi for gene g in sample i

Sources of variation

Wild-type mice Mutant mice

Biological

12

λg1 λg2 λg3 λg4 M1 M2 M3 M4

Biological variation

yg1 yg2 yg3 yg4

Measurement error

Variation in y is mixture of biological (multiplicative) and measurement (Poisson)

slide-3
SLIDE 3

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 3

Mixture structure implies a quadratic mean-variance relationship

var(ygi) = μgi + φg μgi

2

13

CV2 of the “true” expression levels λgi across replicates Poisson variation CV = coefficient of variation = sd/mean

Biological coefficient of variation

Total CV2 = Technical CV2 + Biological CV2

From sequencing CV of “true” expression

14

q g technology p level zero for large counts ≈ constant

BCV = √φg

An experiment with technical replicates

wild-type mouse Mutant mouse

15

λg1 λg2 λg3 λg4 yg1 yg2 yg3 yg4 M1 M2 M3 M4

True technical reps show Poisson variation for each gene BCV ≈ 0

Data from Marioni et al., Genome Res, 2008

16

binned variance, sample variance

Real data with biological replicate show quadratic mean-variance relationships BCV = 0 38

binned variance, sample variance

Data from Parikh et al, Genome Biology, 2010

0.38

17

Estimating biological variability, and borrowing strength between genes genes

18

slide-4
SLIDE 4

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 4

edgeR negative binomial approach

If λgi are gamma distributed, then ygi ~ NegBin( μgi, φg)

19

We use a conditional likelihood approach to obtain approximately unbiased estimates for the dispersions φg even in small samples Complexity of dispersion: sharing information between genes Separate gene-wise estimation of φg is impractical Common dispersion (Robinson & Smyth 2008) 2008) Trended dispersion (Anders & Huber 2010) Gene-wise by empirical Bayes shrinkage (Robinson & Smyth, 2007)

20

Common dispersion likelihood

Assumes same dispersion for all genes φg = φ Genewise conditional log-likelihood

21

Genewise conditional log likelihood ℓg(φ;yg) Common-dispersion log-likelihood ℓc(φ) = (1/G) ∑gℓg(φ;yg) Maximized at φc Empirical Bayes squeezing

  • f the genewise dispersions

Posterior = ℓg(φg) + α ℓc(φc)

Estimate φg by empirical posterior mode:

22

Genewise likelihood Empirical prior distribution Precision

  • f prior

Empirical Bayes squeezing

  • f the genewise dispersions

Posterior = ℓg(φg) + α ℓc(φc)

Estimate φg by empirical posterior mode:

23

Genewise likelihood Empirical prior distribution Local weighting produces abundance dependent prior Precision

  • f prior

Usually BCV is somewhat larger for small counts

Mouse lymphomas (Stan Lee)

slide-5
SLIDE 5

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 5

BCV can show a strong trend (for technical reasons)

Mouse hemapoeitic stem cells, NuGEN amplified

25

p (Samir Taoudi) Asymptotic dispersion Common dispersion

.8 1.0 1.2

Pu1 Restoration in Leukemia

  • n

Gm16901 Ighv 1-84 Ighv 5-9 Gm16965 Ighv 1-42 Gm16814 Ighv 1-85 Gm16969 Gm16891

And often we can identify hyper- variable genes

  • 20
  • 18
  • 16
  • 14
  • 12
  • 10
  • 8
  • 6

0.0 0.2 0.4 0.6 log2 Common Concentration Tagwise Dispersi

Ighv 1-54 Igkv 17-127 Igkv 5-43 Igkv 12-44 Gm16746 Ighv 5-17

Mouse pre B cell leukaemias (Mark McKenzie)

Empirical Bayes improves accuracy

  • f genewise dispersion estimation

Separate estimation Emp Bayes compromise Common

27

Common dispersion

Biological variation is gene-specific

28

Genewise goodness of fit tests

Matched tumour and normal tissue samples from Tuch et al (2010), BCV = 40%

Genewise goodness of fit tests

Prostate adenocarcinoma cell line from Li et al 2008 Low biological variability, BCV=14%

slide-6
SLIDE 6

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 6

Need gene-specific dispersions to prioritize less variable genes

Top 4 genes using trended dispersions Top 4 genes using genewise dispersions

Plt8 vs Suz12 wt lymphomas

More complex experiments

32

Oral squamous cancer: paired normal and tumour samples

Normal tissue Tumour tissue Patient 8 N8 T8 Patient 33 N33 T33

33

Patient 33 N33 T33 Patient 51 N51 T51

Tuch et al, PloS ONE 2010

Aim: find genes DE in tumour tissue, adjusting for differences between patients

Ignoring patient differences will

  • ver-estimate variability

Normal Difference between cancer

34

Normal Cancer cancer and normal samples looks not significant

Consistent DE is found after accounting for baseline patient effects

Patient 1 P ti t 2 2-fold difference between

35

Patient 2 Patient 3

Cancer Normal

cancer and normal within each patient

Use log-linear models for general experimental design

log μgi = log λig + log Mgi = xi

Tβg + log Mgi

36

row of design matrix vector of log fold changes (normalized) library size (unknown)

slide-7
SLIDE 7

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 7

Oral squamous cancer

Normal Tumour Patient 8 N8 T8 Patient 33 N33 T33 Patient 51 N51 T51

Tuch et al, PloS ONE 2010 Detect 1271

37

BCV=40% generally DE genes 184 genes specific to individual tumours

+

FDR < 0.05

Simulation study

Abundances based on real data 10000 genes, 200 DE 3 vs 3 sample comparison library sizes alternate between 2 million and 20 million Gamma-Poisson mixture Trended biological variation with genewise variation around trend

38

The edgeR NB approach achieves good FDR performance

  • veries

39

False disco Number of DE genes

RNA-Seq vs. Microarray (5% FDR)

2 4

Log Fold Changes (8295 Overlaps)

  • s. untreated

1107 both up 1167 both down Ikzf1

Dox3d vs. Untreated

RNASeq Microarray

  • 4
  • 3
  • 2
  • 1

1 2 3

  • 6
  • 4
  • 2

Microarray : dox3d vs. untreated RNASeq : dox3d vs

40

9706 234 2056 1107 up 9723 182 2031 1167 down

Microarray DE genes in RNA-Seq

Dox3d vs. Untreated t stats: top DE Genes from Microarray

41

Dox3d Untreated Microarrray Top DE Genes 117 up 76 down

Genes DE in RNA-Seq but missing on microarray

42

slide-8
SLIDE 8

Differential Expression Analysis of RNA-Seq Read Counts Gordon Smyth Winter School, 2 July 2012 8

Conclusions

Model the mean-variance relationship Estimate gene-specific biological variation Borrow strength between genes Linear modelling for multi-factor experiments

43

Acknowledgements

edgeR team Davis McCarthy Yunshun Chen Mark Robinson

Collaborators Samir Taoudi Stanley Lee Warren Alexander

Yifang Hu Belinda Phipson Wei Shi Yang Liao

Mark McKenzie Ross Dickins

44