RNA-seq: filtering, quality control and visualisation COMBINE - - PowerPoint PPT Presentation

rna seq filtering quality control and visualisation
SMART_READER_LITE
LIVE PREVIEW

RNA-seq: filtering, quality control and visualisation COMBINE - - PowerPoint PPT Presentation

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and visualisation (part 1) Slide taken from COMBINE RNAseq workshop on 23/09/2016 RNA-seq of Mouse mammary gland Virgin n=2 Basal Pregnant n=2 cells


slide-1
SLIDE 1

RNA-seq: filtering, quality control and visualisation

COMBINE RNA-seq Workshop

slide-2
SLIDE 2

QC and visualisation (part 1)

slide-3
SLIDE 3

RNA-seq of Mouse mammary gland

Basal cells Luminal cells Virgin Pregnant Lactating Virgin Pregnant Lactating n=2 n=2 n=2 n=2 n=2 n=2

Fu et al. (2015) ‘EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival’ Nat Cell Biol

Slide taken from COMBINE RNAseq workshop on 23/09/2016

slide-4
SLIDE 4

(some) questions we can ask

  • Which genes are differentially expressed

between basal and luminal cells?

  • … between basal and luminal in virgin mice?
  • … between pregnant and lactating mice?
  • … between pregnant and lactating mice in

basal cells?

Slide taken from COMBINE RNAseq workshop on 23/09/2016

slide-5
SLIDE 5
  • Reading in the data

– counts data and sample information

  • Formatting the data

– clean it up so we can look at it easily

slide-6
SLIDE 6

Filtering out lowly expressed genes

  • Genes with very low counts in all samples provide

little evidence for differential expression

  • Often samples have many genes with zero or very

low counts

−10 −5 5 10 15 0.00 0.05 0.10 0.15 0.20 Density

  • A. Raw data

Log−cpm 10_6_5_11 9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c −5 5 10 15 0.00 0.05 0.10 0.15 0.20 Density

  • B. Filtered data

Log−cpm 10_6_5_11 9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c

slide-7
SLIDE 7

Filtering out lowly expressed genes

  • Testing for differential expression for many

genes simultaneously adds to the multiple testing burden, reducing the power to detect DE genes.

  • IT IS VERY IMPORTANT to filter out genes that

have all zero counts or very low counts.

  • We filter using CPM values rather than counts

because they account for differences in sequencing depth between samples.

slide-8
SLIDE 8

Filtering out lowly expressed genes

  • CPM = counts per million, or how many counts

would I get for a gene if the sample had a library size of 1M. For a given gene:

Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5

slide-9
SLIDE 9

Filtering out lowly expressed genes

  • Use a CPM threshold to define “expressed” and

“unexpressed”

  • As a general rule, a good threshold can be chosen for a

CPM value that corresponds to a count of 10.

  • In our dataset, the samples have library sizes of 20 to 20

something million.

Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5

slide-10
SLIDE 10

Filtering out lowly expressed genes

  • Use a CPM threshold to define “expressed” and

“unexpressed”

  • As a general rule, a good threshold can be chosen for a

CPM value that corresponds to a count of 10.

  • In our dataset, the samples have library sizes of 20 to 20

something million.

Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5 We use a CPM threshold of 0.5!

slide-11
SLIDE 11

Filtering out lowly expressed genes

  • Use a CPM threshold to define “expressed” and

“unexpressed”

  • As a general rule, a good threshold can be chosen for a

CPM value that corresponds to a count of 10.

  • In our dataset, the samples have library sizes of 20 to 20

something million.

Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5 We use a CPM threshold of 0.5! But if this is too hard to work out, a CPM threshold of 1 works well in most cases.

slide-12
SLIDE 12

Filtering out lowly expressed genes

Basal cells Luminal cells

Virgin Pregnant Lactating Virgin Pregnant Lactating

  • We keep any gene that is (roughly) expressed in at least one

group.

  • 12 samples, 6 groups, 2 replicates in each group.

Keep if CPM > 0.5 in at least 2 out of 12 samples

slide-13
SLIDE 13

Filtering out lowly expressed genes

Basal cells Luminal cells

Virgin expressed Pregnant expressed Lactating expressed Virgin unexpressed Pregnant unexpressed Lactating unexpressed

  • We keep any gene that is (roughly) expressed in at least one

group.

  • 12 samples, 6 groups, 2 replicates in each group.

Keep if CPM > 0.5 in at least 2 out of 12 samples

slide-14
SLIDE 14

Filtering out lowly expressed genes

Basal cells Luminal cells

Virgin unexpressed Pregnant expressed Lactating unexpressed Virgin unexpressed Pregnant unexpressed Lactating unexpressed

  • We keep any gene that is (roughly) expressed in at least one

group.

  • 12 samples, 6 groups, 2 replicates in each group.

Keep if CPM > 0.5 in at least 2 out of 12 samples

slide-15
SLIDE 15

Filtering out lowly expressed genes

  • We keep any gene that is (roughly) expressed in at least one

group.

  • 12 samples, 6 groups, 2 replicates in each group.

Keep gene if CPM > 0.5 in at least 2 or more samples Basal cells Luminal cells

Virgin unexpressed Pregnant expressed Lactating unexpressed Virgin unexpressed Pregnant unexpressed Lactating unexpressed

slide-16
SLIDE 16

QC and visualisation (part 2)

slide-17
SLIDE 17

MDS Plots

  • A visualisation of a principle components analysis

which looks at where the greatest sources of variation in the data come from.

  • Distances represents the typical log2-FC observed

between each pair of samples

– e.g. 6 units apart = 2^6 = 64-fold difference

  • Unsupervised – separation based on data, no

prior knowledge of experimental design.

– Useful for an overview of the data. Do samples separate by experimental groups? – Quality control – Outliers?

slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

QC and visualisation (part 3)

slide-21
SLIDE 21

Normalisation for composition bias

  • 10_6_5_11

9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c −5 5 10 15

  • A. Example: Unnormalised data

Log−cpm

  • 10_6_5_11

9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c −5 5 10 15

  • B. Example: Normalised data

Log−cpm

If we ran a DE analysis on Sample 1 and Sample 3, almost all genes will be down- regulated in Sample 1!!

slide-22
SLIDE 22

Normalisation for composition bias

  • 10_6_5_11

9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c −5 5 10 15

  • A. Example: Unnormalised data

Log−cpm

  • 10_6_5_11

9_6_5_11 purep53 JMS8−2 JMS8−3 JMS8−4 JMS8−5 JMS9−P7c JMS9−P8c −5 5 10 15

  • B. Example: Normalised data

Log−cpm

slide-23
SLIDE 23

Normalisation for composition bias

  • TMM normalisation (Robinson and Oshlack, 2010)
  • How do we make the expression of all the genes go UP in

the one sample?

– Scaling factors

  • E.g. scale library size by 0.1 so effective library size is

1M.

  • Scaling factor <1 makes the CPM larger.

Library size Count CPM 10M 10 1 1M 10 10

slide-24
SLIDE 24

Voom

slide-25
SLIDE 25

Variance of log-cpm depends on mean

  • f log-cpm

SEQC (tech var only) Mouse (low bio var) Simulations (mod bio var) Nigerian (high bio var) D melanogaster (systematically dif) Average log2(count + 0.5)

25

slide-26
SLIDE 26

Normal dist. assumes that the data is

  • continuous
  • has constant variance

RNA-seq data is

  • discrete
  • has non-constant mean-variance trend

Voom

  • Transform to log-counts per million
  • Remove mean-var dependence through

the use of precision weights

slide-27
SLIDE 27

After: constant variance

Mean log-cpm Log2- variance Mean log-count Qtr-root variance

Before: trended variance

Variance weights

  • Obtain variance estimates for each observation using

mean-var trend.

  • Assign inverse variance weights to each observation.
  • Weights remove mean-variance trend from the data.