Quality Control S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R Fanny - - PowerPoint PPT Presentation

quality control
SMART_READER_LITE
LIVE PREVIEW

Quality Control S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R Fanny - - PowerPoint PPT Presentation

Quality Control S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R Fanny Perraudeau Senior Data Scientist, Whole Biome Tung dataset 6 RNA-sequencing datasets per individual: 3 bulk & 3 single-cell (on C1 Plates). 1 2 Batch effects and the


slide-1
SLIDE 1

Quality Control

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

Fanny Perraudeau

Senior Data Scientist, Whole Biome

slide-2
SLIDE 2

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Tung dataset

6 RNA-sequencing datasets per individual: 3 bulk & 3 single-cell (on C1 Plates).

Batch effects and the effective design of single cell gene expression studies. Tung et al. Figure 1a.

1 2

slide-3
SLIDE 3

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Tung dataset

sce class: SingleCellExperiment dim: 18726 864 metadata(0): assays(1): counts rownames(18726): ENSG00000237683 ENSG00000187634 ... ERCC-00170 ERCC-00171 rowData names(0): colnames(864): NA19098.r1.A01 NA19098.r1.A02 ... NA19239.r3.H11 NA19239.r3.H12 colData names(5): individual replicate well batch sample_id reducedDimNames(0): spikeNames(1): ERCC

slide-4
SLIDE 4

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Calculate quality control measures

# load the scater library library(scater) # calculate quality control measures sce <- calculateQCMetrics( sce, feature_controls = list(ERCC = isSpike(sce, "ERCC"))

ERCC spike-in genes are used to lter out low-quality cells High ratio of synthetic spike-in RNAs vs endogenous RNAs means cell is likely dead or stressed

Quality control with scater (Single Cell Analysis Toolkit for Gene Expression Data in R): https://bioconductor.org/packages/3.9/bioc/vignettes/scater/inst/doc/vignette qc.html

1 2 3

slide-5
SLIDE 5

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Functions used in exercises

Calculate quality measures: calculateQCMetrics() Get the count matrix: counts() Find sum for each row of a matrix: rowSums() Find elements that follow a pattern: grepl() Identify spike-in genes: isSpike() Plot the distribution of x : plot(density(x)) Add a line to a plot: abline()

slide-6
SLIDE 6

Let's practice!

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

slide-7
SLIDE 7

Quality Control (continued)

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

Fanny Perraudeau

Senior Data Scientist, Whole Biome

slide-8
SLIDE 8

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Calculate quality control measures

library(scater) sce <- calculateQCMetrics( sce, feature_controls = list(ERCC = isSpike(sce, "ERCC") )

slide-9
SLIDE 9

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Library size

T

  • tal number of reads for each cell

In scatter : total_counts Goal: remove cells with few reads

slide-10
SLIDE 10

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Library size

# plot the density of library size and add a vertical line plot(density(sce$total_counts), main = "Density - total_counts") # set the threshold for minimal library size threshold <- 20000 # plot a vertical line abline(v = threshold)

slide-11
SLIDE 11

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Library size

# find entries in the total_counts matrix greater than threshold keep <- (sce$total_counts > threshold) # tabulate the keep matrix table(keep) keep FALSE TRUE 27 837

slide-12
SLIDE 12

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Batch

plotPhenoData( sce, aes_string(x = "total_counts", y = "total_counts_ERCC", colour = "batch"))

slide-13
SLIDE 13

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Batch

plotPhenoData( sce, aes_string(x = "total_counts", y = "total_counts_ERCC", colour = "batch"))

slide-14
SLIDE 14

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Cell ltering - Batch

# find batches that are NOT equal to NA19098.r2 keep <- (sce$batch != "NA19098.r2") # tabulate the keep matrix table(keep) keep FALSE TRUE 96 768

slide-15
SLIDE 15

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Gene ltering

remove genes mainly not expressed

# keep genes with counts of at least 2 in at least 2 cells filter_genes <- apply(counts(sce), 1, function(x) length(x[x >= 2] >= 2) # tabulate filter_genes table(filter_genes) filter_genes FALSE TRUE 4512 14214

performed after cell ltering

slide-16
SLIDE 16

Let's practice!

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

slide-17
SLIDE 17

Normalization

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

Fanny Perraudeau

Senior Data Scientist, Whole Biome

slide-18
SLIDE 18

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Biological and technical variation

slide-19
SLIDE 19

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Batch effect

Clustering by batch - undesired technical artifact

slide-20
SLIDE 20

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Goal of normalization

remove technical variation (e.g. batch effect) ...while preserving biological variation

slide-21
SLIDE 21

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Normalization methods

Normalizing by dividing by normalization factor Library size Counts per million (CPM) Other common scaling factors Weighted trimmed mean of M-values (TMM) in edgeR DESeq scaling factors Scaling factors accounting for zero ination in scran

"Normalizing single cell RNA sequencing data Challenges and opportunities" (Vallejos et al 2017)

1 2

slide-22
SLIDE 22

SINGLE-CELL RNA-SEQ WORKFLOWS IN R

Functions used in exercises

Plot principal components: plotPCA() Get rst two principal components: reducedDim(sce, “PCA”)[, 1:2] Calculate and get the size factors: computeSumFactors() , sizeFactors() Names of the matrices stored in an SCE: assays() Normalize counts: normalize() Plot the relative log expression: plotRLE()

slide-23
SLIDE 23

Let's practice!

S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R