Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge - - PowerPoint PPT Presentation

single cell rna sequencing
SMART_READER_LITE
LIVE PREVIEW

Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge - - PowerPoint PPT Presentation

Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge Institute RNA-Sequence Analysis Course - EMBL EBI 12 April 2019 Based on materials by Aaron Lun Why use single-cell RNA-seq It allows us to chracterise heterogeneity in the gene


slide-1
SLIDE 1

Single-cell RNA-sequencing

Ximena Ibarra-Soria

CRUK Cambridge Institute RNA-Sequence Analysis Course - EMBL EBI 12 April 2019 Based on materials by Aaron Lun

slide-2
SLIDE 2

Why use single-cell RNA-seq

It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:

slide-3
SLIDE 3

Why use single-cell RNA-seq

It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:

◮ Define cell identities (e.g. cell-types or subtypes).

doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549
slide-4
SLIDE 4

Why use single-cell RNA-seq

It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:

◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,

stress).

doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549
slide-5
SLIDE 5

Why use single-cell RNA-seq

It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:

◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,

stress).

◮ Study dynamic processes (e.g. differentiation, activation).

doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549
slide-6
SLIDE 6

Why use single-cell RNA-seq

It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:

◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,

stress).

◮ Study dynamic processes (e.g. differentiation, activation). ◮ Study noise in transcriptional regulation.

doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549
slide-7
SLIDE 7

Why use single-cell RNA-seq

RNA-seq allows quantification of the whole transcriptome.

◮ FISH: small number of transcripts. ◮ seqFISH+: 10,000 genes. ◮ FACS: small number of proteins. ◮ Mass cytometry: ~40 proteins.

slide-8
SLIDE 8

A typical single-cell experiment

Tissue Dissociated single cells Physical separation Lysis

AAAAAAAA 3'-TTTTTTTTT-X-5'

Reverse transcription Cell barcoding Pooled sequencing

ACTG ACTG GCTA GCTA CCTG CCTG

ACTG GCTA CCTG

A C T G A C T G G C T A G C T A C C T G C C T G

Computational analysis

◮ Dissociation can be easy or hard (blood vs muscle). ◮ Many separation methods (plate-based, droplets). ◮ Different protocols for RT and cDNA generation (full-length,

5’/3’ biased)

slide-9
SLIDE 9

Throughput of scRNA-seq protocols

scRNA-seq protocols have increased hugely in throughput.

◮ Cell separation using FACS or microfluidic devices. ◮ Automation of RT and cDNA generation.

Svensson et al., Nature Protocols, 2018 (doi.org/10.1038/nprot.2017.149) 1,000,000 100,000 10,000 1,000 1 100 10 Multiplexing MARS-seq DroNC-seq sci-RNA-seq Manual inDrop

a b

Integrated fluidic circuits Liquid-handling robotics Nanodroplets Picowells In situ barcoding Tang et al. 200918 Islam et al. 201124 Brennecke et al. 201364 Jaitin et al. 201433 Klein et al. 201534 Macosko et al. 201540 Bose et al. 201543 Cao et al. 201751 Rosenberg et al. 201752 Single cells in study Study publication date 2009 2010 2011 2012 2013 2014 2015 2016 2017 High-throughput sequencing of RNA from single cells STRT-seq CEL-seq Fluidigm C1 SMART-seq SMART-seq2 CytoSeq Drop-seq 10x Genomics SPLiT-seq Seq-Well
slide-10
SLIDE 10

scRNA-seq protocols

Plate-based

◮ Hundreds to a few

thousand cells.

◮ High number of genes

detected.

◮ High capture efficiency. ◮ Full-length transcripts. ◮ UMIs optional. ◮ Compatible with spike-ins.

Droplet-based

◮ Tens to hundreds of

thousands cells.

◮ Lower number of genes

detected.

◮ Variable capture efficiency. ◮ 3’/5’-biased. ◮ UMIs. ◮ No spike-ins.

slide-11
SLIDE 11

Library prep

◮ Most protocols use polyA selection. ◮ cDNA is amplified by PCR.

◮ Introduction of strong biases. ◮ Alleviated by the use of Unique Molecular Identifiers (UMIs).

3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-CGACGGTTACG...TTTT-GCTT-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-AGCGTAGGCT...TTTT-GGAT-AACGT-5' 3'-CACGGAAAAT...TTTT-GCTT-AACGT-5' 3'-CAGCAGCTGA...TTTT-AGTA-AACGT-5' 3'-CCGGGGAGGA...TTTT-AGTA-AACGT-5' 3'-CTTTTATGAG...TTTT-AGTA-AACGT-5' 3'-CGAGGGCGGT...TTTT-GGAT-AACGT-5' 3'-CAGTCGTACG...TTTT-GGAT-AACGT-5' 3'-CAGGCTGACG...TTTT-GGAT-AACGT-5' 3'-GGATAGCTAG...TTTT-GGAT-AACGT-5' Different fragmentation site per amplicon 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-CGACGGTTACG...TTTT-GCTT-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' UMI Constant Transcript (RC)

After RT After amplification After fragmentation After sequencing

TAGG TAGG TAGG TAGG TAGG TTCG ATGA ATGA ATGA GTCAGCATGC GCTCCCGCCA TCGCATCCGA GACCGACTGC CCTATCGATC GAGCCTTTTA GTCGTCGACT GGCCCCTCCT GAAAATACTC Read 1 Read 2
slide-12
SLIDE 12

Cell barcoding

Allows multiplexing to sequence many libraries in the same lane. Different strategies:

  • 1. Cell barcode in the PCR primer.

◮ Incorporated during library prep. ◮ Plate-based methods only (different barcode per well).

  • 2. Cell barcode in the oligo-dT primer.
CGACTA-NNNN-TTTTTTTT-3' GTCAAA-NNNN-TTTTTTTT-3' Cell barcode (constant within bead) UMI (variable within bead) One bead loaded per droplet, as well as ≤ 1 cell (hopefully)

Bead Bead

Different cell barcode
slide-13
SLIDE 13

scRNA-seq data

In its rawest form, FASTQ files after Illumina sequencing.

  • 1. Align reads to reference genome.

◮ Many good and fast aligners (e.g. subread, STAR).

  • 2. Count number of reads mapped to each gene (e.g. HTSeq,

featureCounts). This produces a count matrix with one count per gene per cell.

slide-14
SLIDE 14

scRNA-seq data

In its rawest form, FASTQ files after Illumina sequencing.

  • 1. Align reads to reference genome.

◮ Many good and fast aligners (e.g. subread, STAR).

  • 2. Count number of reads mapped to each gene (e.g. HTSeq,

featureCounts). This produces a count matrix with one count per gene per cell.

◮ If UMIs are used, reads with the same UMI are collapsed to a

single count.

slide-15
SLIDE 15

scRNA-seq data

In its rawest form, FASTQ files after Illumina sequencing.

  • 1. Align reads to reference genome.

◮ Many good and fast aligners (e.g. subread, STAR).

  • 2. Count number of reads mapped to each gene (e.g. HTSeq,

featureCounts). This produces a count matrix with one count per gene per cell.

◮ If UMIs are used, reads with the same UMI are collapsed to a

single count.

◮ Data generated with the 10X platform can be processed with

CellRanger.

slide-16
SLIDE 16

scRNA-seq data

A typical scRNA-seq data count matrix

◮ Lots of zeros (both dropouts and lack of expression).

~100 - 1000 cells ~10000-40000 genes

slide-17
SLIDE 17

scRNA-seq data

◮ Lots of zeros (both dropouts and lack of expression).

Brennecke et al., Nat Methods, 2013 107

a

105 103 10 107 105 103 10 Normalized read count (5,000 pg, technical replicate 1) Normalized read count (5,000 pg, technical replicate 2) 107

c

105 103 10 107 105 103 10 Normalized read count (50 pg, technical replicate 1) Normalized read count (50 pg, technical replicate 2) 107

d

105 103 10 107 105 103 10 Normalized read count (10 pg, technical replicate 1) Normalized read count (10 pg, technical replicate 2) 107

b

105 103 10 107 105 103 10 Normalized read count (500 pg, technical replicate 1) Normalized read count (500 pg, technical replicate 2)

5000 pg 500 pg 50 pg 10 pg

slide-18
SLIDE 18

scRNA-seq data analysis

Aim: to extract real biology from data with technical noise

  • 1. Quality control.
  • 2. Normalisation of cell-specific biases.
  • 3. Batch correction.
  • 4. Modelling technical noise.
  • 5. Dimensionality reduction and visualisation.
  • 6. Clustering.

. . . followed by higher-level analyses and interpretation.

slide-19
SLIDE 19

Quality control

Removal of low-quality cells arising by:

◮ Insufficient sequencing. ◮ Failed reverse transcription. ◮ Damaged cells during dissociation.

slide-20
SLIDE 20

Quality control

We use several metrics to identify low-quality samples:

◮ Total number of reads per cell (low). ◮ Total number of genes detected (low). ◮ Percentage of reads mapped to mitochondrial genes (high). ◮ Percentage of reads mapped to spike-in transcripts (high).

Non-mito Mito Spike-in Coverage Non-mito Mito Spike-in Non-mito Mito Spike-in Coverage Damage Coverage Extreme damage
slide-21
SLIDE 21

Quality control

QC metrics.

3 4 5 6 7 2383 2384 2677 2739 sample log10_total_counts 5000 10000 15000 2383 2384 2677 2739 sample total_features_by_counts

Data from Messmer et al., Cell Reports (2019).

slide-22
SLIDE 22

Quality control

QC metrics.

25 50 75 100 2383 2384 2677 2739 sample pct_counts_ERCC 5 10 15 20 2383 2384 2677 2739 sample pct_counts_Mt

Data from Messmer et al., Cell Reports (2019).

slide-23
SLIDE 23

Quality control

How to define low-quality?

  • 1. Define fixed thresholds, e.g., at least 100,000 counts per cell.

◮ simple, easy to interpret. ◮ hard to generalize across data sets.

slide-24
SLIDE 24

Quality control

How to define low-quality?

  • 1. Define fixed thresholds, e.g., at least 100,000 counts per cell.

◮ simple, easy to interpret. ◮ hard to generalize across data sets.

  • 2. Detect outliers in the QC metric distribution.

◮ adapts to mean/variance of QC metrics across population. ◮ assumes most cells are high-quality and removes outliers.

slide-25
SLIDE 25

Quality control

How to define low-quality?

  • 1. Define fixed thresholds, e.g., at least 100,000 counts per cell.

◮ simple, easy to interpret. ◮ hard to generalize across data sets.

  • 2. Detect outliers in the QC metric distribution.

◮ adapts to mean/variance of QC metrics across population. ◮ assumes most cells are high-quality and removes outliers.

  • 3. Define low-quality cells as outliers on gene expression

◮ risky, may remove cells in rare subpopulations.

slide-26
SLIDE 26

Normalisation

Cell-specific biases need to be normalised to be able to compare data from different cells.

◮ Sequencing depth. ◮ Capture efficiency. ◮ Composition biases.

Truth Raw data Normalized Bias
slide-27
SLIDE 27

Normalisation

Scaling normalisation to remove biases between cells.

◮ compute a “size factor” to divide the counts for each cell

To demonstrate: consider counts for a few genes in a few cells:

◮ assume X, Y, Z. . . are not DE between cells. ◮ systematic fold-differences are technical in origin.

slide-28
SLIDE 28

Normalisation

Scaling normalisation to remove biases between cells.

◮ compute a “size factor” to divide the counts for each cell

To demonstrate: consider counts for a few genes in a few cells:

◮ assume X, Y, Z. . . are not DE between cells. ◮ systematic fold-differences are technical in origin.

slide-29
SLIDE 29

Normalisation

Composition biases are introduced by differential expression.

Gene Coverage A B C D Gene Coverage A B C D

Upregulated Composition bias

Cell X Cell Y

(sequenced to same depth)

◮ Normalisation by library size is particularly susceptible. ◮ Requires methods robust to DE, e.g., TMM, DESeq. ◮ Methods developed for bulk RNA-seq are not robust to zeroes.

slide-30
SLIDE 30

Actually, it is possible to use bulk methods for scRNA-Seq

Zero Inflated Negative Binomial (ZINB)

  • Risso et al., 2018 - https:/

/ www.nature.com/articles/ s41467-017-02554-5 Application to scRNA-Seq - Van der Berge et al., 2018 - https:/ / genomebiology.biomedcentral.com/ articles/10.1186/s13059-018-1406-4

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Normalisation by sharing information across cells

1 2 3 4 All cells (averaged to make a reference pseudo-cell) Cell pool A: θ1 + θ2 + θ3 + θ4 = θA θ5 + θ6 + θ7 + θ8 = θB Cell pool B: 1 1 1 1 0 0 0 0 ... 0 0 0 0 1 1 1 1 ... 1 0 1 0 1 0 1 0 ... 0 1 1 0 1 1 0 0 ... ... θA θB θC θD ... θ1 θ2 θ3 θ4 ... = Single cell 7 6 5 8 System of linear equations: Lun et al, Genom Biol (2106).

◮ Cells are pooled to increase counts and avoid problems with

zeroes.

◮ Size factor per pool estimated robustly (median), to protect

against DE.

◮ Clustering the data before pooling further protects against DE.

◮ Solve linear system to obtain a size factor per cell.

slide-34
SLIDE 34

Normalisation - spike-ins

Normalisation on gene counts corrects for RNA content.

◮ Spike-in transcripts are not affected by RNA content. ◮ Using gene-based size factors will “over-normalize”.

Instead, define the sum of spike-in counts as spike-in size factor:

◮ normalise spike-ins by dividing with spike-in size factor. ◮ normalise genes by dividing with gene size factor.

slide-35
SLIDE 35

Spike-in versus gene-based normalisation

Alternatively: normalise genes with spike-in size factors:

◮ when you can’t assume most genes are not DE. ◮ when changes in total RNA content are interesting.

Data from Islam et al., Genome Research (2019).

slide-36
SLIDE 36

Batch correction

Data generated by different labs, or at different times, suffer from batch effects.

◮ Large datasets inevitably need to be processed in multiple

batches.

◮ These need to be removed to be able to compare them.

Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).

slide-37
SLIDE 37

Batch correction

Methods developed for bulk RNA-seq data fail due to composition biases.

◮ Assumption: cell

composition is identical across batches.

◮ Systematic differences in

mean expression are technical. Instead, use mutual nearest neighbours to identify equivalent populations. Use these to compute the batch effect.

Batch 2 Batch 1 x x y y z w Batch effect x x y y z w MNN pairings Nearest in batch 1 Nearest in batch 2 t s e r a e n l a u t u M x x y y z w Correction vectors x/x y/y z w x/x y/y z w z Batch 3

a c b d e Haghverdi et al, Nat Biotech (2108).

slide-38
SLIDE 38

Batch correction

Data generated by different labs, or at different times, suffer from batch effects.

◮ Large datasets inevitably need to be processed in multiple

batches.

◮ These need to be removed to be able to compare them.

Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).

slide-39
SLIDE 39

Modelling technical and biological variability

How much of cell-to-cell variability is technical vs biological? To quantify variability we can use different metrics:

  • 1. Squared coefficient of variation (CV2).

◮ divide the variance of (normalised) counts by the squared mean.

var(zig) ¯ z2

g

where zig = yig si

  • 2. Variance of log-expression.

◮ compute the variance of normalised log-expression values.

var[log2(yig si + 1)]

slide-40
SLIDE 40

Fitting a trend to the technical variance

100 10 1 0.1 0.01 CV2 0.1 1 10 100 103 104 105 Average normalized read count Mean log-expression Variance of log-expression 5 10 15 5 10 15 20

Brennecke et al., Nat Methods (2013). Data from Wilson et al., Cell Stem Cell (2015).

◮ Find a trend to the variance of the spike-in transcripts. ◮ Biological variance = residual from the trend for each gene. ◮ Use to identify interesting genes for downstream steps =

feature selection.

slide-41
SLIDE 41

Dimensionality reduction with PCA

Principal Component Analysis.

◮ identifies axes of maximal variance in high-dimensional data. ◮ each principal component (PC) explains less variance.

PC1 PC2 PC3 PC1 PC2

The first few (5-100) PCs can be used as a “summary” of the data.

◮ Speed up downstream analyses by reducing dimensionality. ◮ Focus on biology, remove random noise in later PCs.

https://www.nature.com/articles/nature11241 …

slide-42
SLIDE 42

Visualisation in low dimensional space - PCA

The first 2-3 PCs can be used for visualisation.

Data from Zeisel et al., Science (2015).

◮ Simple and efficient, but limited resolution of complex

structure.

slide-43
SLIDE 43

Visualisation in low dimensional space - t-SNE

◮ Preserve distances to neighbouring cells. ◮ Non-linear: not limited to straight axes.

Data from Zeisel et al., Science (2015).

Powerful, but need to fiddle with random seed and perplexity

slide-44
SLIDE 44

Visualisation in low dimensional space - diffusion maps

Uses a diffusion process to model a continuum of expression.

◮ Useful for trajectories (e.g., differentiation).

. . . and many others, e.g., UMAP, force directed graphs, self

  • rganising maps.
slide-45
SLIDE 45

Clustering

Group cells with similar expression profiles into “clusters”:

◮ Annotate as particular cell types or states. ◮ Interpretable summary of the data.

Many methods available:

Hierarchical Graph-based k-means

◮ Many distance metrics (e.g., Euclidean, cosine, Spearman’s

rho)

slide-46
SLIDE 46

Clustering

Most methods work well, provided you:

◮ filter to only use features of interest. ◮ assess cluster separatedness (silhouette width, gap statistic). ◮ experimentally validate novel clusters.

slide-47
SLIDE 47

Summary

Starting from a count matrix:

  • 1. Quality control on cells.
  • 2. Normalisation of cell-specific biases.
  • 3. Modelling technical noise.
  • 4. Dimensionality reduction and clustering.

. . . followed by higher-level analyses and interpretation. Try it yourself!

Practical workflow from Aaron Lun

slide-48
SLIDE 48

Other resources

◮ Bioconductor workflow Simple Single Cell

https://bioconductor.org/packages/3.8/workflows/html/simpleSingleCell.html

◮ Hemberg lab single-cell course

https://hemberg-lab.github.io/scRNA.seq.course/index.html

slide-49
SLIDE 49

Aknowledgements

Aaron Lun John Marioni