Single-cell RNA-sequencing
Ximena Ibarra-Soria
CRUK Cambridge Institute RNA-Sequence Analysis Course - EMBL EBI 12 April 2019 Based on materials by Aaron Lun
Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge - - PowerPoint PPT Presentation
Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge Institute RNA-Sequence Analysis Course - EMBL EBI 12 April 2019 Based on materials by Aaron Lun Why use single-cell RNA-seq It allows us to chracterise heterogeneity in the gene
Single-cell RNA-sequencing
Ximena Ibarra-Soria
CRUK Cambridge Institute RNA-Sequence Analysis Course - EMBL EBI 12 April 2019 Based on materials by Aaron Lun
Why use single-cell RNA-seq
It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:
Why use single-cell RNA-seq
It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:
◮ Define cell identities (e.g. cell-types or subtypes).
doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549Why use single-cell RNA-seq
It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:
◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,
stress).
doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549Why use single-cell RNA-seq
It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:
◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,
stress).
◮ Study dynamic processes (e.g. differentiation, activation).
doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549Why use single-cell RNA-seq
It allows us to chracterise heterogeneity in the gene expression profile of a population. The cell is the basic unit of life. At the cell-level we can:
◮ Define cell identities (e.g. cell-types or subtypes). ◮ Observe cell states and behaviour (e.g. cell cycle, metabolism,
stress).
◮ Study dynamic processes (e.g. differentiation, activation). ◮ Study noise in transcriptional regulation.
doi.org/10.1038/nri2707 doi.org/10.15252/msb.20145549Why use single-cell RNA-seq
RNA-seq allows quantification of the whole transcriptome.
◮ FISH: small number of transcripts. ◮ seqFISH+: 10,000 genes. ◮ FACS: small number of proteins. ◮ Mass cytometry: ~40 proteins.
A typical single-cell experiment
Tissue Dissociated single cells Physical separation Lysis
AAAAAAAA 3'-TTTTTTTTT-X-5'Reverse transcription Cell barcoding Pooled sequencing
ACTG ACTG GCTA GCTA CCTG CCTGACTG GCTA CCTG
A C T G A C T G G C T A G C T A C C T G C C T GComputational analysis
◮ Dissociation can be easy or hard (blood vs muscle). ◮ Many separation methods (plate-based, droplets). ◮ Different protocols for RT and cDNA generation (full-length,
5’/3’ biased)
Throughput of scRNA-seq protocols
scRNA-seq protocols have increased hugely in throughput.
◮ Cell separation using FACS or microfluidic devices. ◮ Automation of RT and cDNA generation.
Svensson et al., Nature Protocols, 2018 (doi.org/10.1038/nprot.2017.149) 1,000,000 100,000 10,000 1,000 1 100 10 Multiplexing MARS-seq DroNC-seq sci-RNA-seq Manual inDropa b
Integrated fluidic circuits Liquid-handling robotics Nanodroplets Picowells In situ barcoding Tang et al. 200918 Islam et al. 201124 Brennecke et al. 201364 Jaitin et al. 201433 Klein et al. 201534 Macosko et al. 201540 Bose et al. 201543 Cao et al. 201751 Rosenberg et al. 201752 Single cells in study Study publication date 2009 2010 2011 2012 2013 2014 2015 2016 2017 High-throughput sequencing of RNA from single cells STRT-seq CEL-seq Fluidigm C1 SMART-seq SMART-seq2 CytoSeq Drop-seq 10x Genomics SPLiT-seq Seq-WellscRNA-seq protocols
Plate-based
◮ Hundreds to a few
thousand cells.
◮ High number of genes
detected.
◮ High capture efficiency. ◮ Full-length transcripts. ◮ UMIs optional. ◮ Compatible with spike-ins.
Droplet-based
◮ Tens to hundreds of
thousands cells.
◮ Lower number of genes
detected.
◮ Variable capture efficiency. ◮ 3’/5’-biased. ◮ UMIs. ◮ No spike-ins.
Library prep
◮ Most protocols use polyA selection. ◮ cDNA is amplified by PCR.
◮ Introduction of strong biases. ◮ Alleviated by the use of Unique Molecular Identifiers (UMIs).
3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-CGACGGTTACG...TTTT-GCTT-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-AGCGTAGGCT...TTTT-GGAT-AACGT-5' 3'-CACGGAAAAT...TTTT-GCTT-AACGT-5' 3'-CAGCAGCTGA...TTTT-AGTA-AACGT-5' 3'-CCGGGGAGGA...TTTT-AGTA-AACGT-5' 3'-CTTTTATGAG...TTTT-AGTA-AACGT-5' 3'-CGAGGGCGGT...TTTT-GGAT-AACGT-5' 3'-CAGTCGTACG...TTTT-GGAT-AACGT-5' 3'-CAGGCTGACG...TTTT-GGAT-AACGT-5' 3'-GGATAGCTAG...TTTT-GGAT-AACGT-5' Different fragmentation site per amplicon 3'-ACATCGATCGC...TTTT-GGAT-AACGT-5' 3'-CGACGGTTACG...TTTT-GCTT-AACGT-5' 3'-TGAGCATCACTA...TTTT-AGTA-AACGT-5' UMI Constant Transcript (RC)After RT After amplification After fragmentation After sequencing
TAGG TAGG TAGG TAGG TAGG TTCG ATGA ATGA ATGA GTCAGCATGC GCTCCCGCCA TCGCATCCGA GACCGACTGC CCTATCGATC GAGCCTTTTA GTCGTCGACT GGCCCCTCCT GAAAATACTC Read 1 Read 2Cell barcoding
Allows multiplexing to sequence many libraries in the same lane. Different strategies:
◮ Incorporated during library prep. ◮ Plate-based methods only (different barcode per well).
Bead Bead
Different cell barcodescRNA-seq data
In its rawest form, FASTQ files after Illumina sequencing.
◮ Many good and fast aligners (e.g. subread, STAR).
featureCounts). This produces a count matrix with one count per gene per cell.
scRNA-seq data
In its rawest form, FASTQ files after Illumina sequencing.
◮ Many good and fast aligners (e.g. subread, STAR).
featureCounts). This produces a count matrix with one count per gene per cell.
◮ If UMIs are used, reads with the same UMI are collapsed to a
single count.
scRNA-seq data
In its rawest form, FASTQ files after Illumina sequencing.
◮ Many good and fast aligners (e.g. subread, STAR).
featureCounts). This produces a count matrix with one count per gene per cell.
◮ If UMIs are used, reads with the same UMI are collapsed to a
single count.
◮ Data generated with the 10X platform can be processed with
CellRanger.
scRNA-seq data
A typical scRNA-seq data count matrix
◮ Lots of zeros (both dropouts and lack of expression).
~100 - 1000 cells ~10000-40000 genes
scRNA-seq data
◮ Lots of zeros (both dropouts and lack of expression).
Brennecke et al., Nat Methods, 2013 107a
105 103 10 107 105 103 10 Normalized read count (5,000 pg, technical replicate 1) Normalized read count (5,000 pg, technical replicate 2) 107c
105 103 10 107 105 103 10 Normalized read count (50 pg, technical replicate 1) Normalized read count (50 pg, technical replicate 2) 107d
105 103 10 107 105 103 10 Normalized read count (10 pg, technical replicate 1) Normalized read count (10 pg, technical replicate 2) 107b
105 103 10 107 105 103 10 Normalized read count (500 pg, technical replicate 1) Normalized read count (500 pg, technical replicate 2)5000 pg 500 pg 50 pg 10 pg
scRNA-seq data analysis
Aim: to extract real biology from data with technical noise
. . . followed by higher-level analyses and interpretation.
Quality control
Removal of low-quality cells arising by:
◮ Insufficient sequencing. ◮ Failed reverse transcription. ◮ Damaged cells during dissociation.
Quality control
We use several metrics to identify low-quality samples:
◮ Total number of reads per cell (low). ◮ Total number of genes detected (low). ◮ Percentage of reads mapped to mitochondrial genes (high). ◮ Percentage of reads mapped to spike-in transcripts (high).
Non-mito Mito Spike-in Coverage Non-mito Mito Spike-in Non-mito Mito Spike-in Coverage Damage Coverage Extreme damageQuality control
QC metrics.
3 4 5 6 7 2383 2384 2677 2739 sample log10_total_counts 5000 10000 15000 2383 2384 2677 2739 sample total_features_by_countsData from Messmer et al., Cell Reports (2019).
Quality control
QC metrics.
25 50 75 100 2383 2384 2677 2739 sample pct_counts_ERCC 5 10 15 20 2383 2384 2677 2739 sample pct_counts_MtData from Messmer et al., Cell Reports (2019).
Quality control
How to define low-quality?
◮ simple, easy to interpret. ◮ hard to generalize across data sets.
Quality control
How to define low-quality?
◮ simple, easy to interpret. ◮ hard to generalize across data sets.
◮ adapts to mean/variance of QC metrics across population. ◮ assumes most cells are high-quality and removes outliers.
Quality control
How to define low-quality?
◮ simple, easy to interpret. ◮ hard to generalize across data sets.
◮ adapts to mean/variance of QC metrics across population. ◮ assumes most cells are high-quality and removes outliers.
◮ risky, may remove cells in rare subpopulations.
Normalisation
Cell-specific biases need to be normalised to be able to compare data from different cells.
◮ Sequencing depth. ◮ Capture efficiency. ◮ Composition biases.
Truth Raw data Normalized BiasNormalisation
Scaling normalisation to remove biases between cells.
◮ compute a “size factor” to divide the counts for each cell
To demonstrate: consider counts for a few genes in a few cells:
◮ assume X, Y, Z. . . are not DE between cells. ◮ systematic fold-differences are technical in origin.
Normalisation
Scaling normalisation to remove biases between cells.
◮ compute a “size factor” to divide the counts for each cell
To demonstrate: consider counts for a few genes in a few cells:
◮ assume X, Y, Z. . . are not DE between cells. ◮ systematic fold-differences are technical in origin.
Normalisation
Composition biases are introduced by differential expression.
Gene Coverage A B C D Gene Coverage A B C D
Upregulated Composition biasCell X Cell Y
(sequenced to same depth)◮ Normalisation by library size is particularly susceptible. ◮ Requires methods robust to DE, e.g., TMM, DESeq. ◮ Methods developed for bulk RNA-seq are not robust to zeroes.
Normalisation by sharing information across cells
1 2 3 4 All cells (averaged to make a reference pseudo-cell) Cell pool A: θ1 + θ2 + θ3 + θ4 = θA θ5 + θ6 + θ7 + θ8 = θB Cell pool B: 1 1 1 1 0 0 0 0 ... 0 0 0 0 1 1 1 1 ... 1 0 1 0 1 0 1 0 ... 0 1 1 0 1 1 0 0 ... ... θA θB θC θD ... θ1 θ2 θ3 θ4 ... = Single cell 7 6 5 8 System of linear equations: Lun et al, Genom Biol (2106).◮ Cells are pooled to increase counts and avoid problems with
zeroes.
◮ Size factor per pool estimated robustly (median), to protect
against DE.
◮ Clustering the data before pooling further protects against DE.
◮ Solve linear system to obtain a size factor per cell.
Normalisation - spike-ins
Normalisation on gene counts corrects for RNA content.
◮ Spike-in transcripts are not affected by RNA content. ◮ Using gene-based size factors will “over-normalize”.
Instead, define the sum of spike-in counts as spike-in size factor:
◮ normalise spike-ins by dividing with spike-in size factor. ◮ normalise genes by dividing with gene size factor.
Spike-in versus gene-based normalisation
Alternatively: normalise genes with spike-in size factors:
◮ when you can’t assume most genes are not DE. ◮ when changes in total RNA content are interesting.
Data from Islam et al., Genome Research (2019).
Batch correction
Data generated by different labs, or at different times, suffer from batch effects.
◮ Large datasets inevitably need to be processed in multiple
batches.
◮ These need to be removed to be able to compare them.
Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).
Batch correction
Methods developed for bulk RNA-seq data fail due to composition biases.
◮ Assumption: cell
composition is identical across batches.
◮ Systematic differences in
mean expression are technical. Instead, use mutual nearest neighbours to identify equivalent populations. Use these to compute the batch effect.
Batch 2 Batch 1 x x y y z w Batch effect x x y y z w MNN pairings Nearest in batch 1 Nearest in batch 2 t s e r a e n l a u t u M x x y y z w Correction vectors x/x y/y z w x/x y/y z w z Batch 3a c b d e Haghverdi et al, Nat Biotech (2108).
Batch correction
Data generated by different labs, or at different times, suffer from batch effects.
◮ Large datasets inevitably need to be processed in multiple
batches.
◮ These need to be removed to be able to compare them.
Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).
Modelling technical and biological variability
How much of cell-to-cell variability is technical vs biological? To quantify variability we can use different metrics:
◮ divide the variance of (normalised) counts by the squared mean.
var(zig) ¯ z2
g
where zig = yig si
◮ compute the variance of normalised log-expression values.
var[log2(yig si + 1)]
Fitting a trend to the technical variance
100 10 1 0.1 0.01 CV2 0.1 1 10 100 103 104 105 Average normalized read count Mean log-expression Variance of log-expression 5 10 15 5 10 15 20
Brennecke et al., Nat Methods (2013). Data from Wilson et al., Cell Stem Cell (2015).
◮ Find a trend to the variance of the spike-in transcripts. ◮ Biological variance = residual from the trend for each gene. ◮ Use to identify interesting genes for downstream steps =
feature selection.
Dimensionality reduction with PCA
Principal Component Analysis.
◮ identifies axes of maximal variance in high-dimensional data. ◮ each principal component (PC) explains less variance.
PC1 PC2 PC3 PC1 PC2
The first few (5-100) PCs can be used as a “summary” of the data.
◮ Speed up downstream analyses by reducing dimensionality. ◮ Focus on biology, remove random noise in later PCs.
https://www.nature.com/articles/nature11241 …
Visualisation in low dimensional space - PCA
The first 2-3 PCs can be used for visualisation.
Data from Zeisel et al., Science (2015).
◮ Simple and efficient, but limited resolution of complex
structure.
Visualisation in low dimensional space - t-SNE
◮ Preserve distances to neighbouring cells. ◮ Non-linear: not limited to straight axes.
Data from Zeisel et al., Science (2015).
Powerful, but need to fiddle with random seed and perplexity
Visualisation in low dimensional space - diffusion maps
Uses a diffusion process to model a continuum of expression.
◮ Useful for trajectories (e.g., differentiation).
. . . and many others, e.g., UMAP, force directed graphs, self
Clustering
Group cells with similar expression profiles into “clusters”:
◮ Annotate as particular cell types or states. ◮ Interpretable summary of the data.
Many methods available:
Hierarchical Graph-based k-means
◮ Many distance metrics (e.g., Euclidean, cosine, Spearman’s
rho)
Clustering
Most methods work well, provided you:
◮ filter to only use features of interest. ◮ assess cluster separatedness (silhouette width, gap statistic). ◮ experimentally validate novel clusters.
Summary
Starting from a count matrix:
. . . followed by higher-level analyses and interpretation. Try it yourself!
Practical workflow from Aaron Lun
Other resources
◮ Bioconductor workflow Simple Single Cell
https://bioconductor.org/packages/3.8/workflows/html/simpleSingleCell.html
◮ Hemberg lab single-cell course
https://hemberg-lab.github.io/scRNA.seq.course/index.html
Aknowledgements
Aaron Lun John Marioni