Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem - - PowerPoint PPT Presentation

using single cell transcriptome sequencing to infer
SMART_READER_LITE
LIVE PREVIEW

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem - - PowerPoint PPT Presentation

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories Sandrine Dudoit Division of Biostatistics and Department of Statistics University of California, Berkeley www.stat.berkeley.edu/~sandrine Computational


slide-1
SLIDE 1

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories

Sandrine Dudoit Division of Biostatistics and Department of Statistics University of California, Berkeley www.stat.berkeley.edu/~sandrine

Computational and Systems Immunology Seminar Stanford University June 28, 2016

Version: 26/06/2016, 22:58

1 / 105

slide-2
SLIDE 2

Acknowledgments

  • Sandrine Dudoit, Division of Biostatistics and Department of

Statistics, UC Berkeley.

◮ Fanny Perraudeau, Graduate Group in Biostatistics, UC

Berkeley.

◮ Davide Risso, Division of Biostatistics, UC Berkeley. ◮ Kelly Street, Graduate Group in Biostatistics, UC Berkeley.

  • John Ngai, Department of Molecular and Cell Biology, UC

Berkeley – Principal investigator.

◮ Diya Das. ◮ Russell Fletcher. ◮ David Stafford.

  • Elizabeth Purdom, Department of Statistics, UC Berkeley.
  • Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,

France.

◮ Svetlana Gribkova. 2 / 105

slide-3
SLIDE 3

Acknowledgments

  • Nir Yosef, Department of Electrical Engineering and Computer

Sciences, UC Berkeley.

◮ Michael Cole. ◮ Allon Wagner.

  • Funded by BRAIN Initiative and California Institute for

Regenerative Medicine (CIRM).

3 / 105

slide-4
SLIDE 4

Outline

1 Olfactory Stem Cell Fate Trajectories

Olfactory Stem Cells and Neural Regeneration Olfactory Epithelium p63 Dataset Analysis Pipeline

2 Exploratory Data Analysis and Quality Assessment/Control 3 Normalization and Expression Quantitation

Motivation Methods Software: scone Zero-Inflated Negative Binomial Model

4 Resampling-Based Sequential Ensemble Clustering

Motivation Methods Software: clusterExperiment

5 Cell Lineage and Pseudotime Inference

Motivation

4 / 105

slide-5
SLIDE 5

Outline

Methods Software: slingshot

5 / 105

slide-6
SLIDE 6

Olfactory Stem Cells and Neural Regeneration

  • R. Fletcher, J. Ngai

6 / 105

slide-7
SLIDE 7

Olfactory Stem Cells and Neural Regeneration

  • The nature of stem cells giving rise to the nervous system is
  • f particular interest in neurobiology, because neural stem

cells remain active in certain brain regions for the entire life of an individual.

  • We focus on the mouse olfactory epithelium (OE), a site of

active neurogenesis in the postnatal animal.

  • Adult olfactory stem cells support the replacement of olfactory

sensory neurons and non-neuronal support cells (e.g., sustentacular) over postnatal life and can reconstitute the entire OE following injury.

  • The OE is a convenient system to study, due to its

experimental accessibility (in situ analysis) and its limited number of cell types:

◮ olfactory sensory neurons (OSN), ◮ sustentacular cells (SUS), 7 / 105

slide-8
SLIDE 8

Olfactory Stem Cells and Neural Regeneration

◮ cells of the Bowman gland, ◮ microvillous cells (rare). 8 / 105

slide-9
SLIDE 9

Olfactory Stem Cells and Neural Regeneration

Figure 1: Mouse olfactory epithelium.

9 / 105

slide-10
SLIDE 10

Olfactory Stem Cells and Neural Regeneration

Sustentacular cells Mature olfactory neurons Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia Bowman’s gland

  • HBCs: multipotent, quiescent – deep reserve adult tissue stem cell
  • GBCs: proliferative progenitor cells + transit-amplifying cells

The Horizontal Basal Cell Is an Adult Tissue Stem Cell Figure 2: Olfactory epithelium cell types.

10 / 105

slide-11
SLIDE 11

Olfactory Stem Cells and Neural Regeneration

Open questions.

  • Determine the stage at which the neuronal and non-neuronal

lineages bifurcate/diverge.

  • Characterize discrete intermediate stages of cell differentiation.
  • Identify the genetic networks and signaling pathways that

promote self-renewal and regulate the transition to differentiation.

11 / 105

slide-12
SLIDE 12

Olfactory Stem Cells and Neural Regeneration

p63 regulation of horizontal basal cells.

  • The horizontal basal cell (HBC) is an adult tissue stem cell.
  • The p63 protein (tumor protein p63, TP63) promotes

self-renewal of HBC by blocking differentiation.

  • When p63 is down-regulated, this “brake” is removed,

allowing differentiation to proceed at the expense of self-renewal. Thus, p63 can be viewed as a “molecular switch” that decides between the alternate stem cell fates of self-renewal vs. differentiation.

  • We use single-cell RNA-Seq to analyze cell fate trajectories

from olfactory stem cells (HBC) of p63 conditional knock-out mice.

12 / 105

slide-13
SLIDE 13

Olfactory Epithelium p63 Dataset

X Cre-ER Krt5 YFP loxP-STOP-loxP Rosa p63

+

Experimental design

Russell Fletcher, Levi Gadye, Mike Sanchez

~ ~

24h 48h 72h 96h 7d 14d

tamoxifen

YFP+;p63+/+:

98 cells Resting HBCs

FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq

Differentiating cells

YFP+;p63-/-:

88 cells 100 cells 128 cells 110 cells 98 cells

Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102 wild-type and 534 p63 knock-out cells.

13 / 105

slide-14
SLIDE 14

Olfactory Epithelium p63 Dataset

P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04

Number of cells per batch, colored by bio

number of cells 10 20 30 40 50 60 KO_14DPT KO_24HPT KO_48HPT KO_7DPT KO_96HPT WT_72HPT WT_96HPT

Figure 4: Experimental design. Number of cells per batch, colored by biological condition.

14 / 105

slide-15
SLIDE 15

Olfactory Epithelium p63 Dataset

  • Single-cell RNA-Seq for 636 HBC.

◮ 102 wild-type (WT)/resting cells

534 p63 knock-out (KO) cells, at five timepoints following tamoxifen treatment.

◮ Biological replicate: Cells from 1–3 mice.

At least two replicates per biological condition.

◮ One FACS run and one C1 run per biological replicate

= ⇒ 14 batches.

◮ 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).

  • Some confounding between biological and technical effects.
  • Batch effects nested within biological effects.

15 / 105

slide-16
SLIDE 16

Olfactory Epithelium p63 Dataset

  • Marker genes. 94 marker genes, curated from literature and

from prior microarray, sequencing, and in situ experiments, e.g., neuronal, progenitor cell markers.

  • Housekeeping genes. 715 housekeeping (HK) genes, curated

from prior microarray experiments, expected to be constantly and highly-expressed across cells of the OE.

  • Gene-level read counts. TopHat2 alignment to RefSeq mm10

genome and featureCounts (bioinf.wehi.edu.au/featureCounts) counting, with genes defined as union of all isoforms.

16 / 105

slide-17
SLIDE 17

Analysis Pipeline

  • Input/Output (I/O).

◮ Input. FASTQ files, gene-level annotation metadata,

sample-level annotation metadata.

◮ Output. Cluster labels corresponding to cellular types, cell

lineages and pseudotimes, list of differentially expressed genes corresponding to markers/transcriptome signatures for these types/lineages.

  • Pre-processing. Read alignment, gene-level expression

quantitation, quality assessment/control.

  • Exploratory data analysis (EDA) and quality

assessment/control (QA/QC). Numerical and graphical summaries of gene-level read counts, sample filtering (e.g., based on QC measures), gene filtering (e.g., zero inflation).

17 / 105

slide-18
SLIDE 18

Analysis Pipeline

  • Normalization and expression quantitation. Adjust for

unwanted technical effects (e.g., C1 run effects), account for zero inflation and over-dispersion.

  • Cluster analysis. Dimensionality reduction (e.g., based on

principal component analysis), distance measure, clustering procedure and tuning parameters, confidence measures/validation of clustering results.

  • Cell lineage and pseudotime inference.
  • Differential expression (DE) analysis. Identification of marker

genes for cell types/lineages.

18 / 105

slide-19
SLIDE 19

Computationally Reproducible Research

  • Version control system for software development and data

analysis: Git, GitHub.

  • R data packages, with standard object structure (S4 classes

ExpressionSet, SummarizedExperiment) for gene-level expression measures and gene-level and sample-level annotation metadata.

  • Compendia, i.e., integrated and executable input documents,

with text, data, code, and software, that allow the generation

  • f dynamic and reproducible output documents, intermixing

text and code output (textual and graphical). Different views

  • f the output document (e.g., PDF, HTML) can be

automatically generated and updated whenever the text, data,

  • r code change.

E.g. For R and L

AT

EX/markdown, Sweave, knitr, RStudio.

19 / 105

slide-20
SLIDE 20

Zero Inflation

Proportion of genes with zero count, pre gene filtering

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6

Proportion of cells in which a gene is detected, pre gene filtering

N = 22054 Bandwidth = 0.02443 Density

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 5: Zero inflation. Pre gene filtering.

20 / 105

slide-21
SLIDE 21

Zero Inflation

  • Single-cell RNA-Seq data have many more genes with zero

read counts than bulk RNA-Seq data.

  • This zero inflation could occur for biological reasons (i.e., the

gene is simply not expressed) or technical reasons (e.g., low capture efficiency).

  • Zero-count gene filtering is advisable for normalization and

downstream analyses.

  • Most RNA-Seq normalization methods involve scaling and

perform poorly when many genes have zero counts.

  • In particular, the global-scaling method of Anders and Huber

(2010), implemented in the Bioconductor R package DESeq, discards any gene having zero count in at least one sample. In practice, the scaling factors are therefore estimated based on

  • nly a handful of genes, e.g., 5/22,054 genes for OE dataset.

21 / 105

slide-22
SLIDE 22

Zero Inflation

  • Full-quantile (FQ) normalization also doesn’t behave properly

due to ties from the large number of zeros.

  • We apply the following zero-count gene filtering to the OE

dataset: Retain only the genes with at least nr = 20 reads, in at least ns = 40 samples. This yields 9,133/22,054 genes.

22 / 105

slide-23
SLIDE 23

Zero Inflation

Proportion of genes with zero count, post gene filtering

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

  • portion of cells in which a gene is detected, post gene filtering

N = 9133 Bandwidth = 0.03333 Density

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 6: Zero inflation. Post gene filtering.

23 / 105

slide-24
SLIDE 24

Sample-Level QC

P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 1e+06 2e+06 3e+06

QC measures by batch

NREADS P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 500000 1000000 1500000 2000000 2500000 3000000 3500000

QC measures by batch

NALIGNED P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 90 92 94 96

QC measures by batch

RALIGN P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 20 30 40 50 60 70

QC measures by batch

TOTAL_DUP

Figure 7: Sample-level QC. Boxplots of QC measures, by batch.

24 / 105

slide-25
SLIDE 25

Sample-Level QC

P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 0.0 0.1 0.2 0.3 0.4 0.5 0.6

QC measures by batch

PCT_INTRONIC_BASES P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 0.10 0.15 0.20 0.25 0.30

QC measures by batch

PCT_INTERGENIC_BASES P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 0.2 0.3 0.4 0.5 0.6 0.7 0.8

QC measures by batch

PCT_MRNA_BASES P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

QC measures by batch

MEDIAN_CV_COVERAGE

Figure 8: Sample-level QC. Boxplots of QC measures, by batch.

25 / 105

slide-26
SLIDE 26

Sample-Level QC

  • −5

5 10 −4 −2 2 4 6

QC PCA by batch

PC1 PC2 P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 −4 −2 2 4 6

QC PCA by batch

PC1

(a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch

Figure 9: Sample-level QC. Principal component analysis (PCA) of sample-level QC measures.

26 / 105

slide-27
SLIDE 27

Sample-Level QC

  • ● ●
  • −40

−20 20 40 60 80 −5 5 10

QC PC1 vs. count PC1

Count PC1 QC PC1 Cor=−0.49 PC1 PC2 PC3 PC4 PC5

Absolute correlation of count PC and QC measures

0.0 0.2 0.4 0.6 0.8 1.0

NREADS NALIGNED RALIGN TOTAL_DUP PRIMER PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 10: Sample-level QC. Association of counts and sample-level QC measures.

27 / 105

slide-28
SLIDE 28

Sample-Level QC: Summary

  • The distribution of QC measures can vary substantially within

and between batches.

  • Some QC measures clearly point to low-quality samples, e.g.,

low percentage of mapped reads (RALIGN).

  • There can be a strong association between QC measures and

read counts (cf. PCA).

  • Filtering samples based on QC measures is advisable, as

normalization procedures may not be able to adjust for QC and some samples simply have low quality.

  • Normalization procedures based on QC measures (e.g.,

regression on first few PC of QC measures) should also be considered.

28 / 105

slide-29
SLIDE 29

Gene-Level Counts

2 4 6 8 10

Gene−level log−count

log(count+1) −4 −2 2 4 6 8

Gene−level RLE

RLE

(a) Log-count (b) RLE

Figure 11: Gene-level counts. Gene-level log-count and relative log expression (RLE = log-ratio of read count to median read count across samples).

29 / 105

slide-30
SLIDE 30

Gene-Level Counts: Summary

  • After gene and sample filtering and before normalization,

there are large differences in gene-level count distributions within and between batches (cf. RLE, housekeeping genes).

  • The counts are still zero-inflated.
  • There can be substantial association of counts and

sample-level QC measures.

  • Normalization is essential before any clustering or differential

expression analysis, to ensure that observed differences in expression measures between samples and/or genes are truly due to differential expression and not technical artifacts.

30 / 105

slide-31
SLIDE 31

Motivation

  • In order to derive expression measures and compare these

measures between samples and/or genomic regions of interest (ROI), one first needs to normalize read counts to adjust for differences in sequencing depths (i.e., total read counts) and ROI lengths, and possibly more complex technical effects, e.g., C1 run/library preparation/flow-cell/lane effects and nucleotide composition (e.g., GC-content) effects – sequenceability.

  • Normalization is essential to ensure that observed differences

in expression measures between samples and/or ROI are truly due to differential expression and not technical artifacts.

31 / 105

slide-32
SLIDE 32

Motivation

  • Normalization is still an important issue for RNA-Seq, despite

initial optimistic claims such as: “One particularly powerful advantage of RNA-Seq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets” (Wang et al., 2009).

  • This subject has received relatively little attention and few

novel methods have been proposed in the RNA-Seq literature.

  • It is still common to only scale ROI-level counts by ROI length

and total sample read count to adjust, respectively, for differences in ROI lengths and sequencing depths (even for technical replicate lanes). E.g. Estimators based on Poisson model as in Marioni et al. (2008); RPKM, Reads Per Kilobase of exon model per Million mapped reads, as in Mortazavi et al. (2008).

32 / 105

slide-33
SLIDE 33

Motivation

  • In many cases, however, one needs to account for more

complex features of the genomic regions and experimental design.

  • Normalization seems even more important for single-cell

RNA-Seq than bulk RNA-Seq.

  • There are large effects related to C1 run (cells processed in

batches of 96 wells).

  • There is also greater variation in gene-level counts across

single cells than across bulk RNA-Seq samples.

  • The very small amount of RNA present in a single cell may

cause amplification bias (i.e., many sequenced copies of a transcript), hence high-magnitude outliers, and low capture efficiency (i.e., genes expressed in the cell, but not captured by the sequencing process), hence zero inflation.

33 / 105

slide-34
SLIDE 34

Motivation

  • Widely-used bulk RNA-Seq methods are not well-suited for

handling the zero inflation and high level of technical noise of scRNA-Seq data (Vallejos et al., 2016).

34 / 105

slide-35
SLIDE 35

Between-Sample Normalization

Most normalization methods proposed to date are adaptations of methods for bulk RNA-Seq and microarrays.

  • Global-scaling normalization. Gene-level counts are scaled by

a single factor per sample.

◮ Total-count (TC), as in RPKM (Mortazavi et al., 2008), still

widely used.

◮ Housekeeping gene count, as in qRT-PCR, e.g., per-sample

POLR2A count (Bullard et al., 2010).

◮ Single-quantile, e.g., per-sample upper-quartile (UQ) of counts

for genes with non-zero counts (Bullard et al., 2010).

  • Pairwise global-scaling normalization. Gene-level counts are

scaled by a single factor per sample computed with respect to a reference sample.

◮ Trimmed Mean of M values (TMM) (Robinson and Oshlack,

2010): Select a reference sample and obtain a robust estimate

  • f the overall fold-change between each sample and the
  • reference. Implemented in edgeR.

35 / 105

slide-36
SLIDE 36

Between-Sample Normalization

◮ Anders and Huber (2010): The global-scaling factor for a

given sample is defined as the median fold-change between that sample and a synthetic reference sample whose counts are defined as the geometric means of the counts across samples. Implemented in DESeq and edgeR (as “RLE”).

  • Full-quantile normalization (FQ). All quantiles of the gene

count distributions are matched between samples (Bullard et al., 2010). Specifically, for each sample, the distribution of sorted read counts is matched to a reference distribution defined in terms of a function of the sorted counts (e.g., median) across samples. Inspired from the microarray literature (Irizarry et al., 2003).

36 / 105

slide-37
SLIDE 37

Between-Sample Normalization

  • Loess normalization. Loess fits are performed for

mean-difference plots (MD-plots) of log-counts for pairs of samples (all possible pairs (cyclic) or each sample paired with a synthetic reference obtained by averaging counts across samples). This approach, also inspired from the microarray literature (Bolstad et al., 2003), was used in Lov´ en et al. (2012) and is implemented in the affy function loess.normalize and limma function normalizeCyclicLoess.

  • Regression-based normalization for known technical factors,

e.g., C1 run, QC measures. Linear model (limma), negative binomial model (edgeR), empirical Bayes model (ComBat of Johnson et al. (2007), implemented in R package sva), supervised version of RUV below.

  • Remove unwanted variation (RUV).

37 / 105

slide-38
SLIDE 38

Between-Sample Normalization

◮ Proposed initially for the normalization of microarray data

(Gagnon-Bartsch and Speed, 2012; Gagnon-Bartsch et al., 2013; Jacob et al., 2013) and then adapted to sequencing data (Risso et al., 2014a,b), the main idea is to remove “unwanted variation” by performing factor analysis for a suitably-chosen subset of control genes and/or samples.

◮ For n samples and J genes, consider the log-linear/generalized

linear model (GLM) log E[Y |X, W ] = W α + Xβ, (1) where Y is the n × J matrix of gene-level read counts, X is an n × M design matrix corresponding to the M covariates of interest/factors of “wanted variation” (e.g., treatment) and β its associated M × J matrix of parameters of interest, and W is an n × K matrix corresponding to unknown factors of “unwanted variation” and α its associated K × J matrix of nuisance parameters.

38 / 105

slide-39
SLIDE 39

Between-Sample Normalization

◮ The unwanted factors W can be estimated as follows.

  • RUVg: Based on control genes, with either constant

expression (β = 0) or known expression fold-changes between samples (e.g., housekeeping genes, ERCC spike-ins, “in-silico” empirical controls).

  • RUVr: Based on residuals, such as deviance residuals from a

first-pass GLM regression of the counts on the covariates of interest, without RUVg normalization, but possibly after some

  • ther normalization procedure such as upper-quartile.
  • RUVs: Based on replicate/negative control samples for which

the covariates of interest are constant.

  • Generalization of RUV. The RUV model can be readily

extended to include known sample-level covariates (e.g., C1 run, QC measures) and known gene-level covariates (e.g., GC-content) that should be adjusted for.

39 / 105

slide-40
SLIDE 40

Between-Sample Normalization: RUV

= + α logE[Y|W,X] W X β

Unwanted variation Variation of interest

J n K n n K J M M J

Read counts

n samples x J genes

Figure 12: RUV model.

40 / 105

slide-41
SLIDE 41

SCONE

  • D. Risso, M. Cole, N. Yosef

41 / 105

slide-42
SLIDE 42

SCONE

SCONE: Single-Cell Overview of Normalized Expression. A general framework for the normalization of scRNA-Seq data.

  • Range of normalization methods.

◮ Global-scaling, e.g., DESeq, TMM. ◮ Full-quantile (FQ). ◮ Unknown factors of unwanted variation: Remove unwanted

variation (RUV).

◮ Known factors of unwanted variation: Regression-based

normalization on, e.g., QC PC, C1 run.

  • Normalization performance metrics.
  • Numerical and graphical summaries of normalized read counts

and metrics.

  • R package scone, to be released through the Bioconductor

Project: github.com/yoseflab/scone.

42 / 105

slide-43
SLIDE 43

SCONE

Figure 13: scone. Regression model.

43 / 105

slide-44
SLIDE 44

SCONE

Performance metrics. (Green: Good when high; Red: Good when low.)

  • BIO SIL: Average silhouette width by biological condition.
  • BATCH SIL: Average silhouette width by batch.
  • PAM SIL: Maximum average silhouette width for PAM

clusterings, for a range of user-supplied numbers of clusters.

  • EXP QC COR: Maximum squared Spearman correlation

between count PCs and QC measures.

  • EXP UV COR: Maximum squared Spearman correlation

between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).

  • EXP WV COR: Maximum squared Spearman correlation

between count PCs and factors of wanted variation (derived from positive control genes).

44 / 105

slide-45
SLIDE 45

SCONE

  • RLE MED: Mean squared median relative log expression

(RLE).

  • RLE IQR: Mean inter-quartile range (IQR) of RLE.

45 / 105

slide-46
SLIDE 46

Software Package scone

Application to OE p63 dataset.

  • Apply and evaluate 172 normalization procedures using main

scone function.

◮ scaling method: None, DESeq, TMM, FQ. ◮ uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5. ◮ adjust biology: Yes/no. ◮ adjust batch: Yes/no.

  • Select a normalization procedure based on (function of) the

performance scores. Unweighted mean score = ⇒ none,fq,qc k=4,bio,no batch Weighted mean score = ⇒ none,fq,qc k=2,no bio,no batch

46 / 105

slide-47
SLIDE 47

Software Package scone

  • −0.15

−0.10 −0.05 0.00 0.05 0.10 −0.15 −0.05 0.00 0.05 0.10

SCONE: Biplot of scores colored by mean score

PC1 PC2 BIO_SIL BATCH_SIL PAM_SIL EXP_QC_COR EXP_UV_COR EXP_WV_COR RLE_MED RLE_IQR

Figure 14: scone. Biplot of performance scores, colored by mean score (yellow high/good, blue low/bad).

47 / 105

slide-48
SLIDE 48

Software Package scone

  • −0.4

−0.2 0.0 0.2 −0.20 −0.10 0.00 0.10

SCONE: Score PCA colored by mean score

PC1 PC2 Top mean Top weighted mean fq,ruv_k=1 none

Figure 15: scone. PCA of performance scores, colored by mean score.

48 / 105

slide-49
SLIDE 49

Software Package scone

−0.4 −0.2 0.0 0.2 −0.20 −0.10 0.00 0.10

SCONE: Score PCA colored by method

PC1 PC2

  • none

fq deseq tmm

Figure 16: scone. PCA of performance scores, colored by method – scaling method.

49 / 105

slide-50
SLIDE 50

Software Package scone

−0.4 −0.2 0.0 0.2 −0.20 −0.10 0.00 0.10

SCONE: Score PCA colored by method

PC1 PC2

  • no_bio,no_batch

no_bio,batch bio,no_batch bio,batch

Figure 17: scone. PCA of performance scores, colored by method – adjust biology, adjust batch.

50 / 105

slide-51
SLIDE 51

Software Package scone

FQ + RUVg(HK, k=1): W by batch

W −0.15 −0.05 0.00 0.05 0.10

  • −0.20

−0.15 −0.10 −0.05 0.00 0.05 0.10 −5 5 10

FQ + RUVg(HK, k=1): QC PC1 vs. W

W QC PC1 Cor=−0.56

(a) Unwanted factor W (b) QC PC1 vs. W

Figure 18: scone. Association of RUVg unwanted factor W and QC measures for none,fq,ruv k=1,no bio,batch.

51 / 105

slide-52
SLIDE 52

Software Package scone

−3 −2 −1 1 2 3 4

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc

RLE −2 2 4 6

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc

RLE

(a) All genes (b) Housekeeping genes

Figure 19: scone. Gene-level relative log expression (RLE = log-ratio of read count to median read count across samples) for method with top weighted mean score none,fq,qc k=2,no bio,no batch.

52 / 105

slide-53
SLIDE 53

Software Package scone

  • −20

20 40 −5 5 10

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1

count PC1 QC PC1 Cor=0 PC1 PC2 PC3 PC4 PC5

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure

0.0 0.2 0.4 0.6 0.8 1.0

NREADS NALIGNED RALIGN TOTAL_DUP PRIMER PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 20: scone. Association of counts and sample-level QC measures, none,fq,qc k=2,no bio,no batch.

53 / 105

slide-54
SLIDE 54

Software Package scone: Summary

  • Unnormalized gene-level counts exhibit large differences in

distributions within and between batches and association with sample-level QC measures.

  • Different normalization methods vary in performance

according to SCONE metrics and lead to different distributions

  • f gene-level counts, hence clustering and DE results.
  • Global-scaling normalization. Not aggressive enough to handle

potentially large batch effects and association of counts and QC measures. Biological effects are dominated by nuisance technical effects. Additionally, for DESeq, the scaling factors are computed based on only a handful of genes with non-zero counts in all cells (5/22,054).

54 / 105

slide-55
SLIDE 55

Software Package scone: Summary

  • Batch effect normalization. Adjusting for batch effects

without properly accounting for the nesting of batch within biological effects (no bio,batch) in the regression model is problematic, as this removes the biological effects of interest (e.g., empirical Bayes framework of ComBat).

  • FQ followed by QC-based or RUVg normalization. Seems

effective: Similar RLE distributions between samples, lower association of counts and QC measures. The first unwanted factor of RUVg is correlated with the first QC PC.

  • The remaining analyses are based on

none,fq,qc k=2,no bio,no batch, the best method according to weighted mean score.

55 / 105

slide-56
SLIDE 56

Software Package scone: Summary

  • Interpretation of performance metrics. Some metrics tend to

favor certain methods over others, e.g., EXP UV COR (correlation between count PCs and factors of unwanted variation) naturally favors RUVg, especially when the same set

  • f negative controls are used for normalization and evaluation.

Hence, a careful, global interpretation of the metrics is recommended.

  • Negative controls. The selection of proper, distinct sets of

negative controls is important, as these are used for both normalization (RUVg) and assessment of normalization results (EXP UV COR).

  • Ongoing efforts.

◮ Zero-inflated negative binomial (ZINB) model. ◮ User-supplied factors unwanted and wanted variation (UV and

WV, respectively).

◮ Other methods (e.g., ComBat/sva). 56 / 105

slide-57
SLIDE 57

Software Package scone: Summary

◮ Other performance metrics. ◮ Visualization. ◮ Shiny app for interactive web interface. 57 / 105

slide-58
SLIDE 58

Resampling-Based Sequential Ensemble Clustering

  • D. Risso, E. Purdom

58 / 105

slide-59
SLIDE 59

Motivation

  • Robustness to choice of samples. Both hierarchical and

partitioning methods tend to be sensitive to the choice of samples to be clustered. Outlying samples/clusters (e.g., glia) are common in scRNA-Seq and mask interesting substructure in the data, often requiring the successive pruning out of dominating clusters to get to the finer structure.

  • Robustness to clustering algorithm and tuning parameters.

Clustering results are sensitive to pre-processing steps such as normalization and dimensionality reduction, as well as to the choice of clustering algorithm and associated tuning parameters (e.g., distance function, number of clusters).

59 / 105

slide-60
SLIDE 60

Motivation

  • Not focusing on the number of clusters. A major tuning

parameter of partitioning methods such as partitioning around medoids (PAM) and k-means is the number of clusters k. Methods for selecting k (e.g., silhouette width) are sensitive to the choice of samples, normalization, and other tuning

  • parameters. They tend to be conservative (low k), i.e.,

capture only the coarse clustering structure and mask interesting substructure in the data. Additionally, the number

  • f clusters k is often not of primary interest.

E.g. Silhouette width with PAM selects only k = 2 clusters for the OE p63 dataset.

  • Not forcing samples into clusters. Some samples may be
  • utliers, that do not really belong to any clusters. Leaving

them out can improve the quality and interpretability of the clustering as well as downstream analyses (e.g., identification

  • f cluster marker genes).

60 / 105

slide-61
SLIDE 61

Motivation

  • Cluster gene expression signatures. Common differential

expression statistics are not well-suited for finding marker genes for the clusters, especially for finer structure in a hierarchy.

61 / 105

slide-62
SLIDE 62

Motivation

2 3 4 5 6 7 8 9 10 11 12 13 14 15

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC: A

k 0.0 0.1 0.2 0.3 0.4

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc

silhouette 0.0 0.1 0.2 0.3 0.4 0.5 0.6 KO_14DPT KO_24HPT KO_48HPT KO_7DPT KO_96HPT WT_72HPT WT_96HPT

(a) Average silhouette width (b) Silhouette widths, k = 2

Figure 21: Partitioning around medoids. 20 PC, one-minus-correlation distance.

62 / 105

slide-63
SLIDE 63

Resampling-Based Sequential Ensemble Clustering

  • We have developed a resampling-based sequential ensemble

clustering approach, with the aim of obtaining stable and tight clusters.

  • Ensemble clustering, i.e., aggregating multiple clusterings
  • btained from different algorithms or applications of a given

algorithm to resampled versions of the learning set, is a general approach for improving stability. This can be viewed as the unsupervised analog of ensemble methods in supervised learning, e.g., bagging, boosting, random forests.

  • Our approach is related to bagged/consensus/tight clustering

(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong, 2005).

  • R package clusterExperiment, released through the

Bioconductor Project.

63 / 105

slide-64
SLIDE 64

Resampling-Based Sequential Ensemble Clustering

RSEC: Resampling-based Sequential Ensemble Clustering.

  • Given a base clustering algorithm (e.g., PAM, k-means) and

associated tuning parameters (e.g., number of principal components, number of clusters k, distance matrix), generate a single candidate clustering using

◮ resampling-based clustering to find robust and tight clusters; ◮ sequential clustering to find stable clusters over a range of

numbers of clusters (Tseng and Wong, 2005).

  • Generate a collection of candidate clusterings by repeating the

above procedure for different base clustering algorithms and tuning parameters.

  • Identify a consensus over the different candidate clusterings.
  • Merge non-differential clusters.
  • Find cluster signatures by testing for differential expression

between selected subsets of clusters.

64 / 105

slide-65
SLIDE 65

Resampling-Based Sequential Ensemble Clustering

  • Visualization. Comparison of multiple clusterings of the same

samples, heatmaps of co-clustering matrices, heatmaps with hierarchical clustering of genes and/or samples.

65 / 105

slide-66
SLIDE 66

Resampling-Based Clustering

  • Consider a particular base partitioning clustering algorithm

(e.g., PAM, k-means) and associated tuning parameters (e.g., number of principal components, number of clusters k, distance matrix).

  • Generate B subsamples of the n observations to be clustered,

e.g., bootstrap or random sample without replacement (of size 0.7n, say).

  • For each subsample, apply the base clustering algorithm (with

k clusters) and assign all n original observations to clusters based on their distances from cluster centroids/medoids.

  • Generate an n × n co-clustering matrix C by recording for

each pair of observations the proportion of times they are clustered together (out of B subsamples).

66 / 105

slide-67
SLIDE 67

Resampling-Based Clustering

  • Identify tight clusters based on the co-clustering matrix. For

example, given a constant α close to 0, cut branches top-down from a hierarchical clustering based on the co-clustering matrix C, such that Ci,j ≥ 1 − α ∀i, j in the selected branch (or some other less stringent function than min). Retain all clusters above a certain size or the largest certain number of clusters.

67 / 105

slide-68
SLIDE 68

Sequential Clustering

Based on Tseng and Wong (2005).

  • Starting with a user-supplied k = k0, generate a collection of

clusterings, {Ck

1 , Ck 2 , . . . , Ck mk}, by applying the above

resampling-based procedure to the base partitioning clustering algorithm with k clusters.

  • Given a constant β close to 1, increase k by one until

s(Ck

i , Ck+1 j

) ≥ β, for some i, j, where s is a measure of similarity between clusters.

  • Identify Ck+1

j

as a stable cluster.

  • Remove Ck+1

j

and repeat the above steps with k ← k − 1, until k is below a certain threshold or a certain number of clusters are identified.

68 / 105

slide-69
SLIDE 69

Sequential Clustering

  • Here, we evaluate a collection of clusters for stability based on

the pairwise similarity measure s(Ci, Cj) ≡ |Ci ∩ Cj|/|Ci ∪ Cj|. (2)

69 / 105

slide-70
SLIDE 70

Ensemble Clustering

  • Generate a collection of candidate clusterings as above, for a

range of base clustering algorithms and tuning parameters.

  • Find a single consensus clustering based on the co-clustering

matrix (as in resampling, only now across different algorithms

  • r tuning parameters).
  • Merge non-differential clusters by creating a hierarchy of

clusters, working up the tree, testing for differential expression between sister nodes, and collapsing nodes with insufficient DE.

70 / 105

slide-71
SLIDE 71

Differential Expression

  • Find cluster gene expression signatures, i.e., marker genes, by

testing for differential expression between selected subsets of clusters.

  • Standard F-statistic. Tests for any difference between
  • clusters. Sensitive to outlying samples/clusters. Non-specific,

i.e., not useful for interpreting differences between clusters.

  • Standard solution in (generalized) linear models/ANOVA is to

consider contrasts between groups of clusters. By using the machinery of the (generalized) linear model, we use all of the samples in testing these contrasts, rather than just those samples involved in the corresponding clusters.

◮ All pairwise. All pairwise comparisons between clusters. ◮ One against all. Compare each cluster to union of remaining

clusters.

71 / 105

slide-72
SLIDE 72

Differential Expression

◮ Dendrogram. Create a hierarchy of clusters, work up the tree,

test for DE between sister nodes (as in approach used for merging clusters).

  • For each contrast, test for DE using empirical Bayes linear

modeling approach of R package limma, with voom option to account for mean-variance relationship of log-counts (i.e.,

  • ver-dispersion).

72 / 105

slide-73
SLIDE 73

Software Package clusterExperiment

Workflow.

  • clusterMany. Generate a collection of candidate clusterings,

for different base clustering algorithms and tuning parameters, with option to use resampling and sequential approaches.

  • combineMany. Find consensus clustering across several

clusterings.

  • Identify non-differential clusters that should be merged into

larger clusters.

◮ makeDendrogram. Hierarchical clustering of the clusters found

by combineMany.

◮ mergeClusters. Merge clusters of this hierarchy based on DE

between nodes.

  • RSEC. Wrapper function around the clusterExperiment

workflow.

73 / 105

slide-74
SLIDE 74

Software Package clusterExperiment

  • getBestFeatures. Find cluster signatures by testing for

differential expression between selected subsets of clusters.

  • Visualization.

◮ plotClusters. Comparison of multiple clusterings of the

same samples. Based on ConsensusClusterPlus package.

◮ plotHeatmap. Heatmaps of co-clustering matrices, heatmaps

with hierarchical clustering of genes and/or samples (interface to aheatmap from NMF package).

74 / 105

slide-75
SLIDE 75

Software Package clusterExperiment

Application to OE p63 dataset.

  • clusterMany: Generate 22 candidate clusterings.

◮ Dimensionality reduction: 25, 50 PC. ◮ Euclidean distance. ◮ Base clustering method: PAM, k = 5, · · · , 15. ◮ Resampling-based clustering: B = 100, proportion = 0.7,

α = 0.3.

◮ Sequential clustering: k0 = 15, β = 0.9. ◮ clusterFunction=c("hierarchical01").

  • combineMany(ce, clusterFunction="hierarchical01",

whichClusters="workflow", proportion=0.7, propUnassigned=0.5, minSize=5).

  • mergeClusters(ce, mergeMethod="adjP",

cutoff=0.05).

75 / 105

slide-76
SLIDE 76

Software Package clusterExperiment

Clusterings from clusterMany

eatures=50,k0=15 eatures=25,k0=15 eatures=50,k0=14 eatures=25,k0=14 eatures=50,k0=13 eatures=25,k0=13 eatures=50,k0=12 eatures=25,k0=12 eatures=50,k0=11 eatures=25,k0=11 eatures=50,k0=10 eatures=25,k0=10 eatures=50,k0=9 eatures=25,k0=9 eatures=50,k0=8 eatures=25,k0=8 eatures=50,k0=7 eatures=25,k0=7 eatures=50,k0=6 eatures=25,k0=6 eatures=50,k0=5 eatures=25,k0=5

Figure 22: clusterExperiment. Comparison of 22 clusterMany clusterings using plotClusters.

76 / 105

slide-77
SLIDE 77

Software Package clusterExperiment

combineMany −1 c1 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c2 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c3 0.2 0.4 0.6 0.8 1

Co−clustering proportion matrix

combineMany

Figure 23: clusterExperiment. Heatmap of co-clustering matrix for clusterMany clusterings, used to create combineMany clustering (plotHeatmap).

77 / 105

slide-78
SLIDE 78

Software Package clusterExperiment

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 0.38 0.058 0.15 0.028 0.066 0.088 0.035 0.014 0.014 0.039 0.039 0.026 0.023 0.017 0.016 0.0065 0.076 0.0079 0.011 0.0086 0.011 0.0042 0.0077 0.0092 0.006 0.0065 0.0086 0.0056 0.0051 0.0038 0.0056

Figure 24: clusterExperiment. Dendrograms from combineMany and mergeClusters.

78 / 105

slide-79
SLIDE 79

Software Package clusterExperiment

  • −50

50 −40 −20 20 40 60 80

Count PCA by cluster

PC1 PC2

  • −1

1 2 3 4 5 6 7 8

Figure 25: clusterExperiment. PCA of gene-level log-counts, colored by mergeClusters.

79 / 105

slide-80
SLIDE 80

Software Package clusterExperiment

mergeClusters −1 m1 m2 m3 m4 m5 m6 m7 m8 Bio −1 −2 Contaminants HBC HBC transition INP/GBC INP2 iOSN Microvillous mOSNs SUS SUS precursor −4 −2 2 4 6 8 10

Heatmap of DE genes, dendrogram contrasts

mergeClusters Bio

Figure 26: clusterExperiment. Heatmap of log-counts for getBestFeatures DE genes using dendrogram contrasts (269, top 50 in each of the 6 nodes).

80 / 105

slide-81
SLIDE 81

Software Package clusterExperiment

sirt5 hnrnpl rmi2 gm17821 nup98 cox6c
  • az1
rpl27 rplp1 rpl13a rps20 rps7 rpl23 rpl37a rpl23a rps29 rpl32 rplp2 rpl24 fau rpl36 rpl37 rps23 rpl35 prrc2c son kcnq1ot1 nip7 eif1 cfl1 rpl10 lars2 rps25 rpl13 naca rplp0 rpl7 rpl8 rps6 rps3a1 rpl21 rps11 rpl14 hmgb1 rpl3 trns1 rnr1 ppia rps27 myh9 rpl9 rps9 rps19 rpl41 rps14 map1b eif4g2 calm2 ywhae gsk3b calm1 rn18s−rs5 malat1 actb gnb1 rnr2 hsp90ab1 eef1a1 h3f3b canx btg1 srsf2 csde1 fth1 sod1 set ywhaz ubb tmsb4x prdx1 akr1a1 sptbn1 ubc mt1 dstn rps5 rpl4 rps3 rpl18a actr2 ddx5 hnrnpa2b1 skp1a hspa8 ptges3 tpt1 actg1 ftl1 ptma gm1821

mergeClusters −1 m1 m2 m3 m4 m5 m6 m7 m8 Bio −1 −2 Contaminants HBC HBC transition INP/GBC INP2 iOSN Microvillous mOSNs SUS SUS precursor 2 4 6 8 10

Heatmap of DE genes, F−test

mergeClusters Bio

Figure 27: clusterExperiment. Heatmap of log-counts for getBestFeatures DE genes using F-test (top 100).

81 / 105

slide-82
SLIDE 82

Software Package clusterExperiment

sox11 creer gstm2 cyp2f2 cbr2 trim66 gap43 ebf1 ebf2 ncam1
  • mp
cnga2 gng13 rgs11 lhx2 ebf3 nhlh1 gng8 arhgdig cyp2g1 lgr5 cyp1a2 scarb1 coch cd24a cftr ceacam1 prmt5 fabp5 neurod1 nhlh2 zfp423 ebf4 rrm1 ezh2 hes6 ccnd1 rbm24 kit dtl tead2 mcm7 ccna2 ect2 top2a cdca8 pbk il6 notch1 yap1 icam1 ppara fgfr2 dkk3 igfbp4 itgb5 egfr igfbp2 pax6 sall1 elf5 il33 sox2 wnt4 lifr nrcam trp63 krt14 krt5 sox9 kitl sfrp1 f3 perp ccnd2 hmgb2 tubb5 cdkn1b egfp igfbp5 krt8 hes1 notch2

mergeClusters −1 m1 m2 m3 m4 m5 m6 m7 m8 Bio −1 −2 Contaminants HBC HBC transition INP/GBC INP2 iOSN Microvillous mOSNs SUS SUS precursor 2 4 6 8 10

Heatmap of markers genes

mergeClusters Bio

Figure 28: clusterExperiment. Heatmap of log-counts for “a priori” markers genes (83).

82 / 105

slide-83
SLIDE 83

Software Package clusterExperiment: Summary

  • Tuning parameters.

◮ α controls tightness. ◮ β controls stability. ◮ k0, the initial number of clusters used in sequential clustering,

is the parameter with the greatest impact on the results. Larger k0 tend to lead to smaller and tighter clusters.

  • Caveat. The DE analysis is exploratory and nominal p-values
  • nly a rough summary of significance (reliance on models, tiny

p-values even after adjustment for multiple testing, same data used to define clusters and to perform DE analysis).

  • Ongoing efforts.

◮ Cluster confidence measures. ◮ Potentially assign unclustered observations to clusters. ◮ DE using ZINB model. ◮ Greater modularity (e.g., distance functions, DE test). ◮ Shiny app for interactive web interface. 83 / 105

slide-84
SLIDE 84

Cell Lineage and Pseudotime Inference

  • K. Street, D. Risso, E. Purdom

84 / 105

slide-85
SLIDE 85

Motivation

  • Mapping transcriptional progression from stem cells to

specialized cell types is essential for properly understanding the mechanisms regulating cell and tissue differentiation.

  • There may not always be a clear distinction between states,

but rather a smooth transition, with individual cells existing

  • n a continuum between states.
  • In such a case, cells may undergo gradual transcriptional

changes, where the relationship between states can be represented as a continuous lineage dependent upon an underlying spatial or temporal variable. This representation, referred to as pseudotemporal ordering, can help us understand how cells differentiate and how cell fate decisions are made (Bendall et al., 2014; Campbell et al., 2015; Ji and Ji, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnell et al., 2014).

85 / 105

slide-86
SLIDE 86

Motivation

  • We have developed Slingshot as a flexible and robust

framework for inferring cell lineages and pseudotimes in the study of continuous differentiation processes.

86 / 105

slide-87
SLIDE 87

Cell Lineage and Pseudotime Inference

  • Input/Output.

◮ Input. Normalized gene expression measures and cell

clustering.

◮ Output. Cell lineages, i.e., subsets of ordered cell clusters.

Cell pseudotimes, i.e., for each lineage, ordered sequence of cells and associated pseudotimes.

  • Dimensionality reduction.

◮ Principal component analysis (PCA) seems effective and

simple, in conjunction with steps detailed next.

◮ Other approaches include related linear methods, e.g.,

independent component analysis (ICA) (Trapnell et al., 2014, Monocle), and non-linear methods, e.g., Laplacian eigenmaps/spectral embedding (Campbell et al., 2015, Embeddr), t-distributed stochastic neighbor embedding (t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016, Wanderlust).

87 / 105

slide-88
SLIDE 88

Cell Lineage and Pseudotime Inference

  • Inferring cell lineages.

◮ Minimum spanning tree (MST; ape package) over cell clusters,

with between-cluster distance based on Euclidean distance between cluster means scaled by within-cluster covariance.

◮ Outlying clusters. Identified using granularity parameter ω that

limits maximum edge weight in the tree. Specifically, build MST using an artificial cluster Ω, with distance ω from other clusters (a fraction of maximum pairwise distance between clusters), and then remove Ω.

◮ Root and leaf nodes. May either be pre-specified or

automatically selected. Root node. If not pre-specified, selected based on parsimony (i.e., set of lineages with maximal number of clusters shared between them). Leaf nodes. If pre-specified, constrained MST.

◮ A lineage is then defined as any unique path coming out of the

root node and ending in a leaf node.

88 / 105

slide-89
SLIDE 89

Cell Lineage and Pseudotime Inference

◮ Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,

2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014, Monocle) offers greater stability and computational efficiency, less complex lineages, and easier determination of directionality and branching.

  • Inferring cell pseudotimes.

◮ Iterative procedure inspired from the principal curve algorithm

  • f Hastie and Stuetzle (1989); principal.curve function in

princurve package.

◮ In the case of branching lineages, a shrinkage step is included

at each iteration, that forces a degree of similarity between the curves in the neighborhood of shared clusters.

◮ Pseudotime values are derived by orthogonal projection onto

the curves.

◮ Cells belonging to clusters that are included in multiple

lineages have multiple, similar pseudotime values.

89 / 105

slide-90
SLIDE 90

Cell Lineage and Pseudotime Inference

◮ Previous approaches also use smooth curves to represent

lineages (Campbell et al., 2015; Petropoulos et al., 2016, Embeddr), while others use piecewise linear paths through the MST and extract orderings either by orthogonal projection (Ji and Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree (Trapnell et al., 2014, Monocle).

◮ We find that smooth curves provide discerning power not found

in piecewise linear trajectories, while also adding stability over a range of dimensionality reduction and clustering methods.

  • Differential expression. Regression of gene expression

measures on pseudotime, e.g., generalized additive models (GAM) (Ji and Ji, 2016, TSCAN).

  • Visualization. Two- and three-dimensional plots of cell

lineages and pseudotimes, gene-level trajectories, heatmaps for DE genes.

90 / 105

slide-91
SLIDE 91

Cell Lineage and Pseudotime Inference

  • R package slingshot, to be released through the Bioconductor

Project: github.com/kstreet13/slingshot.

91 / 105

slide-92
SLIDE 92

Software Package slingshot

  • Modularity.

◮ Integrates easily with a range of normalization, clustering, and

dimensionality reduction methods.

◮ get lineages: Given expression measures and cluster labels,

use MST to infer lineages. get lineages(X, clus.labels, start.clus = NULL, end.clus = NULL, dist.fun = NULL, omega = Inf, distout = FALSE).

◮ get curves: Given lineages, infer pseudotimes.

get curves(X, clus.labels, lineages, thresh = 1e-04, maxit = 100, stretch = 2, shrink = TRUE).

  • Flexibility. Can be used with varying levels of supervision.

◮ Cluster-based approach allows for easy supervision when

researchers have prior knowledge of cell classes, while still being able to detect novel branching events.

◮ User-supplied or data-driven selection of root and leaf nodes. 92 / 105

slide-93
SLIDE 93

Software Package slingshot

  • Visualization.

◮ plot tree: MST in 2 and 3D. ◮ plot curves: Lineage curves in 2 and 3D. 93 / 105

slide-94
SLIDE 94

Software Package slingshot

Application to OE p63 dataset.

  • Applied to the first three principal components, Slingshot

identifies two lineages: The first corresponds to the HBC-to-neurons transition, the second to the HBC-to-sustentacular cells transition.

  • A first-pass DE analysis, based on a regression of log-count on

pseudotime using GAM, suggests that many genes are involved in the differentiation process.

  • Among the top 100 DE genes for each lineage, only 13 are DE

in both, suggesting distinct processes in the neuronal vs. non-neuronal lineages.

94 / 105

slide-95
SLIDE 95

Software Package slingshot

  • ● ●
  • −50

50 −50 50 PC2 PC1

  • −40

20 40 60 −40 20 40 60 80 PC3 PC2

  • Figure 29: slingshot. PCA of gene-level log-counts, colored by clusters,

with MST edges used to infer lineages (get lineages, plot tree).

95 / 105

slide-96
SLIDE 96

Software Package slingshot

  • ● ●
  • −50

50 −50 50 PC2 PC1

  • −40

20 40 60 −40 20 40 60 80 PC3 PC2

Figure 30: slingshot. PCA of gene-level log-counts, colored by clusters, with smooth curves representing lineages and used to infer pseudotimes (get curves, plot curves).

96 / 105

slide-97
SLIDE 97

Software Package slingshot

  • 50

150 4 8

fstl5

pseudotime log(count+1)

  • 50

150 4 8

ugt2a1

pseudotime log(count+1)

  • 50

150 6

gstm1

pseudotime log(count+1)

  • 50

150 4 8

ncam1

pseudotime log(count+1)

  • 50

150 4 8

gstm2

pseudotime log(count+1)

  • 50

150 4 8

rtn1

pseudotime log(count+1)

  • 50

150 4 8

aqp3

pseudotime log(count+1)

  • 50

150 4 8

scn9a

pseudotime log(count+1)

  • 50

150 4

gm581

pseudotime log(count+1)

Slingshot: DE genes based on GAM, lineage 1

Figure 31: slingshot. Scatterplots of gene-level log-count vs. pseudotime for GAM DE genes in lineage 1 (HBC–Neurons).

97 / 105

slide-98
SLIDE 98

Software Package slingshot

jakmip1 tuba1a gm11223 stmn1 sult1d1 ak1 ebf3 elavl3 myt1l atf5 ttll7 homer2 uchl1 stmn3 rd3 calb2 snap25 cntn4 aplp1 clgn rtp1 gap43 trim66 sept3 ank2 spock1 gm581 scn9a mgst3 gng13 5730409k12rik prune2 flrt1 syt11 cystm1 dcx stmn2 fstl5 ncam1 ebf1 tubb3 ebf2 ell3 rtn1 map1b gnb1 rps27 prdx1 sp8 ifitm2 runx1 ezr zfp36l1 tcf7l2 atf3 krt18 sdc4 prdx6 gstm1 cyp2f2 cbr2 fmo6 ccnd2 gsta4 slc26a7 lypd2 notch2 aox2 hspb1 socs3 dapl1 s100a11 ahnak sik1 hes1 bhlhe40 irf1 cyr61 junb nr4a1 klf4 fosb wls rbm47 foxn3 ptn gja1 ifitm3 clu aqp4 anxa2 ces1d ephx1 zfp36 epas1 gpm6a scgb1c1 ugt2a1 aqp3 gstm2

mergeClusters m1 m2 m3 m5 m7 m8 Bio HBC HBC transition INP/GBC INP2 iOSN mOSNs SUS SUS precursor SUS progenitor 2 4 6 8 10

Slingshot: DE genes based on GAM, lineage 1

mergeClusters Bio

Figure 32: slingshot. Heatmap of log-counts for GAM DE genes in lineage 1 (HBC–Neurons), cells sorted by pseudotime.

98 / 105

slide-99
SLIDE 99

Software Package slingshot

Lineages.

$lineage1 [1] "8" "3" "7" "5" "2" "1" $lineage2 [1] "8" "3" "7" "4"

Biological annotation of clusters.

cl b 1 2 3 4 5 7 8 HBC 0 26 1 47 HBC transition 0 35 0 19 INP/GBC 1 0 41 INP2 0 34 iOSN 1 77 1 Microvillous 8 mOSNs 33 1 SUS 1 61 5 SUS precursor 7 2 4 62 SUS progenitor 6

99 / 105

slide-100
SLIDE 100

Software Package slingshot: Summary

Ongoing efforts.

  • Number of lineages: User-supplied, testing for distinct

lineages, merging non-differential lineages.

  • DE within and between (i.e., bifurcation) lineages.
  • Visualization.
  • Performance measures.
  • OOP with S4 classes and methods.
  • Shiny app for interactive web interface.

100 / 105

slide-101
SLIDE 101

References

  • S. Anders and W. Huber. Differential expression analysis for sequence count
  • data. Genome Biology, 11(10):R106, 2010.
  • S. C. Bendall, K. L. Davis, E. D. Amir, M. D. Tadmor, E. F. Simonds, T. J.

Chen, D. K. Shenfeld, G. P. Nolan, and D. Pe’er. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell

  • development. Cell, 157(3):714–725, 2014.
  • B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of

normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19(2):185–193, 2003.

  • J. H. Bullard, E. A. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of

statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11:Article 94, 2010. URL http://www.biomedcentral.com/1471-2105/11/94/abstract. (Highly accessed).

101 / 105

slide-102
SLIDE 102

References

  • K. Campbell, C. P. Ponting, and C. Webber. Laplacian eigenmaps and principal

curves for high resolution pseudotemporal ordering of single-cell RNA-seq

  • profiles. Technical report, MRC Functional Genomics Unit, University of

Oxford, UK, 2015. URL biorxiv.org/content/early/2015/09/18/027219.

  • S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering
  • procedure. Bioinformatics, 19(9):1090–1099, 2003. URL http:

//bioinformatics.oxfordjournals.org/content/19/9/1090.abstract.

  • J. Gagnon-Bartsch, L. Jacob, and T. P. Speed. Removing unwanted variation

from high dimensional data with negative controls. Technical Report 820, Department of Statistics, University of California, Berkeley, 2013.

  • J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct for

unwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012.

  • T. Hastie and W. Stuetzle. Principal curves. Journal of the American

Statistical Association, 84(406):502–516, 1989.

  • R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis,
  • U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of

high density oligonucleotide array probe level data. Biostatistics, 4(2): 249–264, 2003.

102 / 105

slide-103
SLIDE 103

References

  • L. Jacob, J. Gagnon-Bartsch, and T. P. Speed. Correcting gene expression data

when neither the unwanted factors nor the factor of interest is observed. Annals of Applied Statistics, 2013. (Submitted).

  • Z. Ji and H. Ji. TSCAN: Pseudo-time reconstruction and evaluation in

single-cell RNA-seq analysis. Nucleic Acids Research, 2016.

  • W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray

expression data using empirical Bayes methods. Biostatistics, 8(1):118–127, 2007.

  • F. Leisch. Bagged clustering. Technical Report 51, SFB Adaptive Information

Systems and Modelling in Economics and Management Science, Vienna University of Economics and Business Administration, Vienna, Austria, August 1999. URL www.ci.tuwien.ac.at/~leisch/papers/fl-techrep.html.

  • J. Lov´

en, D. A. Orlando, A. A. Sigova, C. Y. Lin, P. B. Rahl, C. B. Burge,

  • D. L. Levens, T. I. Lee, and R. A. Young. Revisiting global gene expression
  • analysis. Cell, 151(3):476–482, 2012.

103 / 105

slide-104
SLIDE 104

References

  • J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad. RNA-seq:

an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9):1509–1517, 2008.

  • A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping

and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7):621–628, 2008.

  • S. Petropoulos, D. Edsg¨

ard, B. Reinius, Q. Deng, S. P. Panula, S. Codeluppi,

  • A. Plaza Reyes, S. Linnarsson, R. Sandberg, and F. Lanner. Single-cell

RNA-Seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell, 165(In press):1–15, 2016.

  • D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of RNA-seq data

using factor analysis of control genes or samples. Nature Biotechnology, 32 (9):896–902, 2014a. URL http: //www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2931.html.

  • D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. The role of spike-in standards in

the normalization of RNA-seq. In S. Datta and D. Nettleton, editors, Statistical Analysis of Next Generation Sequencing Data, Frontiers in Probability and the Statistical Sciences, chapter 9, pages 169–190. Springer International Publishing, 2014b.

104 / 105

slide-105
SLIDE 105

References

  • M. D. Robinson and A. Oshlack. A scaling normalization method for differential

expression analysis of RNA-seq data. Genome Biology, 11(3):R25, 2010.

  • J. Shin, D. A. Berg, Y. Zhu, J. Y. Shin, J. Song, M. A. Bonaguidi,
  • G. Enikolopov, D. W. Nauen, K. M. Christian, G. Ming, and H. Song.

Single-cell RNA-Seq with Waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell, 17(3):360–372, 2015.

  • C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J.

Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology, 4(32):381–391, 2014.

  • G. C. Tseng and W. H. Wong. Tight clustering: a resampling-based approach

for identifying stable and tight patterns in data. Biometrics, 61(1):10–16, 2005.

  • C. A. Vallejos, D. Risso, A. Scialdone, S. Dudoit, and J. C. Marioni.

Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods, 2016. (Submitted).

  • Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for
  • transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.

105 / 105