using single cell transcriptome sequencing to infer
play

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem - PowerPoint PPT Presentation

Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories Sandrine Dudoit Division of Biostatistics and Department of Statistics University of California, Berkeley www.stat.berkeley.edu/~sandrine Computational


  1. Zero Inflation • Full-quantile (FQ) normalization also doesn’t behave properly due to ties from the large number of zeros. • We apply the following zero-count gene filtering to the OE dataset: Retain only the genes with at least n r = 20 reads, in at least n s = 40 samples. This yields 9,133/22,054 genes. 22 / 105

  2. Zero Inflation Proportion of genes with zero count, post gene filtering oportion of cells in which a gene is detected, post gene filtering 2.5 0.8 2.0 0.6 1.5 Density 1.0 0.4 0.5 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 N = 9133 Bandwidth = 0.03333 (a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected Figure 6: Zero inflation. Post gene filtering. 23 / 105

  3. Sample-Level QC QC measures by batch QC measures by batch 3500000 NALIGNED 3000000 3e+06 NREADS 2500000 2000000 2e+06 1500000 1e+06 1000000 500000 P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 QC measures by batch QC measures by batch TOTAL_DUP 70 96 RALIGN 60 94 50 40 92 30 90 20 P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 Figure 7: Sample-level QC. Boxplots of QC measures, by batch. 24 / 105

  4. Figure 8: Sample-level QC. Boxplots of QC measures, by batch. PCT_MRNA_BASES PCT_INTRONIC_BASES 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 QC measures by batch QC measures by batch P01 P01 P02 P02 P03A P03A P03B P03B P04 P04 P05 P05 P06 P06 P10 P10 P11 P11 P12 P12 P13 P13 P14 P14 Y01 Y01 Y04 Y04 MEDIAN_CV_COVERAGE PCT_INTERGENIC_BASES 0.10 0.15 0.20 0.25 0.30 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 QC measures by batch QC measures by batch P01 P01 P02 P02 P03A P03A P03B P03B P04 P04 P05 P05 P06 P06 P10 P10 P11 P11 P12 P12 Sample-Level QC P13 P13 P14 P14 Y01 Y01 Y04 Y04 25 / 105

  5. Sample-Level QC QC PCA by batch QC PCA by batch 6 ● ● 6 ● ● ● ● ● 4 ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● PC1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 ● ● ● −4 ● ● ● ● ● ● ● ● ● P01 P02 P03A P03B P04 P05 P06 P10 P11 P12 P13 P14 Y01 Y04 −5 0 5 10 PC1 (a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch Figure 9: Sample-level QC. Principal component analysis (PCA) of sample-level QC measures. 26 / 105

  6. Sample-Level QC QC PC1 vs. count PC1 Absolute correlation of count PC and QC measures 1.0 NREADS ● Cor=−0.49 NALIGNED RALIGN ● TOTAL_DUP ● PRIMER 10 0.8 PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES ● PCT_MRNA_BASES ● 0.6 MEDIAN_CV_COVERAGE ● ● QC PC1 ● MEDIAN_5PRIME_BIAS ● 5 MEDIAN_3PRIME_BIAS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● ● ● ● 0.0 −40 −20 0 20 40 60 80 PC1 PC2 PC3 PC4 PC5 Count PC1 (a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures Figure 10: Sample-level QC. Association of counts and sample-level QC measures. 27 / 105

  7. Sample-Level QC: Summary • The distribution of QC measures can vary substantially within and between batches. • Some QC measures clearly point to low-quality samples, e.g., low percentage of mapped reads (RALIGN). • There can be a strong association between QC measures and read counts (cf. PCA). • Filtering samples based on QC measures is advisable, as normalization procedures may not be able to adjust for QC and some samples simply have low quality. • Normalization procedures based on QC measures (e.g., regression on first few PC of QC measures) should also be considered. 28 / 105

  8. Gene-Level Counts Gene−level log−count Gene−level RLE 8 10 6 8 4 log(count+1) 6 RLE 2 4 0 −2 2 −4 0 (a) Log-count (b) RLE Figure 11: Gene-level counts. Gene-level log-count and relative log expression (RLE = log-ratio of read count to median read count across samples). 29 / 105

  9. Gene-Level Counts: Summary • After gene and sample filtering and before normalization, there are large differences in gene-level count distributions within and between batches (cf. RLE, housekeeping genes). • The counts are still zero-inflated. • There can be substantial association of counts and sample-level QC measures. • Normalization is essential before any clustering or differential expression analysis, to ensure that observed differences in expression measures between samples and/or genes are truly due to differential expression and not technical artifacts. 30 / 105

  10. Motivation • In order to derive expression measures and compare these measures between samples and/or genomic regions of interest (ROI), one first needs to normalize read counts to adjust for differences in sequencing depths (i.e., total read counts) and ROI lengths, and possibly more complex technical effects, e.g., C1 run/library preparation/flow-cell/lane effects and nucleotide composition (e.g., GC-content) effects – sequenceability. • Normalization is essential to ensure that observed differences in expression measures between samples and/or ROI are truly due to differential expression and not technical artifacts. 31 / 105

  11. Motivation • Normalization is still an important issue for RNA-Seq, despite initial optimistic claims such as: “One particularly powerful advantage of RNA-Seq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets” (Wang et al., 2009). • This subject has received relatively little attention and few novel methods have been proposed in the RNA-Seq literature. • It is still common to only scale ROI-level counts by ROI length and total sample read count to adjust, respectively, for differences in ROI lengths and sequencing depths (even for technical replicate lanes). E.g. Estimators based on Poisson model as in Marioni et al. (2008); RPKM, Reads Per Kilobase of exon model per Million mapped reads, as in Mortazavi et al. (2008). 32 / 105

  12. Motivation • In many cases, however, one needs to account for more complex features of the genomic regions and experimental design. • Normalization seems even more important for single-cell RNA-Seq than bulk RNA-Seq. • There are large effects related to C1 run (cells processed in batches of 96 wells). • There is also greater variation in gene-level counts across single cells than across bulk RNA-Seq samples. • The very small amount of RNA present in a single cell may cause amplification bias (i.e., many sequenced copies of a transcript), hence high-magnitude outliers, and low capture efficiency (i.e., genes expressed in the cell, but not captured by the sequencing process), hence zero inflation. 33 / 105

  13. Motivation • Widely-used bulk RNA-Seq methods are not well-suited for handling the zero inflation and high level of technical noise of scRNA-Seq data (Vallejos et al., 2016). 34 / 105

  14. Between-Sample Normalization Most normalization methods proposed to date are adaptations of methods for bulk RNA-Seq and microarrays. • Global-scaling normalization. Gene-level counts are scaled by a single factor per sample. ◮ Total-count (TC), as in RPKM (Mortazavi et al., 2008), still widely used. ◮ Housekeeping gene count, as in qRT-PCR, e.g., per-sample POLR2A count (Bullard et al., 2010). ◮ Single-quantile, e.g., per-sample upper-quartile (UQ) of counts for genes with non-zero counts (Bullard et al., 2010). • Pairwise global-scaling normalization. Gene-level counts are scaled by a single factor per sample computed with respect to a reference sample. ◮ Trimmed Mean of M values (TMM) (Robinson and Oshlack, 2010): Select a reference sample and obtain a robust estimate of the overall fold-change between each sample and the reference. Implemented in edgeR. 35 / 105

  15. Between-Sample Normalization ◮ Anders and Huber (2010): The global-scaling factor for a given sample is defined as the median fold-change between that sample and a synthetic reference sample whose counts are defined as the geometric means of the counts across samples. Implemented in DESeq and edgeR (as “RLE”). • Full-quantile normalization (FQ). All quantiles of the gene count distributions are matched between samples (Bullard et al., 2010). Specifically, for each sample, the distribution of sorted read counts is matched to a reference distribution defined in terms of a function of the sorted counts (e.g., median) across samples. Inspired from the microarray literature (Irizarry et al., 2003). 36 / 105

  16. Between-Sample Normalization • Loess normalization. Loess fits are performed for mean-difference plots (MD-plots) of log-counts for pairs of samples (all possible pairs (cyclic) or each sample paired with a synthetic reference obtained by averaging counts across samples). This approach, also inspired from the microarray literature (Bolstad et al., 2003), was used in Lov´ en et al. (2012) and is implemented in the affy function loess.normalize and limma function normalizeCyclicLoess . • Regression-based normalization for known technical factors, e.g., C1 run, QC measures. Linear model (limma), negative binomial model (edgeR), empirical Bayes model (ComBat of Johnson et al. (2007), implemented in R package sva), supervised version of RUV below. • Remove unwanted variation (RUV). 37 / 105

  17. Between-Sample Normalization ◮ Proposed initially for the normalization of microarray data (Gagnon-Bartsch and Speed, 2012; Gagnon-Bartsch et al., 2013; Jacob et al., 2013) and then adapted to sequencing data (Risso et al., 2014a,b), the main idea is to remove “unwanted variation” by performing factor analysis for a suitably-chosen subset of control genes and/or samples. ◮ For n samples and J genes, consider the log-linear/generalized linear model (GLM) log E[ Y | X , W ] = W α + X β, (1) where Y is the n × J matrix of gene-level read counts, X is an n × M design matrix corresponding to the M covariates of interest/factors of “wanted variation” (e.g., treatment) and β its associated M × J matrix of parameters of interest, and W is an n × K matrix corresponding to unknown factors of “unwanted variation” and α its associated K × J matrix of nuisance parameters. 38 / 105

  18. Between-Sample Normalization ◮ The unwanted factors W can be estimated as follows. • RUVg: Based on control genes, with either constant expression ( β = 0) or known expression fold-changes between samples (e.g., housekeeping genes, ERCC spike-ins, “in-silico” empirical controls). • RUVr: Based on residuals, such as deviance residuals from a first-pass GLM regression of the counts on the covariates of interest, without RUVg normalization, but possibly after some other normalization procedure such as upper-quartile. • RUVs: Based on replicate/negative control samples for which the covariates of interest are constant. • Generalization of RUV. The RUV model can be readily extended to include known sample-level covariates (e.g., C1 run, QC measures) and known gene-level covariates (e.g., GC-content) that should be adjusted for. 39 / 105

  19. Between-Sample Normalization: RUV Read counts Unwanted variation Variation of interest J M J K J α β K logE[ Y | W , X ] = X W + M n n n n samples x J genes Figure 12: RUV model. 40 / 105

  20. SCONE D. Risso, M. Cole, N. Yosef 41 / 105

  21. SCONE SCONE: Single-Cell Overview of Normalized Expression. A general framework for the normalization of scRNA-Seq data. • Range of normalization methods. ◮ Global-scaling, e.g., DESeq, TMM. ◮ Full-quantile (FQ). ◮ Unknown factors of unwanted variation: Remove unwanted variation (RUV). ◮ Known factors of unwanted variation: Regression-based normalization on, e.g., QC PC, C1 run. • Normalization performance metrics. • Numerical and graphical summaries of normalized read counts and metrics. • R package scone, to be released through the Bioconductor Project: github.com/yoseflab/scone . 42 / 105

  22. SCONE Figure 13: scone. Regression model. 43 / 105

  23. SCONE Performance metrics. (Green: Good when high; Red: Good when low.) • BIO SIL: Average silhouette width by biological condition. • BATCH SIL: Average silhouette width by batch. • PAM SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters. • EXP QC COR: Maximum squared Spearman correlation between count PCs and QC measures. • EXP UV COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV). • EXP WV COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes). 44 / 105

  24. SCONE • RLE MED: Mean squared median relative log expression (RLE). • RLE IQR: Mean inter-quartile range (IQR) of RLE. 45 / 105

  25. Software Package scone Application to OE p63 dataset. • Apply and evaluate 172 normalization procedures using main scone function. ◮ scaling method : None, DESeq, TMM, FQ. ◮ uv factors : None, RUVg k = 1 , · · · , 5, QC PC k = 1 , · · · , 5. ◮ adjust biology : Yes/no. ◮ adjust batch : Yes/no. • Select a normalization procedure based on (function of) the performance scores. Unweighted mean score = ⇒ none,fq,qc k=4,bio,no batch Weighted mean score = ⇒ none,fq,qc k=2,no bio,no batch 46 / 105

  26. Software Package scone SCONE: Biplot of scores colored by mean score 0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 ● ● ● ● ● ● ● ● ● ● ● ● ● EXP_WV_COR ● ● ● ● ● ● EXP_UV_COR BIO_SIL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 RLE_MED EXP_QC_COR ● ● ● RLE_IQR ● ● ● ● ● ● ● ● ● ● PC2 ● BATCH_SIL PAM_SIL ● ● ● ● ● −0.05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.15 ● ● ● ● ● ● ● −0.15 −0.10 −0.05 0.00 0.05 0.10 PC1 Figure 14: scone. Biplot of performance scores, colored by mean score (yellow high/good, blue low/bad). 47 / 105

  27. Software Package scone SCONE: Score PCA colored by mean score ● ● ● Top mean ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● ● ● ● Top weighted mean ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● fq,ruv_k=1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● none ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.20 ● −0.4 −0.2 0.0 0.2 PC1 Figure 15: scone. PCA of performance scores, colored by mean score. 48 / 105

  28. Software Package scone SCONE: Score PCA colored by method ● ● ● none ● ● ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● ● ● fq ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● deseq ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● tmm ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.20 ● −0.4 −0.2 0.0 0.2 PC1 Figure 16: scone. PCA of performance scores, colored by method – scaling method . 49 / 105

  29. Software Package scone SCONE: Score PCA colored by method ● ● ● no_bio,no_batch ● ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● ● ● ● no_bio,batch ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● bio,no_batch ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● bio,batch ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.20 ● −0.4 −0.2 0.0 0.2 PC1 Figure 17: scone. PCA of performance scores, colored by method – adjust biology , adjust batch . 50 / 105

  30. Software Package scone FQ + RUVg(HK, k=1): W by batch FQ + RUVg(HK, k=1): QC PC1 vs. W 0.10 ● Cor=−0.56 ● 0.05 ● 10 0.00 ● ● ● ● QC PC1 ● 5 ● −0.05 ● ● ● ● ● W ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● ● ● ● −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 W (a) Unwanted factor W (b) QC PC1 vs. W Figure 18: scone. Association of RUVg unwanted factor W and QC measures for none,fq,ruv k=1,no bio,batch . 51 / 105

  31. Software Package scone SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc 6 4 3 4 2 2 1 RLE RLE 0 0 −1 −2 −2 −3 (a) All genes (b) Housekeeping genes Figure 19: scone. Gene-level relative log expression (RLE = log-ratio of read count to median read count across samples) for method with top weighted mean score none,fq,qc k=2,no bio,no batch . 52 / 105

  32. Software Package scone SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1 SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure 1.0 ● NREADS Cor=0 NALIGNED RALIGN ● TOTAL_DUP ● PRIMER 10 0.8 PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES ● PCT_MRNA_BASES ● 0.6 MEDIAN_CV_COVERAGE ● ● QC PC1 ● MEDIAN_5PRIME_BIAS 5 ● MEDIAN_3PRIME_BIAS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● ● ● ● 0.0 −20 0 20 40 PC1 PC2 PC3 PC4 PC5 count PC1 (a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures Figure 20: scone. Association of counts and sample-level QC measures, none,fq,qc k=2,no bio,no batch . 53 / 105

  33. Software Package scone: Summary • Unnormalized gene-level counts exhibit large differences in distributions within and between batches and association with sample-level QC measures. • Different normalization methods vary in performance according to SCONE metrics and lead to different distributions of gene-level counts, hence clustering and DE results. • Global-scaling normalization. Not aggressive enough to handle potentially large batch effects and association of counts and QC measures. Biological effects are dominated by nuisance technical effects. Additionally, for DESeq, the scaling factors are computed based on only a handful of genes with non-zero counts in all cells (5/22,054). 54 / 105

  34. Software Package scone: Summary • Batch effect normalization. Adjusting for batch effects without properly accounting for the nesting of batch within biological effects ( no bio,batch ) in the regression model is problematic, as this removes the biological effects of interest (e.g., empirical Bayes framework of ComBat). • FQ followed by QC-based or RUVg normalization. Seems effective: Similar RLE distributions between samples, lower association of counts and QC measures. The first unwanted factor of RUVg is correlated with the first QC PC. • The remaining analyses are based on none,fq,qc k=2,no bio,no batch , the best method according to weighted mean score. 55 / 105

  35. Software Package scone: Summary • Interpretation of performance metrics. Some metrics tend to favor certain methods over others, e.g., EXP UV COR (correlation between count PCs and factors of unwanted variation) naturally favors RUVg, especially when the same set of negative controls are used for normalization and evaluation. Hence, a careful, global interpretation of the metrics is recommended. • Negative controls. The selection of proper, distinct sets of negative controls is important, as these are used for both normalization (RUVg) and assessment of normalization results (EXP UV COR). • Ongoing efforts. ◮ Zero-inflated negative binomial (ZINB) model. ◮ User-supplied factors unwanted and wanted variation (UV and WV, respectively). ◮ Other methods (e.g., ComBat/sva). 56 / 105

  36. Software Package scone: Summary ◮ Other performance metrics. ◮ Visualization. ◮ Shiny app for interactive web interface. 57 / 105

  37. Resampling-Based Sequential Ensemble Clustering D. Risso, E. Purdom 58 / 105

  38. Motivation • Robustness to choice of samples. Both hierarchical and partitioning methods tend to be sensitive to the choice of samples to be clustered. Outlying samples/clusters (e.g., glia) are common in scRNA-Seq and mask interesting substructure in the data, often requiring the successive pruning out of dominating clusters to get to the finer structure. • Robustness to clustering algorithm and tuning parameters. Clustering results are sensitive to pre-processing steps such as normalization and dimensionality reduction, as well as to the choice of clustering algorithm and associated tuning parameters (e.g., distance function, number of clusters). 59 / 105

  39. Motivation • Not focusing on the number of clusters. A major tuning parameter of partitioning methods such as partitioning around medoids (PAM) and k -means is the number of clusters k . Methods for selecting k (e.g., silhouette width) are sensitive to the choice of samples, normalization, and other tuning parameters. They tend to be conservative (low k ), i.e., capture only the coarse clustering structure and mask interesting substructure in the data. Additionally, the number of clusters k is often not of primary interest. E.g. Silhouette width with PAM selects only k = 2 clusters for the OE p63 dataset. • Not forcing samples into clusters. Some samples may be outliers, that do not really belong to any clusters. Leaving them out can improve the quality and interpretability of the clustering as well as downstream analyses (e.g., identification of cluster marker genes). 60 / 105

  40. Motivation • Cluster gene expression signatures. Common differential expression statistics are not well-suited for finding marker genes for the clusters, especially for finer structure in a hierarchy. 61 / 105

  41. Motivation SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC: A SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batc 0.4 KO_14DPT 0.6 KO_24HPT KO_48HPT 0.5 KO_7DPT 0.3 KO_96HPT 0.4 WT_72HPT silhouette WT_96HPT 0.3 0.2 0.2 0.1 0.1 0.0 0.0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k (a) Average silhouette width (b) Silhouette widths, k = 2 Figure 21: Partitioning around medoids. 20 PC, one-minus-correlation distance. 62 / 105

  42. Resampling-Based Sequential Ensemble Clustering • We have developed a resampling-based sequential ensemble clustering approach, with the aim of obtaining stable and tight clusters. • Ensemble clustering, i.e., aggregating multiple clusterings obtained from different algorithms or applications of a given algorithm to resampled versions of the learning set, is a general approach for improving stability. This can be viewed as the unsupervised analog of ensemble methods in supervised learning, e.g., bagging, boosting, random forests. • Our approach is related to bagged/consensus/tight clustering (Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong, 2005). • R package clusterExperiment, released through the Bioconductor Project. 63 / 105

  43. Resampling-Based Sequential Ensemble Clustering RSEC: Resampling-based Sequential Ensemble Clustering. • Given a base clustering algorithm (e.g., PAM, k -means) and associated tuning parameters (e.g., number of principal components, number of clusters k , distance matrix), generate a single candidate clustering using ◮ resampling-based clustering to find robust and tight clusters; ◮ sequential clustering to find stable clusters over a range of numbers of clusters (Tseng and Wong, 2005). • Generate a collection of candidate clusterings by repeating the above procedure for different base clustering algorithms and tuning parameters. • Identify a consensus over the different candidate clusterings. • Merge non-differential clusters. • Find cluster signatures by testing for differential expression between selected subsets of clusters. 64 / 105

  44. Resampling-Based Sequential Ensemble Clustering • Visualization. Comparison of multiple clusterings of the same samples, heatmaps of co-clustering matrices, heatmaps with hierarchical clustering of genes and/or samples. 65 / 105

  45. Resampling-Based Clustering • Consider a particular base partitioning clustering algorithm (e.g., PAM, k -means) and associated tuning parameters (e.g., number of principal components, number of clusters k , distance matrix). • Generate B subsamples of the n observations to be clustered, e.g., bootstrap or random sample without replacement (of size 0.7 n , say). • For each subsample, apply the base clustering algorithm (with k clusters) and assign all n original observations to clusters based on their distances from cluster centroids/medoids. • Generate an n × n co-clustering matrix C by recording for each pair of observations the proportion of times they are clustered together (out of B subsamples). 66 / 105

  46. Resampling-Based Clustering • Identify tight clusters based on the co-clustering matrix. For example, given a constant α close to 0, cut branches top-down from a hierarchical clustering based on the co-clustering matrix C , such that C i , j ≥ 1 − α ∀ i , j in the selected branch (or some other less stringent function than min). Retain all clusters above a certain size or the largest certain number of clusters. 67 / 105

  47. Sequential Clustering Based on Tseng and Wong (2005). • Starting with a user-supplied k = k 0 , generate a collection of clusterings, {C k 1 , C k 2 , . . . , C k m k } , by applying the above resampling-based procedure to the base partitioning clustering algorithm with k clusters. • Given a constant β close to 1, increase k by one until i , C k +1 s ( C k ) ≥ β , for some i , j , where s is a measure of j similarity between clusters. • Identify C k +1 as a stable cluster. j • Remove C k +1 and repeat the above steps with k ← k − 1, j until k is below a certain threshold or a certain number of clusters are identified. 68 / 105

  48. Sequential Clustering • Here, we evaluate a collection of clusters for stability based on the pairwise similarity measure s ( C i , C j ) ≡ |C i ∩ C j | / |C i ∪ C j | . (2) 69 / 105

  49. Ensemble Clustering • Generate a collection of candidate clusterings as above, for a range of base clustering algorithms and tuning parameters. • Find a single consensus clustering based on the co-clustering matrix (as in resampling, only now across different algorithms or tuning parameters). • Merge non-differential clusters by creating a hierarchy of clusters, working up the tree, testing for differential expression between sister nodes, and collapsing nodes with insufficient DE. 70 / 105

  50. Differential Expression • Find cluster gene expression signatures, i.e., marker genes, by testing for differential expression between selected subsets of clusters. • Standard F -statistic. Tests for any difference between clusters. Sensitive to outlying samples/clusters. Non-specific, i.e., not useful for interpreting differences between clusters. • Standard solution in (generalized) linear models/ANOVA is to consider contrasts between groups of clusters. By using the machinery of the (generalized) linear model, we use all of the samples in testing these contrasts, rather than just those samples involved in the corresponding clusters. ◮ All pairwise. All pairwise comparisons between clusters. ◮ One against all. Compare each cluster to union of remaining clusters. 71 / 105

  51. Differential Expression ◮ Dendrogram. Create a hierarchy of clusters, work up the tree, test for DE between sister nodes (as in approach used for merging clusters). • For each contrast, test for DE using empirical Bayes linear modeling approach of R package limma, with voom option to account for mean-variance relationship of log-counts (i.e., over-dispersion). 72 / 105

  52. Software Package clusterExperiment Workflow. • clusterMany . Generate a collection of candidate clusterings, for different base clustering algorithms and tuning parameters, with option to use resampling and sequential approaches. • combineMany . Find consensus clustering across several clusterings. • Identify non-differential clusters that should be merged into larger clusters. ◮ makeDendrogram . Hierarchical clustering of the clusters found by combineMany . ◮ mergeClusters . Merge clusters of this hierarchy based on DE between nodes. • RSEC . Wrapper function around the clusterExperiment workflow. 73 / 105

  53. Software Package clusterExperiment • getBestFeatures . Find cluster signatures by testing for differential expression between selected subsets of clusters. • Visualization. ◮ plotClusters . Comparison of multiple clusterings of the same samples. Based on ConsensusClusterPlus package. ◮ plotHeatmap . Heatmaps of co-clustering matrices, heatmaps with hierarchical clustering of genes and/or samples (interface to aheatmap from NMF package). 74 / 105

  54. Software Package clusterExperiment Application to OE p63 dataset. • clusterMany : Generate 22 candidate clusterings. ◮ Dimensionality reduction: 25, 50 PC. ◮ Euclidean distance. ◮ Base clustering method: PAM, k = 5 , · · · , 15. ◮ Resampling-based clustering: B = 100, proportion = 0 . 7, α = 0 . 3. ◮ Sequential clustering: k 0 = 15, β = 0 . 9. ◮ clusterFunction=c("hierarchical01") . • combineMany(ce, clusterFunction="hierarchical01", whichClusters="workflow", proportion=0.7, propUnassigned=0.5, minSize=5) . • mergeClusters(ce, mergeMethod="adjP", cutoff=0.05) . 75 / 105

  55. Software Package clusterExperiment Clusterings from clusterMany eatures=25,k0=5 eatures=50,k0=5 eatures=25,k0=6 eatures=50,k0=6 eatures=25,k0=7 eatures=50,k0=7 eatures=25,k0=8 eatures=50,k0=8 eatures=25,k0=9 eatures=50,k0=9 eatures=25,k0=10 eatures=50,k0=10 eatures=25,k0=11 eatures=50,k0=11 eatures=25,k0=12 eatures=50,k0=12 eatures=25,k0=13 eatures=50,k0=13 eatures=25,k0=14 eatures=50,k0=14 eatures=25,k0=15 eatures=50,k0=15 Figure 22: clusterExperiment. Comparison of 22 clusterMany clusterings using plotClusters . 76 / 105

  56. Software Package clusterExperiment Co−clustering proportion matrix combineMany 1 combineMany −1 c1 0.8 c10 c11 c12 0.6 c13 c14 0.4 c15 c16 c17 0.2 c18 c19 c2 0 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c3 Figure 23: clusterExperiment. Heatmap of co-clustering matrix for clusterMany clusterings, used to create combineMany clustering ( plotHeatmap ). 77 / 105

  57. Software Package clusterExperiment c6 c4 c14 c16 c18 c6 c20 0.006 c4 0.0086 c17 c14 0.014 c19 c16 0.035 c18 c23 c25 c20 0.0038 c21 0.0065 c17 0.0077 c19 c24 c23 c22 0.011 c25 0.066 0.0051 c30 c21 c31 0.0042 c24 0.076 0.0056 c29 c22 0.0056 c13 c30 0.026 c32 c31 c29 c28 0.0065 0.15 0.088 c13 c15 c32 c3 0.011 c28 c27 0.039 c15 c10 0.0079 c3 c26 c27 0.023 c12 0.38 0.028 c10 c5 c26 c11 c12 0.016 c2 c5 c11 c8 0.039 0.0086 c2 c7 0.0092 0.017 c8 c9 0.058 c7 c1 c9 0.014 c1 Figure 24: clusterExperiment. Dendrograms from combineMany and mergeClusters . 78 / 105

  58. Software Package clusterExperiment Count PCA by cluster 80 ● −1 ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● 2 60 ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● 5 ● ● ● ● 40 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 7 ● ● ● ● ● ● ● ● ● ● 8 ● ● PC2 ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● −50 0 50 PC1 Figure 25: clusterExperiment. PCA of gene-level log-counts, colored by mergeClusters . 79 / 105

  59. Software Package clusterExperiment Heatmap of DE genes, dendrogram contrasts mergeClusters Bio mergeClusters 10 −1 m1 8 m2 6 m3 m4 4 m5 m6 2 m7 m8 0 Bio −2 −1 −4 −2 Contaminants HBC HBC transition INP/GBC INP2 iOSN Microvillous Figure 26: clusterExperiment. Heatmap of log-counts for mOSNs SUS getBestFeatures DE genes using dendrogram contrasts (269, top 50 in SUS precursor each of the 6 nodes). 80 / 105

  60. Software Package clusterExperiment Heatmap of DE genes, F−test mergeClusters Bio gm1821 mergeClusters ptma ftl1 10 actg1 tpt1 ptges3 −1 hspa8 skp1a hnrnpa2b1 ddx5 actr2 rpl18a rps3 m1 rpl4 rps5 8 dstn mt1 ubc m2 sptbn1 akr1a1 prdx1 tmsb4x ubb ywhaz set m3 sod1 fth1 csde1 srsf2 6 btg1 m4 canx h3f3b eef1a1 hsp90ab1 rnr2 gnb1 actb m5 malat1 rn18s−rs5 calm1 gsk3b ywhae m6 calm2 4 eif4g2 map1b rps14 rpl41 rps19 m7 rps9 rpl9 myh9 rps27 ppia rnr1 m8 trns1 rpl3 2 hmgb1 rpl14 rps11 rpl21 rps3a1 rps6 rpl8 rpl7 rplp0 naca Bio rpl13 rps25 lars2 rpl10 0 cfl1 −1 eif1 nip7 kcnq1ot1 son prrc2c rpl35 rps23 −2 rpl37 rpl36 fau rpl24 rplp2 Contaminants rpl32 rps29 rpl23a rpl37a rpl23 rps7 rps20 HBC rpl13a rplp1 rpl27 oaz1 cox6c HBC transition nup98 gm17821 rmi2 hnrnpl sirt5 INP/GBC INP2 iOSN Microvillous mOSNs Figure 27: clusterExperiment. Heatmap of log-counts for SUS getBestFeatures DE genes using F -test (top 100). SUS precursor 81 / 105

  61. Software Package clusterExperiment Heatmap of markers genes mergeClusters Bio notch2 mergeClusters hes1 krt8 igfbp5 egfp −1 cdkn1b tubb5 hmgb2 10 ccnd2 perp f3 m1 sfrp1 kitl sox9 krt5 m2 krt14 trp63 8 nrcam lifr wnt4 m3 sox2 il33 elf5 sall1 pax6 m4 igfbp2 egfr itgb5 6 igfbp4 dkk3 m5 fgfr2 ppara icam1 yap1 notch1 m6 il6 pbk cdca8 4 top2a m7 ect2 ccna2 mcm7 tead2 dtl kit m8 rbm24 ccnd1 hes6 2 ezh2 rrm1 ebf4 zfp423 nhlh2 neurod1 fabp5 Bio prmt5 ceacam1 cftr cd24a 0 −1 coch scarb1 cyp1a2 lgr5 cyp2g1 arhgdig −2 gng8 nhlh1 ebf3 lhx2 Contaminants rgs11 gng13 cnga2 omp ncam1 ebf2 HBC ebf1 gap43 trim66 cbr2 HBC transition cyp2f2 gstm2 creer sox11 INP/GBC INP2 iOSN Microvillous mOSNs Figure 28: clusterExperiment. Heatmap of log-counts for “a priori” SUS markers genes (83). SUS precursor 82 / 105

  62. Software Package clusterExperiment: Summary • Tuning parameters. ◮ α controls tightness. ◮ β controls stability. ◮ k 0 , the initial number of clusters used in sequential clustering, is the parameter with the greatest impact on the results. Larger k 0 tend to lead to smaller and tighter clusters. • Caveat. The DE analysis is exploratory and nominal p -values only a rough summary of significance (reliance on models, tiny p -values even after adjustment for multiple testing, same data used to define clusters and to perform DE analysis). • Ongoing efforts. ◮ Cluster confidence measures. ◮ Potentially assign unclustered observations to clusters. ◮ DE using ZINB model. ◮ Greater modularity (e.g., distance functions, DE test). ◮ Shiny app for interactive web interface. 83 / 105

  63. Cell Lineage and Pseudotime Inference K. Street, D. Risso, E. Purdom 84 / 105

  64. Motivation • Mapping transcriptional progression from stem cells to specialized cell types is essential for properly understanding the mechanisms regulating cell and tissue differentiation. • There may not always be a clear distinction between states, but rather a smooth transition, with individual cells existing on a continuum between states. • In such a case, cells may undergo gradual transcriptional changes, where the relationship between states can be represented as a continuous lineage dependent upon an underlying spatial or temporal variable. This representation, referred to as pseudotemporal ordering, can help us understand how cells differentiate and how cell fate decisions are made (Bendall et al., 2014; Campbell et al., 2015; Ji and Ji, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnell et al., 2014). 85 / 105

  65. Motivation • We have developed Slingshot as a flexible and robust framework for inferring cell lineages and pseudotimes in the study of continuous differentiation processes. 86 / 105

  66. Cell Lineage and Pseudotime Inference • Input/Output. ◮ Input. Normalized gene expression measures and cell clustering. ◮ Output. Cell lineages, i.e., subsets of ordered cell clusters. Cell pseudotimes, i.e., for each lineage, ordered sequence of cells and associated pseudotimes. • Dimensionality reduction. ◮ Principal component analysis (PCA) seems effective and simple, in conjunction with steps detailed next. ◮ Other approaches include related linear methods, e.g., independent component analysis (ICA) (Trapnell et al., 2014, Monocle), and non-linear methods, e.g., Laplacian eigenmaps/spectral embedding (Campbell et al., 2015, Embeddr), t-distributed stochastic neighbor embedding (t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016, Wanderlust). 87 / 105

  67. Cell Lineage and Pseudotime Inference • Inferring cell lineages. ◮ Minimum spanning tree (MST; ape package) over cell clusters, with between-cluster distance based on Euclidean distance between cluster means scaled by within-cluster covariance. ◮ Outlying clusters. Identified using granularity parameter ω that limits maximum edge weight in the tree. Specifically, build MST using an artificial cluster Ω, with distance ω from other clusters (a fraction of maximum pairwise distance between clusters), and then remove Ω. ◮ Root and leaf nodes. May either be pre-specified or automatically selected. Root node. If not pre-specified, selected based on parsimony (i.e., set of lineages with maximal number of clusters shared between them). Leaf nodes. If pre-specified, constrained MST. ◮ A lineage is then defined as any unique path coming out of the root node and ending in a leaf node. 88 / 105

  68. Cell Lineage and Pseudotime Inference ◮ Constructing the MST on clusters (Ji and Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014, Monocle) offers greater stability and computational efficiency, less complex lineages, and easier determination of directionality and branching. • Inferring cell pseudotimes. ◮ Iterative procedure inspired from the principal curve algorithm of Hastie and Stuetzle (1989); principal.curve function in princurve package. ◮ In the case of branching lineages, a shrinkage step is included at each iteration, that forces a degree of similarity between the curves in the neighborhood of shared clusters. ◮ Pseudotime values are derived by orthogonal projection onto the curves. ◮ Cells belonging to clusters that are included in multiple lineages have multiple, similar pseudotime values. 89 / 105

  69. Cell Lineage and Pseudotime Inference ◮ Previous approaches also use smooth curves to represent lineages (Campbell et al., 2015; Petropoulos et al., 2016, Embeddr), while others use piecewise linear paths through the MST and extract orderings either by orthogonal projection (Ji and Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree (Trapnell et al., 2014, Monocle). ◮ We find that smooth curves provide discerning power not found in piecewise linear trajectories, while also adding stability over a range of dimensionality reduction and clustering methods. • Differential expression. Regression of gene expression measures on pseudotime, e.g., generalized additive models (GAM) (Ji and Ji, 2016, TSCAN). • Visualization. Two- and three-dimensional plots of cell lineages and pseudotimes, gene-level trajectories, heatmaps for DE genes. 90 / 105

  70. Cell Lineage and Pseudotime Inference • R package slingshot, to be released through the Bioconductor Project: github.com/kstreet13/slingshot . 91 / 105

  71. Software Package slingshot • Modularity. ◮ Integrates easily with a range of normalization, clustering, and dimensionality reduction methods. ◮ get lineages : Given expression measures and cluster labels, use MST to infer lineages. get lineages(X, clus.labels, start.clus = NULL, end.clus = NULL, dist.fun = NULL, omega = Inf, distout = FALSE) . ◮ get curves : Given lineages, infer pseudotimes. get curves(X, clus.labels, lineages, thresh = 1e-04, maxit = 100, stretch = 2, shrink = TRUE) . • Flexibility. Can be used with varying levels of supervision. ◮ Cluster-based approach allows for easy supervision when researchers have prior knowledge of cell classes, while still being able to detect novel branching events. ◮ User-supplied or data-driven selection of root and leaf nodes. 92 / 105

  72. Software Package slingshot • Visualization. ◮ plot tree : MST in 2 and 3D. ◮ plot curves : Lineage curves in 2 and 3D. 93 / 105

  73. Software Package slingshot Application to OE p63 dataset. • Applied to the first three principal components, Slingshot identifies two lineages: The first corresponds to the HBC-to-neurons transition, the second to the HBC-to-sustentacular cells transition. • A first-pass DE analysis, based on a regression of log-count on pseudotime using GAM, suggests that many genes are involved in the differentiation process. • Among the top 100 DE genes for each lineage, only 13 are DE in both, suggesting distinct processes in the neuronal vs. non-neuronal lineages. 94 / 105

  74. Software Package slingshot ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 0 50 PC2 80 ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● −40 0 20 40 60 PC3 Figure 29: slingshot. PCA of gene-level log-counts, colored by clusters, with MST edges used to infer lineages ( get lineages , plot tree ). 95 / 105

  75. Software Package slingshot ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 0 50 PC2 80 ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● −40 0 20 40 60 PC3 Figure 30: slingshot. PCA of gene-level log-counts, colored by clusters, with smooth curves representing lineages and used to infer pseudotimes ( get curves , plot curves ). 96 / 105

  76. Software Package slingshot Slingshot: DE genes based on GAM, lineage 1 fstl5 ugt2a1 gstm1 log(count+1) log(count+1) log(count+1) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 150 0 50 150 0 50 150 pseudotime pseudotime pseudotime ncam1 gstm2 rtn1 log(count+1) log(count+1) log(count+1) ● ● ● ● ● ● 8 ● 8 ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 150 0 50 150 0 50 150 pseudotime pseudotime pseudotime aqp3 scn9a gm581 log(count+1) log(count+1) log(count+1) 8 ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 150 0 50 150 0 50 150 pseudotime pseudotime pseudotime Figure 31: slingshot. Scatterplots of gene-level log-count vs. pseudotime for GAM DE genes in lineage 1 (HBC–Neurons). 97 / 105

  77. Software Package slingshot Slingshot: DE genes based on GAM, lineage 1 mergeClusters Bio mergeClusters gstm2 aqp3 ugt2a1 10 scgb1c1 gpm6a m1 epas1 zfp36 ephx1 ces1d anxa2 m2 aqp4 clu ifitm3 8 gja1 m3 ptn foxn3 rbm47 wls fosb m5 klf4 nr4a1 junb cyr61 6 irf1 m7 bhlhe40 hes1 sik1 ahnak s100a11 m8 dapl1 socs3 hspb1 aox2 notch2 4 lypd2 slc26a7 gsta4 ccnd2 Bio fmo6 cbr2 cyp2f2 gstm1 HBC prdx6 sdc4 2 krt18 atf3 tcf7l2 HBC transition zfp36l1 ezr runx1 ifitm2 INP/GBC sp8 prdx1 rps27 gnb1 0 map1b INP2 rtn1 ell3 ebf2 tubb3 ebf1 iOSN ncam1 fstl5 stmn2 dcx mOSNs cystm1 syt11 flrt1 prune2 5730409k12rik SUS gng13 mgst3 scn9a gm581 spock1 SUS precursor ank2 sept3 trim66 gap43 rtp1 SUS progenitor clgn aplp1 cntn4 snap25 calb2 rd3 stmn3 uchl1 homer2 ttll7 atf5 myt1l elavl3 ebf3 ak1 sult1d1 stmn1 gm11223 tuba1a jakmip1 Figure 32: slingshot. Heatmap of log-counts for GAM DE genes in lineage 1 (HBC–Neurons), cells sorted by pseudotime. 98 / 105

  78. Software Package slingshot Lineages. $lineage1 [1] "8" "3" "7" "5" "2" "1" $lineage2 [1] "8" "3" "7" "4" Biological annotation of clusters. cl b 1 2 3 4 5 7 8 HBC 0 0 26 0 0 1 47 HBC transition 0 0 35 0 0 19 0 INP/GBC 0 0 1 0 41 0 0 INP2 0 34 0 0 0 0 0 iOSN 1 77 1 0 0 0 0 Microvillous 0 0 0 8 0 0 0 mOSNs 33 1 0 0 0 0 0 SUS 0 0 1 61 0 5 0 SUS precursor 0 0 7 2 4 62 0 SUS progenitor 0 0 0 0 6 0 0 99 / 105

  79. Software Package slingshot: Summary Ongoing efforts. • Number of lineages: User-supplied, testing for distinct lineages, merging non-differential lineages. • DE within and between (i.e., bifurcation) lineages. • Visualization. • Performance measures. • OOP with S4 classes and methods. • Shiny app for interactive web interface. 100 / 105

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend