Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau - PowerPoint PPT Presentation

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau cathy.maugis@insa-toulouse.fr C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 1 / 52

Plan Introduction 1 Feature selection / extraction 2 Dimension reduction 3 Single cell clustering 4 Pseudotime analysis 5 Differential analysis 6 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 2 / 52

scRNA-seq data n cells, G genes: n ≤ G or n ≈ G = ⇒ high dimensionality Measures: x ij = expression of the gene j for the cell i ∈ N Technical and biological noise High variability Zero-inflated data = ⇒ "sparsity" ( ≥ 80 % of zeros per raw, dropouts) C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 3 / 52

Biological questions Are there distinct subpopulations of cells? For each cell type, what are the marker genes? How visualize the cells? Are there continuums of differentiation / activation cell states? ... Rostom et al, FEBS 2017 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 4 / 52

Statistical analysis Clustering of cells Variable (gene) selection in learning or differential analysis (hypothesis testing) Reduction dimension Network inference ... Rostom et al, FEBS 2017 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 5 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) https://bioconductor.org/packages/release/bioc/html/sincell.html C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) [Poirion et al., 2016] C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) [Poirion et al., 2016] [Wolf et al., 2018] SCANPY https://github.com/theislab/Scanpy C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) [Poirion et al., 2016] [Wolf et al., 2018] SCANPY [Guo et al., 2015] SINCERA: https://github.com/xu-lab/SINCERA https://research.cchmc.org/pbge/sincera.html Fig 1. Schematic Workflow. The analytic pipeline consists of three main components: pre-processing, cell type identification, and cell type specific gene signature and driving force identification. C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) [Poirion et al., 2016] [Wolf et al., 2018] SCANPY [Guo et al., 2015] SINCERA: [Lun et al., 2016] Workflow Package : simpleSingleCell https://bioconductor.org/packages/release/workflows/html/simpleSingleCell.html C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Some bio-info-stat. pipelines/workflows [Juliá et al., 2015] Sincell (Bioconductor/R package) [Poirion et al., 2016] [Wolf et al., 2018] SCANPY [Guo et al., 2015] SINCERA: [Lun et al., 2016] Workflow Package : simpleSingleCell [Satija et al., 2015] SEURAT: https://satijalab.org/seurat/ ... C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52

Feature (gene) extraction Simple filtering criteria : see e.g [Lun et al., 2016],[Soneson and Robinson, 2018] Filtering of lowly expressed genes: genes expressed in < τ % of cells genes with a mean average of expression < τ Dropout-based feature selection M3Drop, [Andrews and Hemberg, 2018] Based on the Michaelis-Menten function S P dropout = 1 − K M + S where S = mean expression P dropout = dropout rate MLE to obtain the global K M across all genes C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 8 / 52

Highly Variable Genes (HVG) [Brennecke et al., 2013] Fits a quadratic model (gamma generalized linear model) to the relationship between mean expression and the coefficient of variation squared (CV2) χ 2 test is used to find genes signif. above the curve Implemented in M3Drop package C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52

Highly Variable Genes (HVG) [Brennecke et al., 2013] [Kim et al., 2015] Uses spike-ins to estimate parameters related to technical variance and estimates gene-specific biological variability by substracting the estimated technical variance from the total variance. C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52

Highly Variable Genes (HVG) [Brennecke et al., 2013] [Kim et al., 2015] [Vallejos et al., 2015] BASiCS = Bayesian Analysis of Single-Cell Sequencing Data Models spike-ins and endogenous genes simultaneously as two Poisson-Gamma hierarchical models C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52

Highly correlated genes Gene-gene correlation: Calculate the gene-gene correlation matrix ρ = ( ρ ij ) i , j = 1 ,..., G Evaluate the correlation magnitude for each gene : ˜ ρ i = max | ρ ij | j Take the top few thousand genes having the highest correlation magnitude PCA loadings: Select the genes with high PCA loadings ... Non adapted for batch effects C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 10 / 52

Objectives Minimize curse of dimensionality Allow visualization Reduce computational time .... But attention to the interpretations after! C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 12 / 52

Principal component analysis (PCA) C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 13 / 52

Principal component analysis (PCA) Diagonalization of the covariance (or correlation) matrix Linear transformations: meta-variables = linear combinations of the genes Capture the dimensions with higher variance Fast deterministic procedure Sparse-PCA : PCA + gene selection C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 13 / 52

Extensions of PCA for scRNAseq data [Pierson and Yau, 2015] : ZIFA (Zero Inflated Factor Analysis) Deals with the large number of zero-values in scRNASeq data Relationship between the dropout rate p 0 and the mean level of non-zero expression (log read count) µ : p 0 = exp( − λµ 2 ) ZIFA adopts a latent variable model and uses an EM algorithm for the parameter estimation Python software : https://github.com/epierson9/ZIFA [Risso et al., 2017] : ZINB-WaVE = Zero-Inflated Negative Binomial Model for RNA-Seq Data a method similar to PCA based on a zero- inflated negative binomial model instead of a Gaussian model https://bioconductor.org/packages/release/bioc/html/zinbwave.html C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 14 / 52

Extensions of PCA [Lin et al., 2017] CIDR ( https://github.com/VCCRI/CIDR ) Preliminary, log( x ij + 1 ) 1 Identification of dropout candidates. 2 (CIDR finds a sample-dependent threshold that separates the zero peak from the rest of the expression distribution for each cell) Estimation of the relationship between dropout rate and gene 3 expression levels (non-linear least-squares regression to fit a decreasing logistic function to the data) Calculation of dissimilarity between the imputed gene expression 4 profiles for every pairs of single cells PCoA using the CIDR dissimilarity matrix 5 Clustering (CAH) using the first few principal coordinates 6 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 15 / 52

Example of t-SNE plot C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 16 / 52

t-SNE Reduce a dataset to 2 dimensions Non-linear dimension reduction technique Want to preserve the neighborhood "Don’t interpret distances in t-SNE plots" https://constantamateur.github.io/2018-01-02-tSNE/ C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 17 / 52

t-SNE Reduce a dataset to 2 dimensions Non-linear dimension reduction technique Want to preserve the neighborhood "Don’t interpret distances in t-SNE plots" INPUT : X = ( x 1 , . . . , x n ) with x i ∈ R G (High dimensional data) OUTPUT: Y = ( y 1 , . . . , y n ) with y i ∈ R 2 ( Low dimensional data) C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 17 / 52

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau - PowerPoint PPT Presentation

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau cathy.maugis@insa-toulouse.fr C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 1 / 52 Plan Introduction 1 Feature selection / extraction 2 Dimension

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

QC of scRNAseq data sa Bjrklund asa.bjorklund@scilifelab.se Outline Background on

Quality Control of scRNAseq data sa Bjrklund asa.bjorklund@scilifelab.se Outline

scRNAseq clustering tools sa Bjrklund asa.bjorklund@scilifelab.se What is a celltype? What

scRNAseq clustering tools sa Bjrklund asa.bjorklund@scilifelab.se What is a celltype? What

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept 2016 Single-cell vs bulk

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

Preliminary statistical analysis of the international eventing results 2014 Madrid 23/1/15

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics

NO 3 NPGO !"#$ !"%$ !""$ &$$$ NPGO SSHa pattern Sea Level Anomalies

Statistical Analysis of RNA-Seq Data: Experimental design Lorena S. Rivarola-Duarte PhD Student

P RACTICAL A PPLICATIONS OF M ICROBIAL M ODELLING W EBINAR S ERIES May 22, 2018 10:00 a.m. CDT

Projected sea surface temperatures over the 21st century: changes in the mean, variability and

Sambuz

Useful Links

Newsletter

Mail Us

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau - PowerPoint PPT Presentation

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau cathy.maugis@insa-toulouse.fr C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 1 / 52 Plan Introduction 1 Feature selection / extraction 2 Dimension

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

QC of scRNAseq data sa Bjrklund asa.bjorklund@scilifelab.se Outline Background on

Quality Control of scRNAseq data sa Bjrklund asa.bjorklund@scilifelab.se Outline

scRNAseq clustering tools sa Bjrklund asa.bjorklund@scilifelab.se What is a celltype? What

scRNAseq clustering tools sa Bjrklund asa.bjorklund@scilifelab.se What is a celltype? What

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept 2016 Single-cell vs bulk

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

Preliminary statistical analysis of the international eventing results 2014 Madrid 23/1/15

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert &amp; Alois Tschopp Biostatistics

NO 3 NPGO !&quot;#$ !&quot;%$ !&quot;&quot;$ &amp;$$$ NPGO SSHa pattern Sea Level Anomalies

Statistical Analysis of RNA-Seq Data: Experimental design Lorena S. Rivarola-Duarte PhD Student

P RACTICAL A PPLICATIONS OF M ICROBIAL M ODELLING W EBINAR S ERIES May 22, 2018 10:00 a.m. CDT

Projected sea surface temperatures over the 21st century: changes in the mean, variability and

Sambuz

Useful Links

Newsletter

Mail Us

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics

NO 3 NPGO !"#$ !"%$ !""$ &$$$ NPGO SSHa pattern Sea Level Anomalies