Introduction to Single Cell Transcriptomic Analysis Acknowledgments - - PowerPoint PPT Presentation

introduction to single cell transcriptomic analysis
SMART_READER_LITE
LIVE PREVIEW

Introduction to Single Cell Transcriptomic Analysis Acknowledgments - - PowerPoint PPT Presentation

Introduction to Single Cell Transcriptomic Analysis Acknowledgments Brian Haas Karthik Shekhar Timothy Tickle Caroline Porter Ayshwarya Subramanian | subraman@broadinstitute.org In-depth-NGS-Data-Analysis-Course | 2018-09-27 Goals for today


slide-1
SLIDE 1

Introduction to Single Cell Transcriptomic Analysis

Acknowledgments Brian Haas Karthik Shekhar Timothy Tickle Caroline Porter

Ayshwarya Subramanian | subraman@broadinstitute.org In-depth-NGS-Data-Analysis-Course | 2018-09-27

slide-2
SLIDE 2

2

Goals for today

  • Overview of single-cell RNA-seq data analysis
  • What is in a count matrix?
  • Preprocessing data: quality control (QC) and filtering
  • Dimensionality reduction & Clustering
  • Biology: inferring cell types and further
slide-3
SLIDE 3

3

Why should we study gene expression at the resolution of single cells?

slide-4
SLIDE 4

Incredible diversity in cell types, states, & interactions across human tissues

Skin epithelium Brain meninges Blood vessels Small intestine Liver cirrhosis Breast cancer

http://www.cell.com/pictureshow/skin | https://library.med.utah.edu/WebPath/webpath.html

4

slide-5
SLIDE 5

5

Bulk sequencing profiles measure average profiles from tissue samples

  • Trapnell. Genome Research 2016
slide-6
SLIDE 6

Single-cell and bulk gene expression distributions are very different

Bulk Single cell

Expression Matrix

6

Slide courtesy of Karthik Shekhar

slide-7
SLIDE 7

7

Single-cell RNA-seq analysis pipeline: Analyzing the expression data

  • 1. Expression Matrix

(GENES x CELLS)

  • 2. Filter Cells /

Quality Control

  • 3. Normalization
  • 1. Identify

Variable Genes

  • 2. Dimensionality

Reduction

  • 3. Exploring Known

Marker Genes

  • 4. Clustering
  • 5. Differentially

Expressed Genes

  • 6. Assigning

Cell Type

  • 7. Functional

Annotation

Pre-Processing Clustering Biology

slide-8
SLIDE 8

8

Slide courtesy of Karthik Shekhar

slide-9
SLIDE 9

9

Slide courtesy of Karthik Shekhar

slide-10
SLIDE 10

Slide courtesy of Karthik Shekhar

10

slide-11
SLIDE 11

Count Matrix

Single-cell RNA-seq analysis pipeline: Generating the count matrix

11

slide-12
SLIDE 12

Slide courtesy of Karthik Shekhar

1 2

slide-13
SLIDE 13

Slide courtesy of Karthik Shekhar

13

slide-14
SLIDE 14

Slide courtesy of Karthik Shekhar

14

slide-15
SLIDE 15

Counts Matrix

slide-16
SLIDE 16

Genes Have Different Distributions

slide-17
SLIDE 17

Genes Have Different Distributions

slide-18
SLIDE 18

Genes Have Different Distributions

slide-19
SLIDE 19

Genes Have Different Distributions

slide-20
SLIDE 20

Genes Have Different Distributions

slide-21
SLIDE 21

Genes have different distributions

Housekeeping Technical Rare populations Different populations

21

slide-22
SLIDE 22

Underlying Biology

  • Zero inflation.
  • Drop-out event during

reverse-transcription.

  • Genes with more expression

have less zeros.

  • Complexity varies.
  • Transcription stochasticity.
  • Transcription bursting.
  • Coordinated transcription of

multigene networks. –Over-dispersed counts.

  • Higher Resolution.

–More sources of signal

slide-23
SLIDE 23

Cell Identity is a Mixture of Multiple Factors

slide-24
SLIDE 24

Expression has Many Sources

slide-25
SLIDE 25

Technical vs Intrinsic Noise

slide-26
SLIDE 26

Slide courtesy of Karthik Shekhar

2 6

slide-27
SLIDE 27

Technical Conceptual Challenges

slide-28
SLIDE 28

What is Study Confounding

slide-29
SLIDE 29

29

Goals for today

  • Overview of single-cell RNA-seq data analysis
  • What is in a count matrix?
  • Preprocessing data: quality control (QC) and filtering
  • Dimensionality reduction & Clustering
  • Biology: inferring cell types and further
slide-30
SLIDE 30

Count Preparation is Different Depending on Assays

slide-31
SLIDE 31

Filtering Genes: Averages are Less Useful

slide-32
SLIDE 32

Filtering Genes: Using Prevalence

slide-33
SLIDE 33

Filtering Genes: Using Prevalence

slide-34
SLIDE 34

Representing Genes Throughout Cells

A gene (GAD2) across many groups of cells. Habib et al. 2016 Box Plot Violin Plot

slide-35
SLIDE 35

What is Metadata?

Other information that describes your measurements.

  • Patient information.
  • Lifestyle (smoking), Patient Biology (age),

Comorbidity

  • Study information.
  • Treatment, Cage, Sequencing Site, Sequencing

Date

  • Sequence QC on cells.
  • Useful in filtering and stratifying.
slide-36
SLIDE 36

Filtering Cells: Removing Outlier Cells

  • Bulk RNA-Seq studies often do not remove outliers

cells

  • scRNA-Seq often removes “failed libraries”.
  • Outlier cells are not just measured by complexity
  • Percent Reads Mapping
  • Percent Mitochondrial Reads
  • Presence of marker genes
  • Intergenic/ exonic rate
  • 5' or 3' bias
  • other metadata …
  • Useful Tools
  • Picard Tools and RNASeQC
slide-37
SLIDE 37

Filtering Cells: Complexity

Complexity: Simplest definition is the number of genes expressing at any amount in a cell. Filtering both ends. Lower: Failed libraries? Higher: Doublets?

Complexity

slide-38
SLIDE 38

Checks and Balances in Analysis

slide-39
SLIDE 39

39

Some single-cell RNA-seq data challenges to remember

  • Drop out: data has an excessive amount of zeros due

to limiting mRNA Zero expression doesn’t mean the gene isn’t on

slide-40
SLIDE 40

40

Some single-cell RNA-seq data challenges to remember

  • Confounding: quality control metrics have the

potential to be confounded with biology

Batch effects can be removed from the data if the batch effect isn’t completely confounded with biology

slide-41
SLIDE 41

41

Complexity

Cell Complexity Genes per cell (ordered)

complexity = number of genes detected in a cell

Filtering and quality control: Number of genes per cell

slide-42
SLIDE 42

42

Filtering and quality control: Mitochondrial gene expression

Percent of reads in a cell coming from mitochondrial genes is a good measure of cell quality - high mitochondrial gene expression indicates stressed cells (e..g, from damage during tissue dissociation)

slide-43
SLIDE 43

43

Slide courtesy of Karthik Shekhar

slide-44
SLIDE 44

44

Filtering and quality control: Doublets - what are they?

Slide courtesy of Karthik Shekhar

slide-45
SLIDE 45

45

Filtering and quality control: Doublets - resources for identifying them

Most simple way to filter for doublets is to choose an upper threshold on the number of genes or counts per cell in your data - a doublet (which is two cells viewed as one) should in theory have a lot more genes and counts than other cells More sophisticated way to remove doublets is to use a package for identifying doublets, such as: https://github.com/JonathanShor/DoubletDetection https://github.com/AllonKleinLab/scrublet https://www.biorxiv.org/content/early/2018/06/20/352484

slide-46
SLIDE 46

46

Filtering and quality control: Cells vs. reads/cell

Slide courtesy of Karthik Shekhar

slide-47
SLIDE 47

47

Single-cell RNA-seq analysis pipeline: Analyzing the expression data

Expression Matrix (GENES x CELLS) Filter Cells / Quality Control Normalization

  • 1. Identify

Variable Genes

  • 2. Dimensionality

Reduction

  • 3. Exploring Known

Marker Genes

  • 4. Clustering
  • 5. Differentially

Expressed Genes

  • 6. Assigning

Cell Type

  • 7. Functional

Annotation

Pre-Processing Clustering Biology

slide-48
SLIDE 48

48

Data normalization and scaling

Typically, we:

  • Normalize gene expression for each cell by total expression and

multiply by a scale factor Objective is to have relative gene expression to eliminate technical factors that impact the variation in the number of molecules per cell As a caution, there are biological factors that can impact this variation too

  • Log transform the resulting normalized expression

Helps get rid of extreme values in the data

slide-49
SLIDE 49

Seurat: R scRNA-Seq Analysis Package

https://github.com/satijalab/seurat

slide-50
SLIDE 50

Prepping Counts for Seurat

3 prime

  • Expected by Seurat.
  • Counts collapsed with UMIs.
  • Log2 transform (in Seurat).
  • Account for sequencing depth (in Seurat).

Full Transcript Sequencing

  • Can be used in Seurat.
  • TPM +1 transformed counts.
  • Log2 transform (in Seurat).
  • Sequencing depth is already accounted.
slide-51
SLIDE 51

What is a Sparse Matrix?

  • Sparse Matrix
  • A matrix where most of the elements are 0.
  • Dense Matrix
  • A matrix where most elements are not 0.
  • Many ways to efficiently represent a sparse matrix

in memory.

  • Here, the underlying data structure is a

coordinate list.

slide-52
SLIDE 52

2D Array vs a Coordinate List

slide-53
SLIDE 53

53

Single-cell RNA-seq analysis pipeline: Analyzing the expression data

Expression Matrix (GENES x CELLS) Filter Cells / Quality Control Normalization

  • 1. Identify

Variable Genes

  • 2. Dimensionality

Reduction

  • 3. Exploring Known

Marker Genes

  • 4. Clustering
  • 5. Differentially

Expressed Genes

  • 6. Assigning

Cell Type

  • 7. Functional

Annotation

Pre-Processing Clustering Biology

slide-54
SLIDE 54
  • 1. Making Sense of Variation
slide-55
SLIDE 55

Slide courtesy of Karthik Shekhar

55

slide-56
SLIDE 56

Variable Genes in Seurat

Calculate mean expression. Calculate disperstion (standard deviation). Calculate z-score for dispersions within each bin. Stratifies and controls from the relationship between the variability and mean expression.

Default Standard Deviation

slide-57
SLIDE 57

57

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction

Cells are in 20,000 dimensional space

  • many genes are lowly detected / noisy measurements
  • genes are not independent of one another! rather they
  • perate in coregulatory modules

Principal component analysis (PCA) moves us from describing cells with 20,000 gene expression values to 10-100 principal component scores ** Note that the first principal component often captures technical variability

slide-58
SLIDE 58

Slide courtesy of Karthik Shekhar

58

slide-59
SLIDE 59

59

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction
  • PCA is a dimensionality

reduction method that transforms a set of

  • bservations into a set of

linearly uncorrelated variables called principal components

  • The first principal

component contains the most variance, and each component after contains as much variance while still being orthogonal to

  • ther components
slide-60
SLIDE 60

60

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction

From: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

  • PCA is a dimensionality

reduction method that transforms a set of

  • bservations into a set of

linearly uncorrelated variables called principal components

  • The first principal

component contains the most variance, and each component after contains as much variance while still being orthogonal to

  • ther components

Identifying maximal orthogonal sources of variation

slide-61
SLIDE 61

61

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction

PCA of single cell data

  • PC1 separates the red cells from the pink, orange, and green cells
  • PC2 separates the green cells from the red, pink, and orange cells
slide-62
SLIDE 62

62

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction

PCA of single cell data

  • PC3 further splits off the orange cells
slide-63
SLIDE 63

63

Determining cell type, state, and/or function:

  • 2. Dimensionality reduction

PCA of single cell data

  • tSNE is nonlinear dimensionality reduction
  • tSNE collapse the visualization to 2D

tSNE: t- distributed Stochastic Neighbor Embedding

slide-64
SLIDE 64

Dimensionality Reduction

  • Start with many measurements (high dimensional).
  • Want to reduce to few features (lower-

dimensional space).

  • One way is to extract features based on capturing

groups of variance.

  • Another could be to preferentially select some of

the current features.

  • We have already done this.
  • We need this to plot the cells in 2D (or ordinate

them)

  • In scRNA-Seq PC1 may be complexity or technical.
slide-65
SLIDE 65

PCA: Overview

  • Eigenvectors of

covariance matrix.

  • Find orthogonal groups
  • f variance.
  • Given from most to

least variance.

  • Components of

variation.

  • Linear combinations

explaining the variance.

slide-66
SLIDE 66

PCA: in Practice

Things to be aware of-

  • Data with different magnitudes will dominate.
  • Zero center and divided by SD.
  • (Standardized).
  • Can be affected by outliers.
  • Data is often first filtered to remove noise.
slide-67
SLIDE 67

PCs

Notice how lower PCs look more and more “spherical” - this loss of structure indicates that the variation captured by these PCs mostly reflects noise.

slide-68
SLIDE 68

How Many Components Should We Use?

Elbow Plot (Scree Plot)

slide-69
SLIDE 69
  • 3. Visualization

Slide adapted from Karthik Shekhar

6 9

slide-70
SLIDE 70

t-SNE: Collapsing the Visualization to 2D

slide-71
SLIDE 71

t-SNE: Nonlinear Dimensionality Reduction

slide-72
SLIDE 72

t-SNE: How it Works

slide-73
SLIDE 73

PCA and t-SNE Together

  • Often t-SNE is performed on PCA components
  • Liberal number of components.
  • Removes mild signal (assumption of noise).
  • Faster, on less data but, hopefully the same

signal.

slide-74
SLIDE 74

Plotting Metadata on Ordinations

Metadata Gene Expression ✅ X ✅ X

slide-75
SLIDE 75

Caution When Interpreting t-SNE

Nonlinear Optimized for local distanct Big clusters can just mean more cells.

slide-76
SLIDE 76

Learn More About t-SNE

  • Awesome Blog on t-SNE parameterization
  • http://distill.pub/2016/misread-tsne
  • Publication
  • https://lvdmaaten.github.io/publications/papers/

JMLR_2008.pdf

  • Nice YouTube Video
  • https://www.youtube.com/watch?v=RJVL80Gg3l

A

  • Code
  • https://lvdmaaten.github.io/tsne/
  • Interactive Tensorflow
  • http://projector.tensorflow.org/
slide-77
SLIDE 77
  • 4. Clustering cells to identify cell-types

Andrews TS and Hemberg M. Mol Aspects Med. 2018

7 7

slide-78
SLIDE 78

Defining Clusters Through Graphs

slide-79
SLIDE 79

Local Moving Heuristic

slide-80
SLIDE 80

80

Shekhar et al. Cell 2016 Tirosh and Izar et al. Science 2016

slide-81
SLIDE 81

81

Determining cell type, state, and/or function:

  • 3. Visualization

A great tSNE resource! https://distill.pub/2016/misread-tsne/

slide-82
SLIDE 82

82

Single-cell RNA-seq analysis pipeline: Analyzing the expression data

  • 1. Expression Matrix

(GENES x CELLS)

  • 2. Filter Cells /

Quality Control

  • 3. Normalization
  • 1. Identify

Variable Genes

  • 2. Dimensionality

Reduction

  • 3. Exploring Known

Marker Genes

  • 4. Clustering
  • 5. Differentially

Expressed Genes

  • 6. Assigning

Cell Type

  • 7. Functional

Annotation

Pre-Processing Clustering Biology

slide-83
SLIDE 83
  • 5. Assigning cell identity & comparing across

conditions: Differential Expression Analysis

83

Soneson and Robinson. Nat Methods 2018 Haber, Moshe and Rogel et al. Nature 2017

slide-84
SLIDE 84

84

Determining cell type, state, and/or function:

  • 5. Identifying differentially expressed genes

Bulk Single cell

slide-85
SLIDE 85

Differential Expression

slide-86
SLIDE 86

Single Cell Differential Expression (SCDE)

slide-87
SLIDE 87

MAST

  • Uses hurdle model
  • Two part generalized

linear model to address both rate of expression (prevalence) and expression.

  • GLM means covariates

can be used to control for unwanted signal.

  • CDR: Cellular detection rate
  • Cellular complexity
  • Values below a threshold

are 0

Additionally introduces a GSEA method https://github.com/RGLab/MAST

slide-88
SLIDE 88

MAST: Hurdle Models

slide-89
SLIDE 89

Seurat: Differential Expression

  • Default if one cluster again many tests.
  • Can specify an ident.2 test between clusters.
  • Adding speed by excluding tests.
  • Min.pct - controls for sparsity
  • Min percentage in a group
  • Thresh.test - must have this difference in

averages.

slide-90
SLIDE 90

Seurat: Many Choices of DE

Bimod

  • Tests differences in mean and proportions.

Roc

  • Uses AUC like definition of separation.

T

  • Student's T-test.

Tobit

  • Tobit regression on a smoothed data.

MAST

  • Hurdle model for zero inflated data

….

slide-91
SLIDE 91

91

Shekhar et al. Cell 2016 Park and Shreshtha et al. Science 2018

  • 6. Assigning cell identity: Known marker genes
slide-92
SLIDE 92

92

Determining cell type, state, and/or function: Exploring expression of marker genes

slide-93
SLIDE 93

93

Determining cell type, state, and/or function:

  • 6. Assigning cell type
slide-94
SLIDE 94

94

Visualizing genes of interest Dot plots, violin plots, feature plots

Size of circle

  • Gene prevalence in cluster

Color of circle

  • More red, more expressed in cluster

Scales well with many cells

sparse genes prevalent genes lowly expressed highly expressed very specific

slide-95
SLIDE 95

95

Determining cell type, state, and/or function: .Identifying differentially expressed genes

Cell clusters Genes

slide-96
SLIDE 96

96

Visualizing genes of interest Dot plots, violin plots, feature plots

slide-97
SLIDE 97

97

Gene signatures can be used to score each cell based on a set of genes

  • Can visualize a score for each cell and look at multiple genes at once
  • Done for a gene expression program of interest, e.g, cell-cycle,

inflammation, cell type, dissociation

  • Reduces the effects of dropouts

Gene signature for T cells

slide-98
SLIDE 98

98

Visualizing genes of interest Dot plots, violin plots, feature plots

slide-99
SLIDE 99

99

  • 7. Functional annotation by pathway analysis and

gene-set enrichment analysis

Shekhar et al. Cell 2016

slide-100
SLIDE 100

100

Trajectory inference

Bach et al. Nat Comm 2016 Haghverdi et al. Nat Methods 2016 Diffusion pseudotime Diffusion Maps

slide-101
SLIDE 101

Recap: what did we just cover?

101

Wagner, Regev and Yosef. Nat Biotech 2016

We covered just this. So much more to learn!

slide-102
SLIDE 102

Recap: what did we just cover?

102

Wagner, Regev and Yosef. Nat Biotech 2016

Expression Matrix (GENES x CELLS) Filter Cells / Quality Control Normalization

  • 1. Identify

Variable Genes

  • 2. Dimensionality

Reduction

  • 3. Visualization
  • 4. Clustering
  • 5. Differentially

Expressed Genes

  • 6. Assigning

Cell Type

  • 7. Functional

Annotation

Pre-Processing Clustering Biology

Time to execute this pipeline in a hands on example!

slide-103
SLIDE 103

Tools Tools an and r resources

103

slide-104
SLIDE 104

RStudio: integrated development environment for R

104

slide-105
SLIDE 105

105

slide-106
SLIDE 106

106

slide-107
SLIDE 107

107

Single-cell portal: facilitates sharing and dissemination of data from single-cell studies

slide-108
SLIDE 108

Broad Institute Single Cell Portal

108

slide-109
SLIDE 109

109

Resources

Learn more about tSNE

  • Awesome Blog on t-SNE parameterization: http://distill.pub/2016/misread-tsne
  • Publication: https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf
  • Nice YouTube Video: https://www.youtube.com/watch?v=RJVL80Gg3lA
  • Code: https://lvdmaaten.github.io/tsne/
  • Interactive Tensor flow: http://projector.tensorflow.org/

Computational packages for single-cell analysis

  • http://bioconductor.org/packages/devel/workflows/html/simpleSingleCell.html
  • https://satijalab.org/seurat/
  • https://scanpy.readthedocs.io/

Online courses https://hemberg-lab.github.io/scRNA.seq.course/ https://github.com/SingleCellTranscriptomics

slide-110
SLIDE 110

110

Resources, cont.

Comprehensive list of single-cell resources: https://github.com/seandavi/awesome-single-cell www.singlecellnetwork.org

slide-111
SLIDE 111

111

Resources, cont.

Data repositories: JingleBells Data repositories: Conquer