Statistical Methods for Bulk and Single-cell RNA Sequencing Data - - PowerPoint PPT Presentation

statistical methods for bulk and single cell rna
SMART_READER_LITE
LIVE PREVIEW

Statistical Methods for Bulk and Single-cell RNA Sequencing Data - - PowerPoint PPT Presentation

Statistical Methods for Bulk and Single-cell RNA Sequencing Data Jingyi Jessica Li Department of Statistics University of California, Los Angeles http://jsb.ucla.edu The central dogma of molecular biology 2018 marks the 60th anniversary of


slide-1
SLIDE 1

Statistical Methods for Bulk and Single-cell RNA Sequencing Data

Jingyi Jessica Li

Department of Statistics University of California, Los Angeles http://jsb.ucla.edu

slide-2
SLIDE 2

The central dogma of molecular biology

2018 marks the 60th anniversary of the central dogma: DNA makes RNA makes proteins.

Francis Crick speaking at the 1963 CSH Symposium [Cobb, PLoS Biology, 2017]

1

slide-3
SLIDE 3

The central dogma of molecular biology

The central dogma of molecular biology: DNA makes RNA makes proteins.

AACGTCGT GCTG CCG AATCAA

DNA RNA protein transcription

AACGUCGU GCUG CCG AAUCAA

translation

2

slide-4
SLIDE 4

The central dogma of molecular biology

In transcription, a particular segment of DNA (combinations of exons) is copied into RNA segments.

AACGTCGT GCTG CCG AATCAA

gene (DNA) RNA protein transcription

AACGUCGU GCUG CCG AAUCAA

translation

exon 1 exon 2 exon 3 exon 4 introns removed

3

slide-5
SLIDE 5

Understanding genome functions

?

[Kundaje et al., Nature, 2015]

4

slide-6
SLIDE 6

Understanding genome functions

?

4

slide-7
SLIDE 7

Alternative splicing

In alternative splicing, particular exons of a gene may be included into or excluded from a mature RNA isoform [Chow et al., Cell, 1977].

AACGTCGT GCTG CCG AATCAA

gene isoforms alternative splicing

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA

isoform A isoform B

(exon 2 included) (exon 2 excluded)

5

slide-8
SLIDE 8

Alternative splicing

In alternative splicing, particular exons of a gene may be included into or excluded from a mature RNA isoform [Chow et al., Cell, 1977].

AACGTCGT GCTG CCG AATCAA

gene isoforms alternative splicing

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA

isoform A isoform B

AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA

translation protein A protein B proteins

5

slide-9
SLIDE 9

Diversity in RNA isoform structures

Abnormal splicing can lead to genetic diseases.

AACGTCGT GCTG CCG AATCAA

gene RNA isoforms

normal splicing

proteins normal condition

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA

6

slide-10
SLIDE 10

Diversity in RNA isoform structures

Abnormal splicing can lead to genetic diseases.

AACGTCGT GCTG CCG AATCAA

gene RNA isoforms

normal splicing

proteins

abnormal splicing

AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA

normal condition disease condition

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA

6

slide-11
SLIDE 11

Understanding genome functions

Worm genome The human genome project ENCODE Pilot modENCODE Mouse genome 1000 Genomes Pilot ENCODE 1000 Genomes project Epigenome Roadmap GTEx project

7

slide-12
SLIDE 12

RNA sequencing (RNA-seq) technology

RNA-seq data full length RNA isoforms

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

RNA-seq experiments

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

statistical inference

AACGUCGUUG GCUGGU CCGGAGG AACGUCGUUG GCUGGU CCGGAGG

(unknown) (observed)

AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC

8

slide-13
SLIDE 13

RNA sequencing (RNA-seq) experiment

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

full length RNA isoforms (1712 bp on average)

fragmentation

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUC AACGUCG UUG GCUGGU CCGG AAGAACUAUAC

RNA fragments (< 600 bp)

9

slide-14
SLIDE 14

RNA sequencing (RNA-seq) experiment

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

full length RNA isoforms (1712 bp on average)

fragmentation

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUC AACGUCG UUG GCUGGU CCGG AAGAACUAUAC

RNA fragments (< 600 bp)

processing sequencing

TCC TTAGTTCTTGATATG TTGCAGCAAC CGACCA GGCCTCC TTAG TTGCAGC AAC CGACCA GGCC TTCTTGATATG AGG AATCAAGAACTATAC AACGTCG TTG GCTGGT CCGG AACGTCGTTG GCTGGT CCGGAGG AATC AAGAACUAUAC

9

slide-15
SLIDE 15

RNA sequencing (RNA-seq) experiment

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

full length RNA isoforms (1712 bp on average)

fragmentation

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUC AACGUCG UUG GCUGGU CCGG AAGAACUAUAC

RNA fragments (< 600 bp)

processing sequencing

TCC TTAGTTCTTGATATG TTGCAGCAAC CGACCA GGCCTCC TTAG TTGCAGC AAC CGACCA GGCC TTCTTGATATG AGG AATCAAGAACTATAC AACGTCG TTG GCTGGT CCGG AACGTCGTTG GCTGGT CCGGAGG AATC AAGAACUAUAC AACG

RNA-seq reads (< 300 bp)

CAGC TTG GGCC G AGG TATG A AACG CAAC GCTG TTAG AAGA TATG

RNA-seq reads ∝ isoform abundance × isoform length

9

slide-16
SLIDE 16

Mapping RNA-seq reads to the reference genome

AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC

full length RNA isoforms (1712 bp on average)

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

processing sequencing

AACG

RNA-seq reads (< 300 bp)

CAGC TTG GGCC G AGG TATG A AACG CAAC GCTG TTAG AAGA TATG AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

mapping (alignment)

RNA-seq reads aligned to genome

10

slide-17
SLIDE 17

Mapping RNA-seq reads to the reference genome

AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC

full length mRNA transcript (1712 bp on average)

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

processing sequencing

AACG

RNA-seq reads (< 300 bp)

CAGC TTG GGCC G AGG TATG A AACG CAAC GCTG TTAG AAGA TATG AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

mapping (alignment)

RNA-seq reads aligned to genome

2 2 1 2

10

slide-18
SLIDE 18

Mapping RNA-seq reads to the reference genome

AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC

histogram of RNA-seq read counts full length RNA isoforms (1712 bp on average)

AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

processing sequencing

AACG

RNA-seq reads (< 300 bp)

CAGC TTG GGCC G AGG TATG A AACG CAAC GCTG TTAG AAGA TATG AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC

mapping (alignment)

10

slide-19
SLIDE 19

Reference-based RNA-seq data analysis

  • 1. Align RNA-seq reads to a reference genome
  • 2. Analyze aligned reads at three levels

gene-level: exon-level: transcript-level:

DNA mRNA RNA-seq reads ambiguous

n n1 n2 gi = n φi = n1 n1 + n2 α1 α2

a b c 11

slide-20
SLIDE 20

Single-cell (sc) vs. bulk RNA-seq at the gene level

Tissue scRNA-seq bulk RNA-seq genes cells tissue

12

slide-21
SLIDE 21

Bulk RNA-seq: transcript/isoform discovery & quantification

slide-22
SLIDE 22

isoform-level

AIDE: annotation-assisted isoform discovery

13

slide-23
SLIDE 23

Isoform discovery: which isoforms are expressed?

  • More than 90% genes undergo alternative splicing in mammals

[Hooper, Human Genomics, 2014].

  • At least 35% genetic diseases involve abnormal splicing

[Manning et al., Nature Reviews Mol. Cell Biol. 2017].

AACGTCGT GCTG CCG AATCAA

gene isoforms

alternative splicing

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA

isoform A isoform B (exon 2 included) (exon 2 excluded)

14

slide-24
SLIDE 24

Isoform discovery: which isoforms are expressed?

AACGTCGT GCTG CCG AATCAA

gene isoforms

AACGUCGU GCUG CCG AAUCAA AACGUCGU

genome RNA-seq data

GCUG AACGUCGU CCG AACGUCGU AAUCAA AACGUCGU GCUG CCG GCUG CCG AAUCAA

Which isoforms are expressed?

statistical modeling 15

slide-25
SLIDE 25

Challenge 1: large number of candidate isoforms

Variable size (# of candidate isoforms) = 2

# of exons − 1

AACGTCGT GCTG CCG AATCAA

gene isoforms

AACGUCGU GCUG CCG AAUCAA AACGUCGU

genome RNA-seq data

GCUG AACGUCGU CCG AACGUCGU AAUCAA AACGUCGU GCUG CCG GCUG CCG AAUCAA

Which isoforms are expressed?

statistical modeling

For this 4-exon gene, 24 − 1 = 15 candidate isoforms

16

slide-26
SLIDE 26

Challenge 2: great information loss

  • RNA-seq reads are very short compared with full-length isoforms.
  • Most RNA-seq reads do not uniquely map to a single isoform.

?

gene isoform 1 isoform 4 isoform 2 isoform 3

17

slide-27
SLIDE 27

Challenge 2: great information loss

  • RNA-seq reads are very short compared with full-length isoforms.
  • Most RNA-seq reads do not uniquely map to a single isoform.

?

gene isoform 1 isoform 4 isoform 2 isoform 3

  • Technical biases introduced into RNA-seq experiments.

17

slide-28
SLIDE 28

Existing isoform discovery methods

State-of-the-art methods for isoform discovery:

  • SIIER [Jiang et al., Bioinformatics, 2009]
  • Cufflinks [Trapnell et al., Nature Biotechnology, 2010]
  • SLIDE [Li et al., Proc. Natl. Acad. Sci. 2011]
  • StringTie [Pertea et al., Nature Biotechnology, 2015]
  • · · ·

Limitations:

  • 1. Low accuracy for genes with complex splicing structures.
  • 2. Difficult to improve isoform-level performance.

[Kanitz et al., Genome Biology, 2015]

  • 3. Usage of annotations results in false positives.

18

slide-29
SLIDE 29

Usage of annotations results in false positives

Annotated isoforms are experimentally validated:

1 1 2 3 4

gene annotated isoforms

  • Ensembl database: 203, 903 isoforms

[Zerbino et al., Nucleic Acids Research, 2017]

19

slide-30
SLIDE 30

Usage of annotations results in false positives

Annotated isoforms are experimentally validated:

1 1 2 3 4

gene annotated isoforms

  • Ensembl database: 203, 903 isoforms

[Zerbino et al., Nucleic Acids Research, 2017]

annotated isoforms

expressed isoforms in normal brain

19

slide-31
SLIDE 31

Usage of annotations results in false positives

Annotated isoforms are experimentally validated:

1 1 2 3 4

gene annotated isoforms

  • Ensembl database: 203, 903 isoforms

[Zerbino et al., Nucleic Acids Research, 2017]

annotated isoforms

expressed isoforms in normal brain expressed isoforms in Alzheimer's brain

19

slide-32
SLIDE 32

Usage of annotations results in false positives

Annotated isoforms are experimentally validated:

1 1 2 3 4

gene annotated isoforms

  • Ensembl database: 203, 903 isoforms

[Zerbino et al., Nucleic Acids Research, 2017]

annotated isoforms

expressed isoforms in normal brain expressed isoforms in Parkinson's brain expressed isoforms in Alzheimer's brain

19

slide-33
SLIDE 33

False positives → false discoveries

Number of drugs per billion US$ R&D spending

1 10 100 1950 1960 1970 1980 1990 2000 2010

[Scannell et al., Nat. Rev. Drug Discov. 2012]

20

slide-34
SLIDE 34

Highlights of the AIDE method

  • 1. Selectively leverage annotation information to increase the precision

and robustness of isoform discovery.

21

slide-35
SLIDE 35

Highlights of the AIDE method

  • 1. Selectively leverage annotation information to increase the precision

and robustness of isoform discovery.

  • 2. Practical probabilistic model to account for technical biases.
  • 3. Conservatively identify isoforms that make statistically significant

contributions to explaining the observed RNA-seq reads.

21

slide-36
SLIDE 36

Highlights of the AIDE method

  • 1. Selectively leverage annotation information to increase the precision

and robustness of isoform discovery.

  • 2. Practical probabilistic model to account for technical biases.
  • 3. Conservatively identify isoforms that make statistically significant

contributions to explaining the observed RNA-seq reads.

  • 4. First method to control false discoveries by employing a statistical

testing procedure.

Expressed isoforms RNA-seq reads Annotation AIDE model Identified isoforms (unobserved, truth) (prior knowledge, inaccurate) (observed, with noises) (precise)

21

slide-37
SLIDE 37

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

22

slide-38
SLIDE 38

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

vs.

22

slide-39
SLIDE 39

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

selected based on MLE

22

slide-40
SLIDE 40

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

vs. LRT

22

slide-41
SLIDE 41

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

22

slide-42
SLIDE 42

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

vs.

22

slide-43
SLIDE 43

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

selected based on MLE

22

slide-44
SLIDE 44

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

vs. LRT

22

slide-45
SLIDE 45

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms: Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

  • utput

22

slide-46
SLIDE 46

The stepwise selection in AIDE: two stages

annotated isoforms: non-annotated isoforms:

Stage 1: candidates are annotated isoforms only Initialization Forward step Backward step

  • utput

Stage 2: candidates are all possible isoforms Initialization Forward step Backward step

23

slide-47
SLIDE 47

AIDE outperforms state-of-the-art methods

  • Human embryonic stem cells
  • Input: Illumina RNA-seq data
  • Evaluation: PacBio and Nanopore ONT RNA-seq data

0.65 0.34 0.54 0.93 0.93 0.91 0.47 0.3 0.4 0.51 0.21 0.4 0.91 0.89 0.88 0.36 0.19 0.3 0.89 0.94 0.85 0.95 0.98 0.94 0.69 0.8 0.59 0.67 0.37 0.56 0.92 0.92 0.9 0.49 0.32 0.4 0.54 0.23 0.43 0.9 0.87 0.88 0.38 0.2 0.31 0.87 0.94 0.84 0.94 0.98 0.93 0.66 0.78 0.56 Fscore ONT precision recall Fscore PacBio precision recall base exon transcript

Cufflinks AIDE StringTie Cufflinks AIDE StringTie Cufflinks AIDE StringTie Cufflinks AIDE StringTie Cufflinks AIDE StringTie Cufflinks AIDE StringTie

0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 PacBio

24

slide-48
SLIDE 48

AIDE effectively reduces false discoveries in real data

  • Data: breast cancer RNA-seq samples
  • Six genes:
  • isoforms identified only by Cufflinks but not by AIDE
  • experimental validation (PCR)

25

slide-49
SLIDE 49

AIDE effectively reduces false discoveries in real data

  • Data: breast cancer RNA-seq samples
  • Six genes:
  • isoforms identified only by Cufflinks but not by AIDE
  • experimental validation (PCR)
  • Four genes:

the isoforms uniquely predicted by Cufflinks were false positives

MTHFD2

1 2

NPC2 RBM7

1

CD164

1

ZFAND5

MTHFD2-201 MTHFD2-203 NPC2-207 NPC2-205 RBM7-203 RBM7-208 CD164-003 CD164-210 PCR AIDE Cufflinks

+

  • +
  • +

+

PCR AIDE Cufflinks

+

  • +
  • +

+

PCR AIDE Cufflinks

+

  • +
  • +

+

PCR AIDE Cufflinks

+

  • +
  • +

+

a b c d e f

25

slide-50
SLIDE 50

AIDE discovers isoforms with biological significance

FGFR1

PCR AIDE Cufflinks

+ +

  • gene

isoform MCF-7 sample BT549 sample control experiments (suppress expression of the isoform)

26

slide-51
SLIDE 51

Summary of the AIDE method

  • The first isoform discovery method that directly controls false

discoveries by implementing the statistical model selection principle.

Expressed isoforms RNA-seq reads Annotation AIDE model Identified isoforms (unobserved, truth) (prior knowledge, inaccurate) (observed, with noises) (precise)

  • Software: https://github.com/Vivianstats/AIDE
  • Manuscript:

Under review at Genome Research.

27

slide-52
SLIDE 52

Isoform quantification: what are the isoform expression levels?

  • More than 90% genes undergo alternative splicing in mammals

[Hooper, Human Genomics, 2014].

  • At least 35% genetic diseases involve abnormal splicing

[Manning et al., Nature Reviews Mol. Cell Biol. 2017].

AACGTCGT GCTG CCG AATCAA

gene isoforms

alternative splicing

AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA

isoform A isoform B (exon 2 included) (exon 2 excluded)

28

slide-53
SLIDE 53

Motivation: multiple human ESC RNA-seq samples

chr1; gene:TPR

29

slide-54
SLIDE 54

How to combine multiple RNA-seq samples?

Given D RNA-Seq (technical or biological) replicate samples and gene annotations, how to estimate the abundance of each annotated isoform for every gene?

30

slide-55
SLIDE 55

How to combine multiple RNA-seq samples?

Given D RNA-Seq (technical or biological) replicate samples and gene annotations, how to estimate the abundance of each annotated isoform for every gene?

  • Apply a single-sample method to each sample separately and then

average the estimated isoform abundance across multiple samples?

30

slide-56
SLIDE 56

How to combine multiple RNA-seq samples?

Given D RNA-Seq (technical or biological) replicate samples and gene annotations, how to estimate the abundance of each annotated isoform for every gene?

  • Apply a single-sample method to each sample separately and then

average the estimated isoform abundance across multiple samples?

  • This does not fully use the multi-sample information to reduce the

variance in estimating isoform abundance

30

slide-57
SLIDE 57

How to combine multiple RNA-seq samples?

Given D RNA-Seq (technical or biological) replicate samples and gene annotations, how to estimate the abundance of each annotated isoform for every gene?

  • Apply a single-sample method to each sample separately and then

average the estimated isoform abundance across multiple samples?

  • This does not fully use the multi-sample information to reduce the

variance in estimating isoform abundance

  • Apply a single-sample method to a pooled sample from the D

samples?

30

slide-58
SLIDE 58

How to combine multiple RNA-seq samples?

Given D RNA-Seq (technical or biological) replicate samples and gene annotations, how to estimate the abundance of each annotated isoform for every gene?

  • Apply a single-sample method to each sample separately and then

average the estimated isoform abundance across multiple samples?

  • This does not fully use the multi-sample information to reduce the

variance in estimating isoform abundance

  • Apply a single-sample method to a pooled sample from the D

samples?

  • The estimated isoform abundance may be biased by outlier samples

30

slide-59
SLIDE 59

MSIQ

Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification

31

slide-60
SLIDE 60

Summary

  • It is necessary to consider the heterogeneity of different samples to

make robust isoform quantification

32

slide-61
SLIDE 61

Summary

  • It is necessary to consider the heterogeneity of different samples to

make robust isoform quantification

  • MSIQ is able to identify a consistent group of samples that are most

representative of the biological condition

32

slide-62
SLIDE 62

Summary

  • It is necessary to consider the heterogeneity of different samples to

make robust isoform quantification

  • MSIQ is able to identify a consistent group of samples that are most

representative of the biological condition

  • MSIQ increases the accuracy of isoform quantification by

incorporating the information from multiple samples

32

slide-63
SLIDE 63

Summary

  • It is necessary to consider the heterogeneity of different samples to

make robust isoform quantification

  • MSIQ is able to identify a consistent group of samples that are most

representative of the biological condition

  • MSIQ increases the accuracy of isoform quantification by

incorporating the information from multiple samples

  • Our proposed hierarchical model is an umbrella framework that are

generalizable to incorporate more delicate consideration of read generating mechanisms

32

slide-64
SLIDE 64

Paper and Software

MSIQ: joint modeling of multiple RNA-seq samples for accurate isoform quantification by Wei Vivian Li, Anqi Zhao, Shihua Zhang, and Jingyi Jessica Li Annals of Applied Statistics 12(1):510–539 R package MSIQ http://github.com/Vivianstats/MSIQ

33

slide-65
SLIDE 65

Single-cell RNA-seq: dropout imputation

slide-66
SLIDE 66

scRNA-seq vs. bulk RNA-seq at the gene level

Tissue scRNA-seq bulk RNA-seq genes cells tissue

34

slide-67
SLIDE 67

Dropout events in scRNA-seq

from [Kharchenko et al., Nature methods, 2014]

35

slide-68
SLIDE 68

Dropout events in scRNA-seq

  • A dropout event occurs when a transcript is expressed in a cell but is

entirely undetected in its mRNA profile

  • Dropout events occur due to low amounts of mRNA in individual

cells

  • The frequency of dropout events depends on scRNA-seq protocols
  • Fluidigm C1 platform: ∼ 100 cells, ∼ 1 million reads per cell
  • Droplet microfluidics: ∼ 10, 000 cells, ∼ 100K reads per cell [Zilionis

et al., Nature Protocols, 2017]

  • Trade-off: given the same budget, more cells, more dropouts

36

slide-69
SLIDE 69

Statistical methods for scRNA-seq data analysis

  • Clustering / cell type identification
  • SNN-Cliq [Xu et al., Bioinformatics, 2015]: uses the ranking of

genes to construct a graph and learn cell clusters

  • CIDR [Lin et al., Genome Biology, 2017]: incorporates implicit

imputation of dropout values

  • Cell relationship reconstruction
  • Seurat [Satija et al., Nature biotechnology, 2015]: infers the spatial
  • rigins of cells from their scRNA-seq data and a spatial reference

map of landmark genes, whose expressions are imputed based on highly variable genes

  • Dimension reduction
  • ZIFA [Pierson et al., Genome biology, 2015]: accounts for dropout

events based on an empirical observation: dropout rate of a gene depends on its mean expression level in the population

37

slide-70
SLIDE 70

Genome-wide explicit imputation for dropouts

Why do we need genome-wide explicit imputation methods? Downstream analyses relying on the accuracy of gene expression measurements:

  • differential gene expression analysis
  • identification of cell-type-specific genes
  • reconstruction of differentiation trajectory

It is important to adjust/correct the false zero expression values due to dropouts

38

slide-71
SLIDE 71

Genome-wide imputation methods for scRNA-seq

MAGIC [Dijk et al., Cell, 2018]:

  • the first method for explicit and genome-wide imputation of

scRNA-seq gene expression data

  • imputes missing expression values by sharing information across

similar cells

  • creates a Markov transition matrix, which determines the weights of

the cells SAVER [Huang et al., Nature Methods, 2018]:

  • borrows information across genes using a Bayesian approach

DrImpute [Kwak et al., bioRxiv, 2017]:

  • borrows information across cells by averaging multiple imputation

results and several other recent methods available on bioRxiv

39

slide-72
SLIDE 72

Genome-wide imputation methods for scRNA-seq

Limitations of aforementioned methods:

  • It is not ideal to impute all gene expressions
  • imputing expressions unaffected by dropout would introduce new bias
  • could also eliminate meaningful biological variation
  • It is inappropriate to treat all zero expressions as missing values
  • some zero expressions may reflect true biological non-expression
  • zero expressions can be resulted from gene expression stochasticity

40

slide-73
SLIDE 73

Genome-wide imputation methods for scRNA-seq

Limitations of aforementioned methods:

  • It is not ideal to impute all gene expressions
  • imputing expressions unaffected by dropout would introduce new bias
  • could also eliminate meaningful biological variation
  • It is inappropriate to treat all zero expressions as missing values
  • some zero expressions may reflect true biological non-expression
  • zero expressions can be resulted from gene expression stochasticity

How to determine which values are affected by the dropout events?

40

slide-74
SLIDE 74

Our method: scImpute

  • 1. For each gene, to determine which expression values are most likely

affected by dropout events

  • 2. For each cell, to impute the highly likely dropout values by borrowing

information from the same genes’ expression in similar cells

cell j selected cells

  • ther cells

… … gene set A gene set B … … … …

imputation with selected cells

cell j

zero high expression

j j

41

slide-75
SLIDE 75

Example 1: ERCC spike-ins

scImpute recovers the true expression of the ERCC spike-in transcripts, especially low abundance transcripts that are impacted by dropout events

  • 3, 005 cells from the mouse somatosensory cortex region
  • 57 ERCC transcripts
  • raw

scImpute 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

log10(ERCC concentration) log10(read count + 1) cell 1 cell 2 cell 3 cell 4

42

slide-76
SLIDE 76

Example 2: cell clustering

4, 500 peripheral blood mononuclear cells (PBMCs) from high-throughput droplet-based system 10x genomics [Zheng et al., Nature

communications, 2017]

Proportion of zero expression is 92.6%

43

slide-77
SLIDE 77

Example 3: gene expression dynamics

Bulk and single-cell time-course RNA-seq data profiled at 0, 12, 24, 36, 72, and 96 h of the differentiation of embryonic stem cells into definitive endorderm cells [Chu et al., Genome biology, 2016] time point 00h 12h 24h 36h 72h 96h total scRNA-seq (cells) 92 102 66 172 138 188 758 bulk RNA-seq (replicates) 3 3 3 3 3 15

44

slide-78
SLIDE 78

Example 3: gene expression dynamics

Correlation between gene expression in single-cell and bulk data

  • 0.5

0.6 0.7 0.8 12h 24h 36h 72h 96h

time correlation method

raw scImpute

45

slide-79
SLIDE 79

Example 3: gene expression dynamics

Imputed read counts reflect more accurate gene expression dynamics along the time course

46

slide-80
SLIDE 80

Conclusions

  • scImpute is a flexible and easily interpretable statistical method that

addresses the dropout events prevalent in scRNA-seq data

  • scImpute focuses on imputing the missing expression values of

dropout genes, while retaining the expression levels of genes that are largely unaffected by dropout events

  • scImpute is compatible with existing pipelines or downstream

analysis of scRNA-seq data, such as normalization, differential expression analysis, clustering and classification

  • scImpute scales up well when the number of cells increases

47

slide-81
SLIDE 81

Paper and Software

An accurate and robust imputation method scImpute for single-cell RNA-seq data by Wei Vivian Li and Jingyi Jessica Li Nature Communications 9:997 R package scImpute https://github.com/Vivianstats/scImpute

48

slide-82
SLIDE 82

Real vs. semi-synthetic data

49

slide-83
SLIDE 83

Real vs. semi-synthetic data

50

slide-84
SLIDE 84

Benchmark standard

1 2 3 4 5 6 CA1-Pyramidal 442 20 289 1 4 42 40 S1-Pyramidal 2 273 1 1 32 11 Oligodendrocytes 282 62 2 Interneurons 5 7 2 220 6 1 Endothelial 1 14 Microglia 6 Mural 1 Ependymal 7 Astrocytes 1 2 1 20 labels used in Huang et al . labels reported in Zeisel et al .

51

slide-85
SLIDE 85

Acknowledgements

Wei Vivian Li (PhD student, UCLA) Collaborators:

  • Prof. Alexander Hoffmann (UCLA)
  • Prof. Hubing Shi (Sichuan University)
  • Prof. Xin Tong (USC)
  • Prof. Shihua Zhang (CAS)
  • Dr. Anqi Zhao (Harvard)

Website: http://jsb.ucla.edu Email: jli@stat.ucla.edu

52