statistical methods for bulk and single cell rna
play

Statistical Methods for Bulk and Single-cell RNA Sequencing Data - PowerPoint PPT Presentation

Statistical Methods for Bulk and Single-cell RNA Sequencing Data Jingyi Jessica Li Department of Statistics University of California, Los Angeles http://jsb.ucla.edu The central dogma of molecular biology 2018 marks the 60th anniversary of


  1. Statistical Methods for Bulk and Single-cell RNA Sequencing Data Jingyi Jessica Li Department of Statistics University of California, Los Angeles http://jsb.ucla.edu

  2. The central dogma of molecular biology 2018 marks the 60th anniversary of the central dogma: DNA makes RNA makes proteins. Francis Crick speaking at the 1963 CSH Symposium [Cobb, PLoS Biology , 2017] 1

  3. The central dogma of molecular biology The central dogma of molecular biology: DNA makes RNA makes proteins. DNA AACGTCGT GCTG CCG AATCAA transcription RNA AACGUCGU GCUG CCG AAUCAA translation protein 2

  4. The central dogma of molecular biology In transcription, a particular segment of DNA (combinations of exons) is copied into RNA segments. exon 1 exon 2 exon 3 exon 4 gene AACGTCGT GCTG CCG AATCAA (DNA) transcription introns removed RNA AACGUCGU GCUG CCG AAUCAA translation protein 3

  5. Understanding genome functions ? [Kundaje et al., Nature , 2015] 4

  6. Understanding genome functions ? 4

  7. Alternative splicing In alternative splicing, particular exons of a gene may be included into or excluded from a mature RNA isoform [Chow et al., Cell , 1977] . gene AACGTCGT GCTG CCG AATCAA alternative splicing isoforms AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA isoform A isoform B (exon 2 included) (exon 2 excluded) 5

  8. Alternative splicing In alternative splicing, particular exons of a gene may be included into or excluded from a mature RNA isoform [Chow et al., Cell , 1977] . gene AACGTCGT GCTG CCG AATCAA alternative splicing AACGUCGU CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA isoforms AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA isoform A isoform B translation proteins protein B protein A 5

  9. Diversity in RNA isoform structures Abnormal splicing can lead to genetic diseases. gene AACGTCGT GCTG CCG AATCAA normal condition normal splicing AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA RNA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA isoforms AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA proteins 6

  10. Diversity in RNA isoform structures Abnormal splicing can lead to genetic diseases. gene AACGTCGT GCTG CCG AATCAA normal condition disease condition normal splicing abnormal splicing AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG CCG AAUCAA AACGUCGUAAUCAA RNA AACGUCGU CCG AAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA isoforms AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA AACGUCGUAAUCAA AACGUCGUAAUCAA AACGUCGU CCG AAUCAA AACGUCGU CCG AAUCAA proteins 6

  11. Understanding genome functions 1000 Genomes The human genome project ENCODE Pilot 1000 Genomes Pilot GTEx project project Epigenome Worm genome Mouse genome modENCODE ENCODE Roadmap 7

  12. RNA sequencing (RNA-seq) technology AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC full length RNA isoforms AACGUCGUUG GCUGGU CCGGAGG (unknown) AACGUCGUUG GCUGGU CCGGAGG statistical inference RNA-seq experiments RNA-seq data (observed) AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC 8

  13. RNA sequencing (RNA-seq) experiment full length RNA isoforms AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC (1712 bp on average) fragmentation RNA fragments AACGUCG UUG GCUGGU CCGG AGG AAUCAAGAACUAUAC (< 600 bp) AACGUCGUUG GCUGGU CCGGAGG AAUC AAGAACUAUAC 9

  14. RNA sequencing (RNA-seq) experiment full length RNA isoforms AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC (1712 bp on average) fragmentation RNA fragments AACGUCG UUG GCUGGU CCGG AGG AAUCAAGAACUAUAC (< 600 bp) AACGUCGUUG GCUGGU CCGGAGG AAUC AAGAACUAUAC processing sequencing TTGCAGC AAC CGACCA GGCC TCC TTAGTTCTTGATATG AACGTCG TTG GCTGGT CCGG AGG AATCAAGAACTATAC TTGCAGCAAC CGACCA GGCCTCC TTAG TTCTTGATATG AACGTCGTTG GCTGGT CCGGAGG AATC AAGAACUAUAC 9

  15. RNA sequencing (RNA-seq) experiment AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC full length RNA isoforms (1712 bp on average) AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC fragmentation RNA fragments AACGUCG UUG GCUGGU CCGG AGG AAUCAAGAACUAUAC (< 600 bp) AACGUCGUUG GCUGGU CCGGAGG AAUC AAGAACUAUAC processing sequencing TTGCAGC AAC CGACCA GGCC TCC TTAGTTCTTGATATG AACGTCG TTG GCTGGT CCGG AGG AATCAAGAACTATAC TTGCAGCAAC CGACCA GGCCTCC TTAG TTCTTGATATG AACGTCGTTG GCTGGT CCGGAGG AATC AAGAACUAUAC RNA-seq reads AACG CAGC TTG G GGCC AGG A TATG (< 300 bp) AACG CAAC GCTG TTAG AAGA TATG RNA-seq reads ∝ isoform abundance × isoform length 9

  16. Mapping RNA-seq reads to the reference genome AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC full length RNA isoforms AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC (1712 bp on average) processing sequencing RNA-seq reads AACG CAGC TTG G GGCC AGG A TATG (< 300 bp) AACG CAAC GCTG TTAG AAGA TATG mapping (alignment) RNA-seq reads aligned to genome AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC 10

  17. Mapping RNA-seq reads to the reference genome full length mRNA transcript AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC (1712 bp on average) AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC processing sequencing RNA-seq reads AACG CAGC TTG G GGCC AGG A TATG (< 300 bp) AACG CAAC GCTG TTAG AAGA TATG mapping (alignment) RNA-seq reads aligned to genome AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC 2 2 1 2 10

  18. Mapping RNA-seq reads to the reference genome AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC full length RNA isoforms AACGUCGUUG GCUGGU CCGGAGG AAUCAAGAACUAUAC (1712 bp on average) processing sequencing RNA-seq reads AACG CAGC TTG G GGCC AGG A TATG (< 300 bp) AACG CAAC GCTG TTAG AAGA TATG mapping (alignment) histogram of RNA-seq read counts AACGTCGTTG GCTGGT CCGGAGG AATCAAGAACTATAC 10

  19. Reference-based RNA-seq data analysis 1. Align RNA-seq reads to a reference genome 2. Analyze aligned reads at three levels a n g i = n gene-level: n 1 DNA mRNA n 1 exon-level: b φ i = n 1 + n 2 n 2 transcript-level: ambiguous α 1 c α 2 RNA-seq reads 11

  20. Single-cell (sc) vs. bulk RNA-seq at the gene level Tissue scRNA-seq bulk RNA-seq genes cells tissue 12

  21. Bulk RNA-seq: transcript/isoform discovery & quantification

  22. AIDE: annotation-assisted isoform discovery isoform-level 13

  23. Isoform discovery: which isoforms are expressed? • More than 90% genes undergo alternative splicing in mammals [Hooper, Human Genomics , 2014] . • At least 35% genetic diseases involve abnormal splicing [Manning et al., Nature Reviews Mol. Cell Biol. 2017] . gene AACGTCGT GCTG CCG AATCAA alternative splicing isoforms AACGUCGU GCUG CCG AAUCAA AACGUCGU CCG AAUCAA isoform A isoform B (exon 2 included) (exon 2 excluded) 14

  24. Isoform discovery: which isoforms are expressed? RNA-seq data genome statistical modeling gene AACGTCGT GCTG CCG AATCAA Which isoforms are expressed? isoforms AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG AACGUCGU CCG AACGUCGU AAUCAA AACGUCGU GCUG CCG GCUG CCG AAUCAA 15

  25. Challenge 1: large number of candidate isoforms # of exons − 1 Variable size ( # of candidate isoforms) = 2 RNA-seq data genome statistical modeling gene AACGTCGT GCTG CCG AATCAA Which isoforms are expressed? isoforms AACGUCGU GCUG CCG AAUCAA AACGUCGU GCUG AACGUCGU CCG AACGUCGU AAUCAA AACGUCGU GCUG CCG GCUG CCG AAUCAA For this 4-exon gene, 2 4 − 1 = 15 candidate isoforms 16

  26. Challenge 2: great information loss • RNA-seq reads are very short compared with full-length isoforms. • Most RNA-seq reads do not uniquely map to a single isoform. isoform 1 isoform 2 ? gene isoform 3 isoform 4 17

  27. Challenge 2: great information loss • RNA-seq reads are very short compared with full-length isoforms. • Most RNA-seq reads do not uniquely map to a single isoform. isoform 1 isoform 2 ? gene isoform 3 isoform 4 • Technical biases introduced into RNA-seq experiments. 17

  28. Existing isoform discovery methods State-of-the-art methods for isoform discovery: • SIIER [Jiang et al., Bioinformatics , 2009] • Cufflinks [Trapnell et al., Nature Biotechnology , 2010] • SLIDE [Li et al., Proc. Natl. Acad. Sci. 2011] • StringTie [Pertea et al., Nature Biotechnology , 2015] • · · · Limitations: 1. Low accuracy for genes with complex splicing structures. 2. Difficult to improve isoform-level performance. [Kanitz et al., Genome Biology , 2015] 3. Usage of annotations results in false positives. 18

  29. Usage of annotations results in false positives Annotated isoforms are experimentally validated: gene 1 1 2 annotated isoforms 3 4 • Ensembl database: 203 , 903 isoforms [Zerbino et al., Nucleic Acids Research , 2017] 19

  30. Usage of annotations results in false positives Annotated isoforms are experimentally validated: gene 1 1 2 annotated isoforms 3 4 • Ensembl database: 203 , 903 isoforms [Zerbino et al., Nucleic Acids Research , 2017] expressed isoforms in normal brain annotated isoforms 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend