Introduction to RNA-Seq
David Wood Winter School in Mathematics and Computational Biology July 1, 2013
Introduction to RNA-Seq David Wood Winter School in Mathematics and - - PowerPoint PPT Presentation
Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 RNA is... Diverse Dynamic Central DNA Epigenetics rRNA RNA tRNA e c n a d n u b A Protein mRNA Time RNA is... Diverse
David Wood Winter School in Mathematics and Computational Biology July 1, 2013
Central
Epigenetics
Diverse
tRNA mRNA rRNA
Dynamic
Time A b u n d a n c e
Quantitative Qualitative Understand the molecular basis of gene function. Classify and transform cellular states Integrative
Central
Epigenetics
Diverse
tRNA mRNA rRNA
Dynamic
Time A b u n d a n c e
Biological System Technology Available Resources
Project DB
Biological System Technology Available Resources
Project DB This talk: Focusing on reference based mammalian RNA-seq analysis
pA pA pA pA
ATG ATG
TSS
transcription start site pA polyadenylation signal protein coding regions
ATG
translation start site
AAA
polyadenylation non-coding regions genomic DNA microRNAs spliced intron
TSS TSS TSS TSS
ATG
AAA
ATG
AAA
ATG
AAA
ATG ATG ATG
AAA AAA
ATG
pA pA pA pA
ATG ATG
TSS
transcription start site pA polyadenylation signal protein coding regions
ATG
translation start site
AAA
polyadenylation non-coding regions genomic DNA microRNAs spliced intron
TSS TSS TSS TSS
ATG
AAA
ATG
AAA
ATG
AAA
ATG ATG ATG
AAA AAA
ATG
PASR miRNA tiRNA
AAA AAA Alu
pA pA pA pA
ATG ATG
TSS
transcription start site pA polyadenylation signal protein coding regions
ATG
translation start site
AAA
polyadenylation non-coding regions genomic DNA microRNAs spliced intron
TSS TSS TSS TSS
ATG
AAA
ATG
AAA
ATG
AAA
ATG ATG ATG
AAA AAA
ATG
PASR miRNA tiRNA
AAA AAA Alu
pA pA pA pA
ATG ATG
TSS
transcription start site pA polyadenylation signal protein coding regions
ATG
translation start site
AAA
polyadenylation non-coding regions genomic DNA microRNAs spliced intron
TSS TSS TSS TSS
ATG
AAA
ATG
AAA
ATG
AAA
ATG ATG ATG
AAA AAA
ATG
PASR miRNA tiRNA
Mutations Allelic Expression RNA Editing
pA pA pA pA
ATG ATG
TSS TSS TSS TSS AAA PASR miRNA
ATG
AAA
ATG
AAA
ATG
AAA
ATG ATG ATG
AAA AAA
ATG
tiRNA
non-spliced reads junction reads strand specific
Cloonan et al. Nat Methods 2008; 5:613-619
AAA Alu
mutations
!" #!!!!" $!!!!!" $#!!!!" %!!!!!" %#!!!!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" ,-./01-2340" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6"
Discovery genes, exons, junctions, UTRs, fusions (Present and Future)
!" #!!!!" $!!!!!" $#!!!!" %!!!!!" %#!!!!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" ,-./01-2340" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6"
Discovery genes, exons, junctions, UTRs, fusions (Present and Future) Dynamic Range
Mortazavi et al. Nat. Methods 2008; 5:621–628
!" #!!!!" $!!!!!" $#!!!!" %!!!!!" %#!!!!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" ,-./01-2340" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6"
Discovery genes, exons, junctions, UTRs, fusions (Present and Future) Dynamic Range
Mortazavi et al. Nat. Methods 2008; 5:621–628
Nucleotide Specific
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
ATG
AAA
Challenge #1: Introns
ATG
AAA
Challenge #1: Introns Align to database
transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
ATG
AAA
Challenge #1: Introns Challenge #2: Correctness Sufficient Overlap Sufficient Evidence Align to database
transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
ATG
AAA
Challenge #1: Introns Challenge #2: Correctness Sufficient Overlap Sufficient Evidence Align to the transcriptome Challenge #3: Multi-mappers Sequence Similarity Align to database
transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
Data QC (clipping) Align to Filter Set Align to ‘genome’ Align to ‘junctions’ Split read Alignment Choose Alignments, Disambiguate Exclude Flag and Exclude
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
Data QC (clipping) Align to Filter Set Align to ‘genome’ Align to ‘junctions’ Split read Alignment Choose Alignments, Disambiguate Exclude Flag and Exclude BAM BAM BAM Alignment Filtering Analysis Library QC
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
reference? diploid? gene model? ESTs? Algorithm? rRNA, tRNA ?
Data QC (clipping) Align to Filter Set Align to ‘genome’ Align to ‘junctions’ Split read Alignment Choose Alignments, Disambiguate Exclude Flag and Exclude BAM BAM BAM Alignment Filtering Analysis Library QC
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Affects RNA content (Expression quantification)
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Affects RNA content (Expression quantification) Affects Insert Size (transcript identification)
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Affects RNA content (Expression quantification) Affects Insert Size (transcript identification) Affects Strand Specificity
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Affects RNA content (Expression quantification) Affects Insert Size (transcript identification) Affects Strand Specificity Affects Library Complexity (Tag uniqueness)
AAAAA AAAAA AAAAA AAAAA A AAA
Fragment ds-cDNA synthesis Ligate adaptors + Amplify Target RNA
rRNA (80%)
tRNA (15%) 5%
cellular RNA
Deplete rRNA Enrich polyA RNA Profile (ribosomes) Capture (tiling arrays)
Sequencing
Affects RNA content (Expression quantification) Affects Insert Size (transcript identification) Affects Strand Specificity Affects Library Complexity (Tag uniqueness) Affects Mapping Rate Paired-end?
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
ATG
AAA
ATG
Gene A 3500nt (700 reads) Gene B 400nt (160 reads)
AAA
Mortazavi et al. Nat. Methods 2008; 5:621–628
ATG
AAA
ATG
Gene A 3500nt (700 reads) Gene B 400nt (160 reads)
AAA
RPKM = 2.0 RPKM = 4.0
Reads ¡Per ¡Kilobase ¡ ¡per ¡Million
ATG
AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010
ATG
AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010 Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Cellular RNA
ATG
AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010 Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Cellular RNA
RPKM
ATG
AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010 Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Benjamini et al. NAR; 2012
Normalise to GC content of region
ATG
AAA
ATG
AAA
ATG
AAA
ATG
AAA
Exonic Region
ATG
AAA
ATG
AAA
Exonic Region Exon Junction
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region Exon Boundary
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature Extended 3’ UTR
ATG
AAA
ATG
AAA
ATG
AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature Extended 3’ UTR
ATG
AAA
ATG
AAA
Retained Intron
ATG
AAA
ATG
AAA
ATG
AAA
ATG
ATG
AAA
ATG
AAA
ATG
AAA
ATG
diagnostic feature
ATG
AAA
ATG
AAA
ATG
AAA
ATG
diagnostic feature Approach #1: Expression calculated using diagnostic features Strong Evidence Excludes Transcripts Sampling Variability Lacks statistical robustness Easy to calculate Dependent on gene model
ALEXA-seq: Griffith et al. Nat. Methods 2010; 11:R25
ATG
AAA
ATG
AAA
ATG
AAA
ATG
ATG
AAA
ATG
AAA
ATG
AAA
ATG
Approach #2: Expression estimated Construct bipartite graph, then finds minimum path
Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515
ATG
AAA
ATG
AAA
ATG
AAA
ATG
Estimates expression for all transcripts Model can fail in complex / highly expressed regions More statistically robust Error rate largely unknown Incorporates ambiguous reads Approach #2: Expression estimated Construct bipartite graph, then finds minimum path
Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515
ATG
AAA
ATG
AAA
ATG
AAA
Frequency log2 (expression) not “expressed” “expressed” Need to determine ‘expression’ cut-off value
Expressed if > 1 RPKM 1 Lacks sensitivity Arbitrary Has literature support
Expressed if > 1 RPKM 1 Expressed if above intergenic background 2
log2 Expression Frequency 95th percentile
Lacks sensitivity Arbitrary Has literature support
Expressed if > 1 RPKM 1 Expressed if above intergenic background 2
log2 Expression Frequency 95th percentile
Cut-off based
evidence Still somewhat arbitrary Lacks sensitivity Arbitrary Has literature support
Expressed if > 1 RPKM 1 Expressed if above intergenic background 2
log2 Expression Frequency 95th percentile
Cut-off based
evidence Still somewhat arbitrary Incorporate replicate information 3 Based on
reproducibility Requires replicates Lacks sensitivity Arbitrary Has literature support
−log2 (expression) bins np−IDR Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut−off 0.1 0.3 0.5 0.7 0.9 1 −11 −7 −3 1 5 9 13 17 21 25
Expressed if > 1 RPKM 1 Expressed if above intergenic background 2
log2 Expression Frequency 95th percentile
Cut-off based
evidence Still somewhat arbitrary Incorporate replicate information 3 Based on
reproducibility Requires replicates Lacks sensitivity Arbitrary Has literature support
−log2 (expression) bins np−IDR Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut−off 0.1 0.3 0.5 0.7 0.9 1 −11 −7 −3 1 5 9 13 17 21 25Expressed if > 1 RPKM 1 Expressed if above intergenic background 2
log2 Expression Frequency 95th percentile
Cut-off based
evidence Still somewhat arbitrary Incorporate replicate information 3 Based on
reproducibility Requires replicates Choose what is reasonable for your experiment, be consistent! Lacks sensitivity Arbitrary Has literature support
−log2 (expression) bins np−IDR Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut−off 0.1 0.3 0.5 0.7 0.9 1 −11 −7 −3 1 5 9 13 17 21 25ATG
AAA
ATG
AAA ICR
Imprinting
ATG
AAA
ATG
AAA
Imprinting
sQTL eQTL
ATG
AAA
ATG
AAA
Imprinting
sQTL eQTL
Complex Traits
ATG
AAA
ATG
AAA
Imprinting
eQTL
Complex Traits A B C SNPs Allelic Fraction
sQTL
ATG
AAA
ATG
AAA
Imprinting
eQTL
Complex Traits A B C SNPs Allelic Fraction
sQTL
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 Fraction of RNA−seq Reads Matching Reference Allele Density Expected Mean Observed Mean
Degner et al. Bioinformatics 2009
Reference bias
ATG
AAA
ATG
AAA
Imprinting
eQTL
Complex Traits A B C SNPs Allelic Fraction
sQTL
Map to a diploid genome
AlleleSeq: Rozowsky et al. Mol. Sys. Bio 2011
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 Fraction of RNA−seq Reads Matching Reference Allele Density Expected Mean Observed Mean
Degner et al. Bioinformatics 2009
Reference bias
Design Experiment Sample Acquisition Field / Clinic / Lab Validation Verification Sample Acquisition Run Experiment Obtain RNA Make Library Sequencing Base Calling Mapping Library QC Publish Analysis Interpretation 1° 2° 3° 3° 2° Field / Clinic Wet Lab Dry Lab
Single Cell
Shalek, et al. Nature 2013
Single Cell
Shalek, et al. Nature 2013
Huge Cohort
900 donors 30,000 RNA-seq data sets! Genotype-Tissue Expression project (GTEx)
Lonsdale, et al. Nature Genetics 2013
Choose an alignment approach suitable for your experiment, available resources and tools Assess library quality, specifically rRNA contamination, insert size, strand specificity and library complexity Gene and ‘Feature’ Expression can be calculated using count data, and normalised by length, library size and GC content Transcript expression calculation requires alternative approaches and algorithms, which although common, are largely unproven RNA-seq can interrogate nucleotide specific questions, but be careful of alignment biases (diploid mapping can help here) 1 2 3 4 5
Cloonan et al. Nat Methods 2008; Stem cell transcriptome profiling via massive-scale mRNA sequencing Mortazavi et al. Nat. Methods 2008; Mapping and quantifying mammalian transcriptomes by RNA-Seq Wood et al. Bioinformatics 2011; X-MATE: A flexible system for mapping short read data Trapnell et al. Bioinformatics 2009; TopHat: discovering splice junctions with RNA-Seq Koehler et al. Bioinformatics 2010. The Uniqueome: A mappability resource for short-tag sequencing Robinson et al. Genome Biology 2010; A scaling normalization method for differential expression analysis of RNA-seq data. Benjamini et al. NAR; 2012. Summarizing and correcting the GC content bias in high-throughput sequencing Griffith et al. Nat. Methods 2010; Alternative expression analysis by RNA sequencing. Trapnell et al. Nat. Biotech. 2010; Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform Degner et al. Bioinformatics 2009; Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing Rozowsky et al. Mol. Sys. Bio 2011; AlleleSeq: analysis of allele-specific expression and binding in a Shalek, et al. Nature 2013; Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells Lonsdale, et al. Nature Genetics 2013; The Genotype-Tissue Expression (GTEx) project.