Differential gene expression analysis using RNA-seq
Applied Bioinformatics Core
https://abc.med.cornell.edu/
Di ff erential gene expression analysis using RNA-seq Applied - - PowerPoint PPT Presentation
Di ff erential gene expression analysis using RNA-seq Applied Bioinformatics Core https://abc.med.cornell.edu/ Day 1: Introduction 1. RNA isolation & library preparation 2. Illuminas sequencing by synthesis 3. experimental design 4. raw
Applied Bioinformatics Core
https://abc.med.cornell.edu/
Reuter et al. (Mol Cell, 2015). High-Throughput Sequencing Technologies. doi:10.1016/j.molcel.2015.05.004
“RNA%seq)is)not$a$mature$technology.)It$is$ undergoing$rapid$evolution$of)biochemistry)
platforms;)of)computational)pipelines;)and)of$ subsequent$analysis$methods$that$include$ statistical$treatments)and)transcript)model) building.)“)
ENCODE&consortium&
Sample type & quality Sequencing
Experimental design
Library preparation
Bioinformatics
Biological question
NOT COVERED:
discovery
assembly
analysis
(see the course notes for references to useful reviews)
RNA cDNA with adapters cells fragments
28S:18S ratio avoid degraded RNA junk
3’ adapter ligation random priming and reverse transcription second strand synthesis PCR end repair, A- addition, adapter ligation PCR
U U U U U U U
reverse transcription PCR rRNA depletion/mRNA enrichment fragmentation 5’ adapter ligation end repair, A- addition, adapter ligation
U U
classical Illumina protocol (unstranded)
Van Dijk et al. (2014).. Experimental Cell Research, 322(1), 12–20. doi:10.1016/j.yexcr.2014.01.008
sequential ligation of two different adapters
RNA extraction poly(A) enrichment
ribo-depletion
dUTP stranded library preparation
QC!
RNA cDNA with adapters cells fragments
flowcell with primers
http://informatics.fas.harvard.edu/test-tutorial-page/
bridge amplification denaturation cluster generation
removal of complementary strands ! identical fragment copies remain
http://informatics.fas.harvard.edu/test-tutorial-page/
repeat for 50 – 100 bp generate base calls
Image from Illumina
labelled dNTP
“Overall,)our)results)indicate)that) there)is)considerable$RNA$ expression$diversity$between$ humans$and$mice,)well)beyond) what)was)described)previously,) likely)reBlecting)the)fundamental) physiological)differences)between) these)two)organisms.)“)
Comparison of the transcriptional landscapes between human and mouse tissues
Shin Lina,b,1, Yiing Linc,1, Joseph R. Neryd, Mark A. Urichd, Alessandra Breschie,f, Carrie A. Davisg, Alexander Dobing, Christopher Zaleskig, Michael A. Beerh, William C. Chapmanc, Thomas R. Gingerasg,i, Joseph R. Eckerd,j,2, and Michael P. Snydera,2
aDepartment of Genetics, Stanford University, Stanford, CA 94305; bDivision of Cardiovascular Medicine, Stanford University, Stanford, CA 94305; cDepartment of Surgery, Washington University School of Medicine, St. Louis, MO 63110; dGenomic Analysis Laboratory, The Salk Institute for BiologicalStudies, La Jolla, CA 92037; eCentre for Genomic Regulation and UPF, Catalonia, 08003 Barcelona, Spain; fDepartament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Spain; gFunctional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11742; hMcKusick-Nathans Institute of Genetic Medicine and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205; iAffymetrix, Inc., Santa Clara, CA 95051; and jHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037 Contributed by Joseph R. Ecker, July 23, 2014 (sent for review May 23, 2014)
major projects. Thirteen of the mouse and human orthologous
Leek et al. (2010). Nature Reviews. Genetics, 11(10), 733–
Gilad Y and Mizrahi-Man O 2015. F1000Research 2015, 4:121 (doi: 10.12688/ f1000research.6536.1)
Once)we)accounted)for)the)batch)effect) (…),)the)comparative)gene)expression) data)no)longer)clustered)by)species,)and) instead,)we)observed)a)clear)tendency) for)clustering)by)tissue.)
GC content
inclusion of multi-mapped reads exclusion of multi-mapped reads
gene’s expression in an individual replicate with the trimmed mean across all repli- 𝑜𝑢 𝑦̅;𝑜𝑢 𝑡;𝑜𝑢 |𝑦𝑗 − 𝑦̅;𝑜𝑢| > 𝑜𝑡𝑡;𝑜𝑢 𝑜𝑡 𝑔
𝑗
𝑗 𝑜𝑢 = 3 𝑜𝑡 = 5 𝑜𝑢 𝑜𝑡 𝑜𝑡 = 3 – the read depth profiles between clean and “bad” replicates is shown in “bad” replicate “clean” replicates “ ”
from all “clean” “clean” “bad”
log2 Counts
10.16 10.18 10.20 10.22 10.24 10.26
condition 1 condition 2
Gene X
Technical replicates
library prep sequencing lane RNA extraction library prep sequencing lane sequencing lane sequencing lane RNA from an independent growth of cells/tissue RNA extraction RNA extraction
Biological replicates
sequencing lane library prep sequencing lane sequencing lane sequencing lane
Auer, P. L., & Doerge, R. W. (2010). Genetics, 185(2), 405–16. Don’t put all your eggs into the same basket! ! multiplex all samples!
RNA cDNA with adapters cells fragments
Images List of fold changes & statistical values
Raw reads
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping Counting
DE test & multiple testing correction
Normalizing Filtering
.tif .fastq
.sam/.bam .txt .Robj .Robj, .txt
Raw reads
Bustard/RTA/OLB, CASAVA STAR featureCounts DESeq2, edgeR DESeq2, edgeR, limma Customized scripts
Sequence Read Archive GenBank DDBJ ENA
http://www.ncbi.nlm.nih.gov/genbank/ http://www.ddbj.nig.ac.jp/intro-e.html
https://www.ebi.ac.uk/ena/
will retrieve it via ENA (https://www.ebi.ac.uk/ena/)
ls mkdir wget cut grep awk
⌥
@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D
1 2 3 4
⌥
@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D
base error probability p,
e.g. 10e-4!
Phred score,
e.g.: 40
“FASTQ score”, e.g.: ( turn score into ASCII symbol http://www.ascii-code.com/
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40
0........9.............................40 3.....9.............................40 0.2......................26...31........41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
also see Table 2 in the course notes
image from https://en.wikipedia.org/wiki/FASTQ_format
http://www.bioinformatics.babraham.ac.uk/projects/fastqc
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are:
the interactive application
$ mat/software/FastQC/fastqc
Finding out where the reads came from
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping
.tif .fastq .sam/.bam
Aligned reads
Bustard/RTA/OLB, CASAVA STAR
RSeQC, QoRTs
AAAAAA
full-length mRNA
5’ 3’
cDNA fragments
exon exon sequencing
reads aligned reads
AAAAAA
full-length mRNA
5’ 3’
cDNA fragments
exon exon sequencing
reads aligned reads
Spliced alignment tools usually need: 1) reference genome 2) annotation
(sequence, coordinates, types of elements)
curation (“evidence-based”)
https://commons.wikimedia.org/wiki/File:Human_genome_to_genes.png
RefSeq ncbi.nlm.nih.gov/refseq UCSC Known Genes genome.ucsc.edu Ensembl/Gencode gencodegenes.org
Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8
1/3 protein-coding genes > 17,000 non-coding RNAs > 15,000 pseudogenes
reads (but lowest genome coverage)
tissue
Junction-spanning reads
Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8
Which data base to use depends on your question!
expressed genes are more likely to be part
show different mapping rates!
“More)sensitive)annotations,)such)as)Ensembl) (…))should$be$preferred)over)more)speciBic) annotations,)such)as)RefSeq)(…))if)the)aim)is) to)obtain)accurate)expression)estimates.“)
Janes&et&al.&(Brie7ings&in&Bioinformatics,&2015).&doi: 10.1093/bib/bbv007&
“We)observe)that)RefSeq$Genes$produces$the$ most$accurate$fold@change$measures$with) respect)to)a)ground)truth)of)RT%qPCR)gene) expression)estimates.)“)
Wu&et&al.&(BMC&Bioinfo,&2013).&doi: 10.1186/1471M2105M14MS11MS8&
“In)practice,)there)is)no$simple$answer$to$this$question,)and)it)depends)on)the)purpose)
mind)that)no$database$is$perfect)and)some$gene$annotations$might$be$inaccurate$
Zhao&&&Zhang&(BMC&Genomics,&2015).&doi:10.1186/s12864M015M1308M8&
vs. ENSEMBL 0 unambiguous reads for LUZP6; x number of reads for MTPN RefSeq 0 reads
the same number of reads for both genes vs. 1 2 2 1
, BED, SAF…
GTF (“GFF2.5”)
1.
reference coordinate
2.
source
3.
annotation type
4.
start position
5.
end position
6.
score
7.
strand
8.
frame/phase
9.
attributes: <TYPE VALUE>
zero-based, half open ATG location: 6 - 9 or [6,9) Cut site: 11-11 or [11,11) Interval length = stop - start
http://alternateallele.blogspot.de/2012/03/genome-coordinate-conventions.html
ATG location: 7 - 9 or [7,9] Cut site: 11^12 or (11,12) Interval length = stop - start + 1
http:// alternateallele.blogspot.com/ 2012/03/genome-coordinate-cheat- sheet.html
Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635
TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0
Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722
Spliced alignment programs
(use it on the server!)
(aka maximum matching portion)
alignment is not possible
is then aligned again
stitched together
Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635
TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0
Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722
Spliced alignment programs
STAR has numerous options! tune them to meet your needs Current Protocols in Bioinformatics (Sept 2015) DOI: 10.1002/0471250953.bi1114s51 and STARmanual.pdf
novel splice junctions
novel splice junctions
Let’s align the reads for WT_1!
Binary (Decimal) Hex Description 00000000001 (1) 0x1 Is the read paired? 00000000010 (2) 0x2 Are both reads in a pair mapped “properly” (i.e., in the correct
00000000100 (4) 0x4 Is the read itself unmapped? 00000001000 (8) 0x8 Is the mate read unmapped? 00000010000 (16) 0x10 Has the read been mapped to the reverse strand? 00000100000 (32) 0x20 Has the mate read been mapped to the reverse strand? 00001000000 (64) 0x40 Is the read the first read in a pair? 00010000000 (128) 0x80 Is the read the second read in a pair? 00100000000 (256) 0x100 Is the alignment not primary? (A read with split matches may have multiple primary alignment records.) 01000000000 (512) 0x200 Does the read fail platform/vendor quality checks? 10000000000 (1024) 0x400 Is the read a PCR or optical duplicate?
2nd field: binary FLAG most common FLAGS for SR: 0; 4; 16
https://broadinstitute.github.io/ picard/explain-flags.html
6th field: CIGAR string – what was necessary to align the read to the indicated locus? M: alignment (match or mismatch!!) I (N): insertion (large insertions) D: deletion S/H: clipping
AS:i Alignment score BC:Z Barcode sequence HI:i Query is i-th hit stored in the file NH:i Number of reported alignments for the query sequence NM:i Edit distance of the query to the reference MD:Z String that contains the exact positions of mismatches (should complement the CIGAR string) RG:Z Read group (should match the entry after ID if @RG is present in the header.
after 11th field: OPTIONAL information
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping
.tif .fastq
.sam/.bam
Aligned reads
Bustard/RTA/OLB, CASAVA STAR
RSeQC, QoRTs
http://software.broadinstitute.org/software/igv/download
http://chagall.med.cornell.edu/RNASEQcourse/
starting with IGV 2.3, Sashimi plots can easily be created http://software.broadinstitute.org/software/igv/Sashimi
many options!
Lahens et al. Genome Biology 2014 15:R86 doi:10.1186/gb-2014-15-6-r86
data repositories
ARTICLE
Received 7 May 2015 | Accepted 15 Jun 2015 | Published 3 Aug 2015
mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data
Huijuan Feng1,2, Xuegong Zhang1 & Chaolin Zhang2
DOI: 10.1038/ncomms8816
OPEN
https://github.com/friedue/course_RNA-seq2015 ! exercise #3
from alignments to count tables
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
Base calling & demultiplexing Mapping
.tif .fastq
.sam/.bam Bustard/RTA/OLB, CASAVA STAR
Counting
featureCounts
Which regions are expressed? How much are they expressed?
https://www.youtube.com/watch?v=ztyjiCCt_lM Disclaimer: There are 2 schools of thought. We follow the one that’s based on the raw reads and gene overlaps. See the course notes for references for the arguments from the other side.
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count
featureCounts will use read-gene
multi-overlap reads
http://www.nature.com/nmeth/journal/v8/n6/full/nmeth.1613.html
reconstruction suffers from bad precision and bad sensitivity => many FP transcripts (esp. for tricky transcriptomes)!
a considerable portion of the reads
Janes et al. (2015). Briefings in Bioinformatics, (January), 1–9. doi:10.1093/bib/bbv007
instead of transcript reconstruction, perhaps resort to:
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping
.tif .fastq
.sam/.bam Bustard/RTA/OLB, CASAVA STAR
RSeQC Counting
featureCounts
Normalizing
DESeq2, edgeR
fragments overlapping with the union of exons of a gene
strongly influenced by:
sample
TPMi = Xi li ∗ 1 P
j
Xj lk ∗ 106
RPKMi = Xi ( li 103)( N 106)
fragments overlapping with the union of exons of a gene
Kilobase of gene per Million reads mapped
normalized for small counts and library size
gene read counts per bp all gene counts per all gene bp
Avoid [RF]PKM and total read count normalization for DGE!
FP rates for varying % of DE genes (0-30%)
Dillies et al.(2012). Briefings in Bioinformatics. doi:10.1093/bib/bbs046
if you need normalized expression values, use TPM
DESeq’s size factor normalization
hetero- skedasticity
log transformation and variance stabilization (DESeq’s rlog() )
clustering tends to work better on normalized & transformed read counts
Hierarchical clustering
Experiment 1 Experiment 2
single-gene clusters are successively joined
+ “unbiased”
Can be done at virtually every step of the analysis. Choose the version that makes most sense to you.
https://www.ncbi.nlm.nih.gov/nuccore/BK000964 (mouse)
tRNAs), then align your reads and only use those that do not map for the next round of alignment
genes in your subsequent downstream analysis
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping
.tif .fastq
.sam/.bam Bustard/RTA/OLB, CASAVA STAR
RSeQC Counting
HTSeq, featureCounts
Normalizing
DESeq2, edgeR .txt .Robj .Robj, .txt
Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes
FASTQC Base calling & demultiplexing Mapping
.tif .fastq
.sam/.bam Bustard/RTA/OLB, CASAVA STAR
RSeQC Counting
HTSeq, featureCounts
DE test & multiple testing correction
DESeq2, edgeR, limma
Normalizing
DESeq2, edgeR .txt .Robj .Robj, .txt Descriptive plots
taking into account differences in sequencing depth, technical, and biological read count variability.
the difference accounting for performing thousands of tests. 1 test per gene! Garber et al. (2011) Nature Methods, 8(6), 469–477. doi:10.1038/nmeth.1613 H0: no difference in the read distribution between two conditions
logFC (adjusted) p-value
read counts for gene i and sample j fitted mean gene-specific dispersion parameter (fitted towards the average dispersion) moderated log-fold change for gene i model matrix column for sample j library size factor expression value estimate
average norm. count standard error estimate for the logFC
are seeing?
How to decide which tool(s) to use?
WALK-IN CLINICS
@ WCM: Thursdays, 1:30 – 3 pm, LC-504 (1300 York Ave) @ MSKCC: https://www.mskcc.org/ research-advantage/core- facilities/bioinformatics
biostars.org stackoverflow.com bioconductor.org/help/workflows mailing lists of the individual tools seqanswers.com
abc@med.cornell.edu
supplemental material of publications based on HTS data
Sample type & quality
Sequencing
Experimental design
Library preparation
ribo minus
Bioinformatics
Biological question