[PPT] - Di ff erential gene expression analysis using RNA-seq Applied PowerPoint Presentation

SLIDE 1

Differential gene expression analysis using RNA-seq

Applied Bioinformatics Core

https://abc.med.cornell.edu/

SLIDE 2

Day 1: Introduction

1. RNA isolation & library preparation
2. Illumina’s sequencing by synthesis
3. experimental design
4. raw sequencing reads
download
quality control

SLIDE 3

RNA-seq is popular, but still developing

Reuter et al. (Mol Cell, 2015). High-Throughput Sequencing Technologies. doi:10.1016/j.molcel.2015.05.004

“RNA%seq)is)not$a$mature$technology.)It$is$ undergoing$rapid$evolution$of)biochemistry)

f)sample)preparation;)of)sequencing)

platforms;)of)computational)pipelines;)and)of$ subsequent$analysis$methods$that$include$ statistical$treatments)and)transcript)model) building.)“)

ENCODE&consortium&

SLIDE 4

What to expect from the class

Sample type & quality Sequencing

Read length
PE vs. SR
Sequencing errors

Experimental design

Controls
No. of replicates
Randomization

Library preparation

Poly-A enrichment
vs. ribo minus
Strand information

Bioinformatics

Aligner
Normalization
DE analysis strategy
Expression quantification
Alternative splicing
De novo assembly needed
mRNAs, small RNAs
….

Biological question

NOT COVERED:

novel transcript

discovery

transcriptome

assembly

alternative splicing

analysis

(see the course notes for references to useful reviews)

SLIDE 5

RNA-seq workflow overview

Sequencing Sequencing

Fragmentation
mRNA enrichment
Library preparation

Bioinformatics Bioinformatics

Cluster generation
Sequencing by synthesis
Image acquisition

Total RNA e

tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

SLIDE 6

Quality control of RNA extraction

28S:18S ratio avoid degraded RNA junk

SLIDE 7

RNA-seq library preparation

3’ adapter ligation random priming and reverse transcription second strand synthesis PCR end repair, A- addition, adapter ligation PCR

U U U U U U U

reverse transcription PCR rRNA depletion/mRNA enrichment fragmentation 5’ adapter ligation end repair, A- addition, adapter ligation

U U

classical Illumina protocol (unstranded)

Van Dijk et al. (2014).. Experimental Cell Research, 322(1), 12–20. doi:10.1016/j.yexcr.2014.01.008

sequential ligation of two different adapters

RNA extraction poly(A) enrichment

r

ribo-depletion

dUTP stranded library preparation

QC!

SLIDE 8

RNA-seq workflow overview

Sequencing Sequencing Total RNA e

tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

flowcell with primers

http://informatics.fas.harvard.edu/test-tutorial-page/

SLIDE 9

Cluster generation

bridge amplification denaturation cluster generation

removal of complementary strands ! identical fragment copies remain

http://informatics.fas.harvard.edu/test-tutorial-page/

SLIDE 10

Sequencing by synthesis

1. extend 1st base
2. read
3. deblock

repeat for 50 – 100 bp generate base calls

Image from Illumina

labelled dNTP

SLIDE 11

How deep is deep enough?

Goals that require more, longer and possibly paired- end reads:

quantification of lowly expressed genes
identification of genes with small changes between conditions
investigation of alternative splicing/isoform quantification
identification of novel transcripts, chimeric transcripts
de novo transcriptome assembly

for DGE (logFC~ 2) in mammals: 20 – 50 20 – 50 mio mio SR, 75 SR, 75 bp bp

SLIDE 12

Many sources for biased signals

“Overall,)our)results)indicate)that) there)is)considerable$RNA$ expression$diversity$between$ humans$and$mice,)well)beyond) what)was)described)previously,) likely)reBlecting)the)fundamental) physiological)differences)between) these)two)organisms.)“)

Comparison of the transcriptional landscapes between human and mouse tissues

Shin Lina,b,1, Yiing Linc,1, Joseph R. Neryd, Mark A. Urichd, Alessandra Breschie,f, Carrie A. Davisg, Alexander Dobing, Christopher Zaleskig, Michael A. Beerh, William C. Chapmanc, Thomas R. Gingerasg,i, Joseph R. Eckerd,j,2, and Michael P. Snydera,2

aDepartment of Genetics, Stanford University, Stanford, CA 94305; bDivision of Cardiovascular Medicine, Stanford University, Stanford, CA 94305; cDepartment of Surgery, Washington University School of Medicine, St. Louis, MO 63110; dGenomic Analysis Laboratory, The Salk Institute for Biological

Studies, La Jolla, CA 92037; eCentre for Genomic Regulation and UPF, Catalonia, 08003 Barcelona, Spain; fDepartament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Spain; gFunctional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11742; hMcKusick-Nathans Institute of Genetic Medicine and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205; iAffymetrix, Inc., Santa Clara, CA 95051; and jHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037 Contributed by Joseph R. Ecker, July 23, 2014 (sent for review May 23, 2014)

major projects. Thirteen of the mouse and human orthologous

SLIDE 13

Batch effect deluxe

Leek et al. (2010). Nature Reviews. Genetics, 11(10), 733–

739. doi:10.1038/nrg2825

Gilad Y and Mizrahi-Man O 2015. F1000Research 2015, 4:121 (doi: 10.12688/ f1000research.6536.1)

Once)we)accounted)for)the)batch)effect) (…),)the)comparative)gene)expression) data)no)longer)clustered)by)species,)and) instead,)we)observed)a)clear)tendency) for)clustering)by)tissue.)

SLIDE 14

Many more sources for biased signals

sequencing errors
miscalled bases
PCR artifacts
length bias
GC bias
duplicates

GC content

no. of reads

inclusion of multi-mapped reads exclusion of multi-mapped reads

issues with the reference
CNV
mappability
inappropriate data processing

SLIDE 15

Invest in replicates!

recommended: 6 biological replicates per condition for DGE
f strongly changing genes (logFC >= 2) (Gierlinski et al., 2015)

gene’s expression in an individual replicate with the trimmed mean across all repli- 𝑜𝑢 𝑦̅𝑕;𝑜𝑢 𝑡𝑕;𝑜𝑢 |𝑦𝑕𝑗 − 𝑦̅𝑕;𝑜𝑢| > 𝑜𝑡𝑡𝑕;𝑜𝑢 𝑜𝑡 𝑔

𝑗

𝑗 𝑜𝑢 = 3 𝑜𝑡 = 5 𝑜𝑢 𝑜𝑡 𝑜𝑡 = 3 – the read depth profiles between clean and “bad” replicates is shown in “bad” replicate “clean” replicates “ ”

from all “clean” “clean” “bad”

log2 Counts

10.16 10.18 10.20 10.22 10.24 10.26

condition 1 condition 2

Gene X

SLIDE 16

Replicates

Technical replicates

library prep sequencing lane RNA extraction library prep sequencing lane sequencing lane sequencing lane RNA from an independent growth of cells/tissue RNA extraction RNA extraction

Biological replicates

sequencing lane library prep sequencing lane sequencing lane sequencing lane

SLIDE 17

Auer, P. L., & Doerge, R. W. (2010). Genetics, 185(2), 405–16. Don’t put all your eggs into the same basket! ! multiplex all samples!

SLIDE 18

RNA-seq workflow overview

Sequencing Sequencing

Fragmentation
mRNA enrichment
Library preparation

Bioinformatics Bioinformatics

Cluster generation
Sequencing by synthesis
Image acquisition

Total RNA e

tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

SLIDE 19

Images List of fold changes & statistical values

Bioinformatics workflow of RNA-seq analysis

Raw reads

SLIDE 20

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping Counting

DE test & multiple testing correction

Normalizing Filtering

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam .txt .Robj .Robj, .txt

Raw reads

Bustard/RTA/OLB, CASAVA STAR featureCounts DESeq2, edgeR DESeq2, edgeR, limma Customized scripts

SLIDE 21

Where are all the reads?

Sequence Read Archive GenBank DDBJ ENA

http://www.ncbi.nlm.nih.gov/genbank/ http://www.ddbj.nig.ac.jp/intro-e.html

https://www.ebi.ac.uk/ena/

SLIDE 22

Let’s download!

Gierlinski et al. have submitted their data set to SRA – we

will retrieve it via ENA (https://www.ebi.ac.uk/ena/)

accession number: ERP004763

ls mkdir wget cut grep awk

SLIDE 23

FASTQ file format

= FASTA + quality scores

1 read " 4 lines!

⌥

@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D

1 2 3 4

1. @Read ID and sequencing run information
2. sequence
3. + (additional description possible)
4. quality scores

SLIDE 24

Base quality score

⌥

@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D

?

base error probability p,

e.g. 10e-4!

Phred score,

e.g.: 40

10 x log10(p)

“FASTQ score”, e.g.: ( turn score into ASCII symbol http://www.ascii-code.com/

SLIDE 25

Base quality scores

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40

5....0........9.............................40

0........9.............................40 3.....9.............................40 0.2......................26...31........41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

also see Table 2 in the course notes

image from https://en.wikipedia.org/wiki/FASTQ_format

each base has a certain error probability (p)
Phred score = -10 x log10(p)
Phred scores are ASCII-encoded, e.g., “!” COULD represent Phred score 33

SLIDE 26

Quality control of raw reads: FASTQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are:

Import of data from BAM, SAM or FastQ files (any variant)
Providing a quick overview to tell you in which areas there may be problems
Summary graphs and tables to quickly assess your data
Export of results to an HTML based permanent report
Offline operation to allow automated generation of reports without running

the interactive application

$ mat/software/FastQC/fastqc

SLIDE 27

ALIGNMENT

Finding out where the reads came from

SLIDE 28

Day 2: Aligning reads

1. Theoretical background
2. Annotation files
3. Files with aligned reads: SAM/BAM
4. Looking at aligned reads

SLIDE 29

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq .sam/.bam

Aligned reads

Bustard/RTA/OLB, CASAVA STAR

RSeQC, QoRTs

1. Theoretical

background

2. Annotation

files

3. Files with

aligned reads: SAM/BAM

4. Looking at

aligned reads

SLIDE 30

AAAAAA

Aligning short RNA-seq reads

full-length mRNA

5’ 3’

cDNA fragments

exon exon sequencing

reads aligned reads

SLIDE 31

AAAAAA

Aligning short RNA-seq reads

full-length mRNA

5’ 3’

cDNA fragments

exon exon sequencing

reads aligned reads

Spliced alignment tools usually need: 1) reference genome 2) annotation

SLIDE 32

Annotation & their data bases

Annotation is dynamic!

(sequence, coordinates, types of elements)

Automated vs. manual

curation (“evidence-based”)

https://commons.wikimedia.org/wiki/File:Human_genome_to_genes.png

RefSeq ncbi.nlm.nih.gov/refseq UCSC Known Genes genome.ucsc.edu Ensembl/Gencode gencodegenes.org

Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8

1/3 protein-coding genes > 17,000 non-coding RNAs > 15,000 pseudogenes

SLIDE 33

Annotation & read alignment

RefSeq tends to yield the highest number of uniquely mapped

reads (but lowest genome coverage)

Minor influence on mapping success of unspliced reads

tissue

Junction-spanning reads

Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8

Which data base to use depends on your question!

SLIDE 34

No data base is perfect…

Ubiquitous, highly

expressed genes are more likely to be part

f the annotations
Different tissues will

show different mapping rates!

SLIDE 35

Which annotation should one use?

“More)sensitive)annotations,)such)as)Ensembl) (…))should$be$preferred)over)more)speciBic) annotations,)such)as)RefSeq)(…))if)the)aim)is) to)obtain)accurate)expression)estimates.“)

Janes&et&al.&(Brie7ings&in&Bioinformatics,&2015).&doi: 10.1093/bib/bbv007&

“We)observe)that)RefSeq$Genes$produces$the$ most$accurate$fold@change$measures$with) respect)to)a)ground)truth)of)RT%qPCR)gene) expression)estimates.)“)

Wu&et&al.&(BMC&Bioinfo,&2013).&doi: 10.1186/1471M2105M14MS11MS8&

“In)practice,)there)is)no$simple$answer$to$this$question,)and)it)depends)on)the)purpose)

f)the)analysis.)(…))When)choosing)an)annotation)database,)researchers)should)keep)in)

mind)that)no$database$is$perfect)and)some$gene$annotations$might$be$inaccurate$

r$entirely$wrong.”)

Zhao&&&Zhang&(BMC&Genomics,&2015).&doi:10.1186/s12864M015M1308M8&

SLIDE 36

Gene models can differ dramatically

vs. ENSEMBL 0 unambiguous reads for LUZP6; x number of reads for MTPN RefSeq 0 reads

r

the same number of reads for both genes vs. 1 2 2 1

SLIDE 37

Storing annotation information

various formats (all are plain text files): GFF2, GFF3, GTF

, BED, SAF…

GTF (“GFF2.5”)

1.

reference coordinate

2.

source

3.

annotation type

4.

start position

5.

end position

6.

score

7.

strand

8.

frame/phase

9.

attributes: <TYPE VALUE>

SLIDE 38

0 vs. 1 based conventions

zero-based, half open ATG location: 6 - 9 or [6,9) Cut site: 11-11 or [11,11) Interval length = stop - start

http://alternateallele.blogspot.de/2012/03/genome-coordinate-conventions.html

ne-based, fully-closed

ATG location: 7 - 9 or [7,9] Cut site: 11^12 or (11,12) Interval length = stop - start + 1

BED format

http:// alternateallele.blogspot.com/ 2012/03/genome-coordinate-cheat- sheet.html

GFF format

SLIDE 39

STAR spliced alignment

Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635

TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0

Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722

Spliced alignment programs

accurate & sensitive
very fast
memory intensive!

(use it on the server!)

MMP = maximal mappable prefix

(aka maximum matching portion)

reads are split when a continuous

alignment is not possible

the remaining unmappable portion

is then aligned again

finally, aligned portions of the
riginal full-length reads are

stitched together

SLIDE 40

STAR spliced alignment

Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635

TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0

Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722

Spliced alignment programs

accurate & sensitive
very fast
memory intensive!

STAR has numerous options! tune them to meet your needs Current Protocols in Bioinformatics (Sept 2015) DOI: 10.1002/0471250953.bi1114s51 and STARmanual.pdf

SLIDE 41

STAR spliced alignment

1. generate genome index
2. align
align to reference & identify

novel splice junctions

re-run alignment including the

novel splice junctions

-runMode genomeGenerate
-genomeFastaFiles sacCer3.fa
-sjdbGTFfile sacCer3.gtf
-twopassMode

Let’s align the reads for WT_1!

SLIDE 42

Storing aligned reads: SAM/BAM

SLIDE 43

Storing aligned reads: SAM/BAM

Binary (Decimal) Hex Description 00000000001 (1) 0x1 Is the read paired? 00000000010 (2) 0x2 Are both reads in a pair mapped “properly” (i.e., in the correct

rientation with respect to one another)?

00000000100 (4) 0x4 Is the read itself unmapped? 00000001000 (8) 0x8 Is the mate read unmapped? 00000010000 (16) 0x10 Has the read been mapped to the reverse strand? 00000100000 (32) 0x20 Has the mate read been mapped to the reverse strand? 00001000000 (64) 0x40 Is the read the first read in a pair? 00010000000 (128) 0x80 Is the read the second read in a pair? 00100000000 (256) 0x100 Is the alignment not primary? (A read with split matches may have multiple primary alignment records.) 01000000000 (512) 0x200 Does the read fail platform/vendor quality checks? 10000000000 (1024) 0x400 Is the read a PCR or optical duplicate?

2nd field: binary FLAG most common FLAGS for SR: 0; 4; 16

https://broadinstitute.github.io/ picard/explain-flags.html

SLIDE 44

Storing aligned reads: SAM/BAM

SLIDE 45

Storing aligned reads: SAM/BAM

6th field: CIGAR string – what was necessary to align the read to the indicated locus? M: alignment (match or mismatch!!) I (N): insertion (large insertions) D: deletion S/H: clipping

SLIDE 46

Storing aligned reads: SAM/BAM

AS:i Alignment score BC:Z Barcode sequence HI:i Query is i-th hit stored in the file NH:i Number of reported alignments for the query sequence NM:i Edit distance of the query to the reference MD:Z String that contains the exact positions of mismatches (should complement the CIGAR string) RG:Z Read group (should match the entry after ID if @RG is present in the header.

after 11th field: OPTIONAL information

SLIDE 47

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam

Aligned reads

Bustard/RTA/OLB, CASAVA STAR

RSeQC, QoRTs

SLIDE 48

Basic QC of aligned reads

aligner output (e.g., Log.final.out)
samtools flagstat (statistics on the flag values)
RSeQC’s bam_stat
QoRTs
…

visual inspection!

SLIDE 49

Integrative Genomics Viewer

http://software.broadinstitute.org/software/igv/download

SLIDE 50

Integrative Genomics Viewer

load BAM file(s) from URL (“File” -> “Open URL…”):

http://chagall.med.cornell.edu/RNASEQcourse/

take a snapshot of the reads around gene YPL198W

starting with IGV 2.3, Sashimi plots can easily be created http://software.broadinstitute.org/software/igv/Sashimi

many options!

SLIDE 51

Identifying RNA-seq specific biases

RSeQC and/or QoRTs
read distribution (exons/introns/intergenic)
gene body coverage (3’/5’ bias)

Lahens et al. Genome Biology 2014 15:R86 doi:10.1186/gb-2014-15-6-r86

SLIDE 52

in silico RIN calculation

RIN is rarely documented in the

data repositories

mRIN is based on 3’ coverage bias
use RSeQC’s tin.py

ARTICLE

Received 7 May 2015 | Accepted 15 Jun 2015 | Published 3 Aug 2015

mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data

Huijuan Feng1,2, Xuegong Zhang1 & Chaolin Zhang2

DOI: 10.1038/ncomms8816

OPEN

https://github.com/friedue/course_RNA-seq2015 ! exercise #3

SLIDE 53

COUNTING READS

from alignments to count tables

SLIDE 54

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

Counting

featureCounts

Which regions are expressed? How much are they expressed?

SLIDE 55

Counting raw read overlaps

https://www.youtube.com/watch?v=ztyjiCCt_lM Disclaimer: There are 2 schools of thought. We follow the one that’s based on the raw reads and gene overlaps. See the course notes for references for the arguments from the other side.

SLIDE 56

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

Counting read–gene overlaps

featureCounts will use read-gene

verlaps as small as 1 bp

multi-overlap reads

SLIDE 57

http://www.nature.com/nmeth/journal/v8/n6/full/nmeth.1613.html

SLIDE 58

Transcriptome

reconstruction suffers from bad precision and bad sensitivity => many FP transcripts (esp. for tricky transcriptomes)!

False transcripts capture

a considerable portion of the reads

Janes et al. (2015). Briefings in Bioinformatics, (January), 1–9. doi:10.1093/bib/bbv007

Why do we not count transcripts? Because we don’t know them…

instead of transcript reconstruction, perhaps resort to:

exon counts ! DEXSeq
focus on specific splice events (MISO)

SLIDE 59

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

featureCounts

Normalizing

DESeq2, edgeR

SLIDE 60

From couting reads to expression units

Xi

Raw counts: number of reads/

fragments overlapping with the union of exons of a gene

strongly influenced by:

gene length
sequencing depth
expression of all other genes in the same

sample

SLIDE 61

Expression units

TPMi = Xi li ∗ 1 P

j

Xj lk ∗ 106

RPKMi = Xi ( li 103)( N 106)

Xi

Raw counts: number of reads/

fragments overlapping with the union of exons of a gene

[RF]PKM: Reads/Fragments per

Kilobase of gene per Million reads mapped

TPM: Transcripts Per Million
rlog: log2-transformed count data

normalized for small counts and library size

gene read counts per bp all gene counts per all gene bp

SLIDE 62

Avoid [RF]PKM and total read count normalization for DGE!

FP rates for varying % of DE genes (0-30%)

Dillies et al.(2012). Briefings in Bioinformatics. doi:10.1093/bib/bbs046

Effects of normalization methods on FC calculation and DGE analysis

if you need normalized expression values, use TPM

r rlog

SLIDE 63

Let’s count some reads & explore in R! Please save the .RData and the commands! You will need them tomorrow!

SLIDE 64

Expression units

strongly influenced by
gene length
sequencing depth
expression of all other genes in the same sample

DESeq’s size factor normalization

hetero- skedasticity

large dynamic range
discrete values

log transformation and variance stabilization (DESeq’s rlog() )

clustering tends to work better on normalized & transformed read counts

SLIDE 65

Clustering gene expression values

Hierarchical clustering

Experiment 1 Experiment 2

single-gene clusters are successively joined

+ “unbiased”

not very robust
Similarity measures
Euclidean
Pearson correlation
Distance measures
Complete: largest distance
Average: average distance

SLIDE 66

removing rRNAs

Can be done at virtually every step of the analysis. Choose the version that makes most sense to you.

sortMeRNA: http://bioinfo.lifl.fr/RNA/sortmerna/
input: reads in fastq file + rRNA sequences
will extract those reads that do not match to the rRNA sequences
https://www.ncbi.nlm.nih.gov/nuccore/U13369 (human rRNA),

https://www.ncbi.nlm.nih.gov/nuccore/BK000964 (mouse)

make a “genome” index for rRNAs only (and perhaps

tRNAs), then align your reads and only use those that do not map for the next round of alignment

do your alignment and counting as is, simply ignore the rRNA

genes in your subsequent downstream analysis

SLIDE 67

DIFFERENTIAL GENE EXPRESSION

SLIDE 68

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

HTSeq, featureCounts

Normalizing

DESeq2, edgeR .txt .Robj .Robj, .txt

Descriptive plots

SLIDE 69

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

HTSeq, featureCounts

DE test & multiple testing correction

DESeq2, edgeR, limma

Normalizing

DESeq2, edgeR .txt .Robj .Robj, .txt Descriptive plots

SLIDE 70

Read count table

SLIDE 71

DE basics

1. Estimate magnitude of DE

taking into account differences in sequencing depth, technical, and biological read count variability.

2. Estimate the significance of

the difference accounting for performing thousands of tests. 1 test per gene! Garber et al. (2011) Nature Methods, 8(6), 469–477. doi:10.1038/nmeth.1613 H0: no difference in the read distribution between two conditions

logFC (adjusted) p-value

SLIDE 72

Modeling read counts (DESeq2)

Kij ∼ NB(µij, αi)

∼ µij = sjqij

log2(qij) = xj.βi

read counts for gene i and sample j fitted mean gene-specific dispersion parameter (fitted towards the average dispersion) moderated log-fold change for gene i model matrix column for sample j library size factor expression value estimate

SLIDE 73

From read counts to DE

average norm. count standard error estimate for the logFC

SLIDE 74

What next?

Do your results make sense?
Are the results robust?
do multiple tools agree on the majority of the genes?
are the fold changes strong enough to explain the phenotype you

are seeing?

have other experiments yielded similar results?
Downstream analyses: mostly exploratory

How to decide which tool(s) to use?

function/content of original publication
code maintained?
well documented?
used by others?
efficient?

SLIDE 75

WALK-IN CLINICS

@ WCM: Thursdays, 1:30 – 3 pm, LC-504 (1300 York Ave) @ MSKCC: https://www.mskcc.org/ research-advantage/core- facilities/bioinformatics

Where to get help and inspiration?

biostars.org stackoverflow.com bioconductor.org/help/workflows mailing lists of the individual tools seqanswers.com

abc@med.cornell.edu

supplemental material of publications based on HTS data

SLIDE 76

Everything’s connected…

Sample type & quality

Low input?
Degraded?

Sequencing

Read length
PE vs. SR
Sequencing errors

Experimental design

Controls
No. of replicates
Randomization

Library preparation

Poly-A enrichment vs.

ribo minus

Strand information

Bioinformatics

Aligner
Annotation
Normalization
DE analysis strategy
Expression quantification
Alternative splicing
De novo assembly needed
mRNAs, small RNAs
….

Biological question