Di ff erential gene expression analysis using RNA-seq Applied - - PowerPoint PPT Presentation

di ff erential gene expression analysis using rna seq
SMART_READER_LITE
LIVE PREVIEW

Di ff erential gene expression analysis using RNA-seq Applied - - PowerPoint PPT Presentation

Di ff erential gene expression analysis using RNA-seq Applied Bioinformatics Core https://abc.med.cornell.edu/ Day 1: Introduction 1. RNA isolation & library preparation 2. Illuminas sequencing by synthesis 3. experimental design 4. raw


slide-1
SLIDE 1

Differential gene expression analysis using RNA-seq

Applied Bioinformatics Core

https://abc.med.cornell.edu/

slide-2
SLIDE 2

Day 1: Introduction

  • 1. RNA isolation & library preparation
  • 2. Illumina’s sequencing by synthesis
  • 3. experimental design
  • 4. raw sequencing reads
  • download
  • quality control
slide-3
SLIDE 3

RNA-seq is popular, but still developing

Reuter et al. (Mol Cell, 2015). High-Throughput Sequencing Technologies. doi:10.1016/j.molcel.2015.05.004

“RNA%seq)is)not$a$mature$technology.)It$is$ undergoing$rapid$evolution$of)biochemistry)

  • f)sample)preparation;)of)sequencing)

platforms;)of)computational)pipelines;)and)of$ subsequent$analysis$methods$that$include$ statistical$treatments)and)transcript)model) building.)“)

ENCODE&consortium&

slide-4
SLIDE 4

What to expect from the class

Sample type & quality Sequencing

  • Read length
  • PE vs. SR
  • Sequencing errors

Experimental design

  • Controls
  • No. of replicates
  • Randomization

Library preparation

  • Poly-A enrichment
  • vs. ribo minus
  • Strand information

Bioinformatics

  • Aligner
  • Normalization
  • DE analysis strategy
  • Expression quantification
  • Alternative splicing
  • De novo assembly needed
  • mRNAs, small RNAs
  • ….

Biological question

NOT COVERED:

  • novel transcript

discovery

  • transcriptome

assembly

  • alternative splicing

analysis

(see the course notes for references to useful reviews)

slide-5
SLIDE 5

RNA-seq workflow overview

Sequencing Sequencing

  • Fragmentation
  • mRNA enrichment
  • Library preparation

Bioinformatics Bioinformatics

  • Cluster generation
  • Sequencing by synthesis
  • Image acquisition

Total RNA e

  • tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

slide-6
SLIDE 6

Quality control of RNA extraction

28S:18S ratio avoid degraded RNA junk

slide-7
SLIDE 7

RNA-seq library preparation

3’ adapter ligation random priming and reverse transcription second strand synthesis PCR end repair, A- addition, adapter ligation PCR

U U U U U U U

reverse transcription PCR rRNA depletion/mRNA enrichment fragmentation 5’ adapter ligation end repair, A- addition, adapter ligation

U U

classical Illumina protocol (unstranded)

Van Dijk et al. (2014).. Experimental Cell Research, 322(1), 12–20. doi:10.1016/j.yexcr.2014.01.008

sequential ligation of two different adapters

RNA extraction poly(A) enrichment

  • r

ribo-depletion

dUTP stranded library preparation

QC!

slide-8
SLIDE 8

RNA-seq workflow overview

Sequencing Sequencing Total RNA e

  • tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

flowcell with primers

http://informatics.fas.harvard.edu/test-tutorial-page/

slide-9
SLIDE 9

Cluster generation

bridge amplification denaturation cluster generation

removal of complementary strands ! identical fragment copies remain

http://informatics.fas.harvard.edu/test-tutorial-page/

slide-10
SLIDE 10

Sequencing by synthesis

  • 1. extend 1st base
  • 2. read
  • 3. deblock

repeat for 50 – 100 bp generate base calls

Image from Illumina

labelled dNTP

slide-11
SLIDE 11

How deep is deep enough?

Goals that require more, longer and possibly paired- end reads:

  • quantification of lowly expressed genes
  • identification of genes with small changes between conditions
  • investigation of alternative splicing/isoform quantification
  • identification of novel transcripts, chimeric transcripts
  • de novo transcriptome assembly

for DGE (logFC~ 2) in mammals: 20 – 50 20 – 50 mio mio SR, 75 SR, 75 bp bp

slide-12
SLIDE 12

Many sources for biased signals

“Overall,)our)results)indicate)that) there)is)considerable$RNA$ expression$diversity$between$ humans$and$mice,)well)beyond) what)was)described)previously,) likely)reBlecting)the)fundamental) physiological)differences)between) these)two)organisms.)“)

Comparison of the transcriptional landscapes between human and mouse tissues

Shin Lina,b,1, Yiing Linc,1, Joseph R. Neryd, Mark A. Urichd, Alessandra Breschie,f, Carrie A. Davisg, Alexander Dobing, Christopher Zaleskig, Michael A. Beerh, William C. Chapmanc, Thomas R. Gingerasg,i, Joseph R. Eckerd,j,2, and Michael P. Snydera,2

aDepartment of Genetics, Stanford University, Stanford, CA 94305; bDivision of Cardiovascular Medicine, Stanford University, Stanford, CA 94305; cDepartment of Surgery, Washington University School of Medicine, St. Louis, MO 63110; dGenomic Analysis Laboratory, The Salk Institute for Biological

Studies, La Jolla, CA 92037; eCentre for Genomic Regulation and UPF, Catalonia, 08003 Barcelona, Spain; fDepartament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Spain; gFunctional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11742; hMcKusick-Nathans Institute of Genetic Medicine and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205; iAffymetrix, Inc., Santa Clara, CA 95051; and jHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037 Contributed by Joseph R. Ecker, July 23, 2014 (sent for review May 23, 2014)

major projects. Thirteen of the mouse and human orthologous

slide-13
SLIDE 13

Batch effect deluxe

Leek et al. (2010). Nature Reviews. Genetics, 11(10), 733–

  • 739. doi:10.1038/nrg2825

Gilad Y and Mizrahi-Man O 2015. F1000Research 2015, 4:121 (doi: 10.12688/ f1000research.6536.1)

Once)we)accounted)for)the)batch)effect) (…),)the)comparative)gene)expression) data)no)longer)clustered)by)species,)and) instead,)we)observed)a)clear)tendency) for)clustering)by)tissue.)

slide-14
SLIDE 14

Many more sources for biased signals

  • sequencing errors
  • miscalled bases
  • PCR artifacts
  • length bias
  • GC bias
  • duplicates

GC content

  • no. of reads

inclusion of multi-mapped reads exclusion of multi-mapped reads

  • issues with the reference
  • CNV
  • mappability
  • inappropriate data processing
slide-15
SLIDE 15

Invest in replicates!

  • recommended: 6 biological replicates per condition for DGE
  • f strongly changing genes (logFC >= 2) (Gierlinski et al., 2015)

gene’s expression in an individual replicate with the trimmed mean across all repli- 𝑜𝑢 𝑦̅𝑕;𝑜𝑢 𝑡𝑕;𝑜𝑢 |𝑦𝑕𝑗 − 𝑦̅𝑕;𝑜𝑢| > 𝑜𝑡𝑡𝑕;𝑜𝑢 𝑜𝑡 𝑔

𝑗

𝑗 𝑜𝑢 = 3 𝑜𝑡 = 5 𝑜𝑢 𝑜𝑡 𝑜𝑡 = 3 – the read depth profiles between clean and “bad” replicates is shown in “bad” replicate “clean” replicates “ ”

from all “clean” “clean” “bad”

log2 Counts

10.16 10.18 10.20 10.22 10.24 10.26

condition 1 condition 2

Gene X

slide-16
SLIDE 16

Replicates

Technical replicates

library prep sequencing lane RNA extraction library prep sequencing lane sequencing lane sequencing lane RNA from an independent growth of cells/tissue RNA extraction RNA extraction

Biological replicates

sequencing lane library prep sequencing lane sequencing lane sequencing lane

slide-17
SLIDE 17

Auer, P. L., & Doerge, R. W. (2010). Genetics, 185(2), 405–16. Don’t put all your eggs into the same basket! ! multiplex all samples!

slide-18
SLIDE 18

RNA-seq workflow overview

Sequencing Sequencing

  • Fragmentation
  • mRNA enrichment
  • Library preparation

Bioinformatics Bioinformatics

  • Cluster generation
  • Sequencing by synthesis
  • Image acquisition

Total RNA e

  • tal RNA extr

xtraction action

RNA cDNA with adapters cells fragments

slide-19
SLIDE 19

Images List of fold changes & statistical values

Bioinformatics workflow of RNA-seq analysis

Raw reads

slide-20
SLIDE 20

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping Counting

DE test & multiple testing correction

Normalizing Filtering

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam .txt .Robj .Robj, .txt

Raw reads

Bustard/RTA/OLB, CASAVA STAR featureCounts DESeq2, edgeR DESeq2, edgeR, limma Customized scripts

slide-21
SLIDE 21

Where are all the reads?

Sequence Read Archive GenBank DDBJ ENA

http://www.ncbi.nlm.nih.gov/genbank/ http://www.ddbj.nig.ac.jp/intro-e.html

https://www.ebi.ac.uk/ena/

slide-22
SLIDE 22

Let’s download!

  • Gierlinski et al. have submitted their data set to SRA – we

will retrieve it via ENA (https://www.ebi.ac.uk/ena/)

  • accession number: ERP004763

ls mkdir wget cut grep awk

slide-23
SLIDE 23

FASTQ file format

= FASTA + quality scores

1 read " 4 lines!

@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D

1 2 3 4

  • 1. @Read ID and sequencing run information
  • 2. sequence
  • 3. + (additional description possible)
  • 4. quality scores
slide-24
SLIDE 24

Base quality score

@ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7 <DBADDDBH?DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA ’5? B@D

?

base error probability p,

e.g. 10e-4!

Phred score,

e.g.: 40

  • 10 x log10(p)

“FASTQ score”, e.g.: ( turn score into ASCII symbol http://www.ascii-code.com/

slide-25
SLIDE 25

Base quality scores

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40

  • 5....0........9.............................40

0........9.............................40 3.....9.............................40 0.2......................26...31........41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

also see Table 2 in the course notes

image from https://en.wikipedia.org/wiki/FASTQ_format

  • each base has a certain error probability (p)
  • Phred score = -10 x log10(p)
  • Phred scores are ASCII-encoded, e.g., “!” COULD represent Phred score 33
slide-26
SLIDE 26

Quality control of raw reads: FASTQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are:

  • Import of data from BAM, SAM or FastQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running

the interactive application

$ mat/software/FastQC/fastqc

slide-27
SLIDE 27

ALIGNMENT

Finding out where the reads came from

slide-28
SLIDE 28

Day 2: Aligning reads

  • 1. Theoretical background
  • 2. Annotation files
  • 3. Files with aligned reads: SAM/BAM
  • 4. Looking at aligned reads
slide-29
SLIDE 29

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq .sam/.bam

Aligned reads

Bustard/RTA/OLB, CASAVA STAR

RSeQC, QoRTs

  • 1. Theoretical

background

  • 2. Annotation

files

  • 3. Files with

aligned reads: SAM/BAM

  • 4. Looking at

aligned reads

slide-30
SLIDE 30

AAAAAA

Aligning short RNA-seq reads

full-length mRNA

5’ 3’

cDNA fragments

exon exon sequencing

reads aligned reads

slide-31
SLIDE 31

AAAAAA

Aligning short RNA-seq reads

full-length mRNA

5’ 3’

cDNA fragments

exon exon sequencing

reads aligned reads

Spliced alignment tools usually need: 1) reference genome 2) annotation

slide-32
SLIDE 32

Annotation & their data bases

  • Annotation is dynamic!

(sequence, coordinates, types of elements)

  • Automated vs. manual

curation (“evidence-based”)

https://commons.wikimedia.org/wiki/File:Human_genome_to_genes.png

RefSeq ncbi.nlm.nih.gov/refseq UCSC Known Genes genome.ucsc.edu Ensembl/Gencode gencodegenes.org

Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8

1/3 protein-coding genes > 17,000 non-coding RNAs > 15,000 pseudogenes

slide-33
SLIDE 33

Annotation & read alignment

  • RefSeq tends to yield the highest number of uniquely mapped

reads (but lowest genome coverage)

  • Minor influence on mapping success of unspliced reads

tissue

Junction-spanning reads

Zhao & Zhang (2015) BMC Genomics. doi:10.1186/s12864-015-1308-8

Which data base to use depends on your question!

slide-34
SLIDE 34

No data base is perfect…

  • Ubiquitous, highly

expressed genes are more likely to be part

  • f the annotations
  • Different tissues will

show different mapping rates!

slide-35
SLIDE 35

Which annotation should one use?

“More)sensitive)annotations,)such)as)Ensembl) (…))should$be$preferred)over)more)speciBic) annotations,)such)as)RefSeq)(…))if)the)aim)is) to)obtain)accurate)expression)estimates.“)

Janes&et&al.&(Brie7ings&in&Bioinformatics,&2015).&doi: 10.1093/bib/bbv007&

“We)observe)that)RefSeq$Genes$produces$the$ most$accurate$fold@change$measures$with) respect)to)a)ground)truth)of)RT%qPCR)gene) expression)estimates.)“)

Wu&et&al.&(BMC&Bioinfo,&2013).&doi: 10.1186/1471M2105M14MS11MS8&

“In)practice,)there)is)no$simple$answer$to$this$question,)and)it)depends)on)the)purpose)

  • f)the)analysis.)(…))When)choosing)an)annotation)database,)researchers)should)keep)in)

mind)that)no$database$is$perfect)and)some$gene$annotations$might$be$inaccurate$

  • r$entirely$wrong.”)

Zhao&&&Zhang&(BMC&Genomics,&2015).&doi:10.1186/s12864M015M1308M8&

slide-36
SLIDE 36

Gene models can differ dramatically

vs. ENSEMBL 0 unambiguous reads for LUZP6; x number of reads for MTPN RefSeq 0 reads

  • r

the same number of reads for both genes vs. 1 2 2 1

slide-37
SLIDE 37

Storing annotation information

  • various formats (all are plain text files): GFF2, GFF3, GTF

, BED, SAF…

GTF (“GFF2.5”)

1.

reference coordinate

2.

source

3.

annotation type

4.

start position

5.

end position

6.

score

7.

strand

8.

frame/phase

9.

attributes: <TYPE VALUE>

slide-38
SLIDE 38

0 vs. 1 based conventions

zero-based, half open ATG location: 6 - 9 or [6,9) Cut site: 11-11 or [11,11) Interval length = stop - start

http://alternateallele.blogspot.de/2012/03/genome-coordinate-conventions.html

  • ne-based, fully-closed

ATG location: 7 - 9 or [7,9] Cut site: 11^12 or (11,12) Interval length = stop - start + 1

BED format

http:// alternateallele.blogspot.com/ 2012/03/genome-coordinate-cheat- sheet.html

GFF format

slide-39
SLIDE 39

STAR spliced alignment

Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635

TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0

Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722

Spliced alignment programs

  • accurate & sensitive
  • very fast
  • memory intensive!

(use it on the server!)

  • MMP = maximal mappable prefix

(aka maximum matching portion)

  • reads are split when a continuous

alignment is not possible

  • the remaining unmappable portion

is then aligned again

  • finally, aligned portions of the
  • riginal full-length reads are

stitched together

slide-40
SLIDE 40

STAR spliced alignment

Dobin (2013). Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635

TopHat2 ann TopHat2 TopHat1 ann TopHat1 STAR 2-pass ann STAR 2-pass STAR 1-pass ann STAR 1-pass SMALT ReadsMap PASS cons PASS PALMapper cons ann PALMapper cons PALMapper ann PALMapper MapSplice ann MapSplice GSTRUCT ann GSTRUCT GSNAP ann GSNAP GEM cons ann GEM cons GEM ann BAGET ann 20 40 60 80 100 0

Engström et al. (2013) Nature Methods, 10(12), 1185–1191. doi:10.1038/nmeth.2722

Spliced alignment programs

  • accurate & sensitive
  • very fast
  • memory intensive!

STAR has numerous options! tune them to meet your needs Current Protocols in Bioinformatics (Sept 2015) DOI: 10.1002/0471250953.bi1114s51 and STARmanual.pdf

slide-41
SLIDE 41

STAR spliced alignment

  • 1. generate genome index
  • 2. align
  • align to reference & identify

novel splice junctions

  • re-run alignment including the

novel splice junctions

  • -runMode genomeGenerate
  • -genomeFastaFiles sacCer3.fa
  • -sjdbGTFfile sacCer3.gtf
  • -twopassMode

Let’s align the reads for WT_1!

slide-42
SLIDE 42

Storing aligned reads: SAM/BAM

slide-43
SLIDE 43

Storing aligned reads: SAM/BAM

Binary (Decimal) Hex Description 00000000001 (1) 0x1 Is the read paired? 00000000010 (2) 0x2 Are both reads in a pair mapped “properly” (i.e., in the correct

  • rientation with respect to one another)?

00000000100 (4) 0x4 Is the read itself unmapped? 00000001000 (8) 0x8 Is the mate read unmapped? 00000010000 (16) 0x10 Has the read been mapped to the reverse strand? 00000100000 (32) 0x20 Has the mate read been mapped to the reverse strand? 00001000000 (64) 0x40 Is the read the first read in a pair? 00010000000 (128) 0x80 Is the read the second read in a pair? 00100000000 (256) 0x100 Is the alignment not primary? (A read with split matches may have multiple primary alignment records.) 01000000000 (512) 0x200 Does the read fail platform/vendor quality checks? 10000000000 (1024) 0x400 Is the read a PCR or optical duplicate?

2nd field: binary FLAG most common FLAGS for SR: 0; 4; 16

https://broadinstitute.github.io/ picard/explain-flags.html

slide-44
SLIDE 44

Storing aligned reads: SAM/BAM

slide-45
SLIDE 45

Storing aligned reads: SAM/BAM

6th field: CIGAR string – what was necessary to align the read to the indicated locus? M: alignment (match or mismatch!!) I (N): insertion (large insertions) D: deletion S/H: clipping

slide-46
SLIDE 46

Storing aligned reads: SAM/BAM

AS:i Alignment score BC:Z Barcode sequence HI:i Query is i-th hit stored in the file NH:i Number of reported alignments for the query sequence NM:i Edit distance of the query to the reference MD:Z String that contains the exact positions of mismatches (should complement the CIGAR string) RG:Z Read group (should match the entry after ID if @RG is present in the header.

after 11th field: OPTIONAL information

slide-47
SLIDE 47

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam

Aligned reads

Bustard/RTA/OLB, CASAVA STAR

RSeQC, QoRTs

slide-48
SLIDE 48

Basic QC of aligned reads

  • aligner output (e.g., Log.final.out)
  • samtools flagstat (statistics on the flag values)
  • RSeQC’s bam_stat
  • QoRTs

visual inspection!

slide-49
SLIDE 49

Integrative Genomics Viewer

http://software.broadinstitute.org/software/igv/download

slide-50
SLIDE 50

Integrative Genomics Viewer

  • load BAM file(s) from URL (“File” -> “Open URL…”):

http://chagall.med.cornell.edu/RNASEQcourse/

  • take a snapshot of the reads around gene YPL198W

starting with IGV 2.3, Sashimi plots can easily be created http://software.broadinstitute.org/software/igv/Sashimi

many options!

slide-51
SLIDE 51

Identifying RNA-seq specific biases

  • RSeQC and/or QoRTs
  • read distribution (exons/introns/intergenic)
  • gene body coverage (3’/5’ bias)

Lahens et al. Genome Biology 2014 15:R86 doi:10.1186/gb-2014-15-6-r86

slide-52
SLIDE 52

in silico RIN calculation

  • RIN is rarely documented in the

data repositories

  • mRIN is based on 3’ coverage bias
  • use RSeQC’s tin.py

ARTICLE

Received 7 May 2015 | Accepted 15 Jun 2015 | Published 3 Aug 2015

mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data

Huijuan Feng1,2, Xuegong Zhang1 & Chaolin Zhang2

DOI: 10.1038/ncomms8816

OPEN

https://github.com/friedue/course_RNA-seq2015 ! exercise #3

slide-53
SLIDE 53

COUNTING READS

from alignments to count tables

slide-54
SLIDE 54

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

Counting

featureCounts

Which regions are expressed? How much are they expressed?

slide-55
SLIDE 55

Counting raw read overlaps

https://www.youtube.com/watch?v=ztyjiCCt_lM Disclaimer: There are 2 schools of thought. We follow the one that’s based on the raw reads and gene overlaps. See the course notes for references for the arguments from the other side.

slide-56
SLIDE 56

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

Counting read–gene overlaps

featureCounts will use read-gene

  • verlaps as small as 1 bp

multi-overlap reads

slide-57
SLIDE 57

http://www.nature.com/nmeth/journal/v8/n6/full/nmeth.1613.html

slide-58
SLIDE 58
  • Transcriptome

reconstruction suffers from bad precision and bad sensitivity => many FP transcripts (esp. for tricky transcriptomes)!

  • False transcripts capture

a considerable portion of the reads

Janes et al. (2015). Briefings in Bioinformatics, (January), 1–9. doi:10.1093/bib/bbv007

Why do we not count transcripts? Because we don’t know them…

instead of transcript reconstruction, perhaps resort to:

  • exon counts ! DEXSeq
  • focus on specific splice events (MISO)
slide-59
SLIDE 59

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

featureCounts

Normalizing

DESeq2, edgeR

slide-60
SLIDE 60

From couting reads to expression units

Xi

  • Raw counts: number of reads/

fragments overlapping with the union of exons of a gene

strongly influenced by:

  • gene length
  • sequencing depth
  • expression of all other genes in the same

sample

slide-61
SLIDE 61

Expression units

TPMi = Xi li ∗ 1 P

j

Xj lk ∗ 106

RPKMi = Xi ( li 103)( N 106)

Xi

  • Raw counts: number of reads/

fragments overlapping with the union of exons of a gene

  • [RF]PKM: Reads/Fragments per

Kilobase of gene per Million reads mapped

  • TPM: Transcripts Per Million
  • rlog: log2-transformed count data

normalized for small counts and library size

gene read counts per bp all gene counts per all gene bp

slide-62
SLIDE 62

Avoid [RF]PKM and total read count normalization for DGE!

FP rates for varying % of DE genes (0-30%)

Dillies et al.(2012). Briefings in Bioinformatics. doi:10.1093/bib/bbs046

Effects of normalization methods on FC calculation and DGE analysis

if you need normalized expression values, use TPM

  • r rlog
slide-63
SLIDE 63

Let’s count some reads & explore in R! Please save the .RData and the commands! You will need them tomorrow!

slide-64
SLIDE 64

Expression units

  • strongly influenced by
  • gene length
  • sequencing depth
  • expression of all other genes in the same sample

DESeq’s size factor normalization

hetero- skedasticity

  • large dynamic range
  • discrete values

log transformation and variance stabilization (DESeq’s rlog() )

clustering tends to work better on normalized & transformed read counts

slide-65
SLIDE 65

Clustering gene expression values

Hierarchical clustering

Experiment 1 Experiment 2

single-gene clusters are successively joined

+ “unbiased”

  • not very robust
  • Similarity measures
  • Euclidean
  • Pearson correlation
  • Distance measures
  • Complete: largest distance
  • Average: average distance
slide-66
SLIDE 66

removing rRNAs

Can be done at virtually every step of the analysis. Choose the version that makes most sense to you.

  • sortMeRNA: http://bioinfo.lifl.fr/RNA/sortmerna/
  • input: reads in fastq file + rRNA sequences
  • will extract those reads that do not match to the rRNA sequences
  • https://www.ncbi.nlm.nih.gov/nuccore/U13369 (human rRNA),

https://www.ncbi.nlm.nih.gov/nuccore/BK000964 (mouse)

  • make a “genome” index for rRNAs only (and perhaps

tRNAs), then align your reads and only use those that do not map for the next round of alignment

  • do your alignment and counting as is, simply ignore the rRNA

genes in your subsequent downstream analysis

slide-67
SLIDE 67

DIFFERENTIAL GENE EXPRESSION

slide-68
SLIDE 68

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

HTSeq, featureCounts

Normalizing

DESeq2, edgeR .txt .Robj .Robj, .txt

Descriptive plots

slide-69
SLIDE 69

Images Raw reads Aligned reads Read count table Normalized read count table List of fold changes & statistical values Downstream analyses on DE genes

FASTQC Base calling & demultiplexing Mapping

.tif .fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Bustard/RTA/OLB, CASAVA STAR

RSeQC Counting

HTSeq, featureCounts

DE test & multiple testing correction

DESeq2, edgeR, limma

Normalizing

DESeq2, edgeR .txt .Robj .Robj, .txt Descriptive plots

slide-70
SLIDE 70

Read count table

slide-71
SLIDE 71

DE basics

  • 1. Estimate magnitude of DE

taking into account differences in sequencing depth, technical, and biological read count variability.

  • 2. Estimate the significance of

the difference accounting for performing thousands of tests. 1 test per gene! Garber et al. (2011) Nature Methods, 8(6), 469–477. doi:10.1038/nmeth.1613 H0: no difference in the read distribution between two conditions

logFC (adjusted) p-value

slide-72
SLIDE 72

Modeling read counts (DESeq2)

Kij ∼ NB(µij, αi)

∼ µij = sjqij

log2(qij) = xj.βi

read counts for gene i and sample j fitted mean gene-specific dispersion parameter (fitted towards the average dispersion) moderated log-fold change for gene i model matrix column for sample j library size factor expression value estimate

slide-73
SLIDE 73

From read counts to DE

average norm. count standard error estimate for the logFC

slide-74
SLIDE 74

What next?

  • Do your results make sense?
  • Are the results robust?
  • do multiple tools agree on the majority of the genes?
  • are the fold changes strong enough to explain the phenotype you

are seeing?

  • have other experiments yielded similar results?
  • Downstream analyses: mostly exploratory

How to decide which tool(s) to use?

  • function/content of original publication
  • code maintained?
  • well documented?
  • used by others?
  • efficient?
slide-75
SLIDE 75

WALK-IN CLINICS

@ WCM: Thursdays, 1:30 – 3 pm, LC-504 (1300 York Ave) @ MSKCC: https://www.mskcc.org/ research-advantage/core- facilities/bioinformatics

Where to get help and inspiration?

biostars.org stackoverflow.com bioconductor.org/help/workflows mailing lists of the individual tools seqanswers.com

abc@med.cornell.edu

supplemental material of publications based on HTS data

slide-76
SLIDE 76

Everything’s connected…

Sample type & quality

  • Low input?
  • Degraded?

Sequencing

  • Read length
  • PE vs. SR
  • Sequencing errors

Experimental design

  • Controls
  • No. of replicates
  • Randomization

Library preparation

  • Poly-A enrichment vs.

ribo minus

  • Strand information

Bioinformatics

  • Aligner
  • Annotation
  • Normalization
  • DE analysis strategy
  • Expression quantification
  • Alternative splicing
  • De novo assembly needed
  • mRNAs, small RNAs
  • ….

Biological question