Statistical LeaRning
Lydia Müller &Katja Nowick
Bioinformatics group,
Markus Kreuz
IMISE
Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - - PowerPoint PPT Presentation
Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains Using
Bioinformatics group,
IMISE
Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains
Each cell has a unique set of thousands of different RNAs A piece of tissue typically consists of different cell types
mRNA = messenger RNA protein ncRNAs = non-coding RNAs active as RNAs
All these different RNA “species” are measured by RNA-Seq
Form A Form B Form C
Gene X
consisting of several exons
mRNAs of Gene X The different mRNAs produced from gene X are called different transcripts or different isoforms
Expression profiles using Affymetrix arrays: HG-U133Plus2
Designed for human samples ~ 54,000 probesets ~21,000 genes
5 humans 5 chimpanzees
Multiple (11-20) 25-base
represent one transcript
PM MM
Published Gene Sequence
Multiple (11-20) 25-base
Perfect Match Mismatch 5´ 3´ PM is exactly complementary to published sequence MM is changed on 13th base
For Affy arrays, artifacts should be < 10% of image,
affected…
you compare two groups that
subsets of the replicates are handled separately at any stage
becomes in effect another
most of one factor level together if you can’t do all the samples at once.
Slope should be <5
– Chip is divided into 16 sections – 2% lowest probes = background – PM and MM probes are corrected against background in their section – If MM>PM, adjusted so that MM<PM
– Predicts signal based on assumption that it is the sum of the real signal and noise – Uses observed PM probes only – Integrates signal of all probes into model
– Most genes are not changing – Changes are not all in same direction
intensities tends to underestimate fold change
PM MM
“median polish”: linear model based on all arrays in the study Expression values on log2 scale
“1-step Tukey biweight”, only for individual arrays
Biology 6: R16)
3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7
FC=2.28 P=0.001 FC=1.14 P=0.07 FC=1.16 P=0.04 FC=1.43 P=0.25 FC=1.81 P=0.32
3 3.5 4 4.5 5 5.5 6 6.5 7
FC=2.28 P=0.17
Be aware of effect of outliers or biological variation on FC! T-test is a common way to identify differentially expressed genes
before large scale datasets
you would expect the same outcome only ones (1 in 100)
you would expect 100 to have p<0.01 by chance alone
– Multiply p-values by number of tests done – Very conservative; greatly increases false negative rate
– Less conservative, willing to accept some number of false positives in order to decrease false negative rate
– Controls false discovery rate; – Estiamtes height of uniform part of p-value distribution and corrects p-value based on this height
willing to accept
– Remove non-expressed genes (e.g. genes not expressed in all samples) – This typically reduces the number of tests by 50-75% genes
a priori specification of a set of genes that are likely to be differentially expressed
– E.g. from results from other experiments or other prior knowledge – Instead of correcting for all genes, this reduces the multiple testing correction to just that number of a priori defined genes – But don’t define your expectation after analyzing your current dataset or change your expectation after your previous one didn’t work
no de-novo gene discovery, only limited isoform information
(unless you make custom arrays – expensive!) You can use arrays of highly related species (e.g. human arrays for chimpanzee samples) But regions with sequence difference would not hybridize equally well: Solutions: Use/create custom CDF files to “mask” probes you don’t want to consider Determine statistically the probes with unequal behavior between samples (Dannemann et al. 2009) RNA-Seq
Probeset Probeset
Quantification can be done at the level of genes, transcripts, exons
Short reads (e.g. 100 nt long)
EXON EXON
EXON EXONLong gene Short gene
EXON EXON
EXON EXONEXON
EXON EXONEXON
EXON EXONHas fewer counts Need to normalize for gene / transcript length!
Microarrays Continues numbers Transcript abundance: Is linearly related to the fluorescence level recovered after hybridization RNA-Seq Counts of reads (discrete numbers) Transcript abundance: Counts are considered to be linearly related to transcript abundance
With the costs for NGS decreasing, RNA-Seq will replace microarrays Calculate expression values for genes, transcripts (i.e. different isoforms),
… for transcript length:
An early method was proposed by Mortazavi et al. in 2008: RPKM = Reads Per Kilobase of exon model per Million mapped reads Argument against normalization for transcript length in DE analysis: When comparing the same genes between samples, we expect (and hope) that the biases will affect the same gene in the same way in different samples. Thus, we do not worry about gene length bias, GC bias and so on, because the biases in effect “cancel out" when making the comparison between samples. BUT, when comparing different species: Gene/transcript lengths, and size of the transcriptomes are different. Hence we need to normalize for transcript length and mappability.
Mortazavi et al. 2008 # 𝑠𝑓𝑏𝑒𝑡 𝑝𝑤𝑓𝑠𝑚𝑏𝑞𝑞𝑗𝑜 𝑓𝑦𝑝𝑜𝑡 𝑢𝑝𝑢𝑏𝑚 # 𝑛𝑏𝑞𝑞𝑓𝑒 𝑠𝑓𝑏𝑒𝑡 106 ∗ 𝑢𝑝𝑢𝑏𝑚 𝑚𝑓𝑜𝑢ℎ 𝑝𝑔 𝑏𝑚𝑚 𝑓𝑦𝑝𝑜𝑡 (𝑢𝑠𝑏𝑜𝑡𝑑𝑠𝑗𝑞𝑢𝑝𝑛𝑓) 103
… for library size (total number of reads):
Imagine, a gene has the same number of counts in two samples. But the library size was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. In other word: if a non-differentially expressed gene has twice as many counts in one sample than in another, the size factor for this sample should be twice as large as the
We will see normalization for library size using the R libraries DESeq and edgeR
… for transcriptome size:
If comparing two tissues with transcriptomes of different sizes, i.e. when there are noticeably more transcripts expressed in one tissues than the other.
Currently no consensus on the best statistics for normalization and analysis of DE Dillies et al. 2013 compared 7 commonly used normalization methods for RNA-Seq: Total Count, Upper Quartile, Median, Quantile, DESeq, TMM (edgeR), and RPKM
Dillies et al. 2013 The bad message: all methods come up with different results
Dillies et al. 2013 Most genes are assumed to be not DE After normalization RPKM still looks very much like raw counts
Dillies et al. 2013 False positive rate was calculated from simulated datasets, as an average over datasets with different proportions of DE genes DESeq and edgR (TMM) were overall the best
DESeq edgeR
Both assume that most genes are not DE i.e. when comparing across samples, most genes should have similar read counts ratio of ~1 If this is not the case, a correction factor needs to be introduced
all samples
sample are corrected so that it becomes 1
from raw read counts divided by the correction factor
sample is corrected so that it becomes 1
with the corrected library size
We will mainly focus on how to use these packages (not on the statistical details)
3 humans 3 chimpanzees Illumina sequencing of paired ends 76 bp Liu et al. 2012
id SRR111895.bam SRR111935.bam SRR112674.bam NM_001025603 579 1835 1124 NM_021107 120 224 367 NR_036579 934 825 953 id SRR364819.bam SRR365026.bam SRR365029.bam NM_001025603 1430 2010 2514 NM_021107 848 593 406 NM_173803
Raw read counts
Note: Each column must stem from an independent experiment or sample. If you spread sample material from one experiment over several lanes of the sequencer in order to get better coverage, you must sum up the counts from the lanes to get a single column.
– What are the genes? – Are any functional categories over-represented? – Can I infer/build a pathway from my data? – Which changes are specific to a cell line, species … – How do they overlap with other/similar experiments? …