Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - - PowerPoint PPT Presentation

statistical learning
SMART_READER_LITE
LIVE PREVIEW

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - - PowerPoint PPT Presentation

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains Using


slide-1
SLIDE 1

Statistical LeaRning

Lydia Müller &Katja Nowick

Bioinformatics group,

Markus Kreuz

IMISE

slide-2
SLIDE 2

Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains

  • Using microarrays (Affymetrix arrays)
  • Using RNA-Seq (Illumina sequencing)

Differential Gene Expression

slide-3
SLIDE 3

Gene Expression Analysis

  • Measure RNA levels for all or at least thousands
  • f genes at a time for each sample
  • “Snapshot” approach

Each cell has a unique set of thousands of different RNAs A piece of tissue typically consists of different cell types

slide-4
SLIDE 4

The biology of gene expression

Gene expression:

mRNA = messenger RNA  protein ncRNAs = non-coding RNAs active as RNAs

All these different RNA “species” are measured by RNA-Seq

slide-5
SLIDE 5

The biology of gene expression

Form A Form B Form C

Gene X

consisting of several exons

mRNAs of Gene X The different mRNAs produced from gene X are called different transcripts or different isoforms

slide-6
SLIDE 6

Dataset for the exercises

Expression profiles using Affymetrix arrays: HG-U133Plus2

Designed for human samples ~ 54,000 probesets ~21,000 genes

5 humans 5 chimpanzees

slide-7
SLIDE 7

Affymetrix

Multiple (11-20) 25-base

  • ligonucleotide probes

represent one transcript

PM MM

slide-8
SLIDE 8

Affymetrix

Published Gene Sequence

Multiple (11-20) 25-base

  • ligonucleotide probes

Perfect Match Mismatch 5´ 3´ PM is exactly complementary to published sequence MM is changed on 13th base

slide-9
SLIDE 9

QC: 1. Affy chip image

For Affy arrays, artifacts should be < 10% of image,

  • therwise summarized probeset values could be

affected…

slide-10
SLIDE 10

QC: 2. Batch effects

  • In good experimental design,

you compare two groups that

  • nly differ in one factor.
  • Batch effect can occur when

subsets of the replicates are handled separately at any stage

  • f the process; handling group

becomes in effect another

  • factor. Avoid processing all or

most of one factor level together if you can’t do all the samples at once.

slide-11
SLIDE 11

QC: 3. RNA degradation

  • Probes are ordered by location from 5’ to 3’ of

transcript

  • RNA degradation starts from 5’ end

 signal intensity should decrease from 3’ to 5’

Slope should be <5

slide-12
SLIDE 12

Data Preprocessing

  • 1. Background correction
  • 2. Normalization
  • a. Within array
  • b. Between arrays
slide-13
SLIDE 13

Background and probe-specific correction Methods

  • MAS 5: PM-MM, only if PM>MM

– Chip is divided into 16 sections – 2% lowest probes = background – PM and MM probes are corrected against background in their section – If MM>PM, adjusted so that MM<PM

  • RMA: PM = S + Noise;

– Predicts signal based on assumption that it is the sum of the real signal and noise – Uses observed PM probes only – Integrates signal of all probes into model

  • GCRMA: based on amount + location of GC content of the probes
slide-14
SLIDE 14

Normalization

  • Most methods of normalization assume:

– Most genes are not changing – Changes are not all in same direction

  • This applies to both within-array and between-

array normalization

slide-15
SLIDE 15

Between-Array Normalization

  • Necessary when comparing intensity values between arrays
  • Mas5 does no between array normalization
  • RMA uses quantile normalization
  • Process all (and only) arrays of an experiment together!
  • Each chip will get the same empirical distribution of signal

intensities  tends to underestimate fold change

slide-16
SLIDE 16

Summarization

PM MM

  • RMA: uses only PM

“median polish”: linear model based on all arrays in the study  Expression values on log2 scale

  • MAS5: uses PM-MM

“1-step Tukey biweight”, only for individual arrays

slide-17
SLIDE 17

MAS 5 Presence Calls

  • Affymetrix has algorithm to determine

presence/absence of transcripts

  • Despite problems with MM probes, they are useful

in determining presence

  • A data set where all the transcripts present were

known indicated that Affy’s P/M/A calls were 85% accurate with only 10% FDR (Choe et al. 2005, Genome

Biology 6: R16)

  • It makes only sense to analyze expressed

transcripts

slide-18
SLIDE 18

A word on technical replication…

Generally have limited amount of replicates due to cost of the experiments or availability of samples Technical replication is seen by many statisticians as a waste of time and resources because they do not substantially increase your power to detect differences... Biological replicates: include the technical variation and give information on natural/biological variation to determine what amplitude of difference might be biologically relevant

slide-19
SLIDE 19

What is different?

3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7

FC=2.28 P=0.001 FC=1.14 P=0.07 FC=1.16 P=0.04 FC=1.43 P=0.25 FC=1.81 P=0.32

3 3.5 4 4.5 5 5.5 6 6.5 7

FC=2.28 P=0.17

Be aware of effect of outliers or biological variation on FC! T-test is a common way to identify differentially expressed genes

slide-20
SLIDE 20

Multiple Testing Problem

  • Most common statistical tests were good in the times

before large scale datasets

  • P=0.01 means if you were repeating the same test 100x

you would expect the same outcome only ones (1 in 100)

  • P<0.01 means you would not expect the same
  • utcome even if you repeated the test 100 x
  • Problem: With microarrays, you are testing thousands
  • f genes at the same time
  •  If you had 10,000 genes and none were different,

you would expect 100 to have p<0.01 by chance alone

slide-21
SLIDE 21

Multiple Testing Solutions (1)

Adjust p-values to reflect multiple testing:

  • Bonferroni correction:

– Multiply p-values by number of tests done – Very conservative; greatly increases false negative rate

  • Benjamini & Hochberg correction:

– Less conservative, willing to accept some number of false positives in order to decrease false negative rate

  • Storey’s q-values

– Controls false discovery rate; – Estiamtes height of uniform part of p-value distribution and corrects p-value based on this height

Calculate False Discovery Rate:

  • Try multiple p-value cut-offs and use what you are

willing to accept

slide-22
SLIDE 22

Multiple Testing Solutions (2)

Non-specific filtering to remove genes before analysis

– Remove non-expressed genes (e.g. genes not expressed in all samples) – This typically reduces the number of tests by 50-75% genes

slide-23
SLIDE 23

Multiple Testing Solutions (3)

a priori specification of a set of genes that are likely to be differentially expressed

– E.g. from results from other experiments or other prior knowledge – Instead of correcting for all genes, this reduces the multiple testing correction to just that number of a priori defined genes – But don’t define your expectation after analyzing your current dataset or change your expectation after your previous one didn’t work

slide-24
SLIDE 24
  • Limited to genes for which probes are on the array

no de-novo gene discovery, only limited isoform information

  • Cross-hybridization
  • Only available for some species

(unless you make custom arrays – expensive!) You can use arrays of highly related species (e.g. human arrays for chimpanzee samples) But regions with sequence difference would not hybridize equally well: Solutions: Use/create custom CDF files to “mask” probes you don’t want to consider Determine statistically the probes with unequal behavior between samples (Dannemann et al. 2009) RNA-Seq

Limitations of microarrays

Probeset Probeset

X X

slide-25
SLIDE 25
  • 6. Count how many reads map to the gene to determine its expression level

Quantification can be done at the level of genes, transcripts, exons

  • 5. Mapping of reads to a reference genome / genes

RNA-Seq Quantification of gene expression

  • 1. Biochemical amplification of all RNAs
  • 2. Fragmentation of all RNAs
  • 4. Sequencing of all cDNAs

 Short reads (e.g. 100 nt long)

  • 3. Biochemical conversion of all RNAs into cDNA

EXON EXON

EXON EXON
slide-26
SLIDE 26

Quantification of gene expression

Long gene Short gene

EXON EXON

EXON EXON

EXON

EXON EXON

EXON

EXON EXON

Has fewer counts Need to normalize for gene / transcript length!

slide-27
SLIDE 27

Microarrays Continues numbers Transcript abundance: Is linearly related to the fluorescence level recovered after hybridization RNA-Seq Counts of reads (discrete numbers) Transcript abundance: Counts are considered to be linearly related to transcript abundance

RNA-Seq vs Microarray differential expression

With the costs for NGS decreasing, RNA-Seq will replace microarrays Calculate expression values for genes, transcripts (i.e. different isoforms),

  • r exons?  Differential gene, transcript or exon expression?
slide-28
SLIDE 28

… for transcript length:

An early method was proposed by Mortazavi et al. in 2008: RPKM = Reads Per Kilobase of exon model per Million mapped reads Argument against normalization for transcript length in DE analysis: When comparing the same genes between samples, we expect (and hope) that the biases will affect the same gene in the same way in different samples. Thus, we do not worry about gene length bias, GC bias and so on, because the biases in effect “cancel out" when making the comparison between samples. BUT, when comparing different species: Gene/transcript lengths, and size of the transcriptomes are different. Hence we need to normalize for transcript length and mappability.

To normalize or not to normalize …

Mortazavi et al. 2008 # 𝑠𝑓𝑏𝑒𝑡 𝑝𝑤𝑓𝑠𝑚𝑏𝑞𝑞𝑗𝑜𝑕 𝑓𝑦𝑝𝑜𝑡 𝑢𝑝𝑢𝑏𝑚 # 𝑛𝑏𝑞𝑞𝑓𝑒 𝑠𝑓𝑏𝑒𝑡 106 ∗ 𝑢𝑝𝑢𝑏𝑚 𝑚𝑓𝑜𝑕𝑢ℎ 𝑝𝑔 𝑏𝑚𝑚 𝑓𝑦𝑝𝑜𝑡 (𝑢𝑠𝑏𝑜𝑡𝑑𝑠𝑗𝑞𝑢𝑝𝑛𝑓) 103

slide-29
SLIDE 29

… for library size (total number of reads):

Imagine, a gene has the same number of counts in two samples. But the library size was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. In other word: if a non-differentially expressed gene has twice as many counts in one sample than in another, the size factor for this sample should be twice as large as the

  • ne for the other sample.

We will see normalization for library size using the R libraries DESeq and edgeR

… for transcriptome size:

If comparing two tissues with transcriptomes of different sizes, i.e. when there are noticeably more transcripts expressed in one tissues than the other.

To normalize or not to normalize …

slide-30
SLIDE 30

Currently no consensus on the best statistics for normalization and analysis of DE Dillies et al. 2013 compared 7 commonly used normalization methods for RNA-Seq: Total Count, Upper Quartile, Median, Quantile, DESeq, TMM (edgeR), and RPKM

Statistics for differential gene expression

Dillies et al. 2013 The bad message: all methods come up with different results

slide-31
SLIDE 31

Statistics for differential gene expression

Dillies et al. 2013 Most genes are assumed to be not DE After normalization RPKM still looks very much like raw counts 

slide-32
SLIDE 32

Statistics for differential gene expression

Dillies et al. 2013 False positive rate was calculated from simulated datasets, as an average over datasets with different proportions of DE genes DESeq and edgR (TMM) were overall the best

slide-33
SLIDE 33

Statistics for differential gene expression

DESeq edgeR

Both assume that most genes are not DE i.e. when comparing across samples, most genes should have similar read counts  ratio of ~1 If this is not the case, a correction factor needs to be introduced

  • Analyzes the median of the ratios across

all samples

  • If median is not 1 the read counts for that

sample are corrected so that it becomes 1

  • Normalized read counts are calculated

from raw read counts divided by the correction factor

  • Done by estimateSizeFactors()
  • Picks one sample as the reference
  • Analyzes the weighted mean of the ratios
  • f each sample to that reference
  • If mean is not 1 the library size for that

sample is corrected so that it becomes 1

  • Normalized read counts are recalculated

with the corrected library size

  • Done by calcNormFactors()

We will mainly focus on how to use these packages (not on the statistical details)

slide-34
SLIDE 34

Dataset for the exercises

3 humans 3 chimpanzees Illumina sequencing of paired ends 76 bp Liu et al. 2012

id SRR111895.bam SRR111935.bam SRR112674.bam NM_001025603 579 1835 1124 NM_021107 120 224 367 NR_036579 934 825 953 id SRR364819.bam SRR365026.bam SRR365029.bam NM_001025603 1430 2010 2514 NM_021107 848 593 406 NM_173803

Raw read counts

Note: Each column must stem from an independent experiment or sample. If you spread sample material from one experiment over several lanes of the sequencer in order to get better coverage, you must sum up the counts from the lanes to get a single column.

slide-35
SLIDE 35

So, I have 400 significant probes/ probesets – what do I do now?

  • Production of significant gene lists really

represents the beginning of the analysis process

– What are the genes? – Are any functional categories over-represented? – Can I infer/build a pathway from my data? – Which changes are specific to a cell line, species … – How do they overlap with other/similar experiments? …