February 2008 Differential Expression, Power, Exploratory Analysis - - PowerPoint PPT Presentation

february 2008 differential expression power exploratory
SMART_READER_LITE
LIVE PREVIEW

February 2008 Differential Expression, Power, Exploratory Analysis - - PowerPoint PPT Presentation

February 2008 Differential Expression, Power, Exploratory Analysis Mauro Delorenzi Bioinformatics Core Facility (BCF) NCCR molecular oncology Swiss Institute of Bioinformatics (SIB) <Mauro.Delorenzi@isb-sib.ch> () February 10, 2008 2 /


slide-1
SLIDE 1

February 2008 Differential Expression, Power, Exploratory Analysis Mauro Delorenzi Bioinformatics Core Facility (BCF)

NCCR molecular oncology Swiss Institute of Bioinformatics (SIB) <Mauro.Delorenzi@isb-sib.ch>

() February 10, 2008 2 / 77

slide-2
SLIDE 2

Differential gene expression

Microarray differential gene expression

Figure: Source: Roche

Gene expression (GE) with microarrays: fluorescence intensity signal proportional (in some range) to the amount of target RNA hybridized to the array

() February 10, 2008 3 / 77

slide-3
SLIDE 3

Differential gene expression

Microarray differential gene expression

GE quantified in a log scale, usually base 2 Compare two conditions 1 and 2: M = GE2 − GE1 (1) 1 unit of M: factor 2 M=2: factor 4 M=-3: factor 1 / 8 M=u: factor 2u How precisely does GE measure the real gene expression level; How precisely does M measure the real log ratio?

() February 10, 2008 4 / 77

slide-4
SLIDE 4

Differential gene expression () February 10, 2008 5 / 77

slide-5
SLIDE 5

Hypothesis Testing, Classification

Statistical testing

Ex.: comparing the mean value of a variable between two groups Quantification of an effect: difference of the means = µ Null hypothesis H0 µ = 0

  • vs. Alternative hypothesis HA µ 0

Test statistic: t-statistic for µ = 0 measures the effect observed in an experiment Define significance level α, typical value 0.05

() February 10, 2008 6 / 77

slide-6
SLIDE 6

Hypothesis Testing, Classification

Define rejection region Assume the t statistics is follows a student-t distribution and reject when abs(t) > t0, t0 determined so that under H0 P [ t < -t0 ] = α / 2 and P [ t > t0 ] = α / 2 Take a sample, calculate t, decide

() February 10, 2008 7 / 77

slide-7
SLIDE 7

Hypothesis Testing, Classification

p values

The p-value is the probability that the effect, the test statistic t1

  • bserved is at least as large as that observed in the sample, under

H0: One sided test: p = P[t > t1] Two sided test: p = P[abs(t) > t1]

Figure: p value (one-sided test): area beyond the observed value

() February 10, 2008 8 / 77

slide-8
SLIDE 8

Hypothesis Testing, Classification

A p-value without a null hypothesis is meaningless A p-value below α is called significant and corresponds to a rejection

  • f H0.

A (highly) significant p-value means that the observed difference vould occur (very) rarely if truly there were no difference, if H0 were correct but It does not prove that the alternative hypothesis is true. It does not express the strength of confidence in HA. It is based on the assumption of H0 and therefore can only quantify evidence against the null hypothesis.

() February 10, 2008 9 / 77

slide-9
SLIDE 9

Hypothesis Testing, Classification

A (highly) nonsignificant (large) p value does NOT demonstrate that the null hypothesis is true, only that the evidence against H0 is not (very) strong, not convincing enough for a rejection. Large p values might be due to small sample sizes. The p value does not quantify the strength (effect size) of the

  • bserved difference. The same difference of two means can

correspond to very different p-values. A small effect in a very large study can easily lead to a significant p-value, a large effect in a small study might well not give significance.

() February 10, 2008 10 / 77

slide-10
SLIDE 10

Hypothesis Testing, Classification

Decision errors

The findings of a study comparing two groups can be wrong in two ways

  • 1. The results lead to the conclusion that there is a difference when in

reality there is none: FP error, type I error, controlled by α

  • 2. The results lead to the conclusion that there is no difference when

in reality a difference exists FN error, type II error, expressed by β

α is the benchmark for the p-values, if the p-value is below the

threshold α the results is called statistically significant The power of a study is the probability of finding a statistically significant result, if there is a true difference. Power = 1 - β. The choice of adequate power is critical, because investigators and funding agencies must be confident that an existing difference can be detected using the study sample. A study that has little chance to reaching this aim might make no sense.

() February 10, 2008 11 / 77

slide-11
SLIDE 11

Hypothesis Testing, Classification () February 10, 2008 12 / 77

slide-12
SLIDE 12

Hypothesis Testing, Classification

Classification Example

Test on pregnancy (the condition). A test result, which does suggest a pregnancy is called positive, the opposite is a negative result. If 200 cases with the condition are tested and the test has a sensitivity of 90%, in 180 of these cases the test result is positive. If 1200 cases without the condition are tested and the test has a sensitivity of 95%, in 1140 of these cases the test result is positive. Real status vs Test outcome Test Positive Test Negative Sum Condition present 180 20 200 Condition absent 60 1140 1200 Sum 240 1160 1400

() February 10, 2008 13 / 77

slide-13
SLIDE 13

Hypothesis Testing, Classification

Classification, Terms

Real status vs Test outcome Test Positive Test Negative Sum Condition present True Positive TP False Negative FN C Condition absent False Positive FP True Negative TN A Sum P N Total M

Prevalence (PREV): Frequency of the condition in the population = C / M Sensitivity (SENS): Ability (Power) to diagnose presence of the condition = TP / C Specificity (SPEC): Ability to diagnose absence of the condition = TN / A Accuracy (ACC): Proportion of cases classified correctly by the test = (TP + TN) / M Error rate (ER): = (FP + FN) / M = 1 - Accuracy

() February 10, 2008 14 / 77

slide-14
SLIDE 14

Hypothesis Testing, Classification

Positive predictive value (PPV): Proportion among the classified positive that really does have the condition

= TP / P

False discovery proportion (FDP): Proportion among the classified positive that in reality does not have the condition

= FP / P = 1 − PPV

Negative predictive value (NPV): Proportion among the classified negative that really does not have the condition

= TN / N

() February 10, 2008 15 / 77

slide-15
SLIDE 15

Hypothesis Testing, Classification

Multiple testing

Aim: avoid false decisions, reduce their frequency In the same experiment testing N null hypothesis at level α If the nulls are all correct and the tests independent will wrongly reject at least one with probability P = 1 − (1 − α)N On average will wrongly reject Nα of the null hypotheses.

N Prob Exp 10 40% 0.5 50 92% 2.5 90 99% 4.5

This would create much confusion, contradictions between different studies.

() February 10, 2008 16 / 77

slide-16
SLIDE 16

Hypothesis Testing, Classification

Differential Expression

Multiple testing (MT) M=20,000 genes, α = 0.05 When no gene is DE (H0 true for all): expect FP = M * α = 1,000 genes One approach to MT: make test more stringent, adapt α Bonferroni method: use α / M = 2.5E-6 (or keep α, multiply p-values by M) Bonferroni gives Strong control: P [ FP ≥ 1 ] ≤ α Is conservative, when nb tests > 10 there might be little power left Is difficult when N large because small probabilities are not necessarily accurate

() February 10, 2008 17 / 77

slide-17
SLIDE 17

Hypothesis Testing, Classification

Multiple testing

Many other approaches Control false discovery rate FDR (the expected proportion of false discoveries amongst the rejected hypotheses) The false discovery rate is a less stringent condition than the family wise error rate, so these methods are more powerful than the others. A simple methods based on p-values: Benjamini-Hochberg (BH) FDR (in R: p.adjust) Multiple testing corrections with permutation methods Control false discovery rate FDP Pr(FP/P ≥ γ) ≤ α (the observed proportion of false discoveries amongst the rejected hypotheses) FDP relates directly to the information in the observed data Control false discovery counts FDC (number of false discoveries amongst the rejected hypotheses) Korn’s permutation method Korn et

  • al. (2001) Control Pr(FP ≥ k) ≤ α (when k = 0: FWER)

() February 10, 2008 18 / 77

slide-18
SLIDE 18

Hypothesis Testing, Classification

Benjamini-Hochberg (BH) FDR

This correction is not as stringent as Bonferroni, accepts more false positives. There will be also less false negative genes. 1) The p-values of each gene are ranked from the smallest to the largest. 2) The largest p-value remains as it is. 3) The second largest p-value is multiplied by the total number of genes in gene list divided by its rank. Corrected p-value = p-value*(n/n-1) If less than 0.05, it is significant. 4) The third p-value is multiplied as in step 3: Corrected p-value = p-value*(n/n-2) and so on

() February 10, 2008 19 / 77

slide-19
SLIDE 19

Hypothesis Testing, Classification Source: Pounds(2006) [4] () February 10, 2008 20 / 77

slide-20
SLIDE 20

Confidence Intervals and Power

Confidence Intervals

The observed value is affected by measurement error What is the real value? Observed value real value, most of the time Confidence interval: a range of values (from a lower to an upper confidence limit) that will include the true parameter most of the time 95% CI: if we took an infinite number of sample of the same kind the true parameter would be included in the CI 95% of the time (coverage 95%) There remains a 5% chance that the true parameter is outside the CI Need multiple measurements to estimate the CI

() February 10, 2008 21 / 77

slide-21
SLIDE 21

Confidence Intervals and Power () February 10, 2008 22 / 77

slide-22
SLIDE 22

Confidence Intervals and Power

SD

Repeated Measurements: X as random variable (RV) Assume X ∼ N(µ, σ) real: µ, σ ; empirical estimates: ˆ

µ, ˆ σ σ is the standard deviation (SD ) of the distribution of a single

measurement The SD is a measure of variability (scatter), of the degree to which the individual values deviate from the mean The variation can derive from different sources of variation, measurement error, biological heterogeneity, ... Sample Xi for i = 1, 2, ..n The sample SD ˆ

σ is an estimate of the population SD σ and is

computed using deviations of the single values from the sample mean

ˆ σ = n

i=1 (Xi − ¯

X)2 n − 1 (2)

() February 10, 2008 23 / 77

slide-23
SLIDE 23

Confidence Intervals and Power () February 10, 2008 24 / 77

slide-24
SLIDE 24

Confidence Intervals and Power () February 10, 2008 25 / 77

slide-25
SLIDE 25

Confidence Intervals and Power

The sample mean ˆ

µ is a RV ˆ µ = mean(X) = n

i=1 Xi

n (3) The mean is a RV, it depends on the set of measurements (the sample). How is it distributed? S =

n

  • i=1

Xi (4) Mean of a sum of independent RVs is the sum of the means: E(S) = n µ (5) Likewise the variance is additive (not the standard deviation) Var(S) = n σ2 (6)

() February 10, 2008 26 / 77

slide-26
SLIDE 26

Confidence Intervals and Power () February 10, 2008 27 / 77

slide-27
SLIDE 27

Confidence Intervals and Power

Notably the sum of independent normals is normal S ∼ N(nµ,

nσ) Mean

ˆ µ = S

n Division of a RV by a scalar n scales the mean and te SD by the same n, variance by n2 E[ˆ

µ] = µ

Var(ˆ

µ) = Var(S)

n2

= σ2

n (7) The standard deviation of the mean is called standard error of the mean (SEM) SEM = SD

√n = σ √n

(8)

() February 10, 2008 28 / 77

slide-28
SLIDE 28

Confidence Intervals and Power

Confidence Interval

We have the sample mean ˆ

µ, where is the true value µ?

95% of the time it is somewhere in the 95% CI CI = µ ± k ∗ SEM = µ ± k ∗ σ

√n

(9) k depends on the desired coverage probability We do not know σ use ˆ

σ, then

CI = µ ± k ∗ SEM = µ ± k(n) ∗ ˆ

σ √n

(10)

() February 10, 2008 29 / 77

slide-29
SLIDE 29

Confidence Intervals and Power

Imprecision in the estimate of σ adds variability, k(n) depends on n by a student t distribution with df=n-1 k(n) = tα,n−1 Asymptotically (n → ∞) k=1.96 for 95% coverage (convergence of the student to a standard normal distribution) k is larger if n is smaller

n df k 2 1 12.71 3 2 4.30 4 3 3.18 10 9 2.26 30 2 2.05

∞ ∞

1.96

() February 10, 2008 30 / 77

slide-30
SLIDE 30

Confidence Intervals and Power

Confidence Interval

CI have the desired meaning only if the sample is representative of the true values, of the whole population; otherwise, if there is bias, the coverage can be much smaller The SD σ describes the scatter of the data, the SEM

σ √n refers to the scatter of an estimate of the sample mean

The SEM depends on the formula and sample size used to estimate the mean the SD is a property of the sampled population or measurement process only

() February 10, 2008 31 / 77

slide-31
SLIDE 31

Confidence Intervals and Power

Microarray differential gene expression

Multiple measurements → empirical SD(GEi) = σi SEM (mean(GEi)) =

ˆ σi √ni

CI (mean (GEi)) = mean(GEi) ± k

ˆ σi √ni

Aim: reach a specific tolerance γ in the precision of the estimate: k(n) ˆ

σ √n < γ

Need iterative methods to determine n, in first approx: 2 ˆ

σ √n < γ → n ≥ 4

ˆ

σ γ

2

() February 10, 2008 32 / 77

slide-32
SLIDE 32

Confidence Intervals and Power

Sample size calculation

n ≥ 4

ˆ

σ γ

2

Example: SD ˆ

σ = 0.25; tol γ = 0.1

n ≥ 4(2.5)2 n ≥ 4 ∗ 6.25 = 25 arrays Only three microarrays (in each group): 95% CI (mean (GEi)) = mean(GEi) ± 4.3 ˆ

σi √

3 = mean(GEi) ± 2.5 ˆ

σ

Affymetrix 3’ gene expression arrays log2 GE level: technical σ ≃ 0.06 → CI has a width of 0.3 corresponding to a variation of over 20% (20.4 = 1.23) Similarly for the difference of means (M), one can calculate the sample size for a target precision for a particular gene

() February 10, 2008 33 / 77

slide-33
SLIDE 33

Confidence Intervals and Power

Power

True difference of means = δ Assume mean(GE1) = 0 and mean(GE2) = δ Null hypothesis: mean(GE1) = mean(GE2); δ = 0 ; that is M=0 Alternative hypothesis: mean(GE1) mean(GE2) Decide against the null hypothesis if the result M is sufficiently large. Distribution of M =? mean(M) = 0 under H0; mean(M) = δ ≥ 0 under HA VAR(M) = VAR(SEM1) + VAR(SEM2) =

σ2

1

n1 +

σ2

2

n2

Significance level α: reject null hypothesis when M > kα SD(M) For normal distribution kα = Zα/2 → σ estimated from data → kα = tα/2, with df = n1 + n2 − 2 True difference of means = δ Will the result correctly lead to rejection of H0?

() February 10, 2008 34 / 77

slide-34
SLIDE 34

Confidence Intervals and Power

Figure: Distribution of M under H0 and HA

() February 10, 2008 35 / 77

slide-35
SLIDE 35

Confidence Intervals and Power

Wrongly no rejection (false negative decision) with probability β, for example β = 20% Power = 1- β, for example 80%, ability to detect the existing difference in the statistical test P[ M > ( kα * SD(M) ) ] > 1- β

→ δ

> kα SD(M) + kβ SD(M) Student t: kα = tα/2 Normal: kα = Zα/2

() February 10, 2008 36 / 77

slide-36
SLIDE 36

Confidence Intervals and Power

δ > (kα + kβ)

  • σ2

1

n1

+ σ2

2

n2 (11) If n1 = n2 = n

δ √

n > (kα + kβ)

  • σ2

1 + σ2 2

(12) If σ1 = σ2 = σ

δ > σ (kα + kβ)

  • 1

n1

+ 1

n2 (13) If both

δ √

n >

2σ(kα + kβ) (14) SD σ estimated from data ( ˆ

σ instead of real σ ) → (kα + kβ) = tα/2 + tβ

normal approx.: kα + kβ = Zα/2 + Zβ

() February 10, 2008 37 / 77

slide-37
SLIDE 37

Confidence Intervals and Power

n > 2

σ δ 2 (Zα/2 + Zβ)2

(15)

∆ = signal / noise = δ

σ

  • n > 2 1

∆2 (Zα/2 + Zβ)2

(16) Where n = n1 = n2 ; total number of arrays N = 2n > 4

1

∆2 (Zα/2 + Zβ)2

() February 10, 2008 38 / 77

slide-38
SLIDE 38

Confidence Intervals and Power

Table: Quantiles Zα of a standard normal distribution for selected values of the probability α

α

α

α

Zα 0.50 0.00 0.0050

  • 2.58

0.000 050

  • 3.89

0.25

  • 0.67

0.0025

  • 2.81

0.000 025

  • 4.06

0.10

  • 1.28

0.0010

  • 3.09

0.000 010

  • 4.26

0.050

  • 1.64

0.000 50

  • 3.29

0.000 005

  • 4.42

0.025

  • 1.96

0.000 25

  • 3.48

0.000 002 5

  • 4.56

0.010

  • 2.33

0.000 10

  • 3.72

0.000 001

  • 5.20

() February 10, 2008 39 / 77

slide-39
SLIDE 39

Confidence Intervals and Power

Example α=0.1 β=0.1

(Zα/2 + Zβ) = -1.28 -1.64 = - 2.92;

abs(Zα/2 + Zβ) ∼ 3 n > 4 ∗ 32 σ

δ

2 = 36

1

∆2

∆ = signal / noise = δ

σ

  • =

0.5 0.25 = 2

n > 9

() February 10, 2008 40 / 77

slide-40
SLIDE 40

Confidence Intervals and Power

Genes all have different parameters Genes expression is correlated, not independent: work with groups of correlated genes? (gene modules)

() February 10, 2008 41 / 77

slide-41
SLIDE 41

Linear Models, Designed Studies

Simple Linear Regression

Linear Regression refers to determining a particular line through a scatterplot that catches the trend in the relation of two variables Used for 2 broad purposes: Explanation and Prediction

() February 10, 2008 42 / 77

slide-42
SLIDE 42

Linear Models, Designed Studies

Equation for a line to predict y knowing x (in slope-intercept form): y = a + b ∗ x (17) a is called the intercept ; b is the slope

() February 10, 2008 43 / 77

slide-43
SLIDE 43

Linear Models, Designed Studies

R linear modeling

Multivariable linear regression: Y = β0 + β1X1 + β2X2 + β3X3 + . . . + ǫ (18) is linear in the parameters, Fitting to determine estimates ˆ

βi

in R: lm(y ∼ x1 + x2 + x3 ) To compute regression coefficients (intercept and slope)) Can read sim as described (or modeled) by The result of lm is a model object lm.result <- lm(Y ∼ X) summary(lm.result)

() February 10, 2008 44 / 77

slide-44
SLIDE 44

Linear Models, Designed Studies

Model Fitting (parameter estimation

Error = Observed y - Predicted ˆ y = y - ( ˆ a + ˆ b*x) = ǫ = residual The least square (LS) method: Minimize

i ǫ2 i

() February 10, 2008 45 / 77

slide-45
SLIDE 45

Linear Models, Designed Studies

summary lm

Call: lm(formula = yy x + z) Residuals:

Min 1Q Median 3Q Max

  • 67.838
  • 12.608

1.428 14.867 42.896

Coefficients:

. Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 20.5387 1.8917 10.857 < 2e-16 *** x 0.9597 0.1996 4.808 4.59e-06 *** z 0.6576 0.3729 1.764 0.0804

  • Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 20.06 on 117 degrees of freedom Multiple R-Squared: 0.1863, Adjusted R-squared: 0.1724 F-statistic: 13.39 on 2 and 117 DF, p-value: 5.782e-06

() February 10, 2008 46 / 77

slide-46
SLIDE 46

Linear Models, Designed Studies

Linear Models, Designed Studies

Interest for genes that are associated with a given phenotype Example: breast cancer ER activity plays a big role Histological grade can recognize tumors that have more risk of relapsing as metastasis (grade 3) and tumors with less risk (grade 2) We have two factors one with two level another with three ordered levels Linear model?

() February 10, 2008 47 / 77

slide-47
SLIDE 47

Linear Models, Designed Studies

Linear model Gene Expression Y in function of two independent variables (covariates) ER status (e) and grade (g) ) E[Yeg]

= αe + βg

(19) Yeg

= αe + βg + ǫeg

(20) There are six groups. Different parametrisations possible, for example: E[Y01]

=

0 + β1 E[Y02]

=

0 + β2 E[Y02]

=

0 + β3 E[Y11]

= α + β1

E[Y12]

= α + β2

E[Y12]

= α + β3

() February 10, 2008 48 / 77

slide-48
SLIDE 48

Linear Models, Designed Studies

Many (n) samples, each its combination of the k (4) covariates:

  • 1. sample: ER +, grade 2
  • 2. sample: ER -, grade 3
  • 3. sample: ER +, grade 1
  • 4. sample: ER -, grade 3 etc

Y1 = α + β2 Y2 = 0 + β3 Y3 = α + β1 Y4 = 0 + β3 for Yi i= 1 . . . n In more compact form: Y = X β + ε Y and design matrix X known, β to be estimated. Example Y =

              

Y1 Y2 Y3 Y4

              

X =

              

1 1 1 1 1 1

               β =                α β1 β2 β3                ε =                ε1 ε2 ε3 ε4               

() February 10, 2008 49 / 77

slide-49
SLIDE 49

Linear Models, Designed Studies

The Design Matrix X contains the information about the design of the study in the sense that it specifies which values the independent variables had in each case. X =

                     α β1 β2 β3

Y1 | 1 1 Y2 | 1 Y3 | 1 1 Y4 | 1

                    

In an experimental study those are usually set by the researcher (designed), in an observational studies they might not be controlled. The estimation of the effects, of the parameters in the model is done based on the observations Y and the design matrix Standard fitting criterion is minimization of the squared residuals (least square LS method)

i ǫ2 i

= MINIMUM

() February 10, 2008 50 / 77

slide-50
SLIDE 50

Linear Models, Designed Studies

t-test

two-sample t-statistic: t = d sd(d) (21) where d = observed difference = average group 2 - average group 1 Denominator is an estimator of the standard deviation of the numerator Estimate sd(d) based on the standard deviation si of the measurements in each group i=1,2 sd(d) =

  • sp2

1

n1

+ 1

n2

  • (22)

where the pooled estimates of sp is used (squared: the variance) sp2 =

(n1 − 1) s2

1 + (n2 − 1) s2 2

n1 + n2 − 2 (23) weighted average of the single group empirical variances s1 and s2.

() February 10, 2008 51 / 77

slide-51
SLIDE 51

Linear Models, Designed Studies

If the values of the two groups each follow a normal distribution, the t-statistic follows a student distribution with degrees of freedom df = n1 + n2 − 2. The student distribution is generally wider than the standard normal, but converges to the standard normal when df → ∞. The extra variability is due to variability in the estimation of the standard deviation, which is particularly inaccurate when only a small number of points is available

() February 10, 2008 52 / 77

slide-52
SLIDE 52

Linear Models, Designed Studies

Figure: t distribution with varying degrees of freedom

() February 10, 2008 53 / 77

slide-53
SLIDE 53

Linear Models, Designed Studies

In microarray experiments with high number of samples in each group, the t -statistic is used with success. When the number of samples is smallish (about 6 or less per group), it is problematic Due to the wide variance of the estimates for sp and the high number

  • f genes, there is number of genes whose empirical sp are

unrealistically low (even below the technical measurement error of the technique) Their t statistics are "too large" When we rank genes by their t statistics, these genes can be among the top-ranking genes, even when the numerator is small, and can lead to wrong conclusions.

() February 10, 2008 54 / 77

slide-54
SLIDE 54

Linear Models, Designed Studies

penalized t-tests

To reduce the problem of under-estimating the standard deviation, a key idea is to take as standard deviation of gene k a modified value sp∗

k that is obtained using also information from other genes.

Imagine that the typical standard deviation is s0, sp∗

k is determined in

a way, so that it lies between spk and s0. Depending on additional parameters, it is closer to spk or to s0. This is called a shrinkage method and the modified t∗ obtained as the ratio of the (unmodified) numerator with the shrinked standard deviation is called moderated t (mod. t) or penalized t. The mod. t is intermediate between the t and the M (eq. 1). If sp∗

k =

spk, the mod. t is equal to the traditional t, if sp∗

k = s0 for all genes, the moderated t is proportional to M and

ranks the genes as M. Practical utility of mod. t: it can improve the quality of the gene ranking, especially when sample size is small.

() February 10, 2008 55 / 77

slide-55
SLIDE 55

Linear Models, Designed Studies

limma

Limma is a package for the analysis of gene expression microarray data, especially the use of lin- ear models for analysing designed experiments and the assessment of di?erential expression. for online help type help(package=limma) make the analyses stable even for experiments with small number of arraysÑthis is achieved by borrowing information across genes Create your designmatrix. Suppose that there are three RNA sources to be compared. Suppose that the Þrst three arrays are hybridized with RNA1, the next two with RNA2 and the next three with RNA3. design <- model.matrix( -1+factor(c(1,1,1,2,2,3,3,3))) colnames(design) <- c("group1", "group2", "group3")

() February 10, 2008 56 / 77

slide-56
SLIDE 56

Linear Models, Designed Studies

Estimate the fold changes and standard errors by Þtting a linear model for each gene: fit <- lmFit(MA, design=designmatrix) Apply empirical Bayesian smoothing to the standard errors: fit <- eBayes(fit) Show statistics for the top 10 genes. topTable(fit, adjust="fdr") The empirical Bayes is a hierarchical model in which the parameters

  • f the gene expression distributions are generated by other

distributions with their own meta-parameters that are common for all

  • genes. The model allows sharing estimation between genes.

The moderated t-statistic (t) is the ratio of the M-value to its standard

  • error. This has the same interpretation as an ordinary t-statistic

except that the standard errors have been moderated across genes, i.e., shrunk towards a common value. This has the effect of borrowing information from the ensemble of genes to aid with inference about each individual gene.

() February 10, 2008 57 / 77

slide-57
SLIDE 57

Linear Models, Designed Studies

The p-value (p-value) is obtained from the distribution of the moderated t-statistic, usually after some form of adjustment for multiple testing. The B-statistic (lods or B) is the log-odds that that gene is differentially expressed (DE). A B-statistic of zero corresponds to a 50-50 chance that the gene is

  • DE. The B-statistic is automatically adjusted for multiple testing by

assuming that 1% of the genes (or some other percentage speciÞed by the user) are expected to be DE Suppose for example that B=1.5. The odds of DE is exp(1.5)=4.48, i.e, about four and a half to one. The probability that the gene is DE is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is DE. The eBayes function computes one more useful statistic. The moderated F-statistic (F) combines the t-statistics for all the contrasts into an overall test of significance for that gene.

() February 10, 2008 58 / 77

slide-58
SLIDE 58

Patterns in data, Clustering

Data exploration Aims

Identify major trends, relationships in data Simplify data to let the strongest theme come to the surface Check out for systematic errors, batch effects

() February 10, 2008 59 / 77

slide-59
SLIDE 59

Patterns in data, Clustering () February 10, 2008 60 / 77

slide-60
SLIDE 60

Patterns in data, Clustering () February 10, 2008 61 / 77

slide-61
SLIDE 61

Patterns in data, Clustering () February 10, 2008 62 / 77

slide-62
SLIDE 62

Patterns in data, Clustering () February 10, 2008 63 / 77

slide-63
SLIDE 63

Patterns in data, Clustering () February 10, 2008 64 / 77

slide-64
SLIDE 64

Data exploration , Dimension reduction

PCA

In data sets with many variables, groups of variables often move in a coordinated way (are correlated) More than one variable might be reacting to the same set of causes, for example group of genes of a same biological program

  • r genes regulated by a common transcription factor

We can simplify the data by replacing a group of correlated variables with a single new variable Principal components analysis (PCA) is a method for achieving this simplification PCA generates a new set of variables, called principal components (PCs) Each PC is a linear combination of the original variables

() February 10, 2008 65 / 77

slide-65
SLIDE 65

Data exploration , Dimension reduction

PCA is mostly applied as a tool in exploratory data analysis to reduce multidimensional data sets to lower dimensions of uncorrelated variables for analysis PCA transforms data in a new coordinate system by those linear combinations of the original variables that have maximal variance The PCs are constrained to be uncorrelated (orthogonal) to each

  • ther, so in principle there is no redundant information.

One hopes to gain insight about the most "important" aspects of the data By keeping lower-order PCs and ignoring higher-order ones data are simplified and be visualized and explored more easily Sometimes it is tremendously helpful, sometimes not at all

() February 10, 2008 66 / 77

slide-66
SLIDE 66

Data exploration , Dimension reduction

The cloud of points is translated and rotated rigidly so that the longest "diagonal" of the cloud is aligned with the first axis of a new coordinate system The direction of the second longest "diagonal" aligns with the second axis of an orthogonal coordinate system, etc

Figure: Source: Lydia E. Kavraki, http://cnx.org/content/m11461/latest/

() February 10, 2008 67 / 77

slide-67
SLIDE 67

Data exploration , Dimension reduction

Procedure

Suppose data comprising a set of observations of k variables, want to reduce the data so that each observation is described with only k variables, k < n Calculate the empirical mean for each variable and subtract it from all values (mean centering) Data matrix Yn,k: each row one observation, each column one variable (probe, gene) Microarrays: typically have observations in the columns, so first transpose to get the traditional data matrix Yn,k Determine the empirical covariance matrix cov(Y) = Σk,k = 1 nYTY (24) Reminder cov(A, B) = 1 n

  • i

(Ai − ¯

A)(Bi − ¯ B)

() February 10, 2008 68 / 77

slide-68
SLIDE 68

Data exploration , Dimension reduction

Search for a matrix defining the rotation, an orthogonal matrix Pk,k that transforms each data vector ξ into the new coordinate system The matrix P can be of reduced rank, for example a Pk,p, with p « k and project the n data vectors into the space Rp of dimension p. The new data matrix is Zn,p = Yn,k Pk,p For example p=2 and we represent the n observations in a 2dim plot. How to choose P so that the variance of Z is maximal, the loss of variance compared to the full data Y minimized?

() February 10, 2008 69 / 77

slide-69
SLIDE 69

Data exploration , Dimension reduction

Be p1 a vector of norm 1 so that for any vector a of norm 1: var(aTξ) ≤ var(pT

1 ξ) where the ξ are the observations

var(aTξ) = aTΣa The solution is p1= eigenvector of Σ to the largest eigenvalue λ1 and var(pT

1 ξ) = λ1

Eigenproblem: find the eigenvectors a so that Σa = λaa, where

λa ∈ R is the eigenvalue, a real number by which a is scaled when

transformed by the linear transformation encoded by Σ. Be p2 a vector of norm 1 orthogonal to p1 so that for any vector a of norm 1 orthogonal to p1: var(aTξ) ≤ var(pT

2 ξ)

and so on p3, p4 ...

() February 10, 2008 70 / 77

slide-70
SLIDE 70

Data exploration , Dimension reduction

The eigenvectors are taken as columns of P = [p1, p2, ...pr] Zn,p = Yn,kPk,p will have cov(Z) = diagonal matrix (λ1, λ2, λ3, ... ), the new variables have zero correlation. prcomp {stats} in R performs a PCA on the given data matrix The calculation is done by a singular value decomposition SVD of the (centered and possibly scaled) data matrix. Scaling: one uses the correlation matrix instead of the covariance

  • matrix. corr(A, B) =

cov(A,B)

var(A)var(B)

(standardises the sd of each variable, adjusts for different scales in the variables)

() February 10, 2008 71 / 77

slide-71
SLIDE 71

Data exploration , Dimension reduction

p=1: P=[ p1] Each observation y is transformed into pT

1 y ∈ R1

aT

1 y = p11y1 + p12y2 + p13y3...

The first principal component is a single axis in space. When we project each observation on that axis, the resulting values form a new

  • variable. The variance of this variable is the maximum among all

possible choices of the first axis. p=2: P=[ p1, p2 ] The observation y is transformed into (pT

1 y, pT 2 y) ∈ R2

The second principal component is another axis in space, perpendicular to the first. Projecting the observations on this axis generates a second new variable. The variance of this variable is the maximum among all possible choices of this second axis. etc

() February 10, 2008 72 / 77

slide-72
SLIDE 72

Data exploration , Dimension reduction

The first few components might explain a good percentage of the variability in the data If one takes the first p components and discards the rest, the residual variance is RV =

k

r=p+1 λr

k

r=1 λr

Envisage for example RV < 0.1

() February 10, 2008 73 / 77

slide-73
SLIDE 73

Data exploration , Dimension reduction

Use in R: pca.Y = prcomp(x=t(Ymatrix), retx = T) prcomp returns a list with class "prcomp" including: "rotation": the matrix whose columns contain the eigenvector coordinates, the linear combinations that generate the new variables, each variable has a coefficient in each eigenvector ("Variable Loadings") "x": the value of the rotated / projected data Z ("Component Scores")

() February 10, 2008 74 / 77

slide-74
SLIDE 74

Data exploration , Dimension reduction

summary(pca.Y) screeplot(pca.Y): graphics of principal component number versus eigenvalue to visualize the relative variances of the component plot( pca.Y$x[,1:2]) biplot(pca.Y) helps visualize both the principal component coefficients for each variable and the principal component scores for each observation in a single plot. Each variable is represented by a vector, length of the vector indicates how each variable contributes to the two principal components in the plot

() February 10, 2008 75 / 77

slide-75
SLIDE 75

Data exploration , Dimension reduction () February 10, 2008 76 / 77

slide-76
SLIDE 76

Data exploration , Dimension reduction

There is no reason that the directions of maximum variance will contain the features we are interested in. Limitation: We assumed that principal components are orthogonal with each other. Methods such as Independent component analysis (ICA) are developed to address this limitation. PCA uses the eigenvectors of the covariance matrix and it only finds the independent axes of the data (asymptotically) under the Gaussian assumption. For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlates the axes for the observed data but fails in that the axes with the largest variance do not correspond to the underlying basis of the model that generates the data. PCA might reveal important underlying biology or regulation mechanisms, or disturbing batch effects in the data, or not be interpretable ...

() February 10, 2008 77 / 77

slide-77
SLIDE 77

Data exploration , Dimension reduction Peter Dalgaard. Introductory Statistics with R. Springer, New York, 2002. Sorin Draghici. Data Analysis Tools for DNA Microarrays. Chapman & Hall; CRC, 2003. Ulrich Guller and Elizabeth DeLong. Interpreting statistics in medical literature: A vade mecum for surgeons. J Am Coll Surg, 198:441–458, 2004. Stanley Pounds. Estimation and control of multiple testing error rates for microarray studies. Briefings in Bioinformatics, 7:25–36, 2006. Richard M. Simon, Edward L. Korn, Lisa M. McShane, Michael D. Radmacher, George W Wright, and Yingdong Zhao. Design and Analysis of DNA Microarray Investigations. Springer, 2003. () February 10, 2008 77 / 77