Linear Models II Design of Experiments, Analysis of Variance and - - PDF document

linear models ii
SMART_READER_LITE
LIVE PREVIEW

Linear Models II Design of Experiments, Analysis of Variance and - - PDF document

Linear Models II Design of Experiments, Analysis of Variance and Multiple Regression http://bcf.isb-sib.ch/teaching/introStat/ EMBnet Course Introduction to Statistics for Biologists, Jan 2009 The research process Scientific


slide-1
SLIDE 1

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Linear Models II

http://bcf.isb-sib.ch/teaching/introStat/

Design of Experiments, Analysis of Variance and Multiple Regression

EMBnet Course – Introduction to Statistics for Biologists

The research process

Scientific question of interest Decision on what data to collect (and how) Collection and analysis of data Conclusions, generalization Communication and dissemination of results

slide-2
SLIDE 2

EMBnet Course – Introduction to Statistics for Biologists

Generic Question : Does a ‘treatment’ have an ‘effect’?

Examples : Does wine prevent cancer? Does smoking cause lung cancer? Does milk reduce osteoporosis? Does physical exercise slow artheriosclerosis? Does statin treatment lower blood lipids?

EMBnet Course – Introduction to Statistics for Biologists

Experimental Design – why do we care?

Poor design costs: – time, money, ethical considerations To ensure relevant data are collected, and can be analyzed to test the scientific hypothesis/ question of interest – Decide in advance how data will be analyzed – ‘Designing the experiment’ = ‘Planning the analysis’ The design is about the science (biology)

slide-3
SLIDE 3

EMBnet Course – Introduction to Statistics for Biologists

Planning an Experiment

What measurements to make (response) What conditions to study (treatments) What experimental material to use (units)

A “good” experiment tests what you want to test / estimates the effects you are interested in controls for everything else (exclusion, blocking, adjustment) to avoid bias and confounding

Example

Cancer Diagnosis Blood samples were taken from 25 cancer patients and a control group of 25 healthy people. The healthy people were a consecutive series that came to hospital as blood donors. The laboratory analyzed the “positive” samples in March and the “negative” samples in April. What can go wrong in this study?

EMBnet Course – Introduction to Statistics for Biologists Jan 2009

slide-4
SLIDE 4

Example

Agricultural experiment

  • Response = crop yield
  • Treatments Two different sorts of potatoes are

compared

  • Units Two pieces of land can be used

Field 1 Field 2

EMBnet Course – Introduction to Statistics for Biologists Jan 2009

Example: two blocks

Is this a good design ? Block 1 Block 2 Type A Type B

EMBnet Course – Introduction to Statistics for Biologists Jan 2009

slide-5
SLIDE 5

Blocking and Replication

  • replication is needed to estimate the scale of random effects

measurement errors

  • fields are subdivided into smaller areas; the choice of potato

sort of to be planted is randomized inside the two blocks Block 1 Block 2 5 replicas for each treatment in the first block and 8 in the second.

EMBnet Course – Introduction to Statistics for Biologists Jan 2009 EMBnet Course – Introduction to Statistics for Biologists

Addressing the question

A basic means to address this type of question involves comparing two groups of study subjects – Control group: provides a baseline for comparison – Treatment group: group receiving the ‘treatment’

slide-6
SLIDE 6

EMBnet Course – Introduction to Statistics for Biologists

Types of variability

Planned systematic (difference between the conditions, wanted) Chance variation (can handle this with statistical models) Unplanned systematic differences (NOT wanted) – Can bias results – Can only be corrected for if it can be included in the model (adjusting) – e.g. time of measurements

EMBnet Course – Introduction to Statistics for Biologists

Confounding factors

Ideally, both the treatment and control groups are exactly alike in all respects (except for group membership) A confounding factor (or confounder) is associated with both the group membership and the response Example: strong association of gender and lung cancer, confounded by smoking Unbalanced factors that are not associated with response are not confounding

slide-7
SLIDE 7

EMBnet Course – Introduction to Statistics for Biologists

Replication, Randomization, Blocking

Replication – to reduce random variation of the test statistic, increases generalizability Randomization – to remove bias Blocking – to reduce unwanted variation Idea here is that units within a block are similar to each other, but different between blocks ‘Block what you can, randomize what you cannot’

EMBnet Course – Introduction to Statistics for Biologists

Experimental vs. Observational studies

Controlled experiment : subjects assigned to groups by the investigator – randomization: protects against bias in assignment to groups – blind, double-blind : protects against bias in

  • utcome assessment/measurement

– placebo : fake ‘treatment’ Observational study : subjects ‘assign’ themselves to groups – confounder : associated with both group membership and the outcome of interest

slide-8
SLIDE 8

EMBnet Course – Introduction to Statistics for Biologists

Observational studies

Advantages

– often easier to carry out – don’t ‘interfere’ with the system, what you see is ‘natural ’ rather than ‘artificial’ – variation is biologically relevant, as it has been unaltered – sometimes manipulation is not possible

Drawbacks

– confounders

EMBnet Course – Introduction to Statistics for Biologists

Hibernation example

General question: How do changes in an animal’s environment cause the animal to start hibernating? What changes should be studied ?? – temperature – photoperiod (day length: long or short) What measurement(s) to take? – nerve activity enzyme (Na+K+ATP-ase) What animal to study – golden hamster, 2 organs (brain, heart)

slide-9
SLIDE 9

EMBnet Course – Introduction to Statistics for Biologists

Specific question

General question : How do changes in an animal’s environment cause the animal to start hibernating? => Specific question : What is the effect of changing day length on the concentration of the sodium pump enzyme in two golden hamster organs?

EMBnet Course – Introduction to Statistics for Biologists

Sources of variability

Variability due to conditions of interest (wanted) – Day length (long vs. short) – Organ (heart vs. brains) Variability in the response (NOT wanted): measurement error – Preparation of enzyme suspension – Instrument calibration Variability in experimental units (NOT wanted) – Biological differences among hamsters – Environmental differences

slide-10
SLIDE 10

EMBnet Course – Introduction to Statistics for Biologists

Basic designs: Completely randomized

Focus on 1 organ (heart, say) Random assignment: use chance to assign hamsters to long and short days ‘Random’ is not the same as ‘haphazard’ For balance, assign same number to short and long Example (8 hamsters): Long: 4, 1, 7, 2 Short: 3, 8, 5, 6

EMBnet Course – Introduction to Statistics for Biologists

Basic designs: Randomized block

Suppose that the hamsters came from 4 different litters, with 2 hamsters per litter Expect hamsters from the same litter to be more similar than hamsters from different litters Can take each pair of hamsters and randomly assign short or long to one member of each pair Example (coin flip, say): S, L // L, S // S, L // S, L

slide-11
SLIDE 11

EMBnet Course – Introduction to Statistics for Biologists

Basic designs: Factorial crossing

Compare 2 (or more) sets of conditions in the same experiment : Long vs. Short and Heart

  • vs. Brain

In this example, there are 4 combinations of conditions: – Long/Heart, Long/Brain, Short/Heart, Short/Brain Example (2 coin flips, say): L/H: 7, 2 L/B: 4, 1 S/H: 3, 5 S/B: 8, 6

EMBnet Course – Introduction to Statistics for Biologists

Basic designs: Split plot/ repeated measures

First, randomly assign Long days to 4 hamsters and Short days to the other 4 Then, use each hamster twice : once to get Heart conc, and once to get Brain conc This design has units of different sizes for each factor – for day length, the unit is a hamster – for organ, the unit is a part of a hamster

slide-12
SLIDE 12

EMBnet Course – Introduction to Statistics for Biologists

Summary

Optimize precision of the estimates among main comparisons of interest Must satisfy scientific and physical constraints of the experiment You can save a lot of time, money and heart- ache by consulting with an experienced analyst on design issues before any steps of the experiment have been carried out

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

X categorical- Y continuous

We can visually inspect the dependence of the distribution of Y given X by a series of boxplot

  • r stripcharts
slide-13
SLIDE 13

EMBnet Course – Introduction to Statistics for Biologists

ANOVA

Stands for ANalysis Of VAriance But it’s a test of differences in means, generalizes the t-test to more than two groups defined f.ex. by one categorical variable

EMBnet Course – Introduction to Statistics for Biologists

The Observations yij

Treatment group i = 1 i = 2 … i = k means: m1 m2 … mk

y11 y21 … yk,1 y12 y22 … yk,2 … … … … y1, n1 y2, n2 … yk, nk

slide-14
SLIDE 14

EMBnet Course – Introduction to Statistics for Biologists

Mathematical Principle

The differences can be partitioned into between and within groups sum of squares (SS) variance = total SS = SSbetween groups + SSwithin groups TSS = MSS + RSS (total variation = variation explained by the Model + Residual variation inside the groups (measurement error) MS (mean squares) = SS / (number, degrees of freedom), for error MSE and for each factor F test (Fisher) variance ratio, treatment MS / error MS; expected to be 1 if treatment does not explain variation more than error

  • Coeff. of determ. R2 = MSS / TSS

EMBnet Course – Introduction to Statistics for Biologists

The ANOVA table

The analysis is usually laid out in a table For a one-way layout (where the response is assumed to vary according to grouping on one factor):

Source df SS MS F p-val Model k-1 (mi-m)2 MSS/(k-1) MST/MSE * Error n-k (yij-mi)2 RSS/(n-k) Total n-1 (yij-m)2

m = overall mean, mi = mean within group i

slide-15
SLIDE 15

EMBnet Course – Introduction to Statistics for Biologists

Assumptions

Have random samples from each separate population The error variance is the same in each treatment group The samples are sufficiently large that the CLT holds for each sample mean (or the individual population distributions are normal)

EMBnet Course – Introduction to Statistics for Biologists

ANOVA TEST: What does it mean when we reject H?

F test (Fisher) variance ratio assesses the null hypothesis that all population means are equal (joint hypothesis): When we reject the null, that does NOT mean that the means are all different! It means that at least one is different To find out which is different, can do ‘post hoc’ testing (pairwise t-tests, for example)

slide-16
SLIDE 16

EMBnet Course – Introduction to Statistics for Biologists

Interaction

Interaction is very common (and very important) in science Interaction is a difference of differences Interaction is present if the effect of one factor is different for different levels of the

  • ther factor

Main effects can be difficult to interpret in the presence of interaction, because the effect of one factor depends on the level of the other factor

EMBnet Course – Introduction to Statistics for Biologists

Factorial crossing

Compare 2 (or more) sets of conditions in the same experiment Designs with factorial treatment structure allow you to measure interaction between two (or more) sets of conditions that influence the response – you will look at this in more detail during the exercises today Factorial designs may be either observational

  • r experimental
slide-17
SLIDE 17

Interaction in Models

A linear model with two main effects: E(Y) = 0 + 1x + 2z If x and z represent groups coded 0,1 : E(Y)=

X=0 X=1 Z=0 0 + + 1 Z=1 0 + + 2 0 + + 1

1 +

+ 2

1 estimates the difference in means for X=1 compared to X=0 independent from the Z status

EMBnet Course – Introduction to Statistics for Biologists

Interaction in Models

A linear model with two main effects and interaction: E(Y) = 0 + 1x + 2z + 3(x*z) 1 estimates the difference in means for X=1 compared to X=0 when Z=0 1 + 3 estimates the difference in means for X=1 compared to X=0 when Z=1

X=0 X=1 Z=0 0 + + 1 Z=1 0 + + 2 0 + + 1

1 +

+ 2

2 +

+ 3

EMBnet Course – Introduction to Statistics for Biologists

slide-18
SLIDE 18

Interaction plots

3 =0 3 <0

no interaction

EMBnet Course – Introduction to Statistics for Biologists

Interaction plots

3 >0 3 >>0

EMBnet Course – Introduction to Statistics for Biologists

slide-19
SLIDE 19

EMBnet Course – Introduction to Statistics for Biologists

More on model formulas

We can also include interaction terms in a model formula: yvar ~ xvar1 + xvar2 + xvar3 Examples –yvar ~ xvar1 + xvar2 + xvar3 + xvar1:xvar2 –yvar ~ (xvar1 + xvar2 + xvar3)^2 –yvar ~ (xvar1 * xvar2 * xvar3)

EMBnet Course – Introduction to Statistics for Biologists

More on model formulas

The generic form is response ~ predictors The predictors can be numeric or factor Other symbols to create formulas with combinations of variables (e.g. interactions) + to add more variables

  • to leave out variables

: to introduce interactions between two terms * to include both interactions and the terms

(a*b is the same as a+b+a:b)

^n adds all terms including interactions up to order n I() treats what’s in () as a mathematical expression

slide-20
SLIDE 20

EMBnet Course – Introduction to Statistics for Biologists

Interpreting R output

> chicks.aov <- aov(Weight ~ House + Protein*LP*LS) > summary(chicks.aov) Df Sum Sq Mean Sq F value Pr(>F) House 1 708297 708297 15.8153 0.0021705 ** Protein 1 373751 373751 8.3454 0.0147366 * LP 2 636283 318141 7.1037 0.0104535 * LS 1 1421553 1421553 31.7414 0.0001524 *** Protein:LP 2 858158 429079 9.5808 0.0038964 ** Protein:LS 1 7176 7176 0.1602 0.6966078 LP:LS 2 308888 154444 3.4485 0.0687641 . Protein:LP:LS 2 50128 25064 0.5596 0.5868633 Residuals 11 492640 44785

  • Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

EMBnet Course – Introduction to Statistics for Biologists

Multiple linear regression

You can also use more than one ‘X ’ variable to predict Y :

predicted y = a + b1x1 + b2x2

Example : predict ventricular shortening velocity (Y) from blood glucose (X1) and age (X2) The ‘slopes’ b1 and b2 are called coefficients The prediction function for Y is still linear in the parameters (a, b1, b2) As in simple regression, minimize total squared deviation from the prediction surface (instead of a line it’s a plane

  • r higher dim. hyperplane)
slide-21
SLIDE 21

EMBnet Course – Introduction to Statistics for Biologists

Example: cystic fibrosis

> library(ISwR) > data(cystfibr) > round(cor(cystfibr),2) age sex height weight bmp fev1 rv frc tlc pemax age 1.00 -0.17 0.93 0.91 0.38 0.29 -0.55 -0.64 -0.47 0.61 sex -0.17 1.00 -0.17 -0.19 -0.14 -0.53 0.27 0.18 0.02 -0.29 height 0.93 -0.17 1.00 0.92 0.44 0.32 -0.57 -0.62 -0.46 0.60 weight 0.91 -0.19 0.92 1.00 0.67 0.45 -0.62 -0.62 -0.42 0.64 bmp 0.38 -0.14 0.44 0.67 1.00 0.55 -0.58 -0.43 -0.36 0.23 fev1 0.29 -0.53 0.32 0.45 0.55 1.00 -0.67 -0.67 -0.44 0.45 rv -0.55 0.27 -0.57 -0.62 -0.58 -0.67 1.00 0.91 0.59 -0.32 frc -0.64 0.18 -0.62 -0.62 -0.43 -0.67 0.91 1.00 0.70 -0.42 tlc -0.47 0.02 -0.46 -0.42 -0.36 -0.44 0.59 0.70 1.00 -0.18 pemax 0.61 -0.29 0.60 0.64 0.23 0.45 -0.32 -0.42 -0.18 1.00 EMBnet Course – Introduction to Statistics for Biologists

Pairwise plots of cystic fibrosis vars

> pairs(cystfibr)

slide-22
SLIDE 22

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Many variables

Pairwise correlations, similarity, clustering, heatmap

EMBnet Course – Introduction to Statistics for Biologists

R: multiple regression using lm

> attach(cystfibr) > summary(lm(pemax~age+sex+height+weight)) Call: lm(formula = pemax ~ age + sex + height + weight) Residuals: Min 1Q Median 3Q Max

  • 47.791 -18.683 2.747 13.413 43.190

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 70.66072 82.50906 0.856 0.402 age 1.57395 3.13953 0.501 0.622 sex -11.54392 11.23902 -1.027 0.317 height -0.06308 0.80183 -0.079 0.938 weight 0.79124 0.86147 0.918 0.369 Residual standard error: 27.38 on 20 degrees of freedom Multiple R-Squared: 0.4413, Adjusted R-squared: 0.3296 F-statistic: 3.949 on 4 and 20 DF, p-value: 0.01604

slide-23
SLIDE 23

Example Confounding

Ex.: Y= weight and the two groups are gender 1 is the weight difference btw male and female two groups might differ by other characteristics for ex. age or race These cause a difference in weight as well The coefficient is affected, biased Similarly for continuous predictors

EMBnet Course – Introduction to Statistics for Biologists

Confounding Study Case 1

After adjustment for x2: x1 has no additional effect, coeff = 0 positive confounding

x2 y

Association of y with x1 in the univariate model, Higher mean when x1 higher

EMBnet Course – Introduction to Statistics for Biologists

slide-24
SLIDE 24

Confounding Study Case 2

X2 has an effect. Now x1 improves the fitting, coeff <0 after adjustment with X2. Negative confounding (masking)

x2 y

No association of y with x1 in the univariate model, Equal mean when x1 higher

EMBnet Course – Introduction to Statistics for Biologists

Confounding

Positive: the effect is overestimated in the univariate model compared to the refined model When two predictors are positively correlated and have effects of the same sign OR are inversely correlated but have effects of the opposite sign Negative: the effect is underestimated (attenuated, masked) in the univariate model When two predictors are positively correlated but have effects of the opposite sign OR are inversely correlated but have effects of the same sign

EMBnet Course – Introduction to Statistics for Biologists

slide-25
SLIDE 25

Adjustment

Aim is to estimate effects unbiased with a refined model Ideally, all potential confounders y, z, …are known, measured and can be included in a model Aim is to estimate unbiased effects with a refined model E(Y) = 0 + 1x + 2y + 3z under consideration of the effects of each predictor (adjustment)

EMBnet Course – Introduction to Statistics for Biologists Jan 2009

Confounding

In prospective randomized studies the groups should be approximately balanced for all potential confounders Stratification can be used in designed studies In observational studies confounding is to be expected and we do our best to control it in multipredictor models

EMBnet Course – Introduction to Statistics for Biologists

slide-26
SLIDE 26

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

What to do?

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Modeling Overview

Want to capture important features of the relationship between a (set of) variable(s) and one or more response(s) Many models are of the form g(Y) = f(x) + error Differences in the form of g, f and distributional assumptions about the error term R: lm, glm

slide-27
SLIDE 27

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Non-linear relations in lm

X and Y can show a curvilinear relation Transformation Y= a + b*X3 Z=X3 then Y= a + b*Z Multivariate model f.ex. Polynomial Y= a + b*X + c*X2 Z=X2 then Y= a + b*X + c*Z Linear Models (R: lm) handle these cases

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Linearization examples

if Y b * xc, then log(Y) log( b) + c log( x) Y’ b’ + c x’ if Y b + exp(cx), then log(Y- b) log( exp(cx)) = cx Y’ c x Nonlinear regression:

  • Different Iterative algorithm
  • Initial trial values
slide-28
SLIDE 28

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Example: Non-linear relation, nlm

X and Y can show a curvilinear relation

Michaelis_Menten Saturation Enzyme Kinetics

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Linearizing it (Lineweaver-Burk Plot, Double-Reciprocal Plot)

  • slope b = Km / Vm
  • y-intercept a = 1 / Vm
  • 1/v approaches infinity as [S] decreases:
  • undue weight to inaccurate measurement at low concentration
  • insufficient weight to accurate measurements at high

concentration.

slide-29
SLIDE 29

EMBnet Course – Introduction to Statistics for Biologists, Jan 2009

Acknowledgement

Contributions to Slides, Lab and ideas by Darlene Goldstein Books: Peter Dalgaard “Introductory Statistics with R” Springer Ch 5 Regression and correlation Ch 6 Analysis of variance …

Ch 9 Multiple regression Ch 10 Linear models

Eric Vittinghoff et al. “Regression Methods in Biostatistics” Springer Ch 3 Basic Statistical Methods Ch 4 Linear Regression Books: Roger Mead The design of experiments Cambridge University Press