Lecture 13. Multiple regression 2020 (1) Introduction Now there is - - PowerPoint PPT Presentation

lecture 13 multiple regression 2020 1 introduction
SMART_READER_LITE
LIVE PREVIEW

Lecture 13. Multiple regression 2020 (1) Introduction Now there is - - PowerPoint PPT Presentation

Lecture 13. Multiple regression 2020 (1) Introduction Now there is one response variable Y and two predictor variables, X and Z . Data ( X 1 ; Z 1 ; Y 1 ) : : : ( X n ; Z n ; Y n ). We want to either a) predict the value of Y associated with


slide-1
SLIDE 1

Lecture

  • 13. Multiple regression

2020

slide-2
SLIDE 2

(1) Introduction

Now there is one response variable Y and two predictor variables, X and Z. Data (X1; Z1; Y1) : : : (Xn; Zn; Yn). We want to either a) predict the value of Y associated with particular values of X and Z, or b) describe the relationship between Y , X and Z, or c) estimate the e¸ect of changes in X and Z on Y .

slide-3
SLIDE 3

(2) Data example

Time Distance Climb Race (mins) (miles) (1000 ft) Greenmantle Dash 16.08 2.5 0.65 Carnethy 5 Hill 48.35 6.0 2.50 Craig Dunain 33.65 6.0 0.90 Ben Rha 45.60 7.5 0.80 Ben Lomond 62.27 8.0 3.07 Goat Fell 73.22 8.0 2.87 Bens of Jura 204.62 16.0 7.50 Cairnpapple 36.37 6.0 0.80 Scolty 29.75 5.0 0.80 Traprain Law 39.75 6.0 0.65 . . . and so on . . .

slide-4
SLIDE 4

(3) Prediction equation

As for simple linear regression, it may be that a) Predictors X, Z and response Y are all random, or b) Values of predictors X and Z are ˛xed, e.g. by experimental design. In either case, there is a prediction equation Y = b0 + b1X + b2Z + e; Prediction error e is assumed N(0; ff2).

slide-5
SLIDE 5

(4) The multiple regression surface

  • x

z y

slide-6
SLIDE 6

(5) Sums of squares and products

The starting point for all calculations is this 3 ˆ 3 matrix of sums of squares and products.

B B @

Sxx Sxz Sxy Szx Szz Szy Syx Syz Syy

1 C C A

slide-7
SLIDE 7

(6) Estimation equations

^ b1 and ^ b2 are solutions of the two equations b1Sxx + b2Sxz = Sxy b1Szx + b2Szz = Szy When appropriate, corrected sums of squares and products are replaced by variances and covariances. ’Partial’ regression coe‹cient b1 is the e¸ect on E(Y )

  • f changing X while holding Z constant.

’Partial’ regression coe‹cient b2 is the e¸ect on E(Y )

  • f changing Z while holding X constant.
slide-8
SLIDE 8

(7) Partial regression coe‹cients

When X is increased by one unit, the total e¸ect on Y is the sum of two parts, one due to the change in X, the other due to the concomitant change in Z. If the model includes X and not Z, we only see the total e¸ect. Including both X and Z in the model allows us to separate the two parts. The partial regression coe‹cient estimates the part speci˛c to X.

slide-9
SLIDE 9

(8) Estimate of regression coe‹cient

The estimate of b1 is (Sxy ` SxzSyz=Szz) (Sxx ` S2

xz=Szz)

Compare this with the estimate Sxy=Sxx obtained when Z is ignored. The denominator is the residual sum of squares

  • btained after regressing X on Z. Numerator is the

sum of products of Y and the residual of X after ˛tting Z. There is a similar expression for the estimate of b2.

slide-10
SLIDE 10

(9) Residuals and ˛tted values

Fitted value is now ^ Y = — Y + ^ b1(X ` — X) + ^ b2(Z ` — Z), and the anova equation still holds:

X

(Y ` — Y )2 =

X

(^ Y ` — Y )2 +

X

(Y ` ^ Y )2 Regression SSQ simpli˛es to ^ b1Sxy +^ b2Szy, with 2 d.f. Residual sum of squares has n ` 3 d.f.

slide-11
SLIDE 11

(10) The anova table

Sums of squares and mean squares are set out in an anova table, as for simple linear regression, but now degrees of freedom for regression, residual and total sums of squares are 2, n ` 3, and n ` 1. The ANOVA F statistic (with 2 and n ` 3 d.f.) tests the null hypothesis that b1 = b2 = 0, i.e. that E(Y ) = b0 (constant). Regression sum of squares may be split into two components, each with 1 d.f. See later.

slide-12
SLIDE 12

(11) Tests for regression coe‹cients

There is a t test for the hypothesis b1 = 0. As usual, the test statistic is the estimate of b1 divided by its standard error. The null distn is t with n ` 3 d.f. Sxx determined the size of the s.e. for simple linear

  • regression. Now this role is played by Sxx ` S2

xz=Szz.

Correlation between the two predictors reduces this quantity and ’in‚ates’ the standard error. There is a similar result for b2 (switch x and z in the previous paragraph.)

slide-13
SLIDE 13

(12) Cow and her relatives

Mother

  • Father
  • ??
  • Cow

Halfsib

slide-14
SLIDE 14

(13) Estimated breeding value

Y is the breeding value of a cow, X and Z are phenotypes of its mother and paternal half-sister. We want to use X and Z to predict Y . Covariance matrix for X, Z and Y is

B B B B @

VP

1 2VA

VP

1 4VA 1 2VA 1 4VA

VA

1 C C C C A

where VP = VA + VE.

slide-15
SLIDE 15

(14) Estimated breeding value

VP

1 2VA

VP

1 4VA 1 2VA 1 4VA

VA The two equations to be solved are b1VP = 1

2VA, b2VP = 1 4VA, and prediction is

^ Y = h2(X=2 + Z=4), where h2 = VA=VP is the heritability of the trait.

slide-16
SLIDE 16

END OF LECTURE

slide-17
SLIDE 17

Lecture

  • 14. Hill race data

2020

slide-18
SLIDE 18

(15) A special case

If Z takes values 0 and 1, the model gives b0 + b2X when Z = 0 b0 + b1 + b2X when Z = 1. Common slope of the parallel lines is b2. Intercept for the ˛rst line is b0. Intercept for the second line is b0 + b1. b1 is the di¸erence between the intercepts (the constant vertical distance between the two lines).

slide-19
SLIDE 19

(16) A special case

b0 b0 + b1 b1 Y Z = Z = 1 X Y = b0 + b1Z + b2X

slide-20
SLIDE 20

(17) Hill-race data

The di‹culty of a hill race is measured by a) X = total distance covered, b) Z = total climb required. Given distance, climb, and record time for 31 Scottish hill races, multiple regression can ˛nd a relationship between record time Y and the two measures of di‹culty X and Z.

slide-21
SLIDE 21

(18) Hill-race data

2 4 6 8 10 12 14 20 40 60 80 100 Time (mins) Distance (miles)

slide-22
SLIDE 22

(19) Hill-race data

For this analysis, values of climb are grouped as low (climb < 1000 feet, Z = 0), or high (climb > 1000 feet, Z = 1), corresponding to light and dark gray dots on the graph. Estimate Std Error t (Distance) 6.8731 0.4564 15.06 (Climb) 10.3651 2.3175 4.472 Both partial regression coe‹cients are highly signi˛cant (P < 0:001). The single regression line shown on the previous slide fails to capture the e¸ect

  • f di¸erent amounts of climb.
slide-23
SLIDE 23

(20) An F test

Anovas for the regression on distance alone, and the regression on both distance and climb: DF SSQ Distance 1 12081 Residual 29 1474 Distance + Climb 2 12695 Residual 28 860 The two anovas can be combined into one: DF SSQ Distance (ignoring Climb) 1 12081 Climb (adjusted for Distance) 1 614 Residual 28 860

slide-24
SLIDE 24

(21) An F test

DF SSQ MSQ F Distance 1 12081 12081 Climb (adjusted) 1 614 614 20.0 Residual 28 860 30.7 Test b(Climb) = 0: F = 20.0 on 1 and 28 d.f. (P < 0.001). Adding climb to the equation signi˛cantly improves the ˛t. The hypothesis is ˛rmly rejected. There is strong evidence for an e¸ect of climb, after allowing for the e¸ect of distance. The same result (exactly) was obtained with a t test based on the estimated partial regression coe‹cient (F = t2).

slide-25
SLIDE 25

(22) Using the original climb data

What happens if we use the original climb data rather than the grouped (0/1) version? The model is now E(Y ) = (b0 + b1Z) + b2X On the (X,Y ) graph, this speci˛es a family of parallel

  • lines. The vertical position of the line changes

smoothly and continuously as Z changes. The regression coe‹cient b1 measures the rate at which this happens (in units of inches per 1000 feet). The grouped (0/1) version of Z gave just two lines,

  • ne for low climb races, the other for high climb races.
slide-26
SLIDE 26

(23) Comparing the two analyses

The regression coe‹cient and s.e. for distance is similar in the two analyses. The table below shows the estimated e¸ects of climb. Estimate Std Error t Grouped Z 10.3651 2.3175 4.472 Original Z 6.8288 1.1134 6.133 The ungrouped analysis tells us that the (X,Y ) line moves up (the predicted race time increases) by 6.8 mins for every additional 1000 feet of climb. The grouped analysis told us that the line for a ’high’ climb race is 10.4 mins above the line for a ’low’ climb race.

slide-27
SLIDE 27

(24) Diagnostic plot for hill race data

20 40 60 80 −10 −5 5 10 Fitted values Residuals lm(Time ~ Distance + Climb) Residuals vs Fitted

slide-28
SLIDE 28

(25) Diagnostic plot for hill race data

−2 −1 1 2 −2 −1 1 2 Theoretical Quantiles Standardized residuals lm(Time ~ Distance + Climb) Normal Q−Q

slide-29
SLIDE 29

END OF LECTURE

slide-30
SLIDE 30

Lecture

  • 15. Using R, . . .

2020

slide-31
SLIDE 31

(26) Using R

The lm function deals with multiple regression. Diagnostic plots and analysis of variance tables are produced as for simple linear regression.

library(sda) hills31 <- subset(hills, Time < 100) fit <- lm(Time ˜ Distance + Climb, data = hills31) summary(fit) anova(fit) plot(fit, which = 1:2, add.smooth = FALSE) confint(fit, parm = 2:3)

slide-32
SLIDE 32

(27) summary and anova

summary(fit) produces estimates and standard errors

for partial regression coe‹cients. Each coe‹cient is adjusted for all other e¸ects in the

  • model. Results do not depend on the order of terms.

anova(fit) produces ’extra’ sums of squares, which

depend on the order of terms.

slide-33
SLIDE 33

(28) The anova function

With Time ˜ Distance + Climb, Distance 1 12080.7 12080.7 537.0 Climb 1 844.8 844.8 37.6 Residual 28 628.7 22.5 With Time ˜ Climb + Dist, Climb 1 8441.3 8441.3 375.9 Distance 1 4484.2 4484.2 199.7 Residual 28 628.7 22.5 Total regression sum of squares (2 d.f.) is the same in both cases.

slide-34
SLIDE 34

(29) Example model formulas

x and z are numeric vectors, A is a factor. Some possibilities for the right-hand side of an lm formula: Formula Interpretation 1

  • ne-sample t test

x simple linear regression x + z multiple regression A

  • ne-way analysis of variance

A + x parallel lines A * x separate lines for each level of A

slide-35
SLIDE 35

(30) Hill-race data with added factor

Race Time Distance Climb hilo Greenmantle Dash 16.08 2.5 0.65 low Carnethy 5 Hill 48.35 6.0 2.50 high Craig Dunain 33.65 6.0 0.90 low Ben Rha 45.60 7.5 0.80 low Ben Lomond 62.27 8.0 3.07 high Goat Fell 73.22 8.0 2.87 high Cairnpapple 36.37 6.0 0.80 low Scolty 29.75 5.0 0.80 low Traprain Law 39.75 6.0 0.65 low Dollar 43.05 5.0 2.00 high Lomonds of Fife 65.00 9.5 2.20 high Cairn Table 44.13 6.0 0.50 low . . . and so on . . .

slide-36
SLIDE 36

(31) The general case

With more than two predictors, the prediction equation is Y = b0 + b1X + b2Z + b3W + ´ ´ ´ With p predictors, regression SSQ has p d.f., and residual SSQ has n ` p ` 1 d.f. We must have n > p + 1 (otherwise it is impossible to estimate the error variance ff2). For example, when p = 1, we need at least two data points in order to ˛t a line. At least one more data point is needed to provide an estimate of ff2.

slide-37
SLIDE 37

(32) Milk yields

Milk yield and genotype at 6 biallelic SNPs was recorded on 91 cows. SNPs: btn, btn2, dgat1, lep2, ghr8, gh5. Genotype was recorded as an allele count (A1A1, A1A2, A2A2 recorded as 0, 1, and 2). There are 26 ` 1 = 63 possible regression equations.

slide-38
SLIDE 38

(33) Multiple regression (p = 6)

cow btn btn2 dgat1 lep2 ghr8 gh5 kgmilk 1 1 2 1 2 4102 2 1 2 2680 3 2 2 1 3801 4 2 2 2 4324 5 1 2 1 3285 6 2 1 3946 7 1 2 1 2 5014 8 2 1 2 3901 9 2 1 4458 10 1 1 5301 11 1 1 2 4511 12 1 2 1 2 3317 . . . and another 79 cows . . .

slide-39
SLIDE 39

(34) Confounders

Sometimes the e¸ect of one of the predictors (X, say) is of main interest, and the other (Z) is included as a potential ‘confounding’, or ’lurking’ variable. Z is included in the analysis to ensure that the e¸ect

  • f X is adjusted for the e¸ect of Z.

Example: Y = growth rate X = diet (A or B) Z = food intake

slide-40
SLIDE 40

(35) Confounders

diet A diet B Food intake Growth

slide-41
SLIDE 41

(36) Collinearity

If there is a strong correlation between X and Z, the partial regression coe‹cients will have large standard errors. When the correlation is close to ˚1 (‘collinearity’), the ˛tting procedure breaks down and one or more variables must be dropped from the regression equation.

slide-42
SLIDE 42

(37) Diagnostics

Diagnostic methods based on residuals and ˛tted values work in exactly the same way for multiple regression as they do for the simple case of one predictor. Comments already made about outliers, cause and e¸ect, etc, for simple linear regression remain relevant for multiple regression.

slide-43
SLIDE 43

END OF LECTURE