SLIDE 1 Lecture
- 10. Simple linear regression
2020
SLIDE 2
(1) Using one r.v. to predict another
X and Y are random variables. What is the best linear predictor b0 + b1X of Y ? Prediction error is e = Y ` b0 ` b1X. For the ’best’ predictor, there is zero covariance between e and X, cov(X; Y ` b0 ` b1X) = cov(X; Y ) ` b1 var(X) = 0, so b1 = cov(X; Y )= var(X).
SLIDE 3
(2) Using one r.v. to predict another
Imposing the condition E(b0 + b1X) = E(Y ) gives b0 = E(Y ) ` b1E(X). The prediction can be written
b
Y = E(Y ) + b1[X ` E(X)] We can express the relationship between X and Y as Y = b0 + b1X + e, where b0 is the predicted value of Y when X = 0.
SLIDE 4
(3) Prediction error variance
Because there is zero covariance between e and X, var(Y ) = var(b0 + b1X) + var(e) First term on the right is b2
1 var(X) = cov(X; Y )2= var(X).
The prediction error variance is therefore var(e) = var(Y ) ` cov(X; Y )2 var(X) An alternative expression is (1 ` 2) var(Y ), where is the correlation between X and Y .
SLIDE 5 (4) Regression
regress v.i. to go back: to recede: to return to a former place or state: to revert. Tall fathers tend to have tall sons, but average height
- f sons of tall fathers is less than average height of
the fathers. The heights ’regress’ towards the population mean. The prediction equation Y = b0 + b1X is usually called the regression equation, and b1 the regression coe‹cient.
SLIDE 6
(5) Parent-o¸spring regression
Trait is measured on o¸spring (Y ) and parents. Mid-parent value (X) is the average of the two parental values. According to genetic theory, cov(X; Y ) = 1 2VA; var(X) = 1 2(VA + VE) Regression coe‹cient (o¸spring on mid-parent) is b1 = cov(X; Y )= var(X) = VA=(VA + VE); the heritability of the trait.
SLIDE 7
64 66 68 70 72 74 60 65 70 75 height of mid−parent (inches) height of child
SLIDE 8
(7) Sampling
Usually (co)variances are estimated from a sample (X1; Y1); (X2; Y2); : : : ; (Xn; Yn) from a bivariate distn. Notation: Sxx is the corrected sum of squares for X1 : : : Xn. Syy is the same, for Y1 : : : Yn. Sxy is the corrected sum of products
P(Xi ` —
X)(Yi ` — Y ) Sample variance Sxx=(n ` 1) and sample covariance Sxy=(n ` 1) provide unbiased estimates of var(X) and cov(X; Y ). Regression coe‹cient is estimated by ^ b1 = Sxy=Sxx.
SLIDE 9 (8) Simple example
Blood pressure was measured on a sample of women
- f di¸erent ages. Ages were grouped into 10-year
classes, and mean b.p. calculated for each age class. Age class (yrs) 35 45 55 65 75 b.p. (mm) 114 124 143 158 166 Model for the dependence of Y (b.p.) on X (age): Yi = b0 + b1Xi + ei ; i = 1 : : : n Errors (residuals) e1 : : : en are independently distd with zero mean and constant variance ff2. Residuals ei are prediction errors, and ff2 is the prediction error variance (residual variance).
SLIDE 10
(9) Blood pressure data
35 45 55 65 75 110 120 130 140 150 160 170
SLIDE 11
(9) Blood pressure data
35 45 55 65 75 110 120 130 140 150 160 170
SLIDE 12
(10) Calculating slope
— X = 55, — Y = 141. Deviations from mean: X : `20 `10 10 20 Y : `27 `17 2 17 25 Sxx = 1000, Syy = 1936, and Sxy = 1380. Estimated regression coe‹cient (slope): ^ b1 = 1380=1000 = 1:38 (mm/year)
SLIDE 13 (11) The intercept estimate
Equation of the regression line is Y ` 141 = 1:38 (X ` 55); or Y = 65:1 + 1:38 X Slope of the regression line is ^ b1 = 1:38 mm/year,
- r an average increase of 13.8 mm per decade.
The intercept (^ b0 = 65:1) is the predicted value of Y when X = 0. (In this case, an extrapolation far
- utside the range of the data).
To plot the line (manually): Calculate predicted values at two convenient values
- f X and draw the line joining these two points, e.g.
(X = 35, ^ Y = 113:4) and (X = 75, ^ Y = 168:6).
SLIDE 14
END OF LECTURE
SLIDE 15 Lecture
- 11. Residuals and ˛tted values
2020
SLIDE 16
(12) Fitted values and residuals
35 45 55 65 75 110 120 130 140 150 160 170
SLIDE 17
(13) Residuals, ˛tted values
Values of Y predicted by the regression equation at the data values X1 : : : Xn are called ˛tted values (^ Y ). Di¸erences between observed and ˛tted values (Y ` ^ Y ) are called residuals. X Y Fitted Residual 35 114 113.4 +0.6 45 124 127.2 `3.2 55 143 141.0 +2.0 65 158 154.8 +3.2 75 166 168.6 `2.6
SLIDE 18
(14) Analysis of variance
Deviation from the mean can be split into two components: Yi ` — Y = (^ Yi ` — Y ) + (Yi ` ^ Yi) Total sum of squares also splits into two components:
P(Yi ` —
Y )2 =
P(^
Yi ` — Y )2 +
P(Yi ` ^
Yi)2 Total = Regression + Residual Regression SSQ is the corrected sum of squares of the ˛tted values. It simpli˛es to S2
xy=Sxx.
Residual SSQ is the sum of squared residuals.
SLIDE 19
(15) ANOVA calculation
Total sum of squares: Syy. Regression sum of squares: S2
xy=Sxx.
Residual sum of squares is obtained by subtraction. Syy = S2
xy=Sxx
+ (Syy ` S2
xy=Sxx)
Total = Regression + Residual
SLIDE 20 (16) ANOVA calculation
For the blood pressure data, Sxx = 1000, Sxy = 1380, Syy = 1936. Regression SSQ = 13802=1000 = 1904:4. Residual SSQ = 1936 ` 1904:4 = 31:6. These calculations are usually set out in an analysis
- f variance (ANOVA) table.
SLIDE 21
(17) Analysis of variance table
Source Df Sum Sq Mean Sq Regression 1 1904.4 1904.40 Residual 3 31.6 10.53 Total 4 1936.0 Regression SSQ S2
xy=Sxx has one degree of freedom.
With a sample of size n, total SSQ has n ` 1 d.f., residual SSQ has n ` 2 d.f. Residual mean square S2 = 10:53 estimates ff2
SLIDE 22 (18) A check on the arithmetic
Here are the ˛tted values and residuals calculated earlier: X Y Fitted Residual 35 114 113.4 +0.6 45 124 127.2 `3.2 55 143 141.0 +2.0 65 158 154.8 +3.2 75 166 168.6 `2.6 Check that the residual SSQ is the sum of squared residuals. Check that the regression SSQ is the corrected SSQ
- f the ˛tted values (sum of squared deviations about
the mean value of 141).
SLIDE 23
(19) Testing zero slope hypothesis
Null hypothesis H0 : b1 = 0 (‘no relationship between X and Y ’) Sampling variance of ^ b1 is ff2=Sxx. E =
q
S2=Sxx is the estimated s.e. of ^ b1. Under H0, ^ b1=E has a t distn with n ` 2 d.f.
SLIDE 24
(20) Testing zero slope hypothesis
For blood pressure data, E =
q
10:53=1000 = 0:1026, t = 1:38=0:1026 = 13:45 with 3 d.f. Tables of the t distn give P < 0:001 (two-sided test). Hypothesis is ˛rmly rejected.
SLIDE 25
Interval estimate for slope parameter
Upper 2.5% point for t with 3 d.f. is k = 3:182. 95% interval estimate for b1 is ^ b1 ˚ ( k ˆ E ) 1.38 3.182 0.1026 (between 1.05 and 1.70). Alternative formula: (t ˚ k)E, where t is the calculated t statistic. two-sided test at 5% level signi˛cant m end-points of 95% interval have the same sign
SLIDE 26
END OF LECTURE
SLIDE 27 Lecture
- 12. F test, diagnostics, cause and e¸ect,
and the lm function 2020
SLIDE 28
(22) An additional assumption
So far, residuals have been assumed uncorrelated, with zero mean and constant variance (ff2). The results of slides 19-21 (previous lecture) and slide 24 below require the stronger assumption that the residuals are normally distd. (If the sample is reasonably large, the stronger assumption may not be required. The central limit theorem may come to the rescue.)
SLIDE 29
(23) The F distribution
S2
1 and S2 2 are independent estimates of variance ff2,
with degrees of freedom n1 and n2. Distn of S2
1=S2 2 is called the F distn with n1 and n2
degrees of freedom Special case: when n1 = 1, the distn is that of t2, where t has a t distn with n2 d.f.
SLIDE 30
(24) F test for zero slope
Source Df Sum Sq Mean Sq F ratio Regression 1 1904.4 1904.40 180.8 Residual 3 31.6 10.53 Total 4 1936.0 Anova F statistic is the square of the t statistic. H0 is rejected for large values of F (one-sided test, equivalent to two-sided t test). For b.p. data, F = 180.8 with 1 and 3 d.f. Tables of F with 1 and 3 d.f. show this to be highly signi˛cant (P < 0:001).
SLIDE 31
(25) Diagnostics
Inspect residuals for evidence that model assumptions do not hold. Plot residual against predictor variable or ˛tted value. Plots may show evidence of systematic discrepancy, due to inadequacies in the model, or an isolated discrepancy, due to an ‘outlier’. An outlier has an ‘unusually’ large residual. If possible, a reason should be found. Outliers may sometimes be rejected, cautiously.
SLIDE 32
(26) Cause and e¸ect
A correlation between X and Y does not necessarily imply that a change in X causes a change in Y . The link may be between X and Z, and between Z and Y , where Z is a third (unobserved) variable. For example, a correlation between birth rate and tractor sales may arise simply because both variables are increasing over time.
SLIDE 33
(27) Regression in R
age <- c(35, 45, 55, 65, 75) bp <- c(114, 124, 143, 158, 166) fit <- lm(bp ˜ age) summary(fit) anova(fit)
Interval interval for slope parameter:
confint(fit, parm = 2)
SLIDE 34
(28) Summary output
> summary(fit) Residuals: 1 2 3 4 5 0.6 -3.2 2.0 3.2 -2.6 Coefficients: Estimate Std. Error t value (Intercept) 65.10 5.8284 11.17 age 1.38 0.1026 13.45 Multiple R-squared: 0.9837 F-statistic: 180.8 on 1 and 3 DF
SLIDE 35
(29) ANOVA output
> anova(fit) Analysis of Variance Table Df Sum Sq Mean Sq F value age 1 1904.4 1904.40 180.8 Residuals 3 31.6 10.53 > confint(fit, parm = 2) 2.5 % 97.5 % age 1.05 1.71
SLIDE 36
(30) Plotting
# plot the data plot(bp ˜ age) # add regression line abline(fit) # diagnostic plots plot(fit)
SLIDE 37 (31) The Forbes data
22 24 26 28 30 195 200 205 210
SLIDE 38 (32) A diagnostic plot
195 200 205 210 −1.0 −0.5 0.0 0.5
SLIDE 39
(33) One-sample t test revisited
The lm function can be used to analyse the ‘matched pairs’ data of lecture 9.
Y <- c(10,2,22,23,6,31,-3,-7,15) fit <- lm(Y ˜ 1) > summary(fit) Estimate Std. Error t value (Intercept) 11.000 4.262 2.581 > confint(fit) 2.5 % 97.5 % (Intercept) 1.2 20.8
SLIDE 40
END OF LECTURE