Lecture 10. Simple linear regression 2020 (1) Using one r.v. to - - PowerPoint PPT Presentation

lecture 10 simple linear regression 2020 1 using one r v
SMART_READER_LITE
LIVE PREVIEW

Lecture 10. Simple linear regression 2020 (1) Using one r.v. to - - PowerPoint PPT Presentation

Lecture 10. Simple linear regression 2020 (1) Using one r.v. to predict another X and Y are random variables. What is the best linear predictor b 0 + b 1 X of Y ? Prediction error is e = Y ` b 0 ` b 1 X . For the best predictor, there is


slide-1
SLIDE 1

Lecture

  • 10. Simple linear regression

2020

slide-2
SLIDE 2

(1) Using one r.v. to predict another

X and Y are random variables. What is the best linear predictor b0 + b1X of Y ? Prediction error is e = Y ` b0 ` b1X. For the ’best’ predictor, there is zero covariance between e and X, cov(X; Y ` b0 ` b1X) = cov(X; Y ) ` b1 var(X) = 0, so b1 = cov(X; Y )= var(X).

slide-3
SLIDE 3

(2) Using one r.v. to predict another

Imposing the condition E(b0 + b1X) = E(Y ) gives b0 = E(Y ) ` b1E(X). The prediction can be written

b

Y = E(Y ) + b1[X ` E(X)] We can express the relationship between X and Y as Y = b0 + b1X + e, where b0 is the predicted value of Y when X = 0.

slide-4
SLIDE 4

(3) Prediction error variance

Because there is zero covariance between e and X, var(Y ) = var(b0 + b1X) + var(e) First term on the right is b2

1 var(X) = cov(X; Y )2= var(X).

The prediction error variance is therefore var(e) = var(Y ) ` cov(X; Y )2 var(X) An alternative expression is (1 ` 2) var(Y ), where  is the correlation between X and Y .

slide-5
SLIDE 5

(4) Regression

regress v.i. to go back: to recede: to return to a former place or state: to revert. Tall fathers tend to have tall sons, but average height

  • f sons of tall fathers is less than average height of

the fathers. The heights ’regress’ towards the population mean. The prediction equation Y = b0 + b1X is usually called the regression equation, and b1 the regression coe‹cient.

slide-6
SLIDE 6

(5) Parent-o¸spring regression

Trait is measured on o¸spring (Y ) and parents. Mid-parent value (X) is the average of the two parental values. According to genetic theory, cov(X; Y ) = 1 2VA; var(X) = 1 2(VA + VE) Regression coe‹cient (o¸spring on mid-parent) is b1 = cov(X; Y )= var(X) = VA=(VA + VE); the heritability of the trait.

slide-7
SLIDE 7

64 66 68 70 72 74 60 65 70 75 height of mid−parent (inches) height of child

slide-8
SLIDE 8

(7) Sampling

Usually (co)variances are estimated from a sample (X1; Y1); (X2; Y2); : : : ; (Xn; Yn) from a bivariate distn. Notation: Sxx is the corrected sum of squares for X1 : : : Xn. Syy is the same, for Y1 : : : Yn. Sxy is the corrected sum of products

P(Xi ` —

X)(Yi ` — Y ) Sample variance Sxx=(n ` 1) and sample covariance Sxy=(n ` 1) provide unbiased estimates of var(X) and cov(X; Y ). Regression coe‹cient is estimated by ^ b1 = Sxy=Sxx.

slide-9
SLIDE 9

(8) Simple example

Blood pressure was measured on a sample of women

  • f di¸erent ages. Ages were grouped into 10-year

classes, and mean b.p. calculated for each age class. Age class (yrs) 35 45 55 65 75 b.p. (mm) 114 124 143 158 166 Model for the dependence of Y (b.p.) on X (age): Yi = b0 + b1Xi + ei ; i = 1 : : : n Errors (residuals) e1 : : : en are independently distd with zero mean and constant variance ff2. Residuals ei are prediction errors, and ff2 is the prediction error variance (residual variance).

slide-10
SLIDE 10

(9) Blood pressure data

35 45 55 65 75 110 120 130 140 150 160 170

slide-11
SLIDE 11

(9) Blood pressure data

35 45 55 65 75 110 120 130 140 150 160 170

slide-12
SLIDE 12

(10) Calculating slope

— X = 55, — Y = 141. Deviations from mean: X : `20 `10 10 20 Y : `27 `17 2 17 25 Sxx = 1000, Syy = 1936, and Sxy = 1380. Estimated regression coe‹cient (slope): ^ b1 = 1380=1000 = 1:38 (mm/year)

slide-13
SLIDE 13

(11) The intercept estimate

Equation of the regression line is Y ` 141 = 1:38 (X ` 55); or Y = 65:1 + 1:38 X Slope of the regression line is ^ b1 = 1:38 mm/year,

  • r an average increase of 13.8 mm per decade.

The intercept (^ b0 = 65:1) is the predicted value of Y when X = 0. (In this case, an extrapolation far

  • utside the range of the data).

To plot the line (manually): Calculate predicted values at two convenient values

  • f X and draw the line joining these two points, e.g.

(X = 35, ^ Y = 113:4) and (X = 75, ^ Y = 168:6).

slide-14
SLIDE 14

END OF LECTURE

slide-15
SLIDE 15

Lecture

  • 11. Residuals and ˛tted values

2020

slide-16
SLIDE 16

(12) Fitted values and residuals

35 45 55 65 75 110 120 130 140 150 160 170

slide-17
SLIDE 17

(13) Residuals, ˛tted values

Values of Y predicted by the regression equation at the data values X1 : : : Xn are called ˛tted values (^ Y ). Di¸erences between observed and ˛tted values (Y ` ^ Y ) are called residuals. X Y Fitted Residual 35 114 113.4 +0.6 45 124 127.2 `3.2 55 143 141.0 +2.0 65 158 154.8 +3.2 75 166 168.6 `2.6

slide-18
SLIDE 18

(14) Analysis of variance

Deviation from the mean can be split into two components: Yi ` — Y = (^ Yi ` — Y ) + (Yi ` ^ Yi) Total sum of squares also splits into two components:

P(Yi ` —

Y )2 =

P(^

Yi ` — Y )2 +

P(Yi ` ^

Yi)2 Total = Regression + Residual Regression SSQ is the corrected sum of squares of the ˛tted values. It simpli˛es to S2

xy=Sxx.

Residual SSQ is the sum of squared residuals.

slide-19
SLIDE 19

(15) ANOVA calculation

Total sum of squares: Syy. Regression sum of squares: S2

xy=Sxx.

Residual sum of squares is obtained by subtraction. Syy = S2

xy=Sxx

+ (Syy ` S2

xy=Sxx)

Total = Regression + Residual

slide-20
SLIDE 20

(16) ANOVA calculation

For the blood pressure data, Sxx = 1000, Sxy = 1380, Syy = 1936. Regression SSQ = 13802=1000 = 1904:4. Residual SSQ = 1936 ` 1904:4 = 31:6. These calculations are usually set out in an analysis

  • f variance (ANOVA) table.
slide-21
SLIDE 21

(17) Analysis of variance table

Source Df Sum Sq Mean Sq Regression 1 1904.4 1904.40 Residual 3 31.6 10.53 Total 4 1936.0 Regression SSQ S2

xy=Sxx has one degree of freedom.

With a sample of size n, total SSQ has n ` 1 d.f., residual SSQ has n ` 2 d.f. Residual mean square S2 = 10:53 estimates ff2

slide-22
SLIDE 22

(18) A check on the arithmetic

Here are the ˛tted values and residuals calculated earlier: X Y Fitted Residual 35 114 113.4 +0.6 45 124 127.2 `3.2 55 143 141.0 +2.0 65 158 154.8 +3.2 75 166 168.6 `2.6 Check that the residual SSQ is the sum of squared residuals. Check that the regression SSQ is the corrected SSQ

  • f the ˛tted values (sum of squared deviations about

the mean value of 141).

slide-23
SLIDE 23

(19) Testing zero slope hypothesis

Null hypothesis H0 : b1 = 0 (‘no relationship between X and Y ’) Sampling variance of ^ b1 is ff2=Sxx. E =

q

S2=Sxx is the estimated s.e. of ^ b1. Under H0, ^ b1=E has a t distn with n ` 2 d.f.

slide-24
SLIDE 24

(20) Testing zero slope hypothesis

For blood pressure data, E =

q

10:53=1000 = 0:1026, t = 1:38=0:1026 = 13:45 with 3 d.f. Tables of the t distn give P < 0:001 (two-sided test). Hypothesis is ˛rmly rejected.

slide-25
SLIDE 25

Interval estimate for slope parameter

Upper 2.5% point for t with 3 d.f. is k = 3:182. 95% interval estimate for b1 is ^ b1 ˚ ( k ˆ E ) 1.38 3.182 0.1026 (between 1.05 and 1.70). Alternative formula: (t ˚ k)E, where t is the calculated t statistic. two-sided test at 5% level signi˛cant m end-points of 95% interval have the same sign

slide-26
SLIDE 26

END OF LECTURE

slide-27
SLIDE 27

Lecture

  • 12. F test, diagnostics, cause and e¸ect,

and the lm function 2020

slide-28
SLIDE 28

(22) An additional assumption

So far, residuals have been assumed uncorrelated, with zero mean and constant variance (ff2). The results of slides 19-21 (previous lecture) and slide 24 below require the stronger assumption that the residuals are normally distd. (If the sample is reasonably large, the stronger assumption may not be required. The central limit theorem may come to the rescue.)

slide-29
SLIDE 29

(23) The F distribution

S2

1 and S2 2 are independent estimates of variance ff2,

with degrees of freedom n1 and n2. Distn of S2

1=S2 2 is called the F distn with n1 and n2

degrees of freedom Special case: when n1 = 1, the distn is that of t2, where t has a t distn with n2 d.f.

slide-30
SLIDE 30

(24) F test for zero slope

Source Df Sum Sq Mean Sq F ratio Regression 1 1904.4 1904.40 180.8 Residual 3 31.6 10.53 Total 4 1936.0 Anova F statistic is the square of the t statistic. H0 is rejected for large values of F (one-sided test, equivalent to two-sided t test). For b.p. data, F = 180.8 with 1 and 3 d.f. Tables of F with 1 and 3 d.f. show this to be highly signi˛cant (P < 0:001).

slide-31
SLIDE 31

(25) Diagnostics

Inspect residuals for evidence that model assumptions do not hold. Plot residual against predictor variable or ˛tted value. Plots may show evidence of systematic discrepancy, due to inadequacies in the model, or an isolated discrepancy, due to an ‘outlier’. An outlier has an ‘unusually’ large residual. If possible, a reason should be found. Outliers may sometimes be rejected, cautiously.

slide-32
SLIDE 32

(26) Cause and e¸ect

A correlation between X and Y does not necessarily imply that a change in X causes a change in Y . The link may be between X and Z, and between Z and Y , where Z is a third (unobserved) variable. For example, a correlation between birth rate and tractor sales may arise simply because both variables are increasing over time.

slide-33
SLIDE 33

(27) Regression in R

age <- c(35, 45, 55, 65, 75) bp <- c(114, 124, 143, 158, 166) fit <- lm(bp ˜ age) summary(fit) anova(fit)

Interval interval for slope parameter:

confint(fit, parm = 2)

slide-34
SLIDE 34

(28) Summary output

> summary(fit) Residuals: 1 2 3 4 5 0.6 -3.2 2.0 3.2 -2.6 Coefficients: Estimate Std. Error t value (Intercept) 65.10 5.8284 11.17 age 1.38 0.1026 13.45 Multiple R-squared: 0.9837 F-statistic: 180.8 on 1 and 3 DF

slide-35
SLIDE 35

(29) ANOVA output

> anova(fit) Analysis of Variance Table Df Sum Sq Mean Sq F value age 1 1904.4 1904.40 180.8 Residuals 3 31.6 10.53 > confint(fit, parm = 2) 2.5 % 97.5 % age 1.05 1.71

slide-36
SLIDE 36

(30) Plotting

# plot the data plot(bp ˜ age) # add regression line abline(fit) # diagnostic plots plot(fit)

slide-37
SLIDE 37

(31) The Forbes data

22 24 26 28 30 195 200 205 210

slide-38
SLIDE 38

(32) A diagnostic plot

195 200 205 210 −1.0 −0.5 0.0 0.5

slide-39
SLIDE 39

(33) One-sample t test revisited

The lm function can be used to analyse the ‘matched pairs’ data of lecture 9.

Y <- c(10,2,22,23,6,31,-3,-7,15) fit <- lm(Y ˜ 1) > summary(fit) Estimate Std. Error t value (Intercept) 11.000 4.262 2.581 > confint(fit) 2.5 % 97.5 % (Intercept) 1.2 20.8

slide-40
SLIDE 40

END OF LECTURE