Coefficient of Correlation The regression equation Y = 0 + 1 x + - - PowerPoint PPT Presentation

coefficient of correlation
SMART_READER_LITE
LIVE PREVIEW

Coefficient of Correlation The regression equation Y = 0 + 1 x + - - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Coefficient of Correlation The regression equation Y = 0 + 1 x + shows the linear relationship between x and Y . The correlation


slide-1
SLIDE 1

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Coefficient of Correlation

The regression equation Y = β0 + β1x + ǫ shows the linear relationship between x and Y . The correlation coefficient r shows the strength of that relationship.

1 / 20 Simple Linear Regression Coefficient of Correlation

slide-2
SLIDE 2

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Properties of r r always lies between -1 and +1; r = 1 when x and y have a perfect positive linear relationship; r = −1 when x and y have a perfect negative linear relationship; r = 0 when there is no relationship.

2 / 20 Simple Linear Regression Coefficient of Correlation

slide-3
SLIDE 3

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Calculate r directly as r = (xi − ¯ x)(yi − ¯ y) (xi − ¯ x)2 (yi − ¯ y)2 = xiyi − n¯ x¯ y

  • ( x2

i − n¯

x2) ( y 2

i − n¯

y 2) . Calculate r from ˆ β1 as r = ˆ β1 × (xi − ¯ x)2 (yi − ¯ y)2 = ˆ β1 × sx sy . Note that r always has the same sign as ˆ β1.

3 / 20 Simple Linear Regression Coefficient of Correlation

slide-4
SLIDE 4

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Correlation and Causation

Not the same thing! A 1999 article in the journal Nature found “a strong association between myopia and night-time ambient light exposure during sleep in children before they reach two years of age”. The article noted that no causal link was established, but continued “it seems prudent that infants and young children sleep at night without artificial lighting in the bedroom”. Much anguish for parents of myopic children!

4 / 20 Simple Linear Regression Coefficient of Correlation

slide-5
SLIDE 5

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Later studies found that myopic parents tend to leave the light on, and also tend to have myopic children. One study, in particular, found that “the proportion of myopic children in those subjected to a range of nursery-lighting conditions is remarkably uniform”. This suggests that the association observed in the first study resulted from parental behavior and inheritance, not from a causal effect of night-time lighting. The moral: “Correlation does not imply causation”.

5 / 20 Simple Linear Regression Coefficient of Correlation

slide-6
SLIDE 6

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Coefficient of Determination

The coefficient of determination R2 also measures the strength of the relationship between x and y. With only one independent variable, R2 = r 2. When we have more than one independent variable, R2 measures the strength of the relationship of y to all of them. The correlation coefficient r is always between pairs of individual variables.

6 / 20 Simple Linear Regression Coefficient of Determination

slide-7
SLIDE 7

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

We interpret R2 as the fraction of the variance of y that is “explained” by the regression. The definition is R2 = 1 − SSE SSyy = 1 − (yi − ˆ yi)2 (yi − ¯ y)2 . If the regression is strong, we expect ˆ yi to be a good predictor of yi, so SSE << SSyy, whence the ratio is small and R2 is close to 1. Conversely, if the regression is weak, ˆ yi is not much better than ¯ y as a predictor of yi, so the ratio is close to 1 and R2 is close to 0.

7 / 20 Simple Linear Regression Coefficient of Determination

slide-8
SLIDE 8

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Estimation and Prediction

An attractive feature of a regression equation like E(Y |x) = β0 + β1x is that it is valid for values of x other than those in the data set, x1, x2, . . . , xn. That is, we can use it to estimate what E(Y |x) would be for some x that was not part of the experiment. But...using it for some x that is far from all of x1, x2, . . . , xn is extrapolation, and runs the risk that the model may not be a good approximation.

8 / 20 Simple Linear Regression Estimation and Prediction

slide-9
SLIDE 9

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Estimation The estimate of E(Y |x = xp) for some particular xp is ˆ y(xp) = ˆ β0 + ˆ β1xp. This is a statistic, so it has a sampling distribution: it is unbiased: E[ˆ y(xp)] = E(ˆ β0 + ˆ β1xp) = β0 + β1xp = E(Y |x = xp); its standard error is σˆ

y(xp) = σ

  • 1

n + (xp − ¯ x)2 SSxx .

9 / 20 Simple Linear Regression Estimation and Prediction

slide-10
SLIDE 10

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example: in the advertising/sales case, the least squares line is ˆ y(x) = −0.1 + 0.7x. So if x = 4 (advertising expenditure = $400), we estimate the expected revenue to be ˆ y(4) = 2.7,

  • r $2,700.

The estimated standard error of this estimate is 0.61 ×

  • 1

5 + (4 − 3)2 10 = 0.332.

  • r $332.

10 / 20 Simple Linear Regression Estimation and Prediction

slide-11
SLIDE 11

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Prediction Note: ˆ y(xp) = ˆ β0 + ˆ β1xp is the estimate of E(Y |x = xp), the expected value of Y when x = xp. Sometimes we want to predict the actual value of Y for a new

  • bservation at x = xp.

Example: if the store spends $400 on advertising next month, what can we predict about revenue?

11 / 20 Simple Linear Regression Estimation and Prediction

slide-12
SLIDE 12

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Our best guess, the predicted value, is still ˆ y(xp). But the error is larger: the standard error of prediction is σ[y−ˆ

y(xp)] = σ

  • 1 + 1

n + (xp − ¯ x)2 SSxx . Compare with σˆ

y(xp) = σ

  • 1

n + (xp − ¯ x)2 SSxx . In the example, ˆ σ[y−ˆ

y(4)] = 0.690, or $690.

12 / 20 Simple Linear Regression Estimation and Prediction

slide-13
SLIDE 13

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

More generally, we might want to predict the average of m > 1 new

  • bservations at x = xp.

Our best guess, the predicted value, is again ˆ y(xp). The standard error is between σˆ

y(xp) and σ[y−ˆ y(xp)]:

σ

  • 1

m + 1 n + (xp − ¯ x)2 SSxx . The predicted value and standard error are the same if the new

  • bservations are made at different xs whose mean is xp.

13 / 20 Simple Linear Regression Estimation and Prediction

slide-14
SLIDE 14

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Fire Damage

How does the cost of fire damage vary with distance from the nearest fire station? Here x is distance in miles, and y is the cost of damage. Steps in the analysis (not quite the same as in the text):

1

Plot the data:

firedam <- read.table("Text/Exercises&Examples/FIREDAM.txt", header = TRUE) plot(firedam)

The plot shows approximately linear dependence.

14 / 20 Simple Linear Regression Complete Example

slide-15
SLIDE 15

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

2

Overlay the least squares line:

firedam.lm <- lm(DAMAGE ~ DISTANCE, data = firedam) abline(reg = firedam.lm, col = "blue")

No obvious issues.

15 / 20 Simple Linear Regression Complete Example

slide-16
SLIDE 16

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

3

Summarize the fitted model:

summary(firedam.lm)

Least squares line: ˆ y = 10.2779 + 4.9193x. Residual standard error: 2.316 on 13 degrees of freedom Test H0 : β1 = 0:

t = 12.525 with the same 13 degrees of freedom; Pr(> |t|) = 1.25 × 10−8; very strong evidence against H0.

95% Confidence Interval: 4.071 < β1 < 5.768; Coefficient of Determination: 0.9235.

16 / 20 Simple Linear Regression Complete Example

slide-17
SLIDE 17

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

4

Using multiple regression (next topic), check whether the straight line model is adequate by fitting the quadratic model E(Y ) = β0 + β1x + β2x2 and testing H0 : β2 = 0.

5

Use graphical regression diagnostics (later topic) to check the residuals for issues.

17 / 20 Simple Linear Regression Complete Example

slide-18
SLIDE 18

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Regression Through the Origin

In the usual straight-line model E(Y ) = β0 + β1x, the intercept β0 is estimated from the data. In some situations we may know that E(Y ) = 0 when x = 0, or in

  • ther words that β0 = 0.

We should then fit the simpler “regression through the origin” model E(Y ) = β1x.

18 / 20 Simple Linear Regression Regression Through the Origin

slide-19
SLIDE 19

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Advertising and Revenue Example

summary(lm(rev ~ 0 + adv)) # or summary(lm(rev ~ -1 + adv))

Caution When the intercept is omitted, the coefficient of determination is calculated as R2 = 1 − (yi − ˆ yi)2 y 2

i

Because the denominator is y 2

i in place of (yi − ¯

y)2, this R2 cannot be compared with the coefficient of determination in a model that contains the intercept.

19 / 20 Simple Linear Regression Regression Through the Origin

slide-20
SLIDE 20

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

What we “know” may not be true. Cancer Treatment Example x = dosage of a drug for cancer patients, Y = increase in pulse rate after 1 minute If x = 0, the patient takes no drug, so pulse rate should not change. But a patient with zero dosage may be given a placebo instead, and pulse rate may change because of taking any medication. We would usually include the intercept in a situation like this.

20 / 20 Simple Linear Regression Regression Through the Origin