M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa - - PowerPoint PPT Presentation

m7s3 regression thoughts
SMART_READER_LITE
LIVE PREVIEW

M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa - - PowerPoint PPT Presentation

M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 1 / 21 Outline Regression thoughts Properties Coefficient


slide-1
SLIDE 1

M7S3 - Regression Thoughts

Professor Jarad Niemi

STAT 226 - Iowa State University

November 27, 2018

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 1 / 21

slide-2
SLIDE 2

Outline

Regression thoughts

Properties

Coefficient of determination (r2) is amount of variation explained Not reversible Always through (x, y) Residuals sum to zero Residual plots Leverage and influence

Cautions

Extrapolation Correlation does not imply causation Lurking variables (Simpson’s Paradox) Correlations based on average data

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 2 / 21

slide-3
SLIDE 3

Simple linear regression Review

Simple linear regression

For a collection of observations (xi, yi) for i = 1, . . . , n, we can fit a regression line yi = b0 + b1xi + ei where b0 is the sample intercept, b1 is the sample slope, and ei is the residual for individual i by minimizing the sum of squared residuals

n

  • i=1

e2

i

where ei = yi − ˆ yi = yi − (b0 + b1xi) and ˆ yi is the predicted value for individual i.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 3 / 21

slide-4
SLIDE 4

Simple linear regression Review

Simple linear regression graphically

−30 −20 −10 1 2 3 4 5 6 7 8 9 10

x y Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 4 / 21

slide-5
SLIDE 5

Properties Coefficient of determination

Coefficient of determination

The sample correlation r measures the direction and strength of the linear relationship between x and y. Definition The coefficient of determination r2 = 1 − n

i=1 e2 i

n

i=1(yi − y)2

measures the amount of variability in y that can be explained by the linear relationship between x and y.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 5 / 21

slide-6
SLIDE 6

Properties Coefficient of determination

Example

The correlation between weekly sales amount and weekly radio ads is 0.98. The coefficient of variation is r2 ≈ 0.96. Thus about 96% of the variability in weekly sales amount can be explained by the linear relationsihp with weekly radio ads. If you are only told r2, you cannot determine the direction of the relationship.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 6 / 21

slide-7
SLIDE 7

Properties Coefficient of determination

Symmetric

Correlation is symmetric, the correlation of x with y is the same as the correlation of y with x.

cor(x,y) [1] -0.8866024 cor(y,x) [1] -0.8866024

Thus the coefficient of determination is symmetric.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 7 / 21

slide-8
SLIDE 8

Properties Not reversible

Equation not reversible

The regression line is y = b0 + b1x but the opposite regression line is not x = −b0 b1 + 1 b1 y.

regress(y,x) (Intercept) x

  • 1.904408
  • 2.589194
  • b0/b1; 1/b1

[1] -0.7355215 [1] -0.3862206 regress(x,y) (Intercept) x 0.4915144

  • 0.3035940

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 8 / 21

slide-9
SLIDE 9

Properties Always through (x, y)

Always through (x, y)

Recall that knowing any two points is enough to determine a straight line. It can be proved that the regression line always passes through the point (x, y). Suppose you know that x = 5, y = −15, and the y-intercept is −2. What is the slope? y = b0 + b1x = ⇒ b1 = (y − b0)/x So the slope is

(ybar-b0)/xbar [1] -2.6

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 9 / 21

slide-10
SLIDE 10

Properties Always through (x, y)

−30 −20 −10 1 2 3 4 5 6 7 8 9 10

x y Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 10 / 21

slide-11
SLIDE 11

Properties Residuals sum to zero

Residuals sum to zero

When the regression includes an intercept (b0), it can be proved that the residuals sum to zero, i.e.

n

  • i=1

ei = 0. We will often look at residual plots: Residuals vs explanatory variable Residuals vs predicted value These will be centered on 0 due to the result above.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 11 / 21

slide-12
SLIDE 12

Properties Residual plots

Residual vs explanatory variable

−4 4 8 0.0 2.5 5.0 7.5 10.0

x residual Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 12 / 21

slide-13
SLIDE 13

Properties Residual plots

Residual vs predicted

−4 4 8 −20 −10

predicted residual Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 13 / 21

slide-14
SLIDE 14

Properties Leverage and influence

Leverage and influence

Definition An individual has high leverage if its explanatory variable value is far from the explanatory variable values of the other observations. An individual with high leverage will be an outlier in the x direction. An individual has high influence if its inclusion dramatically affects the fitted regression line. Only individuals with high leverage can have high influence.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 14 / 21

slide-15
SLIDE 15

Properties Leverage and influence

Leverage and influence

high low high low 5 10 15 5 10 15 −40 −30 −20 −10 −40 −30 −20 −10

x y Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 15 / 21

slide-16
SLIDE 16

Cautions Correlation does not imply causation

Correlation does not imply causation

You have all likely heard the addage correlation does not imply causation. If two variables have a correlation that is close to -1 or 1, the two variables are highly correlated. This does not mean that one variable causes the

  • ther.

Spurious correlations: http://www.tylervigen.com/spurious-correlations

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 16 / 21

slide-17
SLIDE 17

Cautions Correlation does not imply causation

Correlation does not imply causation (cont.)

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402407/: My attention was drawn to the recent article by Song at al. entitled “How jet lag impairs Major League Baseball performance” (1), not only by its slightly unusual subject but more importantly because I wondered how one could ever actually prove the effect of jet lag on baseball performance. ...Although I do not dispute the large amount of work involved and would be well-nigh incapable of judging the validity of the analyses performed, I must admit that I was taken aback by the way Song et al. (1) systematically present the correlations they identify as direct proof of causality between jet lag and the affected variables. It is actually quite remarkable to me that the word “correlation” does not appear even once in the paper, when this is actually what the authors have been looking at and, in my opinion, to be scientifically accurate, the title of the article should really read: “How jet lag correlates with impairments in Major League Baseball performance.” ...this tendency to amalgamate correlation with causality is apparently extremely frequent in this field of investi-

  • gation. But given the broad readership of PNAS and the subject of this article, I feel that it is likely to be relayed

by the press and to attract the attention of many people, both scientists and nonscientists. Considering the current tendency to misinterpret scientific data, via the misuse of statistics in particular, I feel that a journal such as PNAS should aim to educate by example, and thus ought to enforce more rigor in the presentation of scientific articles regarding the difference between correlations and proven causality. Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 17 / 21

slide-18
SLIDE 18

Cautions Lurking variables

Lurking variables

Definition A lurking variable is a variable that has an important effect on the relationship of the variables under investigation, but that is not included in the variables being studied. What is the relationship between a person’s height and their ideal partner’s height?

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ideal height j 60 65 70 75 80

  • wn height j

Linear Fit

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 18 / 21

slide-19
SLIDE 19

Cautions Lurking variables

Ideal partner height

In this example, gender is a lurking variable:

62 64 66 68 70 72 74 76 ideal height 60 65 70 75 80

  • wn height

Linear Fit Females Linear Fit Males

Linear Fit Females predicted ideal height = 35.798818 + 0.5469203 own height Linear Fit Males predicted ideal height = 34.971329 + 0.4484906 own height

This phenomenon is called Simpson’s Paradox.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 19 / 21

slide-20
SLIDE 20

Cautions Correlations based on average data

Correlations based on average data

Correlations based on average data are often much higher (closer to -1 or 1) than correlations based on individual data. This occurs because the averages smooth out the variability between individuals.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 20 / 21

slide-21
SLIDE 21

Cautions Extrapolation

Extrapolation

Definition Extrapolation occurs when making predictions for explanatory variable values below the sample minimum or above the sample maximum of the explanatory variable. Regression assumes a linear pattern between the response variable and the explanatory variable. Even if this linear assumption is correct for a range

  • f explanatory variable values, there is no reason to expect that this will

continue beyond that range.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 21 / 21