Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

correlation and regression
SMART_READER_LITE
LIVE PREVIEW

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven All models are wrong. Some models are useful. George Box the statistician knows that in nature there never was a normal


slide-1
SLIDE 1

Marc Mehlman

Correlation and Regression

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

“All models are wrong. Some models are useful.” – George Box “ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box

Marc Mehlman (University of New Haven) Correlation and Regression 1 / 64

slide-2
SLIDE 2

Marc Mehlman

Table of Contents

1

Bivariate Data

2

Correlation

3

Simple Regression

4

Variation

5

Logistic Regression

6

Logistic Regression

7

Chapter #9 R Assignment

Marc Mehlman (University of New Haven) Correlation and Regression 2 / 64

slide-3
SLIDE 3

Marc Mehlman

Bivariate Data

Bivariate Data and Scatterplots

Bivariate Data and Scatterplots

Marc Mehlman (University of New Haven) Correlation and Regression 3 / 64

slide-4
SLIDE 4

Marc Mehlman

Bivariate Data

Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70, 178), (72, 192), (74, 184), (68, 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent) variable measures the outcome of a study. The explanatory (or independent) variable is the one that predicts the response variable.

Marc Mehlman (University of New Haven) Correlation and Regression 4 / 64

slide-5
SLIDE 5

Marc Mehlman

Bivariate Data

Student ID Number

  • f Beers

Blood Alcohol Content 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Here we have two quantitative variables recorded for each of 16 students:

  • 1. how many beers they drank
  • 2. their resulting blood alcohol content (BAC)

Bivariate data

 For each individual studied, we record

data on two variables.

 We then examine whether there is a

relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables?

Marc Mehlman (University of New Haven) Correlation and Regression 5 / 64

slide-6
SLIDE 6

Marc Mehlman

Bivariate Data

Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Scatterplots

A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph.

Marc Mehlman (University of New Haven) Correlation and Regression 6 / 64

slide-7
SLIDE 7

Marc Mehlman

Bivariate Data

> plot(trees$Girth~trees$Height,main="girth vs height")

  • 65

70 75 80 85 8 10 12 14 16 18 20

girth vs height

trees$Height trees$Girth

Marc Mehlman (University of New Haven) Correlation and Regression 7 / 64

slide-8
SLIDE 8

Marc Mehlman

Bivariate Data

How to scale a scatterplot

Same data in all four plots

Both variables should be given a similar amount of space:

 Plot is roughly square  Points should occupy all

the plot space (no blank space) Marc Mehlman (University of New Haven) Correlation and Regression 8 / 64

slide-9
SLIDE 9

Marc Mehlman

Bivariate Data

Interpreting scatterplots

 After plotting two variables on a scatterplot, we describe the overall

pattern of the relationship. Specifically, we look for …

 Form: linear, curved, clusters, no pattern  Direction: positive, negative, no direction  Strength: how closely the points fit the “form”  … and clear deviations from that pattern  Outliers of the relationship

Marc Mehlman (University of New Haven) Correlation and Regression 9 / 64

slide-10
SLIDE 10

Marc Mehlman

Bivariate Data

Form

Linear

Nonlinear No relationship Marc Mehlman (University of New Haven) Correlation and Regression 10 / 64

slide-11
SLIDE 11

Marc Mehlman

Bivariate Data

Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable.

Direction

Marc Mehlman (University of New Haven) Correlation and Regression 11 / 64

slide-12
SLIDE 12

Marc Mehlman

Bivariate Data

Strength

The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.

Marc Mehlman (University of New Haven) Correlation and Regression 12 / 64

slide-13
SLIDE 13

Marc Mehlman

Bivariate Data

Outliers

An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern

  • f the relationship.

Marc Mehlman (University of New Haven) Correlation and Regression 13 / 64

slide-14
SLIDE 14

Marc Mehlman

Bivariate Data

Adding categorical variables to scatterplots

Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph.

The graph compares the association between thorax length and longevity

  • f male fruit flies that are allowed to

reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman (University of New Haven) Correlation and Regression 14 / 64

slide-15
SLIDE 15

Marc Mehlman

Correlation

Correlation

Correlation

Marc Mehlman (University of New Haven) Correlation and Regression 15 / 64

slide-16
SLIDE 16

Marc Mehlman

Correlation

Definition Given the bivariate data, (x1, y1), · · · , (xn, yn), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is r def = 1 n − 1

n

  • j=1

xj − ¯ x sx yj − ¯ y sy

  • .

The population correlation coefficient is denoted as ρ def = 1 N

N

  • j=1

xj − µX σX yj − µY σY

  • where the above sum is summed over the entire population of size N.

One thinks of r as an estimator of ρ.

Marc Mehlman (University of New Haven) Correlation and Regression 16 / 64

slide-17
SLIDE 17

Marc Mehlman

Correlation

One can also use the formula r = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

  • n n

j=1 x2 j −

n

j=1 xj

2 n n

j=1 y2 j −

n

j=1 yj

2 R command: > cor(trees$Girth,trees$Height) [1] 0.5192801

Marc Mehlman (University of New Haven) Correlation and Regression 17 / 64

slide-18
SLIDE 18

Marc Mehlman

Correlation

One can also use the formula r = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

  • n n

j=1 x2 j −

n

j=1 xj

2 n n

j=1 y2 j −

n

j=1 yj

2 R command: > cor(trees$Girth,trees$Height) [1] 0.5192801

Marc Mehlman (University of New Haven) Correlation and Regression 17 / 64

slide-19
SLIDE 19

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-20
SLIDE 20

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-21
SLIDE 21

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-22
SLIDE 22

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-23
SLIDE 23

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-24
SLIDE 24

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

slide-25
SLIDE 25

Marc Mehlman

Correlation

r has no unit

r = -0.75 r = -0.75 standardized value of x (unitless) standardized value of y (unitless) Marc Mehlman (University of New Haven) Correlation and Regression 19 / 64

slide-26
SLIDE 26

Marc Mehlman

Correlation

Correlations are calculated using means and standard deviations, and thus are NOT resistant to

  • utliers.

r is not resistant to outliers

Just moving one point away from the linear pattern here weakens the correlation from −0.91 to −0.75 (closer to zero). Marc Mehlman (University of New Haven) Correlation and Regression 20 / 64

slide-27
SLIDE 27

Marc Mehlman

Correlation

14

Correlation

Marc Mehlman (University of New Haven) Correlation and Regression 21 / 64

slide-28
SLIDE 28

Marc Mehlman

Correlation

Caution: Correlation is not Causation Definition When calculating correlation, a lurking variable is a third factor that explains the relationship between the two correlated variables Example (Lurking Variables) There is a strong correlation between shoe size and reading skills among elementary school children. The lurking variable is · · · There is a strong correlation between the number of firefighters at a fire site and the amount of damage. The lurking variable is · · · Caution: Beware correlations based on averaged data. While there is a strong correlation average age and average height among children, the correlation between age and height for individual children is much, much lower.

Marc Mehlman (University of New Haven) Correlation and Regression 22 / 64

slide-29
SLIDE 29

Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Correlation and Regression 23 / 64

slide-30
SLIDE 30

Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Correlation and Regression 23 / 64

slide-31
SLIDE 31

Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Correlation and Regression 23 / 64

slide-32
SLIDE 32

Marc Mehlman

Correlation

Establishing causation from an observed association can be done if: 1) The association is strong. 2) The association is consistent. 3) Higher doses are associated with stronger responses. 4) The alleged cause precedes the effect. 5) The alleged cause is plausible.

Establishing causation

Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer. Marc Mehlman (University of New Haven) Correlation and Regression 24 / 64

slide-33
SLIDE 33

Marc Mehlman

Correlation

Theorem (Test for Correlation) Let H0 : ρ = 0 vs H1 : ρ = 0 The test statistic is t = r

  • 1−r2

n−2

∼ t(n − 2) for H0. One can also use Table A-6 with test statistic |r|. R command: cor.test(X, Y) (one can also do one sided tests with R).

Marc Mehlman (University of New Haven) Correlation and Regression 25 / 64

slide-34
SLIDE 34

Marc Mehlman

Correlation

Using R Example > cor.test(trees$Girth,trees$Height) Pearson’s product-moment correlation data: trees$Girth and trees$Height t = 3.2722, df = 29, p-value = 0.002758 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2021327 0.7378538 sample estimates: cor 0.5192801 Note that one is assuming that the (trees✩Height, trees✩Girth) are sampled from a bivariate normal distribution.

Marc Mehlman (University of New Haven) Correlation and Regression 26 / 64

slide-35
SLIDE 35

Marc Mehlman

Correlation

Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random

  • variable. A sample correlation of r = 0.12 is calculated. Find the p–value
  • f H0 : ρ = 0 versus HA : ρ = 0.

Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] 0.9440518 > 2*(1-pt(tstat,61)) [1] 0.3488675 There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated.

Marc Mehlman (University of New Haven) Correlation and Regression 27 / 64

slide-36
SLIDE 36

Marc Mehlman

Correlation

Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random

  • variable. A sample correlation of r = 0.12 is calculated. Find the p–value
  • f H0 : ρ = 0 versus HA : ρ = 0.

Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] 0.9440518 > 2*(1-pt(tstat,61)) [1] 0.3488675 There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated.

Marc Mehlman (University of New Haven) Correlation and Regression 27 / 64

slide-37
SLIDE 37

Marc Mehlman

Simple Regression

Simple Regression

Simple Regression

Marc Mehlman (University of New Haven) Correlation and Regression 28 / 64

slide-38
SLIDE 38

Marc Mehlman

Simple Regression

Let X = the predictor or independent variable Y = the response or dependent variable. Given a bivariate random variable, (X, Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x, the response, y is y = β0 + β1x + ǫx where β0 + β1x is the mean response for x. The noise terms, the ǫx’s, are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β0, β1 and σ.

Marc Mehlman (University of New Haven) Correlation and Regression 29 / 64

slide-39
SLIDE 39

Marc Mehlman

Simple Regression

8

Conditions for Regression Inference

The figure below shows the regression model when the conditions are

  • met. The line in the figure is the population regression line µy= β0 + β1x.

The Normal curves show how y will vary when x is held fixed at different values. All the curves have the same standard deviation σ, so the variability of y is the same for all values of x. The value of σ determines whether the points fall close to the population regression line (small σ) or are widely scattered (large σ). For each possible value

  • f the explanatory

variable x, the mean of the responses µ(y | x) moves along this line. Marc Mehlman (University of New Haven) Correlation and Regression 30 / 64

slide-40
SLIDE 40

Marc Mehlman

Simple Regression Moderate linear association; regression OK. Obvious nonlinear relationship; regression inappropriate. One extreme

  • utlier, requiring

further examination. Only two values for x; a redesign is due here…

y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x Marc Mehlman (University of New Haven) Correlation and Regression 31 / 64

slide-41
SLIDE 41

Marc Mehlman

Simple Regression

Given bivariate random sample from the simple linear regression model, (x1, y1), (x2, y2), · · · , (xn, yn)

  • ne wishes to estimate the parameters of the model, (β0, β1, σ).

Given an arbitrary line, y = mx + b define the sum of the squares of errors to be n

i=1 [yi − (mxi + b)]2.

Using Calculus, one can find the least–squares regression line, y = b0 + b1x, that minimizes the sum of squares of errors.

Marc Mehlman (University of New Haven) Correlation and Regression 32 / 64

slide-42
SLIDE 42

Marc Mehlman

Simple Regression

Theorem (Estimating β0 and β1) Given the bivariate random sample, (x1, y1) · · · , (xn, yn), the least–squares regression line, y = b0 + b1x is obtained by letting b1 = r sy sx

  • and

b0 = ¯ y − b1¯ x. where b0 is an unbiased estimator of β0 and b1 is an unbiased estimator of β1. Note: The point (¯ x, ¯ y) will lie on the regression line, though there is no reason to believe that (¯ x, ¯ y) is one of the data points. One can also calculate b1 using b1 = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

n n

j=1 x2 j − (n j=1 xj)2

.

Marc Mehlman (University of New Haven) Correlation and Regression 33 / 64

slide-43
SLIDE 43

Marc Mehlman

Simple Regression

Example

> plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red")

  • 65

70 75 80 85 8 10 12 14 16 18 20 girth vs height trees$Height trees$Girth

Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”.

> class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height

  • 6.1883945

0.2557471 Marc Mehlman (University of New Haven) Correlation and Regression 34 / 64

slide-44
SLIDE 44

Marc Mehlman

Simple Regression

Example

> plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red")

  • 65

70 75 80 85 8 10 12 14 16 18 20 girth vs height trees$Height trees$Girth

Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”.

> class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height

  • 6.1883945

0.2557471 Marc Mehlman (University of New Haven) Correlation and Regression 34 / 64

slide-45
SLIDE 45

Marc Mehlman

Simple Regression

Definition The predicted value of y at xj is ˆ yj

def

= b0 + b1xj. The predicted value, ˆ y, is a unbiased estimator of the mean response, µy. Example Using the R dataset “trees”, one wants the predicted girth of three trees,

  • f heights 74, 83 and 91 respectively. One uses the regression model

“girth˜height” for our predictions. The work below is done in R.

> g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91))) 1 2 3 12.73689 15.03862 17.08459

Marc Mehlman (University of New Haven) Correlation and Regression 35 / 64

slide-46
SLIDE 46

Marc Mehlman

Simple Regression

“Never make forecasts, especially about the future.” – Samuel Goldwyn The regression line only has predictive value for y at x if

1 ρ ≈ 0 (if no significant linear correlation, don’t use the regression line

for predictions.) If ρ ≈ 0, then ¯ y is best predictor of y at x.

2 only predict y for x’s within the range of the xj’s – one does not

predict the girth of a tree with a height of 1000 feet. Interpolate, don’t extrapolate. |r| (or r2) is a measure of how well the regression equation fits data. bigger |r| ⇒ better data fits regression line ⇒ better prediction.

Marc Mehlman (University of New Haven) Correlation and Regression 36 / 64

slide-47
SLIDE 47

Marc Mehlman

Simple Regression

Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point.

Outliers and influential points

Child 19 = outlier (large residual) Child 18 = potential influential individual

Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 is isolated from the rest of the points, and might be an influential point. Marc Mehlman (University of New Haven) Correlation and Regression 37 / 64

slide-48
SLIDE 48

Marc Mehlman

Simple Regression All data Without child 18 Without child 19 Outlier Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). Marc Mehlman (University of New Haven) Correlation and Regression 38 / 64

slide-49
SLIDE 49

Marc Mehlman

Simple Regression

Definition Given a data point, (xj, yj), the residual of that point is yi − ˆ yi. Note:

1 Outliers are data points with large residuals. 2 The residuals should be approximately N(0, σ). 3 The regression equation gives the smallest residuals2 = (y − ˆ

y)2 possible.

Marc Mehlman (University of New Haven) Correlation and Regression 39 / 64

slide-50
SLIDE 50

Marc Mehlman

Simple Regression

R command for finding residuals: Example

> g.lm=lm(Girth~Height,data=trees) > residuals(g.lm) 1 2 3 4 5 6 7

  • 3.4139043 -1.8351687 -1.1236745 -1.7253986 -3.8271227 -4.2386170

0.3090842 8 9 10 11 12 13 14

  • 1.9926400 -3.1713756 -1.7926400 -2.7156285 -1.8483871 -1.8483871

0.2418428 15 16 17 18 19 20 21

  • 0.9926400

0.1631072 -2.6501112 -2.5058584 1.7303485 3.6205784 0.2401187 22 23 24 25 26 27 28

  • 0.0713756

1.7631072 3.7746014 2.7958658 2.7728773 2.7171301 3.6286244 29 30 31 3.7286244 3.7286244 4.5383945

Marc Mehlman (University of New Haven) Correlation and Regression 40 / 64

slide-51
SLIDE 51

Marc Mehlman

Simple Regression

Definition Given bivariate data, (x1, y1), · · · , (xn, yn), the residual plot is a plot of the residuals against the xj’s. If (X, Y ) is bivariate normal, the residuals satisfy the Homoscedasticity Assumption: Definition (Homoscedasticity Assumption) The assumption that the variance around the regression line is the same for all values of the predictor variable X. In other words the pattern of the spread of the residual points around the x–axis does not change as one travels left to right on the x–axis. There should not be discernible patterns in the residual plot.

Marc Mehlman (University of New Haven) Correlation and Regression 41 / 64

slide-52
SLIDE 52

Marc Mehlman

Simple Regression

R command for testing if Linear Model applies (residuals approximately N(0, σ)). Example

> g.lm=lm(Girth~Height,data=trees) > par(mfrow=c(2,2)) # visualize four graphs at once > plot(g.lm) > par(mfrow=c(1,1)) # reset the graphics defaults

10 11 12 13 14 15 16 −4 −2 2 4 Fitted values Residuals

  • Residuals vs Fitted

31 6 5

  • ● ●
  • −2

−1 1 2 −1 1 2 Theoretical Quantiles Standardized residuals

Normal Q−Q

31 6 5

10 11 12 13 14 15 16 0.0 0.4 0.8 1.2 Fitted values Standardized residuals

  • Scale−Location

31 6 5

0.00 0.05 0.10 0.15 −2 −1 1 2 Leverage Standardized residuals

  • Cook's distance

0.5

Residuals vs Leverage

31 20 6

Marc Mehlman (University of New Haven) Correlation and Regression 42 / 64

slide-53
SLIDE 53

Marc Mehlman

Variation

Variation

Variation

Marc Mehlman (University of New Haven) Correlation and Regression 43 / 64

slide-54
SLIDE 54

Marc Mehlman

Variation

Variation:

yj − ¯ y total deviation = ˆ yj − ¯ y explained deviation + yj − ˆ yj unexplained deviation . From here, using some math, one gets the following sum of squares, (SS),

n

  • j=1

(yj − ¯ y)2

  • SSTOT=total variation

=

n

  • j=1

(ˆ yj − ¯ y)2

  • SSA=explained variation

+

n

  • j=1

(yj − ˆ yj)2

  • SSE=unexplained variation

.

Marc Mehlman (University of New Haven) Correlation and Regression 44 / 64

slide-55
SLIDE 55

Marc Mehlman

Variation

Definition The coefficient of determination is the portion of the variation in y explained by the regression equation r2 def = SSA SSTOT = n

j=1(ˆ

yj − ¯ y)2 n

j=1(yj − ¯

y)2 . Properties of the Coefficient of Determination:

1

r2 = (r)2 = (correlation coefficient)2.

2

r2 = proportion of variation of Y that is explained by the linear relationship between X and Y . Example Using R, since

> (cor(trees$Girth,trees$Height))^2 [1] 0.2696518

  • ne concludes that approximately 27% of variation in tree Girth is explained by tree Height and

73% by other factors.

Marc Mehlman (University of New Haven) Correlation and Regression 45 / 64

slide-56
SLIDE 56

Marc Mehlman

Variation

r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.

Marc Mehlman (University of New Haven) Correlation and Regression 46 / 64

slide-57
SLIDE 57

Marc Mehlman

Variation

Definition The variance of the observed yi’s about the predicted ˆ yi’s is s2 def = SSE n − 2 = (yj − ˆ yj)2 n − 2 = y2

j − b0

yj − b1 xjyj n − 2 , which is an unbiased estimator of σ2. The standard error of estimate (also called the residual standard error) is s, an estimator of σ. Note: (b0, b1, s) is an estimator of the parameters of the simple linear regression model, (β0, β1, σ). Furthermore, b0, b1 and s2 are unbiased estimators of β0, β1 and σ2.

Marc Mehlman (University of New Haven) Correlation and Regression 47 / 64

slide-58
SLIDE 58

Marc Mehlman

Variation

Definition Let y be a future observation corresponding to x⋆. A (1 − α)100% Prediction Interval for y is a confidence interval where y will be in the confidence interval (1 − α)100% of the time. A prediction interval a confidence interval that not only has to contend with the variability of the response variable, but also the fact that β0 and β1 can only be approximated. Theorem ((1 − α)100% Prediction Interval for y given x = x⋆) A (1 − α)100% Prediction Interval for y given x = x⋆ is ˆ y ± m where ˆ y = b0 + b1x⋆ and the margin of error is m = tα/2(n − 2) s

  • 1 + 1

n + (x⋆ − ¯ x)2 n

j=1(xj − ¯

x)2

  • SEˆ

y

.

Marc Mehlman (University of New Haven) Correlation and Regression 48 / 64

slide-59
SLIDE 59

Marc Mehlman

Variation

A confidence interval for y:

Marc Mehlman (University of New Haven) Correlation and Regression 49 / 64

slide-60
SLIDE 60

Marc Mehlman

Variation

R commands: Example

> g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91)),interval="prediction",level=.90) fit lwr upr 1 12.73689 8.020516 17.45327 2 15.03862 10.238843 19.83839 3 17.08459 11.971691 22.19750 > summary(g.lm) Call: lm(formula = Girth ~ Height, data = trees) Residuals: Min 1Q Median 3Q Max

  • 4.2386 -1.9205 -0.0714

2.7450 4.5384 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.18839 5.96020

  • 1.038

0.30772 Height 0.25575 0.07816 3.272 0.00276 **

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.728 on 29 degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758 Marc Mehlman (University of New Haven) Correlation and Regression 50 / 64

slide-61
SLIDE 61

Marc Mehlman

Logistic Regression

Multivariate Regression

Multivariate Regression

Marc Mehlman (University of New Haven) Correlation and Regression 51 / 64

slide-62
SLIDE 62

Marc Mehlman

Logistic Regression

Given multivariate variate data,

(x(1)

1 , x(1) 2 , · · · , x(1) k , y1), (x(2) 1 , x(2) 2 , · · · , x(2) k , y2), · · · , (x(n) 1 , x(n) 2 , · · · , x(n) k , yn)

where x(i)

1 , x(i) 2 , · · · , x(i) k

is a predictor of the response yi, one explores the following possible model. Definition (Statistical Model of Multivariate Linear Regression) Given a k dimensional multivariate predictor, (x(i)

1 , x(i) 2 , · · · , x(i) k ), the

response, yi, is yi = β0 + β1x(i)

1

+ · · · + βkx(i)

k

+ ǫi where β0 + β1x(i)

1

+ · · · + βkx(i)

k

is the mean response. The noise terms, the ǫi’s are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β0, β1, · · · , βk and σ.

Marc Mehlman (University of New Haven) Correlation and Regression 52 / 64

slide-63
SLIDE 63

Marc Mehlman

Logistic Regression

Definition Given a multivariate normal sample,

  • x(1)

1 , · · · , x(1) k , y1

  • , · · · ,
  • x(n)

1 , · · · , x(n) k , yn

  • ,

the least–squares multiple regression equation, ˆ y = b0 + b1x1 + · · · + bkxk, is the linear equation that minimizes

n

  • j=1

(ˆ yj − yj)2 , where ˆ yj

def

= b0 + b1x(j)

1

+ · · · + bkx(j)

k .

Marc Mehlman (University of New Haven) Correlation and Regression 53 / 64

slide-64
SLIDE 64

Marc Mehlman

Logistic Regression

There must be at least k + 2 data points to do obtain the estimators b0, bj’s and s2 def = n

j=1(yi − ˆ

yi)2 n − k − 1

  • f β0, βj’s and σ2, where

b0, the y–intercept, is the unbiased, least square estimator of β0. bj, the coefficient of xj, is the unbiased, least square estimator of βj. s2 is an unbiased estimator of σ2 and s is an estimator of σ. Due to computational intensity, computers are used to obtain b0, bj’s and s2.

Marc Mehlman (University of New Haven) Correlation and Regression 54 / 64

slide-65
SLIDE 65

Marc Mehlman

Logistic Regression

Example

> g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > par(mfrow=c(2,2)) > plot(g.lm) > par(mfrow=c(1,1)) Does the linear model fit?

10 11 12 13 14 15 16 −4 −2 2 4 Fitted values Residuals

  • Residuals vs Fitted

31 6 5

  • ● ●
  • −2

−1 1 2 −1 1 2 Theoretical Quantiles Standardized residuals

Normal Q−Q

31 6 5

10 11 12 13 14 15 16 0.0 0.4 0.8 1.2 Fitted values Standardized residuals

  • Scale−Location

31 6 5

0.00 0.05 0.10 0.15 −2 −1 1 2 Leverage Standardized residuals

  • Cook's distance

0.5

Residuals vs Leverage

31 20 6

Marc Mehlman (University of New Haven) Correlation and Regression 55 / 64

slide-66
SLIDE 66

Marc Mehlman

Logistic Regression

Example (cont.)

> summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max

  • 3.8664 -1.5819 -0.3788

1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp

  • 0.018666

0.015613

  • 1.196

0.24227 wt

  • 4.609123

1.265851

  • 3.641

0.00113 ** qsec 0.544160 0.466493 1.166 0.25362

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10

Marc Mehlman (University of New Haven) Correlation and Regression 56 / 64

slide-67
SLIDE 67

Marc Mehlman

Logistic Regression

Inflation Problem: As k increases r 2 increases, but the increase in predictability is illusionary. Solution: Best to use Definition The adjusted coefficient of determination is R2

adj = 1 −

n − 1 n − k − 1(1 − R2). The p–value of a test of H0 : β1 = · · · = βk = 0 versus HA : not H0 is associated with a F-statistic. The p–value of a test of H0 : βj = 0 versus HA : βj = 0 is associated with a t-statistic.

Marc Mehlman (University of New Haven) Correlation and Regression 57 / 64

slide-68
SLIDE 68

Marc Mehlman

Logistic Regression

Factor Analysis: One strives for the best fit (largest R2 and smallest p–value associated with the F statistic) with the fewest number of independent variables. Independent variables that are “mostly independent” of the dependent variable or highly correlated with another dependent variable can be discarded. It is an art. Doing this mechanically (on a machine) is called stepwise regression.

Marc Mehlman (University of New Haven) Correlation and Regression 58 / 64

slide-69
SLIDE 69

Marc Mehlman

Logistic Regression

Logistic Regression

Logistic Regression

Marc Mehlman (University of New Haven) Correlation and Regression 59 / 64

slide-70
SLIDE 70

Marc Mehlman

Logistic Regression

A variable that takes on only the values 0 and 1 is a dummy variable. ex, gender, infected, etc If dummy variable is an independent variable, use methods of this chapter. If dummy variable is the dependent variable, use logistic regression. Let Y , the dependent variable is the dummy variable and use ˜ Y = ln

  • p

1−p

  • in place of Y . Here p is the probability of Y = 1.

Marc Mehlman (University of New Haven) Correlation and Regression 60 / 64

slide-71
SLIDE 71

Marc Mehlman

Chapter #9 R Assignment

Chapter #9 R Assignment

Chapter #9 R Assignment

Marc Mehlman (University of New Haven) Correlation and Regression 61 / 64

slide-72
SLIDE 72

Marc Mehlman

Chapter #9 R Assignment

(from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition Duxbury 1990)) Fifteen alligators were captured and two measurements were made on each

  • f the alligators. The weight (in pounds) was recorded with the snout vent

length (in inches this is the distance between the back of the head to the end of the nose). The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. lnLength ~ lnWeight. The authors analyzed the data on the log scale (natural logarithms) and we will follow their approach for consistency.

> lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76, + 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78) > lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, + 3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)

Marc Mehlman (University of New Haven) Correlation and Regression 62 / 64

slide-73
SLIDE 73

Marc Mehlman

Chapter #9 R Assignment 1 Create a scatterplot of “lnLength” ∼ “lnWeight”, complete with the

regression line.

2 What is the slope and y–intercept of the regression line? 3 Predict “lnLength” when “lnWeight” is five. 4 Use graphs to decide if “lnLength” ∼ “lnWeight” satisfies the

requirements for being a linear model.

5 Find a 95% prediction interval for “lnLength” when “lnWeight” is five. 6 What is the p–value of a test of H0 : β1 = 0 versus HA : β1 = 0? 7 What is the standard error of estimate? 8 What is the coefficient of determination, R2. 9 What is the explained variation, the unexplained variation and the

total variation?

10 What is the F statistic of H0 : β1 = 0 versus HA : β1 = 0 and what is

its degrees of freedom?

11 Using the correlation test, what is the p–value of a test that

H0 : ρ = 0 versus HA : ρ = 0?

Marc Mehlman (University of New Haven) Correlation and Regression 63 / 64

slide-74
SLIDE 74

Marc Mehlman

Chapter #9 R Assignment

First enter into R:

> class(state.x77) # "lm" needs a data.frame not a matrix [1] "matrix" > st = as.data.frame(state.x77) # make state.x77 a data.frame > class(st) # "st" is a data.frame [1] "data.frame" > colnames(st)[4] = "Life.Exp" # no spaces in variable names > colnames(st)[6] = "HS.Grad" # no spaces in variable names 1

Do a multivariate regression with “Life.Exp” as the response variable and “Population”, “Income”, “Illiteracy”, “Murder”, “HS.Grad”, “Frost” and “Area” as explanatory variables. (a) Show that the multivariate regression linear model fits this data. (b) What is R2 and adjusted–R2? (c) Which explanatory variables are relevant at the 0.05 significance level?

2

Do another multivariate regression, but only with explanatory variables “Murder” and “HS.Grad”. What is R2 and adjusted–R2?

3

Comparing the adjusted–R2 in the above two problems, what do you conclude?

Marc Mehlman (University of New Haven) Correlation and Regression 64 / 64