1 Example : Example: Medical researchers have noted that adolescent - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Example : Example: Medical researchers have noted that adolescent - - PDF document

Regression analysis is used to investigate whether there is a linear STAT E-150 relationship between two quantitative variables. Statistical Methods The variable we want to predict is the response variable ; the variable we use for this


slide-1
SLIDE 1

1

Review of Linear Regression

STAT E-150 Statistical Methods

2

Regression analysis is used to investigate whether there is a linear relationship between two quantitative variables. The variable we want to predict is the response variable; the variable we use for this prediction is the explanatory variable.

3

If a linear relationship exists, we can create a model for the relationship, and use this model to answer these questions:

What is the relationship between the variables? What does the slope of this linear model tell us? When is it appropriate to use this linear model to make predictions?

4

A First-Order Linear Model is of the form y = 0 + 1x + where y = the response variable x = the independent, or predictor, or explanatory variable = the random error 0 = where the regression line crosses the y-axis; the y-intercept of the regression line is the point (0, 0 ) 1 = the slope of the regression line change in y change in x change in y for every unit increase in x = =

5 6

Steps in regression

  • 1. Hypothesize the form of the model for E(y), the mean or expected

value of y

  • 2. Collect the sample data
  • 3. Use the sample data to estimate the unknown parameters in the

model.

  • 4. Specify the probability distribution of and estimate any unknown

parameters in the distribution. Check the validity of the assumptions made about the probability distribution.

  • 5. Statistically check the usefulness of the model
  • 6. If the model is useful, use the model for appropriate prediction and

estimation

slide-2
SLIDE 2

2

7

Example: Medical researchers have noted that adolescent females are more likely to deliver low-birthweight babies than are adult females. Because LBW babies tend to have higher mortality rates, studies have been conducted to examine the relationship between birthweight and the mother’s age. One such study is discussed in the article “Body Size and Intelligence in 6-Year-Olds: Are Offspring of Teenage Mothers at Risk?” (Maternal and Child Health Journal [2009], pp. 847-856.) Which is the response variable? Which is the independent (or predictor or explanatory) variable?

8

Example: Medical researchers have noted that adolescent females are more likely to deliver low-birthweight babies than are adult females. Because LBW babies tend to have higher mortality rates, studies have been conducted to examine the relationship between birthweight and the mother’s age. One such study is discussed in the article “Body Size and Intelligence in 6-Year-Olds: Are Offspring of Teenage Mothers at Risk?” (Maternal and Child Health Journal [2009], pp. 847-856.) Which is the response variable? The child’s birthweight (in grams) Which is the independent (or predictor or explanatory) variable? The mother’s age

9

The following data is consistent with summary values given in the article, and with data published by the National Center for Health Statistics: Since there is only one independent variable, our model is of the form E(y) = 0 + 1x

Observation 1 2 3 4 5 6 7 8 9 10 Maternal Age (in years) 15 17 18 15 16 19 17 16 18 19 Birthweight (in grams) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 10

The first step in determining whether there is a linear relationship between the variables is to create a scatterplot of the data, with the explanatory variable on the x-axis and the response variable on the y-axis.

11

Does there appear to be a linear relationship?

12

Does there appear to be a linear relationship? The scatter diagram shows a positive linear relationship

slide-3
SLIDE 3

3

What does the scatterplot tell you about the strength and direction of the linear relationship? Write your answer in the context of the scenario. The scatter diagram shows that there is a fairly strong positive linear relationship between the two variables: as the mother’s age increases, the child’s birthweight also increased. That is, higher birthweightsare associated with older mothers.

13

What does the scatterplot tell you about the strength and direction of the linear relationship? Write your answer in the context of the scenario. The scatter diagram shows that there is a fairly strong positive linear relationship between the two variables: as the mother’s age increases, the child’s birthweight also increased. That is, higher birthweights are associated with older mothers.

15

The Method of Least Squares If the data appears to show a linear relationship, the method of least squares finds the line that best fits the data. We can find the vertical distance between the observed value of y and the predicted value of y for each value of x. This difference is called the residual: Residual = observed value - predicted value

  • y y

ε = − ε = − ε = − ε = −

16

We want the size of the residuals to be as small as possible; since some residuals are positive and some are negative, we square the residuals and minimize the squares. The Least Squares line is the one where ) = 0 and ) is minimized. The equation of the least squares line is y = 1x + 0

17

The idealized regression line is E(y) = 1x + 0; this model places the mean of the distribution of y for each value of x on the line:

18

But since not all values of y will be on the line, for each data point (x, y) there is an error, ε, where ε = y - y. So we now have the equation y = 1x + 0 +

slide-4
SLIDE 4

4

19

We will make these assumptions about the probability distribution for the error, :

  • The probability distribution of ε has a mean of 0
  • The probability distribution of ε has a constant variance for all

values of x

  • The probability distribution of ε is approximately normal
  • The errors associated with any two different observations are

independent.

20

The Standard Error for the Slope SE(b1) indicates how much the slope varies from sample to sample. SE(b1) will be smaller when

  • se, the standard deviation of the residuals, is smaller, indicating

less scatter and a stronger relationship between x and y

  • n is larger
  • sx is larger, indicating a more stable regression with a broader

range of x-values

e 1 x

s SE(b ) = n -1 s ⋅ ⋅ ⋅ ⋅

21

The Sampling Distribution for Regression Slopes When the assumptions about the error, , are met, the standardized estimated regression slope, follows a Student’s t-model with n-2 degrees of freedom. We estimate the standard error with , where n is the number of data values, and sx is the standard deviation of the x-values.

2 e

(y - y) s = n - 2 ˆ

  • e

1 x

s SE(b ) = n -1 s ⋅ ⋅ ⋅ ⋅

1 1 1

b

  • t

SE(b ) − − − − = = = =

22

Inferences about the slope 1 To see if there is an association between x and y, we test the hypotheses H0: 1=0 Ha: 10 using the test statistic with n-2 degrees of freedom

1 1 1 1 1 1 1

b

  • b

b t SE(b ) SE(b ) SE(b ) − − − − − − − − = = = = = = = = = = = = Assumptions for the model and the errors:

  • 1. Linearity Assumption

Straight Enough Condition: does the scatterplot appear linear? Check the residuals to see if they appear to be randomly scattered Quantitative Data Condition: Is the data quantitative?

23

  • 2. Independence Assumption:

the errors must be mutually independent Randomization Condition: the individuals are a random sample Check the residuals for patterns, trends, clumping

  • 3. Equal Variance Assumption:

the variability of y should be about the same for all values of x Does The Plot Thicken? Condition: Does the scatterplot show a constant spread about the line? Check the residuals for any patterns

24

  • 4. Normal Population Assumption:

the errors follow a Normal model at each value of x Nearly Normal Condition: Look at a histogram or NPP of the residuals

slide-5
SLIDE 5

5

Use the SPSS output provided to answer the questions below: 1.

  • a. What is the value of the correlation coefficient, r?
  • b. What does this tell you about the strength and direction of the

linear relationship?

25

Model Summary Model R R Square Adjusted R Square

  • Std. Error of the

Estimate 1 .884a .781 .754 205.308

  • a. Predictors: (Constant), Age

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 1.

  • a. What is the value of the correlation coefficient, r? r = .884
  • b. What does this tell you about the strength and direction of the

linear relationship?

26

Model Summary Model R R Square Adjusted R Square

  • Std. Error of the

Estimate 1 .884a .781 .754 205.308

  • a. Predictors: (Constant), Age

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 1.

  • a. What is the value of the correlation coefficient, r? r = .884
  • b. What does this tell you about the strength and direction of the

linear relationship? The linear relationship is strong and positive.

27

Model Summary Model R R Square Adjusted R Square

  • Std. Error of the

Estimate 1 .884a .781 .754 205.308

  • a. Predictors: (Constant), Age

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • a. What is the p-value?

28

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • a. What is the p-value? .001

29

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • b. What is your decision? Reject H0 or not?

30

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight
slide-6
SLIDE 6

6

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • b. What is your decision? Reject H0 or not?

Since p is small (.001), we will reject H0

31

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • c. What can you conclude?

32

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 2. In the hypothesis test of H0: 1 = 0 Ha: 1 0

  • c. What can you conclude?

We can conclude that there is a linear relationship between the mother’s age and the baby’s birthweight.

33

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • a. What is the value of 0?
  • b. Interpret this value in context.

34

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • a. What is the value of 0? 0 = -1163.45
  • b. Interpret this value in context.

35

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • a. What is the value of 0? 0 = -1163.45
  • b. Interpret this value in context.

If the mother’s age were 0, the baby would weigh -1163.45

  • grams. Clearly, this prediction makes no sense in this

application.

36

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight
slide-7
SLIDE 7

7

Use the SPSS output provided to answer the questions below: 3.

  • c. What is the value of 1?
  • d. Interpret this value in context.

37

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • c. What is the value of 1? 1 = 245.15
  • d. Interpret this value in context.

38

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • c. What is the value of 1? 1 = 245.15
  • d. Interpret this value in context.

The birthweight will increase by 245.15 g for each additional year in the age of the mother.

39

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • e. What is the equation of the regression line?

40

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

Use the SPSS output provided to answer the questions below: 3.

  • e. What is the equation of the regression line?

0 = -1163.45 1 = 245.15 y = 245.15 x -1163.45

41

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B

  • Std. Error

Beta 1 (Constant)

  • 1163.450

783.138

  • 1.486

.176 Age 245.150 45.908 .884 5.340 .001

  • a. Dependent Variable: Birthweight

3. f. What birthweight would you expect for the baby of a mother who is 16 years old?

42

slide-8
SLIDE 8

8

3. f. What birthweight would you expect for the baby of a mother who is 16 years old? The equation of the regression line is y = 245.15 x -1163.45 If x = 16, y = 245.15(16) – 1163.45 = 3922.4 – 1163.45 = 2758.95 We would expect the baby of a mother who is 16 years old to weigh 2758.95 grams.

43 44

3. f. What birthweight would you expect for the baby of a mother who is 16 years old? We would expect the baby of a mother who is 16 years old to weigh 2758.95 grams. X 3. g. What birthweight would you expect for the baby of a mother who is 22 years old?

45

3. g. What birthweight would you expect for the baby of a mother who is 22 years old? This prediction cannot be made; the value of 22 is outside the range

  • f the sample data.

46 47

3. h. Here is the scatterplot with the regression line we found. Does it appear to be a good fit? Why?

48

3. h. Here is the scatterplot with the regression line we found. Does it appear to be a good fit? Why? There appears to be a constant scatter about the line.

slide-9
SLIDE 9

9

49

More about Residuals 1. a. What birthweight would you expect for the baby of a mother who is 16 years old? b. What was the birthweight for the baby of a mother who was 16 years old?

  • c. What is the residual?

50

More about Residuals 1. a. What birthweight would you expect for the baby of a mother who is 16 years old? We would expect the baby of a mother who is 16 years old to weigh 2758.95 grams. b. What was the birthweight for the baby of a mother who was 16 years old?

  • c. What is the residual?

51

More about Residuals 1. a. What birthweight would you expect for the baby of a mother who is 16 years old? We would expect the baby of a mother who is 16 years old to weigh 2758.95 grams. b. What was the birthweight for the baby of a mother who was 16 years old? One of the values was 2897 grams.

  • c. What is the residual?

52

More about Residuals 1. a. What birthweight would you expect for the baby of a mother who is 16 years old? We would expect the baby of a mother who is 16 years old to weigh 2758.95 grams. b. What was the birthweight for the baby of a mother who was 16 years old? One of the values was 2897 grams.

  • c. What is the residual?

2897 – 2758.95 = 135.05 grams

53

Recall that we want the probability distribution of the errors, or residuals, to be approximately normal and have a constant variance for all values

  • f x.

Here are examples of residual plots: This residual plot shows no systematic pattern; it shows a uniform scatter of the points about the fitted line, and indicates that the regression line fits the data well.

54

Recall that we want the probability distribution of the errors, or residuals, to be approximately normal and have a constant variance for all values

  • f x.

Here are examples of residual plots: A curved pattern shows that the data is not linear, so a straight line is not a good fit for the data.

slide-10
SLIDE 10

10

55

Recall that we want the probability distribution of the errors, or residuals, to be approximately normal and have a constant variance for all values

  • f x.

Here are examples of residual plots: This residual plot shows that there is more spread for larger values of the explanatory variable, indicating that predictions will be less accurate when x is large.

56

2. Here is a Normal Probability Plot (NPP) of the standardized

  • residuals. What does it tell you about the probability distribution of the

errors?

57

2. Here is a Normal Probability Plot (NPP) of the standardized

  • residuals. What does it tell you about the probability distribution of the

errors? This NPP tells you that the probability distribution of the errors is approximately Normal.

58

The Coefficient of Determination R2, the coefficient of determination, tells you the proportion of the variation in y that can be attributed to the linear relationship between x and y. It is the fraction of the data’s variance accounted for by the model.

  • a. What is the value of r?
  • b. What is the value of R2?
  • c. What does this value tell you? Write your answer in context.

59

The Coefficient of Determination R2, the coefficient of determination, tells you the proportion of the variation in y that can be attributed to the linear relationship between x and y. It is the fraction of the data’s variance accounted for by the model.

  • a. What is the value of r?

We found that r = .884

  • b. What is the value of R2?

R2 = .781

  • c. What does this value tell you? Write your answer in context.

This tells us that 78.1% of the variance in the birthweights is accounted for by the relationship between the birthweight and the mother’s age.

60

The Coefficient of Determination R2, the coefficient of determination, tells you the proportion of the variation in y that can be attributed to the linear relationship between x and y. It is the fraction of the data’s variance accounted for by the model.

  • a. What is the value of r?

We found that r = .884

  • b. What is the value of R2?

R2 = .781

  • c. What does this value tell you? Write your answer in context.

This tells us that 78.1% of the variance in the birthweights is accounted for by the relationship between the birthweight and the mother’s age.

slide-11
SLIDE 11

11

61

The Coefficient of Determination R2, the coefficient of determination, tells you the proportion of the variation in y that can be attributed to the linear relationship between x and y. It is the fraction of the data’s variance accounted for by the model.

  • a. What is the value of r?

We found that r = .884

  • b. What is the value of R2?

R2 = .781

  • c. What does this value tell you? Write your answer in context.

This tells us that 78.1% of the variance in the birthweights is accounted for by the relationship between the birthweight and the mother’s age.