SLIDE 1
ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3 Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot . Further, we can
SLIDE 2
SLIDE 3
The Basic Idea of Regression
In the case that one variable helps explain or predict the other variable, we can summarize the relationship between the variables by means of a regression line. We sometimes refer to the regression line as the least-squares regression line.
SLIDE 4
Why “regression”?
◮ Sir Francis Galton (1822-1911), who was the first to apply
regression to biological and psychological data, considered examples involving the relationships between the height of parents and the heights of their children.
◮ Galton found that taller-than-average parents had
taller-than-average children, but not as tall as their parents.
◮ Thus, the children “regressed towards the mean”.
SLIDE 5
Quick review of linear equations
y
Le
x
A
a = 1 b=2
◮ Let x be an explanatory variable. ◮ Let y be a response variable. ◮ A straight line relating y to x has the form y = a + bx. ◮ The slope of this line is b. ◮ The y-intercept of this line is a.
SLIDE 6
An Example
SLIDE 7
An Example
How do we determine which line is the regression line? Do we just eyeball it? Or do we just pick two points and draw a line between them?
SLIDE 8
Which Line is the Regression Line?
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
n t makes the sum of the squares
- m the line
SLIDE 9
Another Illustration
SLIDE 10
The Equation of the Regression Line 1
The Main Ingredients:
◮ ¯
x, the mean of x
◮ ¯
y, the mean of y
◮ sx, the standard deviation of x ◮ sy, the standard deviation of y ◮ r, the correlation of x and y
SLIDE 11
The Equation of the Regression Line 2
The least-squares regression line is the line ˆ y = a + bx, where b = r sy sx and a = ¯ y − b¯ x.
SLIDE 12
Regression as Prediction
The regression line is given in terms of the variable ˆ y to emphasize the fact that the line gives the predicted response ˆ y for any x. Thus the distinction between response and explanatory variables matters with regression (unlike correlation).
SLIDE 13
When Should We Use Regression?
Only compute the regression line once you’ve confirmed that the relationship between x and y is linear.
e data first! Each data set here gives ression line
Each of the above four data sets yields the regression line ˆ y = 3 + 1
- 2x. But we should first plot each data set.
SLIDE 14
Plotting the Data 1
ŷ = 3 + 0.5x
Moderate linear association; regression OK.
ŷ = 3 + 0.5x
Obvious nonlinear relationship; regression inappropriate.
SLIDE 15
Plotting the Data 2
ŷ = 3 + 0.5x
One extreme outlier, requiring further examination.
ŷ = 3 + 0.5x
Only two values for x, a redesign is due here...
SLIDE 16
Linear vs. Non-linear Plots 1
Below is a scatterplot of brain weight in grams against body weight in kilograms, with the regression line included. This regression line is not helpful for predictions.
SLIDE 17
Linear vs. Non-linear Plots 2
Replacing the values of brain weight with the logarithm of the brain weight and the values of the body weight with the logarithm
- f the body weight allows us to use the line of regression to make
reasonable predictions.
SLIDE 18
The Slope of the Regression Line
The slope of the regression line, b = r sy
sx , has the following
properties:
◮ A change in one standard deviation in x corresponds to a
change of r standard deviations in y.
◮ As r decreases, changes in x have less of an effect on ˆ
y.
◮ With perfect correlation (r = 1 or r = −1), the change in ˆ
y is the same as the change in x.
SLIDE 19
A Certain Point Always on the Regression Line
The least-squares regression line always passes through the point (¯ x, ¯ y). Can you see why? This is a useful fact for plotting the regression line.
SLIDE 20
The Value r 2
The square of the correlation, r2, called the coefficient of determination, tells us how much variation in the values of y can be explained by the least-squares regression of y on x. That is, as x changes, this causes y to vary: “x pulls y along the regression line”. Additional variation may be due to values of y being above or below the regression line.
SLIDE 21
r 2 Example 1
◮ r = −0.3 ◮ r2 = 0.09 ◮ The regression model doesnt explain even 10% of the
variations in y.
SLIDE 22
r 2 Example 2
◮ r = −0.7 ◮ r2 = 0.49 ◮ The regression model explains almost half of the variations in
y.
SLIDE 23
r 2 Example 3
◮ r = −0.99 ◮ r2 = 0.9801 ◮ The regression model explains almost all of the variations in y.
SLIDE 24
Residuals
Once we fit the regression line to the scatterplot, in most cases, not all of the data points will be on this line. The vertical distances from these points to the line are small as possible, and they represent the “left-over” variation in the response variable. Thus we call these vertical distances residuals.
SLIDE 25
Influential Observations 1
Large residuals can effect the regression line, but other unusual
- bservations can also have an effect on the regression line.
Consider, for example, Subject 16.
SLIDE 26
Influential Observations 2
The above data is from a study in which brain activity is measured as the subject watches his or her partner in pain, while the empathy score is determined by a test of empathy.
SLIDE 27
Influential Observations 3
Subject 16 is of interest because he/she scored about 40 points higher than everyone else, which has a strong effect on the correlation.
SLIDE 28
Influential Observations 4
If we remove Subject 16, the regression line doesn’t change much.
ubject 16,
However, the correlation drops from r = 0.515 to r = 0.331. Thus we say that Subject 16 is influential for correlation.
SLIDE 29
Influential Observations 5
An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. If a statistical calculation depends on one or more influential
- bservations, it may be of little practical value.
SLIDE 30
Influential Observations 6
Observe that if we change the value of the brain activity of Subject 16, the regression line changes drastically:
e anges
Subject 16, in this new position, is now influential for regression.
SLIDE 31
In General...
Points in a scatterplot that are outliers in the x or y direction tend to be influential for correlation. Points that are outliers in the x direction are often influential for the regression line.
SLIDE 32
Some Reasons for Caution
Beware of the following potential pitfalls: Extrapolation is the use of a regression line for prediction well
- utside the range of values of the explanatory variable x used to
- btain the line.
A lurking variable is a variable not among the explanatory or response variables in a study that may influence the interpretation
- f the relationship between the variables.
SLIDE 33
Extrapolation
MPG of vehicle versus weight of vehicle in hundreds of pounds A U-haul truck weighs 8120 pounds. What is its MPG?
SLIDE 34
What’s the Lurking Variable?
◮ There is a strong positive association between shoe size and
reading skills in young children.
◮ There is a strong positive association between the number of
firefighters at a fire site and the amount of damage the fire does.
◮ There is a strong negative association between moderate
amounts of wine-drinking and death rates from heart disease in developed countries.
SLIDE 35
Association Does Not Imply Causation 1
We are often interested in whether changes in explanatory variable cause changes in the response variable. Sometimes an association really does reflect cause and effect. However, strong association is not sufficient to draw conclusions about causation. For instance, there are nonsense correlations:
◮ With a decrease in the number of pirates, there has been an
increase in global warming over the same period.
◮ Owning more cars will increase your lifespan.
Lurking variables are often to blame.
SLIDE 36
Association Does Not Imply Causation 2
Direct causation may not be the whole story.
◮ Obese parents tend to have obese children. ◮ Body type is partly determined genetically, so there is is a
potential direct causation.
◮ However, bad habits may be picked up: little exercise, poor
eating habits, etc.
◮ Both could likely contribute to the correlation.
SLIDE 37
From Association to Causation 1
The best way to find evidence that x causes y is to perform an experiment.
◮ We choose changes in x. ◮ We make every effort to keep lurking variables in check.
What if we can’t perform a controlled experiment?
◮ Does using cell phones cause brain tumors? ◮ Are our levels of CO2 emissions causing global warming?
SLIDE 38
From Association to Causation 2
Criteria for establishing causation when an experiment cannot be performed:
- 1. The association is strong.
- 2. The association is consistent.
- 3. Larger changes in x are associated with stronger responses.
- 4. The alleged cause precedes the effect in time.
- 5. The alleged cause is plausible.
SLIDE 39
SLIDE 40