ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - - PowerPoint PPT Presentation

acms 20340 statistics for life sciences
SMART_READER_LITE
LIVE PREVIEW

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3 Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot . Further, we can


slide-1
SLIDE 1

ACMS 20340 Statistics for Life Sciences

Chapter 4: Regression

slide-2
SLIDE 2

A Quick Recap of Chapter 3

Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot. Further, we can measure direction and strength of the linear relationship between two variables via the correlation.

slide-3
SLIDE 3

The Basic Idea of Regression

In the case that one variable helps explain or predict the other variable, we can summarize the relationship between the variables by means of a regression line. We sometimes refer to the regression line as the least-squares regression line.

slide-4
SLIDE 4

Why “regression”?

◮ Sir Francis Galton (1822-1911), who was the first to apply

regression to biological and psychological data, considered examples involving the relationships between the height of parents and the heights of their children.

◮ Galton found that taller-than-average parents had

taller-than-average children, but not as tall as their parents.

◮ Thus, the children “regressed towards the mean”.

slide-5
SLIDE 5

Quick review of linear equations

y

Le

x

A

a = 1 b=2

◮ Let x be an explanatory variable. ◮ Let y be a response variable. ◮ A straight line relating y to x has the form y = a + bx. ◮ The slope of this line is b. ◮ The y-intercept of this line is a.

slide-6
SLIDE 6

An Example

slide-7
SLIDE 7

An Example

How do we determine which line is the regression line? Do we just eyeball it? Or do we just pick two points and draw a line between them?

slide-8
SLIDE 8

Which Line is the Regression Line?

The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

n t makes the sum of the squares

  • m the line
slide-9
SLIDE 9

Another Illustration

slide-10
SLIDE 10

The Equation of the Regression Line 1

The Main Ingredients:

◮ ¯

x, the mean of x

◮ ¯

y, the mean of y

◮ sx, the standard deviation of x ◮ sy, the standard deviation of y ◮ r, the correlation of x and y

slide-11
SLIDE 11

The Equation of the Regression Line 2

The least-squares regression line is the line ˆ y = a + bx, where b = r sy sx and a = ¯ y − b¯ x.

slide-12
SLIDE 12

Regression as Prediction

The regression line is given in terms of the variable ˆ y to emphasize the fact that the line gives the predicted response ˆ y for any x. Thus the distinction between response and explanatory variables matters with regression (unlike correlation).

slide-13
SLIDE 13

When Should We Use Regression?

Only compute the regression line once you’ve confirmed that the relationship between x and y is linear.

e data first! Each data set here gives ression line

Each of the above four data sets yields the regression line ˆ y = 3 + 1

  • 2x. But we should first plot each data set.
slide-14
SLIDE 14

Plotting the Data 1

ŷ = 3 + 0.5x

Moderate linear association; regression OK.

ŷ = 3 + 0.5x

Obvious nonlinear relationship; regression inappropriate.

slide-15
SLIDE 15

Plotting the Data 2

ŷ = 3 + 0.5x

One extreme outlier, requiring further examination.

ŷ = 3 + 0.5x

Only two values for x, a redesign is due here...

slide-16
SLIDE 16

Linear vs. Non-linear Plots 1

Below is a scatterplot of brain weight in grams against body weight in kilograms, with the regression line included. This regression line is not helpful for predictions.

slide-17
SLIDE 17

Linear vs. Non-linear Plots 2

Replacing the values of brain weight with the logarithm of the brain weight and the values of the body weight with the logarithm

  • f the body weight allows us to use the line of regression to make

reasonable predictions.

slide-18
SLIDE 18

The Slope of the Regression Line

The slope of the regression line, b = r sy

sx , has the following

properties:

◮ A change in one standard deviation in x corresponds to a

change of r standard deviations in y.

◮ As r decreases, changes in x have less of an effect on ˆ

y.

◮ With perfect correlation (r = 1 or r = −1), the change in ˆ

y is the same as the change in x.

slide-19
SLIDE 19

A Certain Point Always on the Regression Line

The least-squares regression line always passes through the point (¯ x, ¯ y). Can you see why? This is a useful fact for plotting the regression line.

slide-20
SLIDE 20

The Value r 2

The square of the correlation, r2, called the coefficient of determination, tells us how much variation in the values of y can be explained by the least-squares regression of y on x. That is, as x changes, this causes y to vary: “x pulls y along the regression line”. Additional variation may be due to values of y being above or below the regression line.

slide-21
SLIDE 21

r 2 Example 1

◮ r = −0.3 ◮ r2 = 0.09 ◮ The regression model doesnt explain even 10% of the

variations in y.

slide-22
SLIDE 22

r 2 Example 2

◮ r = −0.7 ◮ r2 = 0.49 ◮ The regression model explains almost half of the variations in

y.

slide-23
SLIDE 23

r 2 Example 3

◮ r = −0.99 ◮ r2 = 0.9801 ◮ The regression model explains almost all of the variations in y.

slide-24
SLIDE 24

Residuals

Once we fit the regression line to the scatterplot, in most cases, not all of the data points will be on this line. The vertical distances from these points to the line are small as possible, and they represent the “left-over” variation in the response variable. Thus we call these vertical distances residuals.

slide-25
SLIDE 25

Influential Observations 1

Large residuals can effect the regression line, but other unusual

  • bservations can also have an effect on the regression line.

Consider, for example, Subject 16.

slide-26
SLIDE 26

Influential Observations 2

The above data is from a study in which brain activity is measured as the subject watches his or her partner in pain, while the empathy score is determined by a test of empathy.

slide-27
SLIDE 27

Influential Observations 3

Subject 16 is of interest because he/she scored about 40 points higher than everyone else, which has a strong effect on the correlation.

slide-28
SLIDE 28

Influential Observations 4

If we remove Subject 16, the regression line doesn’t change much.

ubject 16,

However, the correlation drops from r = 0.515 to r = 0.331. Thus we say that Subject 16 is influential for correlation.

slide-29
SLIDE 29

Influential Observations 5

An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. If a statistical calculation depends on one or more influential

  • bservations, it may be of little practical value.
slide-30
SLIDE 30

Influential Observations 6

Observe that if we change the value of the brain activity of Subject 16, the regression line changes drastically:

e anges

Subject 16, in this new position, is now influential for regression.

slide-31
SLIDE 31

In General...

Points in a scatterplot that are outliers in the x or y direction tend to be influential for correlation. Points that are outliers in the x direction are often influential for the regression line.

slide-32
SLIDE 32

Some Reasons for Caution

Beware of the following potential pitfalls: Extrapolation is the use of a regression line for prediction well

  • utside the range of values of the explanatory variable x used to
  • btain the line.

A lurking variable is a variable not among the explanatory or response variables in a study that may influence the interpretation

  • f the relationship between the variables.
slide-33
SLIDE 33

Extrapolation

MPG of vehicle versus weight of vehicle in hundreds of pounds A U-haul truck weighs 8120 pounds. What is its MPG?

slide-34
SLIDE 34

What’s the Lurking Variable?

◮ There is a strong positive association between shoe size and

reading skills in young children.

◮ There is a strong positive association between the number of

firefighters at a fire site and the amount of damage the fire does.

◮ There is a strong negative association between moderate

amounts of wine-drinking and death rates from heart disease in developed countries.

slide-35
SLIDE 35

Association Does Not Imply Causation 1

We are often interested in whether changes in explanatory variable cause changes in the response variable. Sometimes an association really does reflect cause and effect. However, strong association is not sufficient to draw conclusions about causation. For instance, there are nonsense correlations:

◮ With a decrease in the number of pirates, there has been an

increase in global warming over the same period.

◮ Owning more cars will increase your lifespan.

Lurking variables are often to blame.

slide-36
SLIDE 36

Association Does Not Imply Causation 2

Direct causation may not be the whole story.

◮ Obese parents tend to have obese children. ◮ Body type is partly determined genetically, so there is is a

potential direct causation.

◮ However, bad habits may be picked up: little exercise, poor

eating habits, etc.

◮ Both could likely contribute to the correlation.

slide-37
SLIDE 37

From Association to Causation 1

The best way to find evidence that x causes y is to perform an experiment.

◮ We choose changes in x. ◮ We make every effort to keep lurking variables in check.

What if we can’t perform a controlled experiment?

◮ Does using cell phones cause brain tumors? ◮ Are our levels of CO2 emissions causing global warming?

slide-38
SLIDE 38

From Association to Causation 2

Criteria for establishing causation when an experiment cannot be performed:

  • 1. The association is strong.
  • 2. The association is consistent.
  • 3. Larger changes in x are associated with stronger responses.
  • 4. The alleged cause precedes the effect in time.
  • 5. The alleged cause is plausible.
slide-39
SLIDE 39
slide-40
SLIDE 40