ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression

A Quick Recap of Chapter 3 Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot . Further, we can measure direction and strength of the linear relationship between two variables via the correlation .

The Basic Idea of Regression In the case that one variable helps explain or predict the other variable, we can summarize the relationship between the variables by means of a regression line . We sometimes refer to the regression line as the least-squares regression line .

Why “regression”? ◮ Sir Francis Galton (1822-1911), who was the first to apply regression to biological and psychological data, considered examples involving the relationships between the height of parents and the heights of their children. ◮ Galton found that taller-than-average parents had taller-than-average children, but not as tall as their parents. ◮ Thus, the children “regressed towards the mean”.

Quick review of linear equations y b= 2 Le A x a = 1 ◮ Let x be an explanatory variable. ◮ Let y be a response variable. ◮ A straight line relating y to x has the form y = a + bx . ◮ The slope of this line is b . ◮ The y -intercept of this line is a .

An Example

An Example How do we determine which line is the regression line? Do we just eyeball it? Or do we just pick two points and draw a line between them?

Which Line is the Regression Line? The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. n t makes the sum of the squares om the lin e

Another Illustration

The Equation of the Regression Line 1 The Main Ingredients: ◮ ¯ x , the mean of x ◮ ¯ y , the mean of y ◮ s x , the standard deviation of x ◮ s y , the standard deviation of y ◮ r , the correlation of x and y

The Equation of the Regression Line 2 The least-squares regression line is the line ˆ y = a + bx , where b = r s y s x and a = ¯ y − b ¯ x .

Regression as Prediction The regression line is given in terms of the variable ˆ y to emphasize the fact that the line gives the predicted response ˆ y for any x . Thus the distinction between response and explanatory variables matters with regression (unlike correlation).

When Should We Use Regression? Only compute the regression line once you’ve confirmed that the relationship between x and y is linear . Each data set here gives ression lin e e data first! Each of the above four data sets yields the regression line y = 3 + 1 ˆ 2 x . But we should first plot each data set.

Plotting the Data 1 ŷ = 3 + 0.5 x Moderate linear association; regression OK. ŷ = 3 + 0.5 x Obvious nonlinear relationship; regression inappropriate.

Plotting the Data 2 ŷ = 3 + 0.5 x One extreme outlier, requiring further examination. ŷ = 3 + 0.5 x Only two values for x , a redesign is due here...

Linear vs. Non-linear Plots 1 Below is a scatterplot of brain weight in grams against body weight in kilograms, with the regression line included. This regression line is not helpful for predictions.

Linear vs. Non-linear Plots 2 Replacing the values of brain weight with the logarithm of the brain weight and the values of the body weight with the logarithm of the body weight allows us to use the line of regression to make reasonable predictions.

The Slope of the Regression Line The slope of the regression line, b = r s y s x , has the following properties: ◮ A change in one standard deviation in x corresponds to a change of r standard deviations in y . ◮ As r decreases, changes in x have less of an effect on ˆ y . ◮ With perfect correlation ( r = 1 or r = − 1), the change in ˆ y is the same as the change in x .

A Certain Point Always on the Regression Line The least-squares regression line always passes through the point (¯ x , ¯ y ). Can you see why? This is a useful fact for plotting the regression line.

The Value r 2 The square of the correlation, r 2 , called the coefficient of determination , tells us how much variation in the values of y can be explained by the least-squares regression of y on x . That is, as x changes, this causes y to vary: “ x pulls y along the regression line”. Additional variation may be due to values of y being above or below the regression line.

r 2 Example 1 ◮ r = − 0 . 3 ◮ r 2 = 0 . 09 ◮ The regression model doesnt explain even 10% of the variations in y .

r 2 Example 2 ◮ r = − 0 . 7 ◮ r 2 = 0 . 49 ◮ The regression model explains almost half of the variations in y .

r 2 Example 3 ◮ r = − 0 . 99 ◮ r 2 = 0 . 9801 ◮ The regression model explains almost all of the variations in y .

Residuals Once we fit the regression line to the scatterplot, in most cases, not all of the data points will be on this line. The vertical distances from these points to the line are small as possible, and they represent the “left-over” variation in the response variable. Thus we call these vertical distances residuals .

Influential Observations 1 Large residuals can effect the regression line, but other unusual observations can also have an effect on the regression line. Consider, for example, Subject 16.

Influential Observations 2 The above data is from a study in which brain activity is measured as the subject watches his or her partner in pain, while the empathy score is determined by a test of empathy.

Influential Observations 3 Subject 16 is of interest because he/she scored about 40 points higher than everyone else, which has a strong effect on the correlation.

Influential Observations 4 If we remove Subject 16, the regression line doesn’t change much. ubject 16, However, the correlation drops from r = 0 . 515 to r = 0 . 331. Thus we say that Subject 16 is influential for correlation.

Influential Observations 5 An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. If a statistical calculation depends on one or more influential observations, it may be of little practical value.

Influential Observations 6 Observe that if we change the value of the brain activity of Subject 16, the regression line changes drastically: e anges Subject 16, in this new position, is now influential for regression.

In General... Points in a scatterplot that are outliers in the x or y direction tend to be influential for correlation. Points that are outliers in the x direction are often influential for the regression line.

Some Reasons for Caution Beware of the following potential pitfalls: Extrapolation is the use of a regression line for prediction well outside the range of values of the explanatory variable x used to obtain the line. A lurking variable is a variable not among the explanatory or response variables in a study that may influence the interpretation of the relationship between the variables.

Extrapolation MPG of vehicle versus weight of vehicle in hundreds of pounds A U-haul truck weighs 8120 pounds. What is its MPG?

What’s the Lurking Variable? ◮ There is a strong positive association between shoe size and reading skills in young children. ◮ There is a strong positive association between the number of firefighters at a fire site and the amount of damage the fire does. ◮ There is a strong negative association between moderate amounts of wine-drinking and death rates from heart disease in developed countries.

Association Does Not Imply Causation 1 We are often interested in whether changes in explanatory variable cause changes in the response variable. Sometimes an association really does reflect cause and effect. However, strong association is not sufficient to draw conclusions about causation. For instance, there are nonsense correlations: ◮ With a decrease in the number of pirates, there has been an increase in global warming over the same period. ◮ Owning more cars will increase your lifespan. Lurking variables are often to blame.

Association Does Not Imply Causation 2 Direct causation may not be the whole story. ◮ Obese parents tend to have obese children. ◮ Body type is partly determined genetically, so there is is a potential direct causation. ◮ However, bad habits may be picked up: little exercise, poor eating habits, etc. ◮ Both could likely contribute to the correlation.

From Association to Causation 1 The best way to find evidence that x causes y is to perform an experiment. ◮ We choose changes in x . ◮ We make every effort to keep lurking variables in check. What if we can’t perform a controlled experiment? ◮ Does using cell phones cause brain tumors? ◮ Are our levels of CO 2 emissions causing global warming?

From Association to Causation 2 Criteria for establishing causation when an experiment cannot be performed: 1. The association is strong. 2. The association is consistent. 3. Larger changes in x are associated with stronger responses. 4. The alleged cause precedes the effect in time. 5. The alleged cause is plausible.

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3 Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot . Further, we can

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory

ACMS 20340 Statistics for Life Sciences Chapter 7: Samples and Observational Studies Obtaining

ACMS 20340 Statistics for Life Sciences Chapter 8: Designing Experiments Fishers Experiments

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use

ACMS 20340 Statistics for Life Sciences Chapter 18: Comparing Two Means Daily Activity and

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

ACMS 20340 Statistics for Life Sciences Chapter 14: Introduction to Inference Sampling

ACMS 20340 Statistics for Life Sciences Chapter 11: The Normal Distributions Introducing the

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables

ACMS 20340 Statistics for Life Sciences Chapter 17: Inference About a Population Mean

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit

Information and its sources What should we believe? v Traditions v Authorities v Experiences v

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &

Comparative Visualization Eduard Grller Institute of Computer Graphics and Algorithms Vienna

IPIN competition 2016 - Track 4 Sponsored by Indoor Mobile Robot Positioning

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

The Wisdom of Crowds: Network effects, and the Importance of Experts Aris Anagnostopoulos

Statistics and risk modelling using Python Eric Marsden <eric.marsden@risk-engineering.org>

Regression Quantitative A Aptitude & & Business S Statistics Regr gress ession on

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3 Motivating question: What relationships might hold between two variables? We plot the data given by two variables on a scatterplot . Further, we can

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory

ACMS 20340 Statistics for Life Sciences Chapter 7: Samples and Observational Studies Obtaining

ACMS 20340 Statistics for Life Sciences Chapter 8: Designing Experiments Fishers Experiments

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use

ACMS 20340 Statistics for Life Sciences Chapter 18: Comparing Two Means Daily Activity and

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

ACMS 20340 Statistics for Life Sciences Chapter 14: Introduction to Inference Sampling

ACMS 20340 Statistics for Life Sciences Chapter 11: The Normal Distributions Introducing the

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables

ACMS 20340 Statistics for Life Sciences Chapter 17: Inference About a Population Mean

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit

Information and its sources What should we believe? v Traditions v Authorities v Experiences v

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &amp;

Comparative Visualization Eduard Grller Institute of Computer Graphics and Algorithms Vienna

IPIN competition 2016 - Track 4 Sponsored by Indoor Mobile Robot Positioning

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

The Wisdom of Crowds: Network effects, and the Importance of Experts Aris Anagnostopoulos

Statistics and risk modelling using Python Eric Marsden &lt;eric.marsden@risk-engineering.org&gt;

Regression Quantitative A Aptitude &amp; &amp; Business S Statistics Regr gress ession on

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &

Statistics and risk modelling using Python Eric Marsden <eric.marsden@risk-engineering.org>

Regression Quantitative A Aptitude & & Business S Statistics Regr gress ession on