SLIDE 1
ACMS 20340 Statistics for Life Sciences
Chapter 3: Scatterplots and Correlation
SLIDE 2 Exploratory Data Analysis
Recall that exploratory data analysis has two guiding principles.
- 1. First examine each variable by itself.
Then study the relationships between the variables.
- 2. Represent the data with graphs.
Then add numerical summaries of aspects of the data. Now we’ll start to look at the relationships between variables.
SLIDE 3
Relationships Between Variables
Examples:
◮ Lung capacity decreases with number of cigarettes smoked in
a day.
◮ The DMV warns that alcohol consumption reduces reflex time,
and the effect becomes larger as more alcohol is consumed.
SLIDE 4 Relationships Between Variables
Statistical relationships are overall tendencies. They are not ironclad rules. Two variables can have a statistical relationship, even if some exceptions exist in the data. To compare two variables, always measure them on the same individuals. Examples:
◮ Smoking influences lung capacity. ◮ Blood alcohol content explains variations in reflex time.
In a statistical relationship, one variable explains or influences the
SLIDE 5
Explanatory and Response Variables
A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a response variable. Sometimes referred to as dependent and independent variables.
◮ A response variable “depends on” an explanatory variable
Studies often try to show that changes in a variable cause the changes in another. Many statistical relationships do not involve direct causation.
SLIDE 6
Explanatory and Response Variables
How to identify each type? Case 1: Values of one variable are set to see how it affects another. Case 2: Two variables are observed. This situation may or may not have explanatory/response variables. It depends on how the data is used.
SLIDE 7
Analyzing Statistical Relationships
Analyzing two-variable data expands on what we know:
◮ Plot the data. ◮ Look for overall patterns and any deviations from that
pattern.
◮ Then obtain numerical summaries based on the data.
SLIDE 8
Scatterplots
A scatterplot is a common and useful graph to show the relationship between two quantitative variables. Values of one variable (explanatory, if applicable) on the horizontal axis and the other variable (response) on the vertical axis. Each individual in the data is the point in the plot corresponding to the values of the two variables.
SLIDE 9
Interpreting Scatterplots
When you make a graph, ask yourself “What do I see”
◮ Deja Vu? ◮ Look for the overall pattern. ◮ Describe direction, form, and strength of the relationship. ◮ Check for any striking deviations, such as outliers.
SLIDE 10 Interpreting Scatterplots
“Two variables are positively associated when above-average values of one tend to accompany above-average values of the
- ther, and below- average values also tend to occur together.”
◮ What? ◮ Think “upward trend”.
Two variables are negatively associated when larger values of one variable tend to accompany smaller values of the other.
SLIDE 11 Example
Let’s look at the influence of the number of powerboats registered
- n manatee deaths from collisions with powerboats.
SLIDE 12
Powerboats and Manatees
Does the number of powerboats help explain yearly manatee deaths? What are the explanatory and response variables (if any)? Let’s take a look at the data.
SLIDE 13
Scatterplots
◮ Scatterplots show the relationship between two quantitative
variables.
◮ They are such a fundamental tool that many variations have
been developed.
◮ One variation displays a third categorical variable by varying
the dot style.
SLIDE 14
Iris Data
SLIDE 15
Iris Data
The Iris Data from before. For three species of irises the petal and sepal lengths and widths were measured. Species P–Width P–Length S–Width S–Length Setosa 0.2 1.4 3.5 5.1 Setosa 0.2 1.4 3 4.9 Versicolor 1.3 4.1 2.8 5.7 Virginica 2.5 6 3.3 6.3 Virginica 1.9 5.1 2.7 5.8 . . .
SLIDE 16
Petal Width by Sepal Width
SLIDE 17
Petal Width by Sepal Width, with Species
SLIDE 18
Petal Width by Sepal Width, with Species
SLIDE 19
Running Speed vs. Energy expenditure
This plot is easier to understand by indicating the different inclines.
SLIDE 20
Linear Relationships
Left: Vehicle horsepower vs. weight (100 lbs) Right: Powerboat registrations (thousands) vs. manatee deaths
SLIDE 21
Linear Relationships
◮ While our eyes find it easy to see strong linear relationships,
weak relationships are more difficult to see.
◮ The correlation between a pair of variables is a number
measuring the strength of the linear relationship between them.
◮ It is denoted by the symbol r.
SLIDE 22 Calculating Correlation
The data is x1, x2, . . . , xn for one variable and y1, y2, . . . , yn for the
- ther. The data is paired by individuals, so x1, y1 are observations
from the same individual. ¯ x, sx are mean and standard deviation of x data. ¯ y, sy are mean and standard deviation of y data r = 1 n − 1
xi − ¯ x sx yi − ¯ y sy
SLIDE 23 Deconstructing the Correlation Formula
r = 1 n − 1
xi − ¯ x sx
yi − ¯ y sy
We calculate distance of each value from the mean, and then divide by the standard deviation. This has the effect of rescailing the observations to be in terms of standard deviations from the mean. Standardizing turns r into a unitless measurement.
SLIDE 24
Correlation is symmetric
r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = −0.78.
SLIDE 25
Correlation is symmetric
r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = −0.78.
SLIDE 26
A Small Example
x y 2.0 4.6 1.7 4.4 2.3 4.5 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 First calculate the mean and s.d. of x and y.
SLIDE 27 A Small Example
x y x − ¯ x y − ¯ y 2.0 4.6 0.1 1.7 4.4
2.3 4.5 0.3 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Find distance from mean for both x and y.
SLIDE 28 A Small Example
x y (x − ¯ x)/sx (y − ¯ y)/sy 2.0 4.6 1 1.7 4.4
2.3 4.5 1 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Normalize by dividing by the corresponding s.d.
SLIDE 29 A Small Example
x y (x − ¯ x)/sx (y − ¯ y)/sy product 2.0 4.6 1 1.7 4.4
1 2.3 4.5 1 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Find product of normalized x and y. Sum of products is 1 so r =
1 3−1 · 1 = 0.5
SLIDE 30
Properties of Correlation
◮ r is always between −1 and 1. ◮ If r is close to 0 then there is no linear relationship between
the variables.
◮ If r > 0 then it indicates a positive relationship, with the
relationship being stronger the closer r is to 1.
◮ If r < 0 then it indicates a negative relationship, with the
relationship being stronger the closer r is to −1.
◮ Correlation is not a resistant measure. Just as with the mean
and standard deviation, outliers will affect the value of r.
SLIDE 31
Correlation varies from −1 to +1
SLIDE 32
Manatee Deaths
r = 0.95
SLIDE 33
Horsepower vs. MPG
r = −0.79
SLIDE 34
Weight vs. MPG
r = −0.9
SLIDE 35
Cabin Volume vs. MPG
r = −0.37
SLIDE 36
Iris Species
r = −0.36
SLIDE 37
Linear Relationship?
r = 0.18
SLIDE 38
Linear Relationship?
r = −0.043
SLIDE 39
r is not resistant
Some variables with a very strong linear relationship. r = −0.99
SLIDE 40
r is not resistant
Changing an extreme value keeps the same linear relationship but now r = −0.78