ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots - - PowerPoint PPT Presentation

acms 20340 statistics for life sciences
SMART_READER_LITE
LIVE PREVIEW

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots - - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory Data Analysis Recall that exploratory data analysis has two guiding principles. 1. First examine each variable by itself. Then study the relationships


slide-1
SLIDE 1

ACMS 20340 Statistics for Life Sciences

Chapter 3: Scatterplots and Correlation

slide-2
SLIDE 2

Exploratory Data Analysis

Recall that exploratory data analysis has two guiding principles.

  • 1. First examine each variable by itself.

Then study the relationships between the variables.

  • 2. Represent the data with graphs.

Then add numerical summaries of aspects of the data. Now we’ll start to look at the relationships between variables.

slide-3
SLIDE 3

Relationships Between Variables

Examples:

◮ Lung capacity decreases with number of cigarettes smoked in

a day.

◮ The DMV warns that alcohol consumption reduces reflex time,

and the effect becomes larger as more alcohol is consumed.

slide-4
SLIDE 4

Relationships Between Variables

Statistical relationships are overall tendencies. They are not ironclad rules. Two variables can have a statistical relationship, even if some exceptions exist in the data. To compare two variables, always measure them on the same individuals. Examples:

◮ Smoking influences lung capacity. ◮ Blood alcohol content explains variations in reflex time.

In a statistical relationship, one variable explains or influences the

  • ther.
slide-5
SLIDE 5

Explanatory and Response Variables

A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a response variable. Sometimes referred to as dependent and independent variables.

◮ A response variable “depends on” an explanatory variable

Studies often try to show that changes in a variable cause the changes in another. Many statistical relationships do not involve direct causation.

slide-6
SLIDE 6

Explanatory and Response Variables

How to identify each type? Case 1: Values of one variable are set to see how it affects another. Case 2: Two variables are observed. This situation may or may not have explanatory/response variables. It depends on how the data is used.

slide-7
SLIDE 7

Analyzing Statistical Relationships

Analyzing two-variable data expands on what we know:

◮ Plot the data. ◮ Look for overall patterns and any deviations from that

pattern.

◮ Then obtain numerical summaries based on the data.

slide-8
SLIDE 8

Scatterplots

A scatterplot is a common and useful graph to show the relationship between two quantitative variables. Values of one variable (explanatory, if applicable) on the horizontal axis and the other variable (response) on the vertical axis. Each individual in the data is the point in the plot corresponding to the values of the two variables.

slide-9
SLIDE 9

Interpreting Scatterplots

When you make a graph, ask yourself “What do I see”

◮ Deja Vu? ◮ Look for the overall pattern. ◮ Describe direction, form, and strength of the relationship. ◮ Check for any striking deviations, such as outliers.

slide-10
SLIDE 10

Interpreting Scatterplots

“Two variables are positively associated when above-average values of one tend to accompany above-average values of the

  • ther, and below- average values also tend to occur together.”

◮ What? ◮ Think “upward trend”.

Two variables are negatively associated when larger values of one variable tend to accompany smaller values of the other.

slide-11
SLIDE 11

Example

Let’s look at the influence of the number of powerboats registered

  • n manatee deaths from collisions with powerboats.
slide-12
SLIDE 12

Powerboats and Manatees

Does the number of powerboats help explain yearly manatee deaths? What are the explanatory and response variables (if any)? Let’s take a look at the data.

slide-13
SLIDE 13

Scatterplots

◮ Scatterplots show the relationship between two quantitative

variables.

◮ They are such a fundamental tool that many variations have

been developed.

◮ One variation displays a third categorical variable by varying

the dot style.

slide-14
SLIDE 14

Iris Data

slide-15
SLIDE 15

Iris Data

The Iris Data from before. For three species of irises the petal and sepal lengths and widths were measured. Species P–Width P–Length S–Width S–Length Setosa 0.2 1.4 3.5 5.1 Setosa 0.2 1.4 3 4.9 Versicolor 1.3 4.1 2.8 5.7 Virginica 2.5 6 3.3 6.3 Virginica 1.9 5.1 2.7 5.8 . . .

slide-16
SLIDE 16

Petal Width by Sepal Width

slide-17
SLIDE 17

Petal Width by Sepal Width, with Species

slide-18
SLIDE 18

Petal Width by Sepal Width, with Species

slide-19
SLIDE 19

Running Speed vs. Energy expenditure

This plot is easier to understand by indicating the different inclines.

slide-20
SLIDE 20

Linear Relationships

Left: Vehicle horsepower vs. weight (100 lbs) Right: Powerboat registrations (thousands) vs. manatee deaths

slide-21
SLIDE 21

Linear Relationships

◮ While our eyes find it easy to see strong linear relationships,

weak relationships are more difficult to see.

◮ The correlation between a pair of variables is a number

measuring the strength of the linear relationship between them.

◮ It is denoted by the symbol r.

slide-22
SLIDE 22

Calculating Correlation

The data is x1, x2, . . . , xn for one variable and y1, y2, . . . , yn for the

  • ther. The data is paired by individuals, so x1, y1 are observations

from the same individual. ¯ x, sx are mean and standard deviation of x data. ¯ y, sy are mean and standard deviation of y data r = 1 n − 1

  • i

xi − ¯ x sx yi − ¯ y sy

slide-23
SLIDE 23

Deconstructing the Correlation Formula

r = 1 n − 1

  • i

xi − ¯ x sx

  • Normalize x

yi − ¯ y sy

  • Normalize y

We calculate distance of each value from the mean, and then divide by the standard deviation. This has the effect of rescailing the observations to be in terms of standard deviations from the mean. Standardizing turns r into a unitless measurement.

slide-24
SLIDE 24

Correlation is symmetric

r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = −0.78.

slide-25
SLIDE 25

Correlation is symmetric

r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = −0.78.

slide-26
SLIDE 26

A Small Example

x y 2.0 4.6 1.7 4.4 2.3 4.5 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 First calculate the mean and s.d. of x and y.

slide-27
SLIDE 27

A Small Example

x y x − ¯ x y − ¯ y 2.0 4.6 0.1 1.7 4.4

  • 0.3
  • 0.1

2.3 4.5 0.3 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Find distance from mean for both x and y.

slide-28
SLIDE 28

A Small Example

x y (x − ¯ x)/sx (y − ¯ y)/sy 2.0 4.6 1 1.7 4.4

  • 1
  • 1

2.3 4.5 1 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Normalize by dividing by the corresponding s.d.

slide-29
SLIDE 29

A Small Example

x y (x − ¯ x)/sx (y − ¯ y)/sy product 2.0 4.6 1 1.7 4.4

  • 1
  • 1

1 2.3 4.5 1 ¯ x = 2, ¯ y = 4.5, sx = 0.3, sy = 0.1 Find product of normalized x and y. Sum of products is 1 so r =

1 3−1 · 1 = 0.5

slide-30
SLIDE 30

Properties of Correlation

◮ r is always between −1 and 1. ◮ If r is close to 0 then there is no linear relationship between

the variables.

◮ If r > 0 then it indicates a positive relationship, with the

relationship being stronger the closer r is to 1.

◮ If r < 0 then it indicates a negative relationship, with the

relationship being stronger the closer r is to −1.

◮ Correlation is not a resistant measure. Just as with the mean

and standard deviation, outliers will affect the value of r.

slide-31
SLIDE 31

Correlation varies from −1 to +1

slide-32
SLIDE 32

Manatee Deaths

r = 0.95

slide-33
SLIDE 33

Horsepower vs. MPG

r = −0.79

slide-34
SLIDE 34

Weight vs. MPG

r = −0.9

slide-35
SLIDE 35

Cabin Volume vs. MPG

r = −0.37

slide-36
SLIDE 36

Iris Species

r = −0.36

slide-37
SLIDE 37

Linear Relationship?

r = 0.18

slide-38
SLIDE 38

Linear Relationship?

r = −0.043

slide-39
SLIDE 39

r is not resistant

Some variables with a very strong linear relationship. r = −0.99

slide-40
SLIDE 40

r is not resistant

Changing an extreme value keeps the same linear relationship but now r = −0.78