Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of - - PowerPoint PPT Presentation

bivariate data
SMART_READER_LITE
LIVE PREVIEW

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of - - PowerPoint PPT Presentation

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 1 / 36 Table of Contents Bivariate Data 1 Scatterplots 2 Correlation 3


slide-1
SLIDE 1

Marc Mehlman Marc Mehlman

Bivariate Data

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

Marc Mehlman (University of New Haven) Bivariate Data 1 / 36

slide-2
SLIDE 2

Marc Mehlman Marc Mehlman

Table of Contents

1

Bivariate Data

2

Scatterplots

3

Correlation

4

Two–Way Tables

5

Chapter #2 R Assignment

Marc Mehlman (University of New Haven) Bivariate Data 2 / 36

slide-3
SLIDE 3

Marc Mehlman Marc Mehlman

Bivariate Data

Bivariate Data

Bivariate Data

Marc Mehlman (University of New Haven) Bivariate Data 3 / 36

slide-4
SLIDE 4

Marc Mehlman Marc Mehlman

Bivariate Data

Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70, 178), (72, 192), (74, 184), (68, 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent) variable measures the outcome of a study. The explanatory (or independent) variable is the one that predicts the response variable.

Marc Mehlman (University of New Haven) Bivariate Data 4 / 36

slide-5
SLIDE 5

Marc Mehlman Marc Mehlman

Scatterplots

Scatterplots

Scatterplots

Marc Mehlman (University of New Haven) Bivariate Data 5 / 36

slide-6
SLIDE 6

Marc Mehlman Marc Mehlman

Scatterplots

Student ID Number

  • f Beers

Blood Alcohol Content 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Here we have two quantitative variables recorded for each of 16 students:

  • 1. how many beers they drank
  • 2. their resulting blood alcohol content (BAC)

Bivariate data

 For each individual studied, we record

data on two variables.

 We then examine whether there is a

relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables?

Marc Mehlman (University of New Haven) Bivariate Data 6 / 36

slide-7
SLIDE 7

Marc Mehlman Marc Mehlman

Scatterplots

Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Scatterplots

A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph.

Marc Mehlman (University of New Haven) Bivariate Data 7 / 36

slide-8
SLIDE 8

Marc Mehlman Marc Mehlman

Scatterplots

> plot(trees$Girth~trees$Height,main="girth vs height")

  • 65

70 75 80 85 8 10 12 14 16 18 20

girth vs height

trees$Height trees$Girth

Marc Mehlman (University of New Haven) Bivariate Data 8 / 36

slide-9
SLIDE 9

Marc Mehlman Marc Mehlman

Scatterplots

How to scale a scatterplot

Same data in all four plots

Both variables should be given a similar amount of space:

 Plot is roughly square  Points should occupy all

the plot space (no blank space) Marc Mehlman (University of New Haven) Bivariate Data 9 / 36

slide-10
SLIDE 10

Marc Mehlman Marc Mehlman

Scatterplots

Interpreting scatterplots

 After plotting two variables on a scatterplot, we describe the overall

pattern of the relationship. Specifically, we look for …

 Form: linear, curved, clusters, no pattern  Direction: positive, negative, no direction  Strength: how closely the points fit the “form”  … and clear deviations from that pattern  Outliers of the relationship

Marc Mehlman (University of New Haven) Bivariate Data 10 / 36

slide-11
SLIDE 11

Marc Mehlman Marc Mehlman

Scatterplots

Form

Linear

Nonlinear No relationship Marc Mehlman (University of New Haven) Bivariate Data 11 / 36

slide-12
SLIDE 12

Marc Mehlman Marc Mehlman

Scatterplots

Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable.

Direction

Marc Mehlman (University of New Haven) Bivariate Data 12 / 36

slide-13
SLIDE 13

Marc Mehlman Marc Mehlman

Scatterplots

Strength

The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.

Marc Mehlman (University of New Haven) Bivariate Data 13 / 36

slide-14
SLIDE 14

Marc Mehlman Marc Mehlman

Scatterplots

Outliers

An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern

  • f the relationship.

Marc Mehlman (University of New Haven) Bivariate Data 14 / 36

slide-15
SLIDE 15

Marc Mehlman Marc Mehlman

Scatterplots

Adding categorical variables to scatterplots

Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph.

The graph compares the association between thorax length and longevity

  • f male fruit flies that are allowed to

reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman (University of New Haven) Bivariate Data 15 / 36

slide-16
SLIDE 16

Marc Mehlman Marc Mehlman

Correlation

Correlation

Correlation

Marc Mehlman (University of New Haven) Bivariate Data 16 / 36

slide-17
SLIDE 17

Marc Mehlman Marc Mehlman

Correlation

Definition Given the bivariate data, (x1, y1), · · · , (xn, yn), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is r def = 1 n − 1

n

  • j=1

xj − ¯ x sx yj − ¯ y sy

  • .

The population correlation coefficient is denoted as ρ def = 1 N

N

  • j=1

xj − µX σX yj − µY σY

  • where the above sum is summed over the entire population of size N.

One thinks of r as an estimator of ρ.

Marc Mehlman (University of New Haven) Bivariate Data 17 / 36

slide-18
SLIDE 18

Marc Mehlman Marc Mehlman

Correlation

One can also use the formula r = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

  • n n

j=1 x2 j −

n

j=1 xj

2 n n

j=1 y2 j −

n

j=1 yj

2 R command: > cor(trees$Girth,trees$Height) [1] 0.5192801

Marc Mehlman (University of New Haven) Bivariate Data 18 / 36

slide-19
SLIDE 19

Marc Mehlman Marc Mehlman

Correlation

One can also use the formula r = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

  • n n

j=1 x2 j −

n

j=1 xj

2 n n

j=1 y2 j −

n

j=1 yj

2 R command: > cor(trees$Girth,trees$Height) [1] 0.5192801

Marc Mehlman (University of New Haven) Bivariate Data 18 / 36

slide-20
SLIDE 20

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-21
SLIDE 21

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-22
SLIDE 22

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-23
SLIDE 23

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-24
SLIDE 24

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-25
SLIDE 25

Marc Mehlman Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor(X, Y ) = cor(Y , X). −1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer |r| is to one, the stronger the linear relationship between the two variables. if |r| = 1 (ie, r = 1 or −1), all the data points lie on a straight line.

Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

slide-26
SLIDE 26

Marc Mehlman Marc Mehlman

Correlation

r has no unit

r = -0.75 r = -0.75 standardized value of x (unitless) standardized value of y (unitless) Marc Mehlman (University of New Haven) Bivariate Data 20 / 36

slide-27
SLIDE 27

Marc Mehlman Marc Mehlman

Correlation

Correlations are calculated using means and standard deviations, and thus are NOT resistant to

  • utliers.

r is not resistant to outliers

Just moving one point away from the linear pattern here weakens the correlation from −0.91 to −0.75 (closer to zero). Marc Mehlman (University of New Haven) Bivariate Data 21 / 36

slide-28
SLIDE 28

Marc Mehlman Marc Mehlman

Correlation

14

Correlation

Marc Mehlman (University of New Haven) Bivariate Data 22 / 36

slide-29
SLIDE 29

Marc Mehlman Marc Mehlman

Correlation

Caution: Correlation is not Causation Definition When calculating correlation, a lurking variable is a third factor that explains the relationship between the two correlated variables Example (Lurking Variables) There is a strong correlation between shoe size and reading skills among elementary school children. The lurking variable is · · · There is a strong correlation between the number of firefighters at a fire site and the amount of damage. The lurking variable is · · · Caution: Beware correlations based on averaged data. While there is a strong correlation average age and average height among children, the correlation between age and height for individual children is much, much lower.

Marc Mehlman (University of New Haven) Bivariate Data 23 / 36

slide-30
SLIDE 30

Marc Mehlman Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Bivariate Data 24 / 36

slide-31
SLIDE 31

Marc Mehlman Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Bivariate Data 24 / 36

slide-32
SLIDE 32

Marc Mehlman Marc Mehlman

Correlation

Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable?

Marc Mehlman (University of New Haven) Bivariate Data 24 / 36

slide-33
SLIDE 33

Marc Mehlman Marc Mehlman

Correlation

Establishing causation from an observed association can be done if: 1) The association is strong. 2) The association is consistent. 3) Higher doses are associated with stronger responses. 4) The alleged cause precedes the effect. 5) The alleged cause is plausible.

Establishing causation

Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer. Marc Mehlman (University of New Haven) Bivariate Data 25 / 36

slide-34
SLIDE 34

Marc Mehlman Marc Mehlman

Two–Way Tables

Two–Way Tables

Two–Way Tables

Marc Mehlman (University of New Haven) Bivariate Data 26 / 36

slide-35
SLIDE 35

Marc Mehlman Marc Mehlman

Two–Way Tables

Given two random variables (the row variable and the column variable) that are categorical, data can be organized into a r × c two–way table. The number of categories for the row variable is r and the number of categories for the column variable is c. The grand total is the total number of bivariate data points considered. For instance noting an individual’s gender (the column variable) and deciphering their perceived chances of getting rich (the row variable) one gets the following 5 × 2 two–way table:

Female Male Almost no chance 96 98 Some chance, but probably not 426 286 A 50–50 chance 696 720 A good chance 663 758 Almost certain 486 597

The ijth cell corresponds to a tally of all the individuals who gave the ith answer to the row variable question and the jth answer to the column variable question.

Marc Mehlman (University of New Haven) Bivariate Data 27 / 36

slide-36
SLIDE 36

Marc Mehlman Marc Mehlman

Two–Way Tables

Given a two–way table, marginal distributions are just the distributions

  • f the row and column random variables. The adjective “marginal” comes

from adding row and column totals to the two–way table to make it easy to calculate the row and column distributions. For instance:

Female Male Total Almost no chance 96 98 194 Some chance, but probably not 426 286 712 A 50–50 chance 696 720 1,416 A good chance 663 758 1,421 Almost certain 486 597 1,083 Total 2,367 2,459 4,826

allows us to see the distribution of the row variable is 194 4, 826, 712 4, 826, 1, 416 4, 826, 1, 421 4, 826, 1, 083 4, 826. The distribution of the column variable is 2,367

4,826, 2,459 4,826.

Marc Mehlman (University of New Haven) Bivariate Data 28 / 36

slide-37
SLIDE 37

Marc Mehlman Marc Mehlman

Two–Way Tables

The joint distribution of two two categorical random variables is again the proportions of total that correspond to each joint result. For instance the joint distribution from the previous example is:

Female Male Almost no chance

96 4,826 98 4,826

Some chance, but probably not

426 4,826 286 4,826

A 50–50 chance

696 4,826 720 4,826

A good chance

663 4,826 758 4,826

Almost certain

486 4,826 597 4,826 Marc Mehlman (University of New Haven) Bivariate Data 29 / 36

slide-38
SLIDE 38

Marc Mehlman Marc Mehlman

Two–Way Tables

The conditional distribution of one of the random variables given the

  • ther random variable takes on a particular value is the proportion of
  • utcomes the random variable takes on given the other random variable is

that particular value. For instance, given:

Female Male Total Almost no chance 96 98 194 Some chance, but probably not 426 286 712 A 50–50 chance 696 720 1,416 A good chance 663 758 1,421 Almost certain 486 597 1,083 Total 2,367 2,459 4,826

conditional distribution of row variable given that column variable equals male is 98 2, 459, 286 2, 459, 720 2, 459, 758 2, 459, 597 2, 459. Similarly, the conditional distribution of the column variable given the row variable equals “A good chance” is

663 1,421, 758 1,421.

Marc Mehlman (University of New Haven) Bivariate Data 30 / 36

slide-39
SLIDE 39

Marc Mehlman Marc Mehlman

Two–Way Tables

39

When studying the relationship between two variables, there may exist a lurking variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered. The lurking variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables. An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox. An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

Simpson’s Paradox

Marc Mehlman (University of New Haven) Bivariate Data 31 / 36

slide-40
SLIDE 40

Marc Mehlman Marc Mehlman

Two–Way Tables

40

Consider the acceptance rates for the following groups of men and women who applied to college.

A higher percentage of men were accepted: Is there evidence of discrimination?

Simpson’s Paradox

Marc Mehlman (University of New Haven) Bivariate Data 32 / 36

slide-41
SLIDE 41

Marc Mehlman Marc Mehlman

Two–Way Tables

41

Consider the acceptance rates when broken down by type of school.

BUSINESS SCHOOL ART SCHOOL

Simpson’s Paradox

Marc Mehlman (University of New Haven) Bivariate Data 33 / 36

slide-42
SLIDE 42

Marc Mehlman Marc Mehlman

Two–Way Tables

42  Lurking variable: Applications were split between the Business

School (240) and the Art School (320). Within each school a higher percentage of women were accepted than men. There is not any discrimination against women!!! This is an example of Simpsons Paradox. When the lurking variable (Type of School: Business or Art) is ignored the data seem to suggest discrimination against women. However, when the type of school is considered, the association is reversed and suggests discrimination against men.

Simpson’s Paradox

Marc Mehlman (University of New Haven) Bivariate Data 34 / 36

slide-43
SLIDE 43

Marc Mehlman Marc Mehlman

Chapter #2 R Assignment

Chapter #2 R Assignment

Chapter #2 R Assignment

Marc Mehlman (University of New Haven) Bivariate Data 35 / 36

slide-44
SLIDE 44

Marc Mehlman Marc Mehlman

Chapter #2 R Assignment 1 Create a scatterplot of weight versus quarter mile times for the

dataset, “mtcars”. Assume the the independent variable is the quarter mile times and the dependent variable is the weight.

2 Find the correlation of of weight versus quarter mile times for the

dataset, “mtcars”.

Marc Mehlman (University of New Haven) Bivariate Data 36 / 36