Marc Mehlman Marc Mehlman
Bivariate Data
Marc H. Mehlman
marcmehlman@yahoo.com
University of New Haven
Marc Mehlman (University of New Haven) Bivariate Data 1 / 36
Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of - - PowerPoint PPT Presentation
Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 1 / 36 Table of Contents Bivariate Data 1 Scatterplots 2 Correlation 3
Marc Mehlman Marc Mehlman
University of New Haven
Marc Mehlman (University of New Haven) Bivariate Data 1 / 36
Marc Mehlman Marc Mehlman
1
2
3
4
5
Marc Mehlman (University of New Haven) Bivariate Data 2 / 36
Marc Mehlman Marc Mehlman
Bivariate Data
Marc Mehlman (University of New Haven) Bivariate Data 3 / 36
Marc Mehlman Marc Mehlman
Bivariate Data
Marc Mehlman (University of New Haven) Bivariate Data 4 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Marc Mehlman (University of New Haven) Bivariate Data 5 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Student ID Number
Blood Alcohol Content 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05
Here we have two quantitative variables recorded for each of 16 students:
For each individual studied, we record
data on two variables.
We then examine whether there is a
relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables?
Marc Mehlman (University of New Haven) Bivariate Data 6 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05
A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph.
Marc Mehlman (University of New Haven) Bivariate Data 7 / 36
Marc Mehlman Marc Mehlman
Scatterplots
70 75 80 85 8 10 12 14 16 18 20
girth vs height
trees$Height trees$Girth
Marc Mehlman (University of New Haven) Bivariate Data 8 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Same data in all four plots
Both variables should be given a similar amount of space:
Plot is roughly square Points should occupy all
the plot space (no blank space) Marc Mehlman (University of New Haven) Bivariate Data 9 / 36
Marc Mehlman Marc Mehlman
Scatterplots
After plotting two variables on a scatterplot, we describe the overall
pattern of the relationship. Specifically, we look for …
Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form” … and clear deviations from that pattern Outliers of the relationship
Marc Mehlman (University of New Haven) Bivariate Data 10 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Linear
Nonlinear No relationship Marc Mehlman (University of New Haven) Bivariate Data 11 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable.
Marc Mehlman (University of New Haven) Bivariate Data 12 / 36
Marc Mehlman Marc Mehlman
Scatterplots
The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.
Marc Mehlman (University of New Haven) Bivariate Data 13 / 36
Marc Mehlman Marc Mehlman
Scatterplots
An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern
Marc Mehlman (University of New Haven) Bivariate Data 14 / 36
Marc Mehlman Marc Mehlman
Scatterplots
Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph.
The graph compares the association between thorax length and longevity
reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman (University of New Haven) Bivariate Data 15 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 16 / 36
Marc Mehlman Marc Mehlman
Correlation
n
N
Marc Mehlman (University of New Haven) Bivariate Data 17 / 36
Marc Mehlman Marc Mehlman
Correlation
j=1 xjyj) − (n j=1 xj)(n j=1 yj)
j=1 x2 j −
j=1 xj
j=1 y2 j −
j=1 yj
Marc Mehlman (University of New Haven) Bivariate Data 18 / 36
Marc Mehlman Marc Mehlman
Correlation
j=1 xjyj) − (n j=1 xj)(n j=1 yj)
j=1 x2 j −
j=1 xj
j=1 y2 j −
j=1 yj
Marc Mehlman (University of New Haven) Bivariate Data 18 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman Marc Mehlman
Correlation
r = -0.75 r = -0.75 standardized value of x (unitless) standardized value of y (unitless) Marc Mehlman (University of New Haven) Bivariate Data 20 / 36
Marc Mehlman Marc Mehlman
Correlation
Correlations are calculated using means and standard deviations, and thus are NOT resistant to
Just moving one point away from the linear pattern here weakens the correlation from −0.91 to −0.75 (closer to zero). Marc Mehlman (University of New Haven) Bivariate Data 21 / 36
Marc Mehlman Marc Mehlman
Correlation
14
Marc Mehlman (University of New Haven) Bivariate Data 22 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 23 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman Marc Mehlman
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman Marc Mehlman
Correlation
Establishing causation from an observed association can be done if: 1) The association is strong. 2) The association is consistent. 3) Higher doses are associated with stronger responses. 4) The alleged cause precedes the effect. 5) The alleged cause is plausible.
Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer. Marc Mehlman (University of New Haven) Bivariate Data 25 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
Marc Mehlman (University of New Haven) Bivariate Data 26 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
Marc Mehlman (University of New Haven) Bivariate Data 27 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
4,826, 2,459 4,826.
Marc Mehlman (University of New Haven) Bivariate Data 28 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
96 4,826 98 4,826
426 4,826 286 4,826
696 4,826 720 4,826
663 4,826 758 4,826
486 4,826 597 4,826 Marc Mehlman (University of New Haven) Bivariate Data 29 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
663 1,421, 758 1,421.
Marc Mehlman (University of New Haven) Bivariate Data 30 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
39
When studying the relationship between two variables, there may exist a lurking variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered. The lurking variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables. An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox. An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.
Marc Mehlman (University of New Haven) Bivariate Data 31 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
40
Consider the acceptance rates for the following groups of men and women who applied to college.
A higher percentage of men were accepted: Is there evidence of discrimination?
Marc Mehlman (University of New Haven) Bivariate Data 32 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
41
Consider the acceptance rates when broken down by type of school.
BUSINESS SCHOOL ART SCHOOL
Marc Mehlman (University of New Haven) Bivariate Data 33 / 36
Marc Mehlman Marc Mehlman
Two–Way Tables
42 Lurking variable: Applications were split between the Business
School (240) and the Art School (320). Within each school a higher percentage of women were accepted than men. There is not any discrimination against women!!! This is an example of Simpsons Paradox. When the lurking variable (Type of School: Business or Art) is ignored the data seem to suggest discrimination against women. However, when the type of school is considered, the association is reversed and suggests discrimination against men.
Marc Mehlman (University of New Haven) Bivariate Data 34 / 36
Marc Mehlman Marc Mehlman
Chapter #2 R Assignment
Marc Mehlman (University of New Haven) Bivariate Data 35 / 36
Marc Mehlman Marc Mehlman
Chapter #2 R Assignment 1 Create a scatterplot of weight versus quarter mile times for the
2 Find the correlation of of weight versus quarter mile times for the
Marc Mehlman (University of New Haven) Bivariate Data 36 / 36