 
              Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven “All models are wrong. Some models are useful.” – George Box “ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 1 / 64
Table of Contents Bivariate Data 1 Correlation 2 Simple Regression 3 Variation 4 Logistic Regression 5 Logistic Regression 6 Chapter #9 R Assignment 7 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 2 / 64
Bivariate Data Bivariate Data and Scatterplots Bivariate Data and Scatterplots Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 3 / 64
Bivariate Data Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70 , 178) , (72 , 192) , (74 , 184) , (68 , 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent ) variable measures the outcome of a study. The explanatory (or independent ) variable is the one that predicts the response variable. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 4 / 64
Bivariate Data Bivariate data  For each individual studied, we record Student Number Blood Alcohol ID of Beers Content data on two variables. 1 5 0.1 2 2 0.03 3 9 0.19  We then examine whether there is a 6 7 0.095 relationship between these two 7 3 0.07 variables: Do changes in one variable 9 3 0.02 tend to be associated with specific 11 4 0.07 changes in the other variables? 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 Here we have two quantitative variables 12 6 0.1 recorded for each of 16 students: 14 7 0.09 1. how many beers they drank 15 1 0.01 2. their resulting blood alcohol content (BAC) 16 4 0.05 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 5 / 64
Bivariate Data Scatterplots A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 6 / 64
Bivariate Data > plot(trees$Girth~trees$Height,main="girth vs height") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 7 / 64
Bivariate Data How to scale a scatterplot Same data in all four plots Both variables should be given a similar amount of space:  Plot is roughly square  Points should occupy all the plot space (no blank space) Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 8 / 64
Bivariate Data Interpreting scatterplots  After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for …  Form : linear, curved, clusters, no pattern  Direction : positive, negative, no direction  Strength : how closely the points fit the “form”  … and clear deviations from that pattern  Outliers of the relationship Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 9 / 64
Bivariate Data Form Linear No relationship Nonlinear Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 10 / 64
Bivariate Data Direction Positive association : High values of one variable tend to occur together with high values of the other variable. Negative association : High values of one variable tend to occur together with low values of the other variable. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 11 / 64
Bivariate Data Strength The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 12 / 64
Bivariate Data Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 13 / 64
Bivariate Data Adding categorical variables to scatterplots Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. The graph compares the association between thorax length and longevity of male fruit flies that are allowed to reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 14 / 64
Correlation Correlation Correlation Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 15 / 64
Correlation Definition Given the bivariate data, ( x 1 , y 1 ) , · · · , ( x n , y n ), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is n � x j − ¯ � � y j − ¯ � 1 x y � r def = . n − 1 s x s y j =1 The population correlation coefficient is denoted as N � x j − µ X � � y j − µ Y � = 1 � ρ def N σ X σ Y j =1 where the above sum is summed over the entire population of size N . One thinks of r as an estimator of ρ . Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 16 / 64
Correlation One can also use the formula n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) r = �� � 2 � � � 2 � �� n �� n n � n n � n j =1 x 2 j =1 y 2 j − j =1 x j j − j =1 y j R command: > cor(trees$Girth,trees$Height) [1] 0.5192801 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 17 / 64
Correlation One can also use the formula n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) r = �� � 2 � � � 2 � �� n �� n n � n n � n j =1 x 2 j =1 y 2 j − j =1 x j j − j =1 y j R command: > cor(trees$Girth,trees$Height) [1] 0.5192801 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 17 / 64
Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64
Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64
Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64
Recommend
More recommend