Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com - PowerPoint PPT Presentation

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven “All models are wrong. Some models are useful.” – George Box “ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 1 / 64

Table of Contents Bivariate Data 1 Correlation 2 Simple Regression 3 Variation 4 Logistic Regression 5 Logistic Regression 6 Chapter #9 R Assignment 7 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 2 / 64

Bivariate Data Bivariate Data and Scatterplots Bivariate Data and Scatterplots Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 3 / 64

Bivariate Data Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70 , 178) , (72 , 192) , (74 , 184) , (68 , 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent ) variable measures the outcome of a study. The explanatory (or independent ) variable is the one that predicts the response variable. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 4 / 64

Bivariate Data Bivariate data  For each individual studied, we record Student Number Blood Alcohol ID of Beers Content data on two variables. 1 5 0.1 2 2 0.03 3 9 0.19  We then examine whether there is a 6 7 0.095 relationship between these two 7 3 0.07 variables: Do changes in one variable 9 3 0.02 tend to be associated with specific 11 4 0.07 changes in the other variables? 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 Here we have two quantitative variables 12 6 0.1 recorded for each of 16 students: 14 7 0.09 1. how many beers they drank 15 1 0.01 2. their resulting blood alcohol content (BAC) 16 4 0.05 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 5 / 64

Bivariate Data Scatterplots A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 6 / 64

Bivariate Data > plot(trees$Girth~trees$Height,main="girth vs height") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 7 / 64

Bivariate Data How to scale a scatterplot Same data in all four plots Both variables should be given a similar amount of space:  Plot is roughly square  Points should occupy all the plot space (no blank space) Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 8 / 64

Bivariate Data Interpreting scatterplots  After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for …  Form : linear, curved, clusters, no pattern  Direction : positive, negative, no direction  Strength : how closely the points fit the “form”  … and clear deviations from that pattern  Outliers of the relationship Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 9 / 64

Bivariate Data Form Linear No relationship Nonlinear Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 10 / 64

Bivariate Data Direction Positive association : High values of one variable tend to occur together with high values of the other variable. Negative association : High values of one variable tend to occur together with low values of the other variable. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 11 / 64

Bivariate Data Strength The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 12 / 64

Bivariate Data Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 13 / 64

Bivariate Data Adding categorical variables to scatterplots Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. The graph compares the association between thorax length and longevity of male fruit flies that are allowed to reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 14 / 64

Correlation Correlation Correlation Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 15 / 64

Correlation Definition Given the bivariate data, ( x 1 , y 1 ) , · · · , ( x n , y n ), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is n � x j − ¯ � � y j − ¯ � 1 x y � r def = . n − 1 s x s y j =1 The population correlation coefficient is denoted as N � x j − µ X � � y j − µ Y � = 1 � ρ def N σ X σ Y j =1 where the above sum is summed over the entire population of size N . One thinks of r as an estimator of ρ . Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 16 / 64

Correlation One can also use the formula n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) r = �� 2 � � � 2 � �� n �� n n � n n � n j =1 x 2 j =1 y 2 j − j =1 x j j − j =1 y j R command: > cor(trees$Girth,trees$Height) [1] 0.5192801 Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 17 / 64

Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman (University of New Haven) Correlation and Regression 18 / 64

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com - PowerPoint PPT Presentation

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven All models are wrong. Some models are useful. George Box the statistician knows that in nature there never was a normal

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Correlation and Regression 9-1 Overview 9-2 Correlation 9-3 Regression 9-4 Variation and

Getting to Regression: The Workhorse of Quantitative Political Analysis Department of

Visualization of Linear Models Correlation and Regression Possums > ggplot(data = possum,

Interpretation of regression coe ffi cients Correlation and Regression Is that textbook

Theory of correlation transfer and correlation structure in recurrent networks Ruben Moreno-Bote

Business Statistics CONTENTS The correlation coefficient The rank correlation coefficient

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

201ab Quantitative methods L.09: Correlation, regression (2) Alt-text: Correlation doesn't imply

Bivariate Correlation r > 0 r < 0 r = 0 r = 0 r > 0 r = 0 remember: r measures

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Biostatistics Correlation and linear regression Burkhardt Seifert & Alois Tschopp

Assessing model fit Correlation and Regression How well does our textbook model fit? >

Coefficient of Correlation The regression equation Y = 0 + 1 x + shows the linear

Lecture III: Leptonic Mixing Neutrinos in cosmology Summer School on Particle Physics ICTP ,

Ma rc ia L . Zuc ke r, Ph.D. ZI VD L L C 1 Wha t? Se t o f pro c e dure s de sig ne

ICRI 2018 Session 1 Sustainable RIs in the global arena policy development, lessons

An Introduction to Bisimulation and Coinduction Davide Sangiorgi Focus Team, INRIA

V2 28 May 2015 What Is Wrong With Stat 101? 1 2 V2 2015 USCOTS Whats Wrong with Stat 101?

Experimental Design Learning Objectives At the end of this lecture, the student should be able

Debian Science Umbrella for scientific packages or dustbin for scientific code Andreas Tille

Unit 1: Introduction to data 3. More exploratory data analysis STA 104 - Summer 2017 PS 1 is

Sambuz

Useful Links

Newsletter

Mail Us

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com - PowerPoint PPT Presentation

Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven All models are wrong. Some models are useful. George Box the statistician knows that in nature there never was a normal

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Correlation and Regression 9-1 Overview 9-2 Correlation 9-3 Regression 9-4 Variation and

Getting to Regression: The Workhorse of Quantitative Political Analysis Department of

Visualization of Linear Models Correlation and Regression Possums &gt; ggplot(data = possum,

Interpretation of regression coe ffi cients Correlation and Regression Is that textbook

Theory of correlation transfer and correlation structure in recurrent networks Ruben Moreno-Bote

Business Statistics CONTENTS The correlation coefficient The rank correlation coefficient

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

201ab Quantitative methods L.09: Correlation, regression (2) Alt-text: Correlation doesn't imply

Bivariate Correlation r &gt; 0 r &lt; 0 r = 0 r = 0 r &gt; 0 r = 0 remember: r measures

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Biostatistics Correlation and linear regression Burkhardt Seifert &amp; Alois Tschopp

Assessing model fit Correlation and Regression How well does our textbook model fit? &gt;

Coefficient of Correlation The regression equation Y = 0 + 1 x + shows the linear

Lecture III: Leptonic Mixing Neutrinos in cosmology Summer School on Particle Physics ICTP ,

Ma rc ia L . Zuc ke r, Ph.D. ZI VD L L C 1 Wha t? Se t o f pro c e dure s de sig ne

ICRI 2018 Session 1 Sustainable RIs in the global arena policy development, lessons

An Introduction to Bisimulation and Coinduction Davide Sangiorgi Focus Team, INRIA

V2 28 May 2015 What Is Wrong With Stat 101? 1 2 V2 2015 USCOTS Whats Wrong with Stat 101?

Experimental Design Learning Objectives At the end of this lecture, the student should be able

Debian Science Umbrella for scientific packages or dustbin for scientific code Andreas Tille

Unit 1: Introduction to data 3. More exploratory data analysis STA 104 - Summer 2017 PS 1 is

Sambuz

Useful Links

Newsletter

Mail Us

Visualization of Linear Models Correlation and Regression Possums > ggplot(data = possum,

Bivariate Correlation r > 0 r < 0 r = 0 r = 0 r > 0 r = 0 remember: r measures

Biostatistics Correlation and linear regression Burkhardt Seifert & Alois Tschopp

Assessing model fit Correlation and Regression How well does our textbook model fit? >