Chapter 7: Introduction to linear regression OpenIntro Statistics, - PowerPoint PPT Presentation

Slope Slope The slope of the regression can be calculated as b 1 = s y R s x In context... b 1 = 3 . 1 3 . 73 × − 0 . 75 = − 0 . 62 Interpretation For each additional % point in HS graduate rate, we would expect the % living in poverty to be lower on average by 0.62% points. 15

Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 16

Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 16

Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 70 intercept 60 50 % in poverty 40 b 0 = 11 . 35 − ( − 0 . 62) × 86 . 01 30 20 = 64 . 68 10 0 0 20 40 60 80 100 % HS grad 16

Which of the following is the correct interpretation of the intercept? (a) For each % point increase in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (b) For each % point decrease in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (c) Having no HS graduates leads to 64.68% of residents living below the poverty line. (d) States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line. (e) In states with no HS graduates % living in poverty is expected to increase on average by 64.68%. 17

More on the intercept Since there are no states in the dataset with no HS graduates, the intercept is of no interest, not very useful, and also not reliable since the predicted value of the intercept is so far from the bulk of the data. 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 18

Regression line � % in poverty = 64 . 68 − 0 . 62 % HS grad 18 16 % in poverty 14 12 10 8 6 80 85 90 % HS grad 19

Interpretation of slope and intercept • Intercept: When x = 0 , y is expected to equal the intercept. • Slope: For each unit in x , y is expected to increase / decrease on average by the slope. Note: These statements are not causal, unless the study is a randomized controlled experiment. 20

Prediction • Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction , simply by plugging in the value of x in the linear model equation. • There will be some uncertainty associated with the predicted value. 18 16 % in poverty 14 12 10 8 6 80 85 90 % HS grad 21

Extrapolation • Applying a model estimate to values outside of the realm of the original data is called extrapolation . • Sometimes the intercept might be an extrapolation. 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 22

Examples of extrapolation 23

Conditions for the least squares line 1. Linearity 26

Conditions for the least squares line 1. Linearity 2. Nearly normal residuals 26

Conditions for the least squares line 1. Linearity 2. Nearly normal residuals 3. Constant variability 26

Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. 27

Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. • Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. If this topic is of interest, an Online Extra is available on openintro.org covering new techniques. 27

Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. • Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. If this topic is of interest, an Online Extra is available on openintro.org covering new techniques. • Check using a scatterplot of the data, or a residuals plot . 27

Anatomy of a residuals plot � RI: % HS grad = 81 % in poverty = 10 . 3 � % in poverty = 64 . 68 − 0 . 62 ∗ 81 = 14 . 46 15 % in poverty � e = % in poverty − % in poverty 10 = 10 . 3 − 14 . 46 = − 4 . 16 5 80 85 90 % HS grad 5 0 −5 28

Anatomy of a residuals plot � RI: % HS grad = 81 % in poverty = 10 . 3 � % in poverty = 64 . 68 − 0 . 62 ∗ 81 = 14 . 46 15 % in poverty � e = % in poverty − % in poverty 10 = 10 . 3 − 14 . 46 = − 4 . 16 � DC: 5 80 85 90 % HS grad % HS grad = 86 % in poverty = 16 . 8 5 � % in poverty = 64 . 68 − 0 . 62 ∗ 86 = 11 . 36 0 � e = % in poverty − % in poverty −5 = 16 . 8 − 11 . 36 = 5 . 44 28

Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. 29

Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. • This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data. 29

Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. • This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data. • Check using a histogram or normal probability plot of residuals. Normal Q−Q Plot 12 10 4 Sample Quantiles 8 frequency 2 6 0 4 −2 2 −4 0 −4 −2 0 2 4 6 −2 −1 0 1 2 residuals Theoretical Quantiles 29

Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 10 8 6 80 85 90 % HS grad 4 0 −4 80 90 30

Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad 4 0 −4 80 90 30

Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad • Also called 4 homoscedasticity . 0 −4 80 90 30

Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad • Also called 4 homoscedasticity . 0 −4 • Check using a histogram or 80 90 normal probability plot of residuals. 30

Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 31

Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 32

R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . 33

R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. 33

R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. 33

R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. • The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data. 33

R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. • The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data. • For the model we’ve been working with, R 2 = − 0 . 62 2 = 0 . 38 . 33

Interpretation of R 2 Which of the below is the correct interpretation of R = − 0 . 62 , R 2 = 0 . 38 ? (a) 38% of the variability in the % of HG graduates among the 51 states is explained by the model. 18 (b) 38% of the variability in the % of 16 residents living in poverty among the % in poverty 14 12 51 states is explained by the model. 10 8 (c) 38% of the time % HS graduates 6 predict % living in poverty correctly. 80 85 90 % HS grad (d) 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model. 34

Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% 35

Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable 35

Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. 35

Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. • Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%. 35

Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. • Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%. • This is the value we get if we plug in 1 for the explanatory variable 35

Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) is the reference level? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 36

Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) has the lowest poverty percentage? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 37

Types of outliers in linear regression

Types of outliers How do outliers influence the least squares line in this plot? 0 To answer this question think of where the regression line would −10 be with and without the outlier(s). Without the outliers the regression line would be steeper, −20 and lie closer to the larger group of observations. With the outliers 5 0 the line is pulled up and away −5 from some of the observations in the larger group. 39

Types of outliers 10 5 How do outliers influence the least squares line in this plot? 0 2 0 −2 40

Types of outliers 10 How do outliers influence 5 the least squares line in this plot? 0 Without the outlier there is no evident relationship between x and y . 2 0 −2 40

Some terminology • Outliers are points that lie away from the cloud of points. 41

Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. 41

Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. • High leverage points that actually influence the slope of the regression line are called influential points. 41

Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. • High leverage points that actually influence the slope of the regression line are called influential points. • In order to determine if a point is influential, visualize the regression line with and without the point. Does the slope of the line change considerably? If so, then the point is influential. If not, then it ˜ Os not an influential point. 41

Influential points Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1. w/ outliers 6.0 w/o outliers log(light intensity) 5.5 5.0 4.5 4.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temp) 42

Types of outliers 40 Which of the below best 20 describes the outlier? 0 (a) influential (b) high leverage −20 (c) none of the above (d) there are no outliers 2 0 −2 43

Types of outliers 15 10 5 Does this outlier influence the slope of the regression 0 line? −5 5 0 −5 44

Types of outliers 15 10 5 Does this outlier influence the slope of the regression 0 line? Not much... −5 5 0 −5 44

Recap Which of following is true? (a) Influential points always change the intercept of the regression line. (b) Influential points always reduce R 2 . (c) It is much more likely for a low leverage point to be influential, than a high leverage point. (d) When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear. (e) None of the above. 45

Chapter 7: Introduction to linear regression OpenIntro Statistics, - PowerPoint PPT Presentation

Chapter 7: Introduction to linear regression OpenIntro Statistics, 3rd Edition Slides developed by Mine C etinkaya-Rundel of OpenIntro. The slides may be copied, edited, and/or shared via the CC BY-SA license. Some images may be included

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Chi-squared ( 2 ) (1.10.5) and F -tests (9.5.2) for the variance of a normal distribution 2

Statistical Analysis Programs in R for FMRI Data Gang Chen, Ziad S. Saad, and Robert W. Cox

Update from the United Kingdom Renal Imaging Network (UKRIN) Sue Francis Sir Peter Mansfield

Diversity in a dish: A human population-based organotypic in vitro model for cardiotoxicity

Extrapolating across levels of biological organization and how mechanistic models can help

Ext xtraction from Bio iological Collections Icaro Alzuru, Andra Matsunaga, Maurcio Tsugawa,

2017-07-29 part 4: phenomenological load and biological inference phenomenological load review

CEE 670 TRANSPORT PROCESSES IN ENVIRONMENTAL AND WATER RESOURCES ENGINEERING Kinetics Lecture

Sambuz

Useful Links

Newsletter

Mail Us