Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - PowerPoint PPT Presentation

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven “ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 1 / 41

Table of Contents Simple Regression 1 Confidence Intervals and Significance Tests 2 Variation 3 Chapter #10 R Assignment 4 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 2 / 41

Simple Regression Simple Regression Simple Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 3 / 41

Simple Regression Let X = the predictor or independent variable Y = the response or dependent variable . Given a bivariate random variable, ( X , Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x , the response, y is y = β 0 + β 1 x + ǫ x where β 0 + β 1 x is the mean response for x . The noise terms, the ǫ x ’s, are assumed to be independent of each other and to be randomly sampled from N (0 , σ ). The parameters of the model are β 0 , β 1 and σ . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 4 / 41

Simple Regression Conditions for Regression Inference The figure below shows the regression model when the conditions are met. The line in the figure is the population regression line µy = β 0 + β 1 x . For each possible value The Normal curves show of the explanatory how y will vary when x is variable x , the mean of held fixed at different values. the responses µ ( y | x ) All the curves have the same moves along this line. standard deviation σ , so the variability of y is the same for all values of x . The value of σ determines whether the points fall close to the population regression line (small σ ) or are widely scattered (large σ ). 8 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 5 / 41

̂ ̂ ̂ ̂ Simple Regression Moderate linear Obvious nonlinear association; relationship; regression OK. regression inappropriate. y = 3 + 0.5 x y = 3 + 0.5 x One extreme Only two values outlier, requiring for x; a redesign is further due here… examination. y = 3 + 0.5 x y = 3 + 0.5 x Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 6 / 41

Simple Regression Given bivariate random sample from the simple linear regression model, ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x n , y n ) one wishes to estimate the parameters of the model, ( β 0 , β 1 , σ ). Given an arbitrary line, y = mx + b define the sum of the squares of errors to be � n i =1 [ y i − ( mx i + b )] 2 . Using Calculus, one can find the least–squares regression line , y = b 0 + b 1 x , that minimizes the sum of squares of errors. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 7 / 41

Simple Regression Theorem (Estimating β 0 and β 1 ) Given the bivariate random sample, ( x 1 , y 1 ) · · · , ( x n , y n ) , the least–squares regression line, y = b 0 + b 1 x is obtained by letting � s y � b 1 = r and b 0 = ¯ y − b 1 ¯ x . s x where b 0 is an unbiased estimator of β 0 and b 1 is an unbiased estimator of β 1 . Note: The point (¯ x , ¯ y ) will lie on the regression line, though there is no reason to believe that (¯ x , ¯ y ) is one of the data points. One can also calculate b 1 using n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) b 1 = . n � n j − ( � n j =1 x 2 j =1 x j ) 2 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 8 / 41

Simple Regression Example > plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height -6.1883945 0.2557471 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 9 / 41

Simple Regression Definition def The predicted value of y at x j is ˆ y j = b 0 + b 1 x j . The predicted value, ˆ y , is a unbiased estimator of the mean response, µ y . Example Using the R dataset “trees”, one wants the predicted girth of three trees, of heights 74, 83 and 91 respectively. One uses the regression model “girth˜height” for our predictions. The work below is done in R. > g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91))) 1 2 3 12.73689 15.03862 17.08459 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 10 / 41

Simple Regression “Never make forecasts, especially about the future.” – Samuel Goldwyn The regression line only has predictive value for y at x if 1 ρ �≈ 0 (if no significant linear correlation, don’t use the regression line for predictions.) If ρ ≈ 0, then ¯ y is best predictor of y at x . 2 only predict y for x ’s within the range of the x j ’s – one does not predict the girth of a tree with a height of 1000 feet. Interpolate, don’t extrapolate . | r | (or r 2 ) is a measure of how well the regression equation fits data. bigger | r | ⇒ better data fits regression line ⇒ better prediction. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 11 / 41

Simple Regression Definition The variance of the observed y i ’s about the predicted ˆ y i ’s is � y 2 � y j − b 1 � x j y j � ( y j − ˆ y j ) 2 j − b 0 s 2 def = = , n − 2 n − 2 which is an unbiased estimator of σ 2 . The standard error of estimate (also called the residual standard error ) is s , an estimator of σ . Note: ( b 0 , b 1 , s ) is an estimator of the parameters of the simple linear regression model, ( β 0 , β 1 , σ ). Furthermore, b 0 , b 1 and s 2 are unbiased estimators of β 0 , β 1 and σ 2 . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 12 / 41

Simple Regression Outliers and influential points Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point. Child 19 = outlier (large residual) Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 is isolated from the rest of the points, and might Child 18 = potential be an influential point. influential individual Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 13 / 41

Simple Regression All data Without child 18 Outlier Without child 19 Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 14 / 41

Simple Regression Definition Given a data point, ( x j , y j ), the residual of that point is y i − ˆ y i . Note: 1 Outliers are data points with large residuals. 2 The residuals should be approximately N (0 , σ ). Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 15 / 41

Simple Regression R command for finding residuals: Example > g.lm=lm(Girth~Height,data=trees) > residuals(g.lm) 1 2 3 4 5 6 7 -3.4139043 -1.8351687 -1.1236745 -1.7253986 -3.8271227 -4.2386170 0.3090842 8 9 10 11 12 13 14 -1.9926400 -3.1713756 -1.7926400 -2.7156285 -1.8483871 -1.8483871 0.2418428 15 16 17 18 19 20 21 -0.9926400 0.1631072 -2.6501112 -2.5058584 1.7303485 3.6205784 0.2401187 22 23 24 25 26 27 28 -0.0713756 1.7631072 3.7746014 2.7958658 2.7728773 2.7171301 3.6286244 29 30 31 3.7286244 3.7286244 4.5383945 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 16 / 41

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - PowerPoint PPT Presentation

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions,

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Linear regression Linear regression is a simple approach to supervised learning. It assumes

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

A Novel Design for Decision Rules Based on Statistical Testing Strategies in a Definitive

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant

The Set of 3 4 4 Contingency Tables has 3-Neighborhood Property Toshio Sumi and Toshio

Mastering the Nmap Scripting Engine by Fyodor and David Fifield

OpenVAS Open Vulnerability Scanning Free your vulnerabilities! Vlatko Koturjak |

SSE Consultative Group Call Q1 18 March 2020: 9AM and 4PM CEST (Geneva) Moderator: Tech

Correlations in Nuclear Matter and the Symmetry Energy Gerd Rpke, Rostock Supernova Crab

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - PowerPoint PPT Presentation

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions,

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Linear regression Linear regression is a simple approach to supervised learning. It assumes

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

A Novel Design for Decision Rules Based on Statistical Testing Strategies in a Definitive

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant

The Set of 3 4 4 Contingency Tables has 3-Neighborhood Property Toshio Sumi and Toshio

Mastering the Nmap Scripting Engine by Fyodor and David Fifield

OpenVAS Open Vulnerability Scanning Free your vulnerabilities! Vlatko Koturjak |

SSE Consultative Group Call Q1 18 March 2020: 9AM and 4PM CEST (Geneva) Moderator: Tech

Correlations in Nuclear Matter and the Symmetry Energy Gerd Rpke, Rostock Supernova Crab

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and