Introduction to Linear Regression Rebecca C. Steorts September 15, - PowerPoint PPT Presentation

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Today ◮ (Re-)Introduction to linear models and the model space ◮ What is linear regression ◮ Basic properties of linear regression ◮ Using data frames for statistical purposes ◮ Manipulation of data into more convenient forms ◮ How do we do exploratory analysis to see if linear regression is appropriate?

Linear regression ◮ Let Y n × 1 be the response (poverty rate) ◮ Let X n × p represent a matrix of covariates (age, education level, state you live in, etc.) Thus, for each observation, there are p covariates. ◮ Let β be an unknown parameter that can help up estimate future poverty rates Y n × 1 = X n × p β p × 1 + ǫ n × 1 where ǫ ∼ N (0 , σ 2 I ) .

Estimation and Prediction ◮ We seek to estimate ˆ β which is the arg min || Y − X β || 2 . Let f ( Y ) = || Y − X β || 2 and solve for ∂ f ( Y ) = ∂ β [( Y − X β ) T ( Y − X β )] = . . . ∂ β We thus, find that ˆ β = ( X T X ) − 1 X T Y .

Predicting ˆ Y We can now predict new observations via ˆ Y = X ( X T X ) − 1 X T Y = H Y , where H is often called the hat matrix.

How can we evaluate an linear model? ◮ Residual plots ◮ Outliers ◮ Colliearity

Residual plots ◮ Graphical tool for identifying non-linearity. ◮ For each observation i , the residual is e i = y i − ˆ y i . ◮ Plotting the residuals is plotting e i versus one covariate x i for all i .

Residuals for multiple covariates ◮ Translating to multiple covariates, we plot the residuals versus the fitted values. ◮ That is, we plot e i versus ˆ y i . (Think about why on your own). ◮ A strong pattern in the residuals indicates non-linearity. ◮ Instead: non-linear transformation of the covariates.

Outliers ◮ An outlier is a point for which y i is far from the value predicted by the model. ◮ Can identify these by residual plots but often it’s hard to know how large a residual should be to consider a point an outlier. ◮ Instead, we can plot the studentized residuals, computed by dividing each residual e i by its estimated standard error. ◮ Observations whose studentized residuals are greater than 3 in absolute value are possible outliers. ◮ What to do if you think you have an outliner, remove it.

High-leverage points ◮ Recall that outliers are observations for which the response y i is unusual given the predictor x i ◮ Observations with high leverage have an unusual value for x i Figure 1:Observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the dotted line is the fit with observation 41 removed. Center: The red observation is not unusual in terms of its X1 value or its X2 value, but still falls outside the bulk of the data, and hence has high leverage. Right: Observation 41 has a high leverage and a high residual.

Correlation ◮ It’s important that the error terms e are all uncorrelated. ◮ e are uncorrelated, means that if e i > 0 provides little or no information about the sign of e i +1 ◮ The fitted values (or standard errors) are computed assumed the e are uncorrleated. ◮ If there is correlated, then the estimated std errors will tend to understimate the true std errors. ◮ This implies confidence and prediction intervals will be too narrow. ◮ Similarly, p-value will be lower than they should (and this could cause a parameter to be thought to be statistically significant when it’s not). ◮ Where does this happen often – time series data (things are correlated).

Collinearity ◮ Collinearity refers to the situation in which two or more predictor variables are closely related to each other. ◮ Suppose we are predicting balance of credit card. ◮ Credit limit and age when plotted appear to have no obvious relation. ◮ However, credit limit and credit rating do (and this is also intuitive is you have credit and own a credit card) ◮ Credit limit and rating are high correlated.

Collinearity (continued) ◮ The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. ◮ Collinearity reduces accuracy of estimates of regression ˆ coefficients and causes std error for beta j to grow. ◮ Elements of this matrix that are large in absolute value indicated pair of highly correlated variables. ◮ Another way to check is computing the variance inflation factor (VIF). VIF of 1 – says no colllinearity. ◮ VIF above 5 problematic amount. How to deal with collinearity? ◮ The first is to drop one of the problematic variables from the regression. ◮ The second solution is to combine the collinear variables together into a single predictor. (better solution since not throwing away data)

So You’ve Got A Data Frame What can we do with it? ◮ Plot it: examine multiple variables and distributions ◮ Test it: compare groups of individuals to each other ◮ Check it: does it conform to what we’d like for our needs?

Test Case: Birth weight data Included in R already: library (MASS) data (birthwt) summary (birthwt) ## low age lwt race ## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. ## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000 ## Median :0.0000 Median :23.00 Median :121.0 Median ## Mean :0.3122 Mean :23.24 Mean :129.8 Mean ## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000 ## Max. :1.0000 Max. :45.00 Max. :250.0 Max. ## smoke ptl ht ## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. ## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st ## Median :0.0000 Median :0.0000 Median :0.00000 Median ## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean ## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd

From R help Go to R help for more info, because someone documented this (thanks, someone!) help(birthwt)

Make it readable! colnames (birthwt) ## [1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" ## [9] "ftv" "bwt" colnames (birthwt) <- c ("birthwt.below.2500", "mother.age", "mother.weight", "race", "mother.smokes", "previous.prem.labor" "hypertension", "uterine.irr", "physician.visits", "birthwt.grams")

Make it readable, again! Let’s make all the factors more descriptive. birthwt$race <- factor ( c ("white", "black", "other")[birthwt$race]) birthwt$mother.smokes <- factor ( c ("No", "Yes")[birthwt$mother.smokes birthwt$uterine.irr <- factor ( c ("No", "Yes")[birthwt$uterine.irr birthwt$hypertension <- factor ( c ("No", "Yes")[birthwt$hypertension

Make it readable, again! summary (birthwt) ## birthwt.below.2500 mother.age mother.weight race ## Min. :0.0000 Min. :14.00 Min. : 80.0 black:26 ## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 other:67 ## Median :0.0000 Median :23.00 Median :121.0 white:96 ## Mean :0.3122 Mean :23.24 Mean :129.8 ## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 ## Max. :1.0000 Max. :45.00 Max. :250.0 ## mother.smokes previous.prem.labor hypertension uterine.irr ## No :115 Min. :0.0000 No :177 No :161 ## Yes: 74 1st Qu.:0.0000 Yes: 12 Yes: 28 ## Median :0.0000 ## Mean :0.1958 ## 3rd Qu.:0.0000 ## Max. :3.0000 ## physician.visits birthwt.grams ## Min. :0.0000 Min. : 709

Explore it! R’s basic plotting functions go a long way. plot (birthwt$race) title (main = "Count of Mother's Race in Springfield MA, 1986") Count of Mother's Race in Springfield MA, 1986 80 60 40 20

Explore it! R’s basic plotting functions go a long way. plot (birthwt$mother.age) title (main = "Mother's Ages in Springfield MA, 1986") Mother's Ages in Springfield MA, 1986 45 40 35 birthwt$mother.age 30 25 20 15

Explore it! R’s basic plotting functions go a long way. plot ( sort (birthwt$mother.age)) title (main = "(Sorted) Mother's Ages in Springfield MA, 1986" (Sorted) Mother's Ages in Springfield MA, 1986 45 40 35 sort(birthwt$mother.age) 30 25 20 15

Explore it! R’s basic plotting functions go a long way. plot (birthwt$mother.age, birthwt$birthwt.grams) title (main = "Birth Weight by Mother's Age in Springfield MA, Birth Weight by Mother's Age in Springfield MA, 1986 5000 4000 birthwt$birthwt.grams 3000 2000 1000

Basic statistical testing Let’s fit some models to the data pertaining to our outcome(s) of interest. plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Birth Weight by Mother's Smoking Habit 5000 4000 Birth Weight (g) 3000 2000 1000

Introduction to Linear Regression Rebecca C. Steorts September 15, - PowerPoint PPT Presentation

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using data frames for statistical

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

of Paternity Overview th and 31 st st 2017 August 29 th 2017 Public Health Division Center for

LOOKING AT THE NUMBERS: SINGLE MOTHER STUDENTS IN PENNSYLVANIA Susana Contreras-Mendez, Research

Children in Medicaid and CHIP November 8, 2018 2:00 p.m. ET 1 Agenda Introduction and

Designing Systems to Improve Care for Every Mother, Everywhere @neel_shah Too Much Capability

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 4 Slides adapted from

Situacional Awareness Improvement: From Classroom to Field (digitally) Topological and tactical

DiVE Virtual Reality Lab Duke ITAC Meeting 2/22/2018 David J. Zielinski Dr. Regis Kopper

EECS 4441 Human-Computer Interaction Topic #3a: The Interaction Steven Castellucci York

Sambuz

Useful Links

Newsletter

Mail Us