Regression Analysis Scott Richter UNCG-Statistical Consulting - PowerPoint PPT Presentation

Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 I. Simple linear regression i. Motivating example-runtime 3 ii. Regression details 12 iii. Regression vs. ANOVA 13 iv. Regression “theory” 20 v. Inferences 24 vi. Usefulness of the model 31 vii. Categorical predictors 39 II. Multiple Regression i. Purposes 42 ii. Terminology 43 iii. Quantitative and categorical predictors 50 iv. Polynomial regression 56 v. Several quantitative variables 60 III. Assumptions/Diagnostics i. Assumptions 76 IV. Transformations 80 i. Example 81 ii. Interpretation after log transformation 83 V. Model Building i. Objectives when there are many predictors 85 ii. Multicollinearity 87 iii. Strategy for dealing with many predictors 89 iv. Sequential variable selection 93 v. Cross validation 96 2 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 I. Simple Linear Regression i. Simple Linear Regression--Motivating Example  Foster, Stine and Waterman (1997, pages 191–199)  Variables o time taken (in minutes) for a production run, Y , and the o number of items produced, X , o 20 randomly selected runs (see Table 2.1 and Figure 2.1).  Want to develop an equation to model the relationship between Y , the run time, and X , the run size Start with a plot of the data 3 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Scatterplot:  What is the overall pattern ?  Any striking deviations from that pattern? 4 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Linear model fit Does this appear to be a valid model? 5 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 “it makes sense to base inferences or conclusions only on valid models.” (Simon Sheather, A Modern Approach to Regression with R ) But, How can we tell if a model is “valid”? o Residual plots can be helpful o Choosing the right plots can be tricky. 6 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Residual plot: How do we get this plot?  Take the regression fit plot  Rotate it until the regression line is horizontal and explode 7 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 8 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Now…what are we looking for in the residual plot? o Random scatter around 0-line suggests valid model o May or may not be a useful model! (“essentially, all models are wrong, but some are useful.” --George E. P. Box) If we believe the model to be valid, we may proceed to interpret: 9 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Parameter estimates from software: Variable DF Parameter Standard t Value Pr > |t| 95% Confidence Limits Estimate Error Intercept 1 149.74770 8.32815 17.98 <.0001 132.25091 167.24450 RunSize 1 0.25924 0.03714 6.98 <.0001 0.18121 0.33728 Interpretation:  For each additional item produced, the average runtime is estimated to increase by 0.26 minutes (about 15s).  Estimate is statistically different from 0 ( p < 0.0001; at least 0.18 with 95% confidence)  Can safely be applied to runs of about between 50 to 350 items 10 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 P -value and confidence interval may require additional checking of residuals: No severe skewness or extreme values -> inferences should be OK 11 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 ii. Simple Linear Regression--Some details  Data consist of a set of bivariate pairs ( Y i , X i )  The data arise either as o a random sample of pairs from a population, o random samples of Y ’s selected independently from several fixed values X i , or o an intact population  The X -variable o is usually thought of as a potential predictor of the Y -variable o values can sometimes be chosen by the researcher  Simple linear regression is used to model the relationship between Y and X so that given a specific value of X o we can predict the value of Y or o estimate the mean of the distribution of Y . 12 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 iii. Simple Linear Regression--Regression vs. ANOVA Another example: Concrete. (From Vardeman (1994), Statistics for Engineering Problem Solving ) A study was performed to investigate the relationship between the strength (psi) of concrete and water/cement ratio. Three settings of water to cement were chosen (0.45, 0.50, 0.55). For each setting 3 batches of concrete were made. Each batch was measured for strength 14 days later. All other variables were kept constant (mix time, quantity of batch, same mixer used (which was cleaned after every use), etc.). The data: Water/cement 0.45 0.45 0.45 0.50 0.50 0.50 0.55 0.55 0.55 Strength 2824 2753 2803 2743 2789 2709 2662 2737 2703 o Essentially 3 “groups”: 45%, 50%, 55% o Can use one-way ANOVA to compare means 13 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Boxplots:  Suggests evidence that o means are different o means decrease as ratio increases 14 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015  ANOVA F-test: o F(2,6) = 4.44, p-value = 0.066 o not convincing evidence that means are different  Regression F-test o F(1,7) = 10.36, p-value = 0.015 o more convincing evidence that means are different 15 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Why different results?  More specific regression alternative: means follow a linear relation  Only one parameter estimate needed (instead of 2) Regression ANOVA Source DF SS MS F value Pr > F Source DF SS MS F Value Pr > F Model 1 12881 12881 10.36 0.015 Model 2 12881 6440.33 4.44 0.066 Error 7 8705.33 1243.62 Error 6 8705.33 1450.89 8 21586 8 21586 Corrected Corrected Total Total Will regression always be more powerful if predictor is numeric? 16 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Suppose the pattern was different: Water/cement 0.45 0.45 0.45 0.50 0.50 0.50 0.55 0.55 0.55 Strength 2743 2789 2709 2824 2753 2803 2662 2737 2703 17 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015  ANOVA F-test: o F(2,6) = 4.44, p-value = 0.066 (no change because the sample means are the same)  Regression F-test o F(1,7) = 1.23, p-value = 0.305 o now, less convincing evidence that means are different o linear model is not valid for these data 18 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Residual plot shows a non-random pattern (possibly quadratic?): 19 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 iv. Simple Linear Regression--A little bit of theory and notation. Simple linear regression model:        Y X | X 0 1     Y X | represents the population mean of Y for a given setting of X  is the intercept of the linear function  0  is the slope of the linear function  1 (All of these are unknown parameters.) 20 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 21 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Method of Least Squares   ˆ   ˆ fit X 1. The fitted value for observation i is its estimated mean: i 0 1   2. The residual for observation i is: res Y fit i i i  and ˆ  that minimize the sum of ˆ 3. The method of least squares finds 0 1 squared residuals. 22 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Estimates for Runsize/Runtime example:   ˆ 149.75 0 o   ˆ 0.26 1 o   fit 149.75 0.26* Runtime o i 23 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 v. Simple Linear Regression--Inferences Three types: 1) Inferences about the regression parameters (most common) Variable DF Parameter Standard t Value Pr > |t| 95% Confidence Limits Estimate Error Intercept 1 149.74770 8.32815 17.98 <.0001 132.25091 167.24450 RunSize 1 0.25924 0.03714 6.98 <.0001 0.18121 0.33728 1. Each row gives a test for evidence that the parameter equals 0: 24 UNCG Quantitative Methodology Series

Regression Analysis Summer 2015 Variable DF Parameter Standard t Value Pr > |t| 95% Confidence Limits Estimate Error Intercept 1 149.74770 8.32815 17.98 <.0001 132.25091 167.24450 RunSize 1 0.25924 0.03714 6.98 <.0001 0.18121 0.33728    Average Runtime=0 when Runsize=0 a. 1st row: H : 0 0 0 149.75 t   i. Test statistic: 17.98 8.33 ii. p-value: <0.0001   iii. strong evidence that 0 0 iv. often not practically meaningful 25 UNCG Quantitative Methodology Series

Regression Analysis Scott Richter UNCG-Statistical Consulting - PowerPoint PPT Presentation

Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics UNCG Quantitative Methodology Series Regression Analysis Summer 2015 I. Simple linear regression i. Motivating example-runtime 3

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Analysis of variance and regression Other types of regression models Other types of regression

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Regression Analysis in Stata Hsueh-Sheng Wu CFDR Workshop Series February 18, 2019 1 Overview

2/18/20 & 2/19/20 POL 144A: Eastern European Democratization Isaac Hale Winter 2020 Hale

Introduction to Regression Analysis Modeling a Response A regression model describes how a

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

1 HEALTHY WEIGHT & 2011 FOOD & HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A

Pla ne t E a rth Course 3 Cre dit Ho ur Ble nde d Ho no rs Co urse ANT / BI O/ CMM 2600 ANT

A N ATIONAL F OOD S TRATEGY ? 1) Blueprint for a National Food Strategy (with the Center for

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration

ICFP International Conference on Functional Abusing Ants for Fun and Programming Profit Most

Func unctiona nal Reporting ng Edward Kmett Overview Who We Are 1 Getting FP in the Door 2

WELCOME! IMPORTANT DATES November 16, 2018 New Member Applications due for SY 2019-20

Regression Analysis Scott Richter UNCG-Statistical Consulting - PowerPoint PPT Presentation

Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics UNCG Quantitative Methodology Series Regression Analysis Summer 2015 I. Simple linear regression i. Motivating example-runtime 3

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Analysis of variance and regression Other types of regression models Other types of regression

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Regression Analysis in Stata Hsueh-Sheng Wu CFDR Workshop Series February 18, 2019 1 Overview

2/18/20 &amp; 2/19/20 POL 144A: Eastern European Democratization Isaac Hale Winter 2020 Hale

Introduction to Regression Analysis Modeling a Response A regression model describes how a

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

1 HEALTHY WEIGHT &amp; 2011 FOOD &amp; HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A

Pla ne t E a rth Course 3 Cre dit Ho ur Ble nde d Ho no rs Co urse ANT / BI O/ CMM 2600 ANT

A N ATIONAL F OOD S TRATEGY ? 1) Blueprint for a National Food Strategy (with the Center for

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration

ICFP International Conference on Functional Abusing Ants for Fun and Programming Profit Most

Func unctiona nal Reporting ng Edward Kmett Overview Who We Are 1 Getting FP in the Door 2

WELCOME! IMPORTANT DATES November 16, 2018 New Member Applications due for SY 2019-20

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

2/18/20 & 2/19/20 POL 144A: Eastern European Democratization Isaac Hale Winter 2020 Hale

1 HEALTHY WEIGHT & 2011 FOOD & HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A