Getting Correct Results from PROC REG Nate Derby Stakana Analytics - - PowerPoint PPT Presentation

getting correct results from proc reg
SMART_READER_LITE
LIVE PREVIEW

Getting Correct Results from PROC REG Nate Derby Stakana Analytics - - PowerPoint PPT Presentation

PROC REG Conclusions Getting Correct Results from PROC REG Nate Derby Stakana Analytics Seattle, WA, USA SUCCESS 3/12/15 Getting Correct Results from PROC REG Nate Derby 1 / 29 PROC REG Conclusions Outline 1 PROC REG Basics Checking


slide-1
SLIDE 1

PROC REG Conclusions

Getting Correct Results from PROC REG

Nate Derby

Stakana Analytics Seattle, WA, USA

SUCCESS 3/12/15

Nate Derby Getting Correct Results from PROC REG 1 / 29

slide-2
SLIDE 2

PROC REG Conclusions

Outline

1

PROC REG Basics Checking Assumptions Understanding the Output

2

Conclusions

Nate Derby Getting Correct Results from PROC REG 2 / 29

slide-3
SLIDE 3

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Basics

PROC REG = Regression Analysis done with SAS. What is regression analysis? Fitting the best-fit straight line through the data. Some assumptions required ... Start with a scatterplot: Data: James Forbes, 1857. Boiling point vs air pressure. work.boiling. Does it fit a straight line?

Nate Derby Getting Correct Results from PROC REG 3 / 29

slide-4
SLIDE 4

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Boiling Point vs Pressure

Pressure (Hg)

20 21 22 23 24 25 26 27 28 29 30 31

Boiling Point (°F)

194 198 202 206 210 214

Nate Derby Getting Correct Results from PROC REG 4 / 29

slide-5
SLIDE 5

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Fitting a Line

We want the line Pressure = β0 + β1Temperature : SAS Code proc reg data=boiling; model press = temp; plot press*temp; run;

Nate Derby Getting Correct Results from PROC REG 5 / 29

slide-6
SLIDE 6

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Boiling Point vs Pressure

Pressure (Hg) 20 21 22 23 24 25 26 27 28 29 30 31 Boiling Point (°F) 194 198 202 206 210 214

Nate Derby Getting Correct Results from PROC REG 6 / 29

slide-7
SLIDE 7

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Checking Assumptions

Model must be appropriate for the data. Check mathematical assumptions of the model. Look at residuals = difference between a point and its fitted value (i.e., value on the line)

Graph of Fitted Line

Do they form a pattern? (Should be NO) Do they fit a normal distribution? (Should be YES) First one above more important than second.

If assumptions above are violated, results could be false, possibly to the point of being completely misleading.

Nate Derby Getting Correct Results from PROC REG 7 / 29

slide-8
SLIDE 8

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Checking for Residual Patterns

Goal: We want residuals to have no pattern whatsoever. Residual = What’s left over after the modeled part.

Graph of Fitted Line

We assume all patterns accounted for by the model. Examples of patterns: Grouped together into “clumps.” All of one part of range above/below line. Farther away from line in one part of range than others. Outliers (sometimes, sometimes not).

Nate Derby Getting Correct Results from PROC REG 8 / 29

slide-9
SLIDE 9

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Checking for Residual Patterns: SAS Code

In General proc reg data=blah; model yyy = xxx; plot residual.*xxx; plot residual.*yyy; plot residual.*predicted.; run; Forbes’ Data proc reg data=boiling; model press = temp; plot residual.*temp; run;

Nate Derby Getting Correct Results from PROC REG 9 / 29

slide-10
SLIDE 10

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Boiling Point vs Model 1 Residual

Residual

  • 0.4
  • 0.2

0.0 0.2 0.4 0.6 0.8 Boiling Point (°F) 194 198 202 206 210 214

Nate Derby Getting Correct Results from PROC REG 10 / 29

slide-11
SLIDE 11

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Trouble in Paradise

Pattern: Clusters of negative residuals. ⇒ Assumption violation! Two options: Modify the data: Transform one of the variables in the model. Modify the model: Change the linear equation in the model statement.

Add/substitute some variables in the model.

Nate Derby Getting Correct Results from PROC REG 11 / 29

slide-12
SLIDE 12

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Modifying the Data

Pressure ⇒ 100 × Log(Pressure): 100 × Log( Pressure ) = β0 + β1Temperature : SAS Code proc reg data=boiling; model hlogpress = temp; plot hlogpress*temp; plot residual.*predicted.; run;

Nate Derby Getting Correct Results from PROC REG 12 / 29

slide-13
SLIDE 13

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Boiling Point vs Log Pressure

100 x Log Pressure (Hg) 130 133 135 138 140 143 145 148 150 Boiling Point (°F) 194 198 202 206 210 214

Nate Derby Getting Correct Results from PROC REG 13 / 29

slide-14
SLIDE 14

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Boiling Point vs Model 2 Residual

Residual

  • 0.50
  • 0.25

0.00 0.25 0.50 0.75 1.00 1.25 1.50 Boiling Point (°F) 194 198 202 206 210 214

Nate Derby Getting Correct Results from PROC REG 14 / 29

slide-15
SLIDE 15

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Checking for Residuals Fitting Normal Distribution

If residuals don’t fit the normal distribution (bell curve), confidence intervals and hypothesis tests will be off. All other results (i.e., estimates) will be valid. We check this via a Quantile-Quantile Plot (Q-Q Plot): Compares quantiles (percentiles) of residual distribution to those of standard normal distribution. We want points to approximately fit a straight line.

Nate Derby Getting Correct Results from PROC REG 15 / 29

slide-16
SLIDE 16

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Checking for Residuals Fitting Normal Distribution

SAS Code proc reg data=boiling noprint; model press = temp; plot residual.*nqq. / nostat nomodel noline; run; proc reg data=boiling noprint; model hlogpress = temp; plot residual.*nqq. / nostat nomodel noline; run;

Nate Derby Getting Correct Results from PROC REG 16 / 29

slide-17
SLIDE 17

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Model 1 Residuals vs Normal Quantiles

Residual

  • 0.4
  • 0.2

0.0 0.2 0.4 0.6 0.8 Normal Quantile

  • 3
  • 2
  • 1

1 2 3

Nate Derby Getting Correct Results from PROC REG 17 / 29

slide-18
SLIDE 18

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Model 2 Residuals vs Normal Quantiles

Residual

  • 0.50
  • 0.25

0.00 0.25 0.50 0.75 1.00 1.25 1.50 Normal Quantile

  • 3
  • 2
  • 1

1 2 3

Nate Derby Getting Correct Results from PROC REG 18 / 29

slide-19
SLIDE 19

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

PROC REG Output: Forbes’ Model 2

The REG Procedure Model: MODEL2 Dependent Variable: hlogpress 100 x Log Pressure (Hg) Number of Observations Read 17 Number of Observations Used 17 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 425.63910 425.63910 2962.79 <.0001 Error 15 2.15493 0.14366 Corrected Total 16 427.79402 Root MSE 0.37903 R-Square 0.9950 Dependent Mean 139.60529 Adj R-Sq 0.9946 Coeff Var 0.27150 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1

  • 42.13778

3.34020

  • 12.62

<.0001 temp Boiling Point (F) 1 0.89549 0.01645 54.43 <.0001 Nate Derby Getting Correct Results from PROC REG 19 / 29

slide-20
SLIDE 20

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

log GDP vs Democracy Index

Gurr's Index (1995)

  • 10
  • 8
  • 5
  • 3

3 5 8 10 Log GDP (1985) 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 10.00

Nate Derby Getting Correct Results from PROC REG 20 / 29

slide-21
SLIDE 21

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

PROC REG Output: Democracy Index

The REG Procedure Model: MODEL1 Dependent Variable: Gurr Index (1995) Number of Observations Read 112 Number of Observations Used 111 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 534.76792 534.76792 12.31 0.0007 Error 109 4734.97983 43.44018 Corrected Total 110 5269.74775 Root MSE 6.59092 R-Square 0.1015 Dependent Mean 3.50450 Adj R-Sq 0.0932 Coeff Var 188.06986 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1

  • 12.98347

4.74073

  • 2.74

0.0072 lgdp Log GDP (1985) 1 2.06913 0.58973 3.51 0.0007 Nate Derby Getting Correct Results from PROC REG 21 / 29

slide-22
SLIDE 22

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Valve Orders vs Shipments

Shipments 39,000 40,000 41,000 42,000 43,000 44,000 Orders 33,000 34,000 35,000 36,000 37,000 38,000 39,000 40,000 41,000

Nate Derby Getting Correct Results from PROC REG 22 / 29

slide-23
SLIDE 23

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Valve Orders vs Model 3 Residual

Residual

  • 2000
  • 1500
  • 1000
  • 500

500 1000 1500 2000 Orders 33,000 34,000 35,000 36,000 37,000 38,000 39,000 40,000 41,000

Nate Derby Getting Correct Results from PROC REG 23 / 29

slide-24
SLIDE 24

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

SAS Output

The REG Procedure Model: MODEL1 Dependent Variable: shipments Shipments Number of Observations Read 54 Number of Observations Used 53 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 38818277 38818277 70.16 <.0001 Error 51 28218196 553298 Corrected Total 52 67036473 Root MSE 743.84001 R-Square 0.5791 Dependent Mean 41527 Adj R-Sq 0.5708 Coeff Var 1.79124 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 20966 2456.79440 8.53 <.0001

  • rders

Orders 1 0.56613 0.06759 8.38 <.0001 Nate Derby Getting Correct Results from PROC REG 24 / 29

slide-25
SLIDE 25

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Problems

Actually, the conclusions are all false. ⇒ There is actually no relationship between orders and shipments. Look at residuals another way: SAS Code proc reg data=valves; var date; model shipments = orders; plot residual.*date; run;

Nate Derby Getting Correct Results from PROC REG 25 / 29

slide-26
SLIDE 26

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Date vs Model 3 Residual

Residual

  • 2000
  • 1500
  • 1000
  • 500

500 1000 1500 2000 Date 07/1983 02/1984 08/1984 03/1985 09/1985 04/1986 10/1986 05/1987 12/1987 06/1988

Nate Derby Getting Correct Results from PROC REG 26 / 29

slide-27
SLIDE 27

PROC REG Conclusions Basics Checking Assumptions Understanding the Output

Valve Orders vs Shipments

Orders/Shipments

32,000 34,000 36,000 38,000 40,000 42,000 44,000

Date

01/84 05/84 09/84 01/85 05/85 09/85 01/86 05/86 09/86 01/87 05/87 09/87 01/88 05/88

Nate Derby Getting Correct Results from PROC REG 27 / 29

slide-28
SLIDE 28

PROC REG Conclusions

Conclusions

When fitting a model with PROC REG, Check the assumptions:

Is there a pattern with residuals vs other variables? (NO) Do the residuals fit a bell curve? (YES) For time series: Is there a pattern with residuals vs time? (NO)

Look at results:

Is the R-squared value close to 1? (YES) Are individual p-values less than 0.05? (YES) Is the p-value for the analysis of variance less than 0.05? (YES)

Nate Derby Getting Correct Results from PROC REG 28 / 29

slide-29
SLIDE 29

Appendix

Further Resources

Sanford Weisberg. Applied Linear Regression. John Wiley and Sons, 1985.

UCLA Help:

www.ats.ucla.edu/stat/sas/output/reg.htm

Nate Derby: http://nderby.org nderby@stakana.com

Nate Derby Getting Correct Results from PROC REG 29 / 29