An Introduction to the Analysis of Rare Events Nate Derby Stakana - - PowerPoint PPT Presentation

an introduction to the analysis of rare events
SMART_READER_LITE
LIVE PREVIEW

An Introduction to the Analysis of Rare Events Nate Derby Stakana - - PowerPoint PPT Presentation

Linear Regression Poisson Regression Beyond Poisson Regression An Introduction to the Analysis of Rare Events Nate Derby Stakana Analytics Seattle, WA SUCCESS 3/12/15 Nate Derby An Introduction to the Analysis of Rare Events 1 / 43


slide-1
SLIDE 1

Linear Regression Poisson Regression Beyond Poisson Regression

An Introduction to the Analysis of Rare Events

Nate Derby

Stakana Analytics Seattle, WA

SUCCESS 3/12/15

Nate Derby An Introduction to the Analysis of Rare Events 1 / 43

slide-2
SLIDE 2

Linear Regression Poisson Regression Beyond Poisson Regression

Outline I

1

Linear Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

2

Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

3

Beyond Poisson Regression

Nate Derby An Introduction to the Analysis of Rare Events 2 / 43

slide-3
SLIDE 3

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Statistical Modeling with Linear Regression

Suppose we have a data set of two variables: Xi and Yi Use Xi to estimate Yi. We’ll know Xi but not Yi. Look at driver population percent vs. annual fuel consumption: Generate scatterplot

SYMBOL1 COLOR=blue ...; PROC GPLOT DATA=home.fuel; PLOT fuel*dlic=1 / ...; RUN;

Nate Derby An Introduction to the Analysis of Rare Events 3 / 43

slide-4
SLIDE 4

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Annual Fuel Consumption per Person (x 1000 gallons) 30 50 70 90 Driver Population Percentage 70% 80% 90% 100% 110%

Fuel Consumption vs Driver Population Percentage

Scatterplot

Nate Derby An Introduction to the Analysis of Rare Events 4 / 43

slide-5
SLIDE 5

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Statistical Modeling with Linear Regression

Statistical model = we fit a trend line to the data. Fit a line that best described the general trend. Linear Regression Model: Yi = β0 + β1Xi

  • linear trend

+ εi

  • error term

Fit a model:

  • Yi =

β0 + β1Xi Estimating (unknown) Yi from (known) Xi

Nate Derby An Introduction to the Analysis of Rare Events 5 / 43

slide-6
SLIDE 6

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Graphing a Linear Regression Line

Quickly fit and graph a linear regression line: Generate linear regression line

SYMBOL1 COLOR=blue ...; SYMBOL2 LINE=1 COLOR=red INTERPOL=rl ...; PROC GPLOT DATA=home.fuel; ... PLOT fuel*dlic=1; fuel*dlic=2 / ... OVERLAY; RUN;

NOTE: Regression equation : fuel = 9.617975 + 57.20502*dlic.

  • FUELi = 9.617975 + 57.20502 · DLICi.

Nate Derby An Introduction to the Analysis of Rare Events 6 / 43

slide-7
SLIDE 7

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Annual Fuel Consumption per Person (x 1000 gallons) 30 50 70 90 Driver Population Percentage 70% 80% 90% 100% 110%

Fuel Consumption vs Driver Population Percentage

Linear Regression Line

Nate Derby An Introduction to the Analysis of Rare Events 7 / 43

slide-8
SLIDE 8

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Adding Prediction Intervals

Let’s add 95% prediction intervals: Adding prediction intervals

SYMBOL1 COLOR=blue ...; SYMBOL3 LINE=1 COLOR=red INTERPOL=rlcli ...; PROC GPLOT DATA=home.fuel; ... PLOT fuel*dlic=1; fuel*dlic=3 / ... OVERLAY; RUN;

95% of data points should be within these intervals. Should hold for future data points!

Nate Derby An Introduction to the Analysis of Rare Events 8 / 43

slide-9
SLIDE 9

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Annual Fuel Consumption per Person (x 1000 gallons) 30 50 70 90 Driver Population Percentage 70% 80% 90% 100% 110%

Fuel Consumption vs Driver Population Percentage

Linear Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 9 / 43

slide-10
SLIDE 10

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Not Just for Straight Lines

Quadratic trend: Yi = β0 + β1Xi + β2X 2

i

+ εi Cubic trend: Yi = β0 + β1Xi + β2X 2

i + β3X 3 i + εi

Quadratic/cubic trends

SYMBOL1 COLOR=blue ...; SYMBOL4 LINE=1 COLOR=red INTERPOL=rqcli ...; SYMBOL5 LINE=1 COLOR=red INTERPOL=rccli ...; PROC GPLOT DATA=home.fuel; PLOT fuel*dlic=1; fuel*dlic=4 / ... OVERLAY; RUN; PROC GPLOT DATA=home.fuel; PLOT fuel*dlic=1; fuel*dlic=5 / ... OVERLAY; RUN;

Nate Derby An Introduction to the Analysis of Rare Events 10 / 43

slide-11
SLIDE 11

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Annual Fuel Consumption per Person (x 1000 gallons) 30 50 70 90 Driver Population Percentage 70% 80% 90% 100% 110%

Fuel Consumption vs Driver Population Percentage

Quadratic Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 11 / 43

slide-12
SLIDE 12

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Annual Fuel Consumption per Person (x 1000 gallons) 30 50 70 90 Driver Population Percentage 70% 80% 90% 100% 110%

Fuel Consumption vs Driver Population Percentage

Cubic Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 12 / 43

slide-13
SLIDE 13

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Linear Regression with Rare Events

Rare event: No rule of thumb, but Any disease is considered a rare event. Any event as frequent as a disease can be considered rare. Depends on time unit:

Earthquakes in the past ten years = rare. Earthquakes in the past million years = not so rare.

Our rule of thumb: Rare if number of events in a time period are in single digits

Nate Derby An Introduction to the Analysis of Rare Events 13 / 43

slide-14
SLIDE 14

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Exploratory Analysis

Find a relationship between rare event Yi and some variable Xi: Xi may or may not be rare. Example: Xi/Yi = # worker’s compensation claims per firm one year before/after an inspection at Oregon OSHA. Let’s look at a scatterplot: Generate scatterplot

SYMBOL1 COLOR=blue ...; PROC GPLOT DATA=home.claims; PLOT post_claims*pre_claims=1 / ...; RUN;

Nate Derby An Introduction to the Analysis of Rare Events 14 / 43

slide-15
SLIDE 15

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Scatterplot

Nate Derby An Introduction to the Analysis of Rare Events 15 / 43

slide-16
SLIDE 16

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Scatterplot Not Useful

Data points stacked on top of each other! We have 1293 data points, can only see 49. Let’s look at a bubble plot: Generate scatterplot

PROC FREQ DATA=home.claims NOPRINT; TABLES post_claims*pre_claims / out=stats1 ( KEEP=post_claims pre_claims count ); RUN; PROC GPLOT DATA=stats1; BUBBLE post_claims*pre_claims=count / ... BSIZE=10; RUN;

BSIZE= determines bubble sizes.

Nate Derby An Introduction to the Analysis of Rare Events 16 / 43

slide-17
SLIDE 17

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Bubble Plot

Nate Derby An Introduction to the Analysis of Rare Events 17 / 43

slide-18
SLIDE 18

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Bubble Plot Not That Useful

Can be difficult to interpret! Box plot is better. PROC BOXPLOT OK, but not consistent with our axes. Let’s look at a box plot with PROC GPLOT: Generate box plot with PROC GPLOT

SYMBOL6 COLOR=blue INTERPOL=boxt00 ...; SYMBOL7 COLOR=red VALUE=diamondfilled ...; PROC GPLOT DATA=home.claims; PLOT post_claims*pre_claims=6 m_post_claims*pre_claims=7 / HAXIS=axis3 VAXIS=axis4 OVERLAY ...; RUN;

INTERPOL=boxt00: tops/bottoms on whiskers showing minima/maxima.

Nate Derby An Introduction to the Analysis of Rare Events 18 / 43

slide-19
SLIDE 19

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Add a Histogram

Good to also show distribution of X =pre-inspection claims: Want to use PROC GPLOT for consistency with our axes. Generate histogram with PROC GPLOT

SYMBOL9 COLOR=blue INTERPOL=boxf00 CV=blue ...; PROC GPLOT DATA=stats2; PLOT count*pre_claims=6 / HAXIS=axis3 ...; RUN;

INTERPOL=boxf00: tops/bottoms on whiskers showing minima/maxima, but filled with CV color.

Nate Derby An Introduction to the Analysis of Rare Events 19 / 43

slide-20
SLIDE 20

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Frequency 100 200 300 400 500 600 700 800 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre-Inspection Claims

Histogram

Nate Derby An Introduction to the Analysis of Rare Events 20 / 43

slide-21
SLIDE 21

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Box Plots

Nate Derby An Introduction to the Analysis of Rare Events 21 / 43

slide-22
SLIDE 22

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Observations

Data highly skewed (lopsided): Right-skewed = right-tailed = mean > median. Pre-inspection claims X really right skewed. X = 3, 4, 5, 7: Post-inspection claims Y really right skewed. X = 0: Y very right skewed: min to 75th percentile all at Y = 0. (no box) X = 1, 2: Y very right skewed: min to median all at Y = 0. (half box) X =other: Too few data points to be important. Data get less skewed for larger values of X. How good a fit does linear regression give us?

Nate Derby An Introduction to the Analysis of Rare Events 22 / 43

slide-23
SLIDE 23

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Linear Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 23 / 43

slide-24
SLIDE 24

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Quadratic Linear Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 24 / 43

slide-25
SLIDE 25

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Cubic Linear Regression Line + 95% Prediction Bounds

Nate Derby An Introduction to the Analysis of Rare Events 25 / 43

slide-26
SLIDE 26

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

How Well Does Our Model Fit?

The lines go through the boxes, right? The median is usually below the regression line, so more than 50% of the data is below that line. Prediction bounds are symmetric around the regression line.

Data are not symmetric around the median values. This is a fundamental mismatch

Data outside the 95% prediction bounds. Is that around 5%? We have a wrong trend line and a false level of accuracy. Linear regression doesn’t work, and we’d like a smooth trend line.

Nate Derby An Introduction to the Analysis of Rare Events 26 / 43

slide-27
SLIDE 27

Linear Regression Poisson Regression Beyond Poisson Regression Statistical Modeling with Linear Regression Linear Regression with Rare Events

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Connecting the Means

Nate Derby An Introduction to the Analysis of Rare Events 27 / 43

slide-28
SLIDE 28

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Poisson Regression

Solution is easy: Use a similar process to linear regression, but ... Instead of symmetric continuous distribution, use a skewed, discrete one! We’re just applying a theoretical distribution that better fits the data.

Nate Derby An Introduction to the Analysis of Rare Events 28 / 43

slide-29
SLIDE 29

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Poisson Regression

Linear regression: Yi has a symmetric distribution with mean E[Yi] = β0 + β1Xi. Poisson regression: Yi has a right-skewed distribution with mean E[Yi] = exp(β0 + β1Xi). Poisson distribution = right-skewed.

Gets less skewed for larger values of E[Yi].

We use exp(β0 + β1Xi) = eβ0+β1Xi rather than β0 + β1Xi.

Starts out small, rapidly increases.

Nate Derby An Introduction to the Analysis of Rare Events 29 / 43

slide-30
SLIDE 30

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Poisson Regression

y = exp(x) x y 1 2 3 2 4 6 8 10

Nate Derby An Introduction to the Analysis of Rare Events 30 / 43

slide-31
SLIDE 31

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Fitting the Model

PROC GENMOD (SAS/STAT) or PROC COUNTREG (SAS/ETS): Fitting the Model

PROC GENMOD DATA=home.claims; MODEL post_claims = pre_claims / DIST=poisson; RUN; PROC COUNTREG DATA=home.claims; MODEL post_claims = pre_claims / DIST=poisson; RUN;

Goodness of fit statistics only useful when comparing different models. We hope for proper coefficient signs and p-values < 0.05.

Nate Derby An Introduction to the Analysis of Rare Events 31 / 43

slide-32
SLIDE 32

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

PROC GENMOD Output

The GENMOD Procedure Model Information Data Set HOME.CLAIMS Distribution Poisson Link Function Log Dependent Variable post_claims Post-Inspection Claims Number of Observations Read 1310 Number of Observations Used 1293 Missing Values 17

Nate Derby An Introduction to the Analysis of Rare Events 32 / 43

slide-33
SLIDE 33

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

PROC GENMOD Output

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 1291 1623.3440 1.2574 Scaled Deviance 1291 1623.3440 1.2574 Pearson Chi-Square 1291 2070.2270 1.6036 Scaled Pearson X2 1291 2070.2270 1.6036 Log Likelihood

  • 972.7283

Full Log Likelihood

  • 1309.5635

AIC (smaller is better) 2623.1270 AICC (smaller is better) 2623.1363 BIC (smaller is better) 2623.4564 Algorithm converged.

Nate Derby An Introduction to the Analysis of Rare Events 33 / 43

slide-34
SLIDE 34

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

PROC GENMOD Output

Analysis Of Maximum Likelihood Parameter Estimates Standard Wald 95% Confidence Wald Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq Intercept 1

  • 0.8425

0.0415

  • 0.9238
  • 0.7611

412.14 <.0001 pre_claims 1 0.2686 0.0098 0.2493 0.2878 749.10 <.0001 Scale 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. Nate Derby An Introduction to the Analysis of Rare Events 34 / 43

slide-35
SLIDE 35

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

PROC COUNTREG Output

The COUNTREG Procedure Model Fit Summary Dependent Variable post_claims Number of Observations 1293 Data Set HOME.CLAIMS Model Poisson Log Likelihood

  • 1310

Maximum Absolute Gradient 2.24243E-7 Number of Iterations 5 Optimization Method Newton-Raphson AIC 2623 SBC 2633 Algorithm converged. Parameter Estimates Standard Approx Parameter DF Estimate Error t Value Pr > |t| Intercept 1

  • 0.842474

0.041499

  • 20.30

<.0001 pre_claims 1 0.268575 0.009813 27.37 <.0001 Nate Derby An Introduction to the Analysis of Rare Events 35 / 43

slide-36
SLIDE 36

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Interpreting the Results

E[Yi] = exp(−0.842474 + 0.268575Xi) A firm with no pre-inspection claims can expect E[Yi] = exp(−0.842474) ≈ 0.43 post-inspection claims. For every pre-inspection claim that a firm has, that firm’s expected post-inspection claims will rise by exp(0.268575) − 1 ≈ 1.308099 − 1 = 30.81%: E[Yi|Xi + 1] = exp(−0.842474 + 0.268575(Xi + 1)) = exp(−0.842474 + 0.268575Xi + 0.268575) = exp(−0.842474 + 0.268575Xi) · exp(0.268575) = E[Yi|Xi] · 1.3081.

Nate Derby An Introduction to the Analysis of Rare Events 36 / 43

slide-37
SLIDE 37

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Getting Predicted Counts

PROC GENMOD (SAS/STAT) or PROC COUNTREG (SAS/ETS): Fitting the Model

PROC GENMOD DATA=home.claims; MODEL post_claims = pre_claims / DIST=poisson; OUTPUT OUT=home.claims_pred PRED=predicted; RUN; PROC COUNTREG DATA=home.claims; MODEL post_claims = pre_claims / DIST=poisson; OUTPUT OUT=home.claims_pred PRED=predicted; RUN;

Some variations in output. Doesn’t work for PROC COUNTREG before 9.22 – use %PROBCOUNTS instead.

Nate Derby An Introduction to the Analysis of Rare Events 37 / 43

slide-38
SLIDE 38

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Poisson Regression Line

Nate Derby An Introduction to the Analysis of Rare Events 38 / 43

slide-39
SLIDE 39

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

Post-Inspection Claims 2 4 6 8 10 12 14 16 18 Pre-Inspection Claims 2 4 6 8 10 12 14 16

Pre- vs Post-Inspection Claims

Poisson (solid) and Cubic Linear Regression (dashed) Lines

Nate Derby An Introduction to the Analysis of Rare Events 39 / 43

slide-40
SLIDE 40

Linear Regression Poisson Regression Beyond Poisson Regression Fitting the Model Interpreting the Results Getting Predicted Counts

How Did We Do?

This is a smooth line! Poisson regression line very close to most of the median values. (Better than hitting the mean values, since the median is robust against outliers and we have skewed distributions) The Poisson regression fit even comes close to the singular values at X = 8 and X = 9. Poisson regression fit is a better fit that any of the three linear regression models!

Nate Derby An Introduction to the Analysis of Rare Events 40 / 43

slide-41
SLIDE 41

Linear Regression Poisson Regression Beyond Poisson Regression

Beyond Poisson Regression

Poisson regression has a couple serious limitations: Assumes mean = variance, often not the case. Often more zeroes than the model can handle. These problems are addressed by negative binomial regression and zero-inflated Poisson/negative binomial regression.

Nate Derby An Introduction to the Analysis of Rare Events 41 / 43

slide-42
SLIDE 42

Linear Regression Poisson Regression Beyond Poisson Regression

Beyond Poisson Regression

BTW ... This analysis does not isolate the effect of inspections. This analysis does not show that inspections have an effect. We need a control group → propensity score matching.

Nate Derby An Introduction to the Analysis of Rare Events 42 / 43

slide-43
SLIDE 43

Appendix

Further Resources

Sanford Weisberg. Applied Linear Regression. Wiley, 2005. Russ Lavery. An Animated Guide: An Introduction to Poisson Regression. Proceedings of the Twenty-Third NESUG Conference, 2010. Nate Derby: nderby@stakana.com

Nate Derby An Introduction to the Analysis of Rare Events 43 / 43