Lecture 14: Introduction to Poisson Regression Ani Manichaikul - - PowerPoint PPT Presentation

lecture 14 introduction to poisson regression
SMART_READER_LITE
LIVE PREVIEW

Lecture 14: Introduction to Poisson Regression Ani Manichaikul - - PowerPoint PPT Presentation

Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why count data? Number of traffic


slide-1
SLIDE 1

Lecture 14: Introduction to Poisson Regression

Ani Manichaikul amanicha@jhsph.edu 8 May 2007

1 / 52

slide-2
SLIDE 2

Overview

Modelling counts Contingency tables Poisson regression models

2 / 52

slide-3
SLIDE 3

Modelling counts I

Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week Number of coughs in a concert hall, per minute Number of customers arriving in a shop, daily

3 / 52

slide-4
SLIDE 4

Modelling counts II

Counts show up in all kinds of scenarios in our every day life In public health, count data are particularly important in measuring disease rates: We often count the number of people acquiring a particular disease in a given year Modelling counts gives us a framework in which to estimate disease incidence Poisson regression will allow us to measure association between incidence, and risk factors of interest adjusting for

  • ther related covariates

4 / 52

slide-5
SLIDE 5

A whole new model to deal with counts? I

So far we have talked about: Linear regression: for normally distributed errors Logistic regression: for binomial distributed errors Typically, our count data do not follow either of these distributions Counts are not binary (0 / 1) Counts are discrete, not continuous Counts typically have a right skewed distribution

5 / 52

slide-6
SLIDE 6

A whole new model to deal with counts? II

So far, the regression strategies we’ve discussed allow us to model: Expected values and expected increase in linear regression Log odds or log odds ratios in logistic regression In modelling counts, we are typically more interested in Incidence rates Incidence ratios (when comparing across levels of a risk factor) Poisson regression will provide us with a framework to handle counts properly!

6 / 52

slide-7
SLIDE 7

Poisson Assumptions

Before going on to talk about Poisson regression, let’s revisit some key concepts about the Poisson distribution: The occurrences of a random event in an interval of time are independent In theory, an infinite number of occurrences of the event are possible (though perhaps rare) within the interval In any extremely small portion of the interval, the probability

  • f more than one occurrence of the event is approximately

zero

7 / 52

slide-8
SLIDE 8

Poisson Probability I

The probability of x occurrence of an event in an interval is: P(X = x) = e−λ · λx x! , x = 0, 1, 2, . . . where λ = the expected number of occurrences in the interval e = a constant (≈ 2.718) For the Poisson distribution: mean = variance = λ Mode = floor (λ), largest integer less than λ We can also think of λ as the rate parameter

8 / 52

slide-9
SLIDE 9

Poisson Probability II

Poisson probability mass function for λ = 5

5 10 15 20 0.00 0.05 0.10 0.15

Poisson probablity, λ=5

x Pr(X=x)

9 / 52

slide-10
SLIDE 10

Poisson Probability III

Poisson probability mass function for λ = 30

10 20 30 40 50 60 70 0.00 0.02 0.04 0.06

Poisson probablity, λ=30

x Pr(X=x)

10 / 52

slide-11
SLIDE 11

Poisson and Binomial

The Poisson distribution can be used to approximate a binomial distribution when: n is large and p is very small, or np = λ is fixed, and n becomes infinitely large

11 / 52

slide-12
SLIDE 12

Example: Cancer in a large population

Yearly cases of esophageal cancer in a large city 30 cases observed in 1990 P(X = 30) = e−λλ30 30! λ = yearly average number of cases of esophageal cancer

12 / 52

slide-13
SLIDE 13

Example: Belief in Afterlife I

Men and women were asked whether or not they believed in afterlife (General Social Survey 1991) Possible responses were: yes, no, or unsure Y N or U M 435 147 582 F 375 134 509 Total 810 281 1091

13 / 52

slide-14
SLIDE 14

Example: Belief in Afterlife II

Question: Is belief in the afterlife independent of gender? We could address this question using a χ2 test To perform a χ2 test:

Begin by stating our model: independece Then, we calculate expected cell counts for each entry in the table

14 / 52

slide-15
SLIDE 15

Example: Belief in Afterlife III

Under an independence model, the probability of belief in afterlife for both genders would be estimated as ˆ p = P(belief in afterlife) = number who believe total number asked = 810 1091 ≈ 0.742 Then, we calculate the expected counts as: E(males answering yes) = # men asked · ˆ p = 582 · 0.742 = 432 E(females answering yes) = # women asked·ˆ p = 509·0.742 = 378

15 / 52

slide-16
SLIDE 16

Example: Belief in Afterlife IV

The observed and expected cell counts are as follows: Y N or U M 435 (432) 147 (150) 582 F 375 (378) 134 (131) 509 Total 810 281 1091 Then, the χ2 statistic is calculated as:

  • i,j

(Oij − Eij)2 Eij = (435 − 432)2 432 + (147 − 150)2 150 + (375 − 378)2 378 + (134 − 131)2 131 = 0.173

16 / 52

slide-17
SLIDE 17

Example: Belief in Afterlife V

Remeber, we are testing the null hypothesis of independence, versus the alternative that the proportion believing in afterlife differs across genders: H0 : pmale = pfemale = poverall Ha : pmale = pfemale We decided to perform the hypothesis test using a χ2 test. Here, the appropriate degrees of freedom are: (number of rows−1)(number of columns−1) = (2−1)(2−1) = 1

17 / 52

slide-18
SLIDE 18

Example: Belief in Afterlife VI

Let’s test the hypothesis at level α = 0.05. Look up the appropriate critical value in R: > qchisq(1-0.05, df=1) [1] 3.841459 Our observed χ2 statistic of 0.173 is much smaller than the critical value We can also look up the corresponding p-value for our test: > pchisq(0.173, df=1, lower.tail=F) [1] 0.6774593 Conclusion: fail to reject the null hypothesis

18 / 52

slide-19
SLIDE 19

Example: Belief in Afterlife – Poisson Model I

Just now, we had to calculate the expected counts to perform the χ2 test Actually, we could have written down a linear model to express the expected counts systematically Make use of the Poisson approximation to binomial Model: Yij ∼ Poisson(λij) λij = λ · αmale · γresponse Here we interpret λij as the Poisson rate for the cell in the ith row and jth column λ is the baseline rate, α is the male effect, and γ is the response

19 / 52

slide-20
SLIDE 20

Example: Belief in Afterlife – Poisson Model II

Recall that the log of a product is equal to the sum of the log: log(a · b) = log(a) + log(b) So, let’s log-transform our model... originally we had: λij = λ · αmale · γresponse Taking the log of both sides, we have: log(λij) = log λ + log(αmale) + log(γresponse) This is looking like a linear model...

20 / 52

slide-21
SLIDE 21

Example: Belief in Afterlife – Poisson Model III

We stated the systematic portion of this model as log(λij) = log λ + log(αmale) + log(γresponse) which we can also write using β’s to look like other linear models: log(λij) = β0 + β1 · I(male) + β2 · I(response) The probabilistic portion of this model enters as: Yij ∼ Poisson(λij)

21 / 52

slide-22
SLIDE 22

Example: Belief in Afterlife – Poisson Model IV

In this Poisson regression model: The outcome is the log of the expected cell count The baseline β0 is the log expected cell count for females responding ”no” β1 is the increase in log expected cell count for males compared to females β2 is the increase in log expected cell count for the response ”yes” compared to ”no”

22 / 52

slide-23
SLIDE 23

Fitting the afterlife model in R I

Data format We began with the view of our data in a table Y N or U M 435 (432) 147 (150) 582 F 375 (378) 134 (131) 509 Total 810 281 1091 However, R expects to see unique observations listed, together with relevant covariates (indicators here), as follows: count male yes 1 435 1 1 2 147 1 3 375 1 4 134

23 / 52

slide-24
SLIDE 24

Fitting the afterlife model in R II

Once the data is entered in R, we can analyze it using the glm command, specifying the family as poisson: > summary(out<-glm(count ~ male + yes, family=poisson)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.87595 0.06787 71.839 <2e-16 *** male 0.13402 0.06069 2.208 0.0272 * yes 1.05868 0.06923 15.291 <2e-16 ***

  • Signif. codes:

0’***’ 0.001’**’ 0.01’*’ 0.05’.’ 0.1’ ’ 1 Null deviance: 272.68544

  • n 3

degrees of freedom Residual deviance: 0.16200

  • n 1

degrees of freedom AIC: 35.407

24 / 52

slide-25
SLIDE 25

Fitting the afterlife model in R III

So we fit the model: log(λij) = β0 + β1 · I(male) + β2 · I(response) and our fitted model is: log(λij) = 4.88 + 0.134 · I(male) + 1.06 · I(response)

25 / 52

slide-26
SLIDE 26

Predicting log expected cell counts I

Using the fitted model: log(λij) = β0 + β1 · I(male) + β2 · I(response) we can get predicted values for log counts in each of the four cells: For females responding ”no”: log(E(count|female, no)) = 4.88 + 0.134 · 0 + 1.06 · 0 = 4.88 For males responding ”no”: log(E(count|male, no)) = 4.88 + 0.134 · 1 + 1.06 · 0 = 5.01

26 / 52

slide-27
SLIDE 27

Predicting log expected cell counts II

Using the fitted model: log(λij) = β0 + β1 · I(male) + β2 · I(response) we can get predicted values for log counts in each the four cells: For females responding ”yes”: log(E(count|female, yes)) = 4.88 + 0.134 · 0 + 1.06 · 1 = 5.94 For males responding ”yes”: log(E(count|male, yes)) = 4.88 + 0.134 · 1 + 1.06 · 1 = 6.07

27 / 52

slide-28
SLIDE 28

Predicting expected cell counts I

Using the fitted model: log(λij) = β0 + β1 · I(male) + β2 · I(response) we can get predicted values for counts in each of the four cells: For females responding ”no”: E(count|female, no) = exp(4.88) = 131 For males responding ”no”: E(count|male, no) = exp(4.88 + 0.134 · 1) = exp(5.01) = 150

28 / 52

slide-29
SLIDE 29

Predicting expected cell counts II

Using the fitted model: log(λij) = β0 + β1 · I(male) + β2 · I(response) we can get predicted values for counts in each of the four cells: For females responding ”yes”: E(count|female, yes) = exp(4.88 + 1.06) = exp(5.94) = 378 For males responding ”yes”: E(count|male, yes) = exp(4.88+0.134+1.06) = exp(6.07) = 432

29 / 52

slide-30
SLIDE 30

Comparing model fit in R with our fit by hand

When we did our χ2 test, we calculated the expected cell counts by hand: Estimate ˆ p = 810

1091, probability of responding ”yes” under the

independence model ˆ p (or 1 − ˆ p) by the number of males (or females) to get the expected number in each cell Expected cell counts represented the numbers we should have gotten under a perfect independence model We obtained the expected cell counts: Y N or U M 432 150 582 F 378 131 509 Total 810 281 1091 which are exactly what we got by Poisson regression!

30 / 52

slide-31
SLIDE 31

Afterlife coefficients I

By fitting the independence model, we force the relative rate

  • f responding ”yes” versus ”no” to the question of belief in

the afterline to be fixed across males and females Deviation from the independence model suggests the proportion of those believing in afterlife differs by gender

31 / 52

slide-32
SLIDE 32

Afterlife coefficients II

Under the independence model: β0 = 4.88 is the log expected count of females responding ”no”, the baseline group β1 = 0.134 is the difference in log expected counts comparing males to females β2 = 1.05 is the difference in log expected counts for ”yes” responses compared to ”no” responses

32 / 52

slide-33
SLIDE 33

Afterlife coefficients III

Under the independence model: exp(β0) = 131.6 is the expected count for females responding ”no”, the baseline group expβ1 = 1.14 is the ratio comparing the counts of males to females exp(β2)=2.85 is the ratio of the number of ”yes” responses compared to ”no” responses

33 / 52

slide-34
SLIDE 34

A regression model for count data

Contingency table: regression model Suppose we have r categories on one axis and c categories on the other We can think of an r×c contingency table with r rows and c columns Independence model: log(E(Yij)) = log(λij) = log(λ) + log(αi) + log(γj) To test independence, we write out χ2 test statistic as:

  • i,j

(Oij − Eij)2 Eij which is distributed as χ2

(r−1)(c−1) under the null hypothesis

  • f independence

34 / 52

slide-35
SLIDE 35

Relation to other types of regression

Both logistic and poisson regression are types of linear models Extend ideas from linear regression with continuous outcomes to:

binary outcomes (logistic regression) count data (Poisson regression)

We say logistic regression uses the ”logit” link Poisson regression uses a ”log” link Poisson models is also called log-linear models Regression Outcome Link Link type Distribution name function Linear Normal identity f (µ) = µ Logistic Binary logit logit(p) = log(

p 1−p)

Log-linear Poisson log log(p)

35 / 52

slide-36
SLIDE 36

Interpretation I

Under the model log(E(Yij)) = log(λij) = log(λ) + log(αi) + log(γj) we can interpret: λij is the predicted rate for the ith row and jth column of our contingency table λ is the baseline rate

36 / 52

slide-37
SLIDE 37

Interpretation II

Under the model log(E(Yij)) = log(λij) = log(λ) + log(αi) + log(γj) we can interpret: αi is the rate ratio, the effect of being in row i compared to baseline (and holding column fixed) γj is the rate ratio, the effect of being in column j compared to baseline (and holding row fixed) The rate ratio is the factor by which the baseline is multiplied to get the rate, or expected cell count, for our particular non-baseline category

37 / 52

slide-38
SLIDE 38

Example: Customers at a lumber company I

Outcome: Y = number of customers visiting store from region Predictors: X1: number of housing units in region X2: average household income X3: average housing unit age in region X4: distance to nearest competitor X5: average distance to store in miles Counts are obtained for 110 regions, so our n=110

38 / 52

slide-39
SLIDE 39

Example: Customers at a lumber company II

Poisson (log-linear) regression model Given observations and covariates: Yi, Xij, 1 ≤ i ≤ n, 1 ≤ j ≤ p Systematic component of model: log(λi) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 Probabilistic component of model: Yi|Xi ∼ Poisson(λi)

39 / 52

slide-40
SLIDE 40

Example: Customers at a lumber company III

Here’s a look at some of the data: Customers Housing Income Age Competitor Store 1 9 606 41393 3 3.04 6.32 2 6 641 23635 18 1.95 8.89 3 28 505 55475 27 6.54 2.05 4 11 866 64646 31 1.67 5.81 5 4 599 31972 7 0.72 8.11 6 4 520 41755 23 2.24 6.81 7 354 46014 26 0.77 9.27 8 14 483 34626 1 3.51 7.92 9 16 1034 85207 13 4.23 4.40 .....

40 / 52

slide-41
SLIDE 41

The Distribution of Customer counts

The customer counts distribution is clearly not normally distributed Linear regression would not work well here Log-linear regression will work just fine

Histogram of Customers

Customers Frequency 5 10 15 20 25 30 35 10 20 30 40

41 / 52

slide-42
SLIDE 42

The fitted model I

> summary(lumber.glm <- glm(Customers ~ Housing + Income +Age +Competitor +Store, family=poisson()) ) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.942e+00 2.072e-01 14.198 < 2e-16 *** Housing 6.058e-04 1.421e-04 4.262 2.02e-05 *** Income

  • 1.169e-05

2.112e-06

  • 5.534 3.13e-08 ***

Age

  • 3.726e-03

1.782e-03

  • 2.091

0.0365 * Competitor 1.684e-01 2.577e-02 6.534 6.39e-11 *** Store

  • 1.288e-01

1.620e-02

  • 7.948 1.89e-15 ***
  • Signif. codes:

0 ’***’0.001 ’**’0.01 ’*’0.05 ’.’0.1 ’ ’1 Null deviance: 422.22

  • n 109

degrees of freedom Residual deviance: 114.99

  • n 104

degrees of freedom

42 / 52

slide-43
SLIDE 43

The fitted model II

We wrote down the model: log(λi) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 and obtained the parameter estimates in R to be: log(λi) = 2.94 + 0.00061Housingi − 0.000017Incomei −0.0037β3Agei + 0.168Competitori − 0.129Storei

43 / 52

slide-44
SLIDE 44

Interpretation I

We interpret ˆ β0 = 2.94 as the baseline log expected count, or log rate, in the group with all covariates (housing, income, age, distance to nearest competitor, and average distance to store) set to zero exp(ˆ β0) = exp(2.94) = 18.9 is the expected count of customers in the baseline group This baseline value does not quite make sense It may be helpful to center our covariates, but this is not a big deal if we don’t care about baseline because our primary inference is about the increase with respect to covariates

44 / 52

slide-45
SLIDE 45

Interpretation II

We interpret ˆ β1 = 6.05 × 10−4 as the increase in log expected count, or the log rate ratio comparing districts whose number

  • f housing units differ by one, adjusting for other covariates

exp(ˆ β1) = exp(6.05 × 10−4) = 1.000605 is the rate ratio comparing districts whose mean housing units differ by one, adjusting for other covariates exp(100 · ˆ β1) = exp(100 · 6.05 × 10−4) = 1.062 is the rate ratio comparing districts whose mean housing units differ by

  • ne hundred

⇒ Keeping other factors constant, a 100 unit increase in housing units, would yield an expected 6.2% increase in customer count

45 / 52

slide-46
SLIDE 46

Interpretation III

Although our reported estimate for β1 is very small, so is the standard error (1.42 × 10−4) The z-statistic reported in R is 4.26, with a p-value of 2.02 × 10−5 We can conclude that there is a statistically significant association between the customer count and mean housing units in the region We could have calculated this z-statistic ourselves as

ˆ β1 SE(ˆ β1)

This z-statistic is also called a Wald statistic

46 / 52

slide-47
SLIDE 47

Interpretation IV

We interpret ˆ β2 = −1.169 × 10−5 to be the increase in log expected count, or the log rate ratio comparing districts whose mean income differs by $1, adjusting for other covariates exp(ˆ β2) = exp(−1.169 × 10−5) = 0.999 is the relative rate comparing regions whose mean income differs by $1, adjusting for other covariates

47 / 52

slide-48
SLIDE 48

Interpretation V

exp(10000ˆ β2) = exp(10000 · −1.169 × 10−5) = 0.889 is the relative rate comparing regions whose mean income differs by $10000, adjusting for other covariates ⇒ Keeping other factors constant, the relative rate comparing regions whose average income differs by $10000 is 0.889 For a $10000 change in average income, we would predict an 11% decrease in the rate of customers visiting the store

48 / 52

slide-49
SLIDE 49

Interpretation VI

A 95% confidence interval for β2 is ˆ β2 ± 1.96 · SE(ˆ β2) = −1.169 × 10−5 ± 1.96 · 2.11 × 10−6 = (−1.58 × 10−5, −7.55 × 10−6) A 95% confidence interval for exp(β2) is e(−1.58×10−5,−7.55×10−6) = (0.9999842, 0.9999924) A 95% confidence interval for exp(10000β2) is e(10000·(−1.58×10−5),10000·(−7.55×10−6)) = (0.854, 0.927)

49 / 52

slide-50
SLIDE 50

Interpretation VII

The 95% confidence interval for the relative rate of customers visiting the store is (0.854, 0.927) for a $10,000 increase in average income for the region Question: Based on this model , if we are going to choose a location to build a new store, should we choose areas with higher or lower income? Does it matter?

50 / 52

slide-51
SLIDE 51

Summary I

Poisson regression gives us a framework in which to build models for count data It is a special case of generalized linear models, so it is closely related to linear and logistic regression modelling All of the same modelling techniques will carry over from linear regression:

Adjustment for confounding Allowing for effect modification by fitting interactions Splines and polynomial terms

51 / 52

slide-52
SLIDE 52

Summary II

Can test significance of regression coefficients using z-statistics (also called Wald statistic) Since Poisson regression specifies a model, we can calculate likelihood ⇒ likelihood ratio tests will apply for testing nested models Will need a new interpretation for regression coefficients:

Intercept term interpreted as log rate for baseline Coefficients for covariates interpreted as log rate ratio, or log relative rate Exponentiate to get rid of the logs

52 / 52