Naval Postgraduate School Logistic Regression Professor Ron Fricker - - PDF document

naval postgraduate school logistic regression professor
SMART_READER_LITE
LIVE PREVIEW

Naval Postgraduate School Logistic Regression Professor Ron Fricker - - PDF document

1 Naval Postgraduate School Logistic Regression Professor Ron Fricker Monterey, California for Survey Data Goals for this Lecture Introduction to logistic regression Discuss when and why it is useful Interpret output Odds


slide-1
SLIDE 1

Logistic Regression for Survey Data Professor Ron Fricker Naval Postgraduate School Monterey, California

1

slide-2
SLIDE 2

Goals for this Lecture

  • Introduction to logistic regression

– Discuss when and why it is useful – Interpret output

  • Odds and odds ratios

– Illustrate use with examples

  • Show how to run in JMP
  • Discuss other software for fitting linear

and logistic regression models to complex survey data

2

slide-3
SLIDE 3

Logistic Regression

  • Logistic regression

– Response (Y) is binary representing event or not – Model, where pi=Pr(Yi=1):

  • In surveys, useful for modeling:

– Probability respondent says “yes” (or “no”)

  • Can also dichotomize other questions

– Probability respondent in a (binary) class

3

1 1 2 2

ln 1

i i i k ki i

p X X X p β β β β ⎛ ⎞ = + + + + ⎜ ⎟ − ⎝ ⎠ K

slide-4
SLIDE 4

Why Logistic Regression?

  • Some reasons:

– Resulting “S” curve fits many observed phenomenon – Model follows the same general principles as linear regression

  • Can estimate probability p of binary outcome

– Estimates of p bounded between 0 and 1

( ) ( )

1 1 2 2 1 1 2 2

ˆ ˆ ˆ ˆ exp ˆ ˆ ˆ ˆ ˆ 1 exp

k k k k

x x x p x x x β β β β β β β β + + + + = + + + + + K K

4

slide-5
SLIDE 5

Linear Regression with Binary Ys

  • Example: modeling presence or

absence of coronary heart disease (CHD) as a function of age

  • Data looks like this:

– 100 obs – min age = 20 – max age = 69 – 43 w/ CHD

ID Age CHD 1 20 2 23 3 24 4 25 5 25 1 6 26 7 26 8 28

. . . . . . . . .

5

slide-6
SLIDE 6

Modeling CHD Existence

  • Imagine each subject flips a coin:

Heads = CHD Tails = no CHD

  • Each coin has a different probability of

heads related to subject’s age

  • Only observe existence of CHD

– y=1, has CHD; y=0, does not

  • We want to model the chance of getting

CHD as a function of age

6

slide-7
SLIDE 7

Proportion with CHD by Age

CHD Age Group n Absent Present Proportion 20-29 10 9 1 0.10 30-34 15 13 2 0.13 35-39 12 9 3 0.25 40-44 15 10 5 0.33 45-49 13 7 6 0.46 50-54 8 3 5 0.63 55-59 17 4 13 0.76 60-69 10 2 8 0.80 Total 100 57 43 0.43

7

slide-8
SLIDE 8

Plotting the Proportions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20 30 40 50 60 70 Mean Group Age Proportion w / CHD

8

slide-9
SLIDE 9

Interpreting Model Results

0.2 0.4 0.6 0.8 1 10 30 50 70 90 Age

p(CHD)

If age is 50 years then the probability of CHD is about 0.56

9

slide-10
SLIDE 10

Logistic Regression: The Picture

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 80 90 100 Age Probability of CHD

data p(age)

10

slide-11
SLIDE 11

Where Logistic Regression Fits

Continuous Categorical

Dependent or Response Independent or Predictor Variable

Continuous Categorical

Linear regression Linear reg. w/ dummy variables Logistic regression Logistic reg. w/ dummy variables

11

slide-12
SLIDE 12

Logistic Regression in JMP

  • Fit much like multiple regression:

Analyze > Fit Model

– Fill in Y with nominal binary dependent variable – Put Xs in model by highlighting and then clicking “Add”

  • Use “Remove” to take out Xs

– Click “Run Model” when done

  • Takes care of missing values and non-

numeric data automatically

12

slide-13
SLIDE 13

Estimating the Parameters

  • JMP estimates βs via maximum likelihood
  • Given estimated βs, probabilities

estimated as

  • Calculating probabilities in JMP is easy

– After Fit Model, red triangle > Save Probability Formula

( ) ( )

1 1 2 2 3 1 1 2 2 3

ˆ ˆ ˆ ˆ exp ˆ ˆ ˆ ˆ ˆ 1 exp

k k

x x x p x x x β β β β β β β β + + + + = + + + + + K K

13

slide-14
SLIDE 14

Probability, Odds, and Log Odds

  • Probability (p)

– Number between 0 and 1 – Example: Pr(Red Sox win next World Series) = 5/8 = 0.62

  • Odds: p/(1-p)

– Any number > 0 – Example: Odds Red Sox win World Series are 5/3 = 1.667

  • Log odds: ln(p/1-p)

– Any number from -¶ to +¶ – Log odds is sometimes called the “logit”

14

slide-15
SLIDE 15

Interpreting the βs

“slope” p-value Log odds of having CHD

  • Slope is positive and significant

– Increasing age means higher probability of coronary heart disease – Increase Age by 1 year and log odds of CHD increases by 0.11 – No t-test, χ-square test instead

  • p-value still means the same thing

15

slide-16
SLIDE 16

Final Model and Results

Age can be any (positive) number and answer still makes sense

0.2 0.4 0.6 0.8 1 10 30 50 70 90 Age

p (C H D )

exp( 5.31 0.111x ) ˆ(CHD) 1 exp( 5.31 0.111x ) age p age − + = + − +

16

slide-17
SLIDE 17
  • An odds ratio is, literally, ratio of two odds

– Example from some recent (non-survey) work:

  • Odds IAer retained = 2.01
  • Odds non-IAer retained = 1.55
  • Odds ratio = 1.30

17

Odds Ratios – An Example

slide-18
SLIDE 18

Interpreting the Slope

  • f an Indicator Variable
  • Let x1 be an indicator variable

– Say, x1=1 means male and x1=0 means female

  • Consider the ratio of two logistic regression

models, one for males and one for females:

  • Exponentiate numerator and denominator:

1 2 2 2 2

|male |female ln ln 1 |male 1 |female

i k ki i i i i i k ki

X X p p p p X X β β β β β β β ⎛ ⎞ ⎛ ⎞ + + + + = ⎜ ⎟ ⎜ ⎟ − − + + + ⎝ ⎠ ⎝ ⎠ K K

1 2 2 1 2 2

exp( )exp( )exp( ) exp( ) exp( )exp( ) ex exp( )

  • O. .

) R p(

i k ki i k ki

X X X X β β β β β β β β = = L L

18

slide-19
SLIDE 19

Example: Using Logistic Regression in NPS New Student Survey

  • Dichotomize Q1 into “satisfied” (4 or 5) and

“not satisfied” (1, 2, or 3)

  • Model satisfied on Gender and Type Student

19

slide-20
SLIDE 20

Compare the Output to Raw Data

20

slide-21
SLIDE 21

Regression in Complex Surveys

  • Parameters are fit to minimize the sums
  • f squared errors to the population:
  • Resulting estimators:

and

  • Still need to estimate standard errors…

1 2 2

ˆ

i i i i i i i i i S i S i S i S i i i i i i S i S i S

w x y w y w x w B w x w x w

∈ ∈ ∈ ∈ ∈ ∈ ∈

− = ⎛ ⎞ −⎜ ⎟ ⎝ ⎠

∑ ∑ ∑ ∑ ∑ ∑ ∑

1

ˆ ˆ

i i i i i S i S i i S

w y B w w B w

∈ ∈ ∈

− = ∑

∑ ∑ [ ]

( )

2 1 1 N i i i

SSE y B B x

=

= − +

21

slide-22
SLIDE 22

Using SAS for Regression

  • SAS procedures for regression assuming

SRS:

– PROC REG – PROC LOGISTIC

  • In SAS v9.1 for complex surveys

– PROC SURVEYREG – PROC SURVEYLOGISTIC

  • See http://support.sas.com/onlinedoc/913/docMainpage.jsp

22

slide-23
SLIDE 23

Using Stata for Regression

  • Stata 9: SVY procedures for regression

include

– svy:regress – svy:logistic – svy:logit

  • See www.stata.com/stata9/svy.html for

more detail

23

slide-24
SLIDE 24

Using R / S+ for Regression

  • ‘survey’ package by Thomas Lumley

– Must install as library for S+ or R – Copy up on Blackboard

  • Has svyglm for generalized linear models
  • If like usual glm in S+, can do linear and

logistic modeling

– But I need to look more closely at it…

  • See http://faculty.washington.edu/tlumley/survey/

24

slide-25
SLIDE 25

What We Have Just Learned

  • Introduced logistic regression

– Discussed when and why it is useful – Interpreted output

  • Odds and odds ratios

– Illustrated use with examples

  • Showed how to run in JMP
  • Discussed other software for fitting

linear and logistic regression models to complex survey data

25