Advanced Section #5: Generalized Linear Models: Logistic Regression - - PowerPoint PPT Presentation

β–Ά
advanced section 5 generalized linear models logistic
SMART_READER_LITE
LIVE PREVIEW

Advanced Section #5: Generalized Linear Models: Logistic Regression - - PowerPoint PPT Presentation

Advanced Section #5: Generalized Linear Models: Logistic Regression and Beyond Nick Stern CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader, and Chris Tanner 1 Outline Motivation 1. Limitations of linear regression 2.


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader, and Chris Tanner

Advanced Section #5: Generalized Linear Models: Logistic Regression and Beyond

1

Nick Stern

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER

Outline

1.

Motivation

  • Limitations of linear regression
  • 2. Anatomy
  • Exponential Dispersion Family (EDF)
  • Link function
  • 3. Maximum Likelihood Estimation for GLM’s
  • Fischer Scoring

2

slide-3
SLIDE 3

CS109A, PROTOPAPAS, RADER

Motivation

3

slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER

Motivation

4

Linear regression framework: 𝑧" = 𝑦"

%𝛾 + πœ—"

Assumptions:

1. Linearity: Linear relationship between expected value and predictors 2. Normality: Residuals are normally distributed about expected value 3. Homoskedasticity: Residuals have constant variance 𝜏* 4. Independence: Observations are independent of one another

slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER

Motivation

5

Expressed mathematically…

  • Linearity

𝔽 𝑧" = 𝑦"

%𝛾

  • Normality

𝑧" ∼ π’ͺ(𝑦"

%𝛾, 𝜏*)

  • Homoskedasticity

𝜏* (instead of) 𝜏"

*

  • Independence

π‘ž 𝑧"|𝑧3 = π‘ž(𝑧") for 𝑗 β‰  π‘˜

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER

Motivation

6

What happens when our assumptions break down?

slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER

Motivation

7

We have options within the framework of linear regression

Transform X or Y (Polynomial Regression)

Nonlinearity

Weight observations (WLS Regression)

Heteroskedasticity

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER

Motivation

8

But assuming Normality can be pretty limiting… Consider modeling the following random variables:

  • Whether a coin flip is heads or tails (Bernoulli)
  • Counts of species in a given area (Poisson)
  • Time between stochastic events that occur w/ constant rate (gamma)
  • Vote counts for multiple candidates in a poll (multinomial)
slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER

Motivation

9

We can extend the framework for linear regression. Enter: Generalized Linear Models Relaxes:

  • Normality assumption
  • Homoskedasticity assumption
slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER

Motivation

10

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER

Anatomy

11

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER

Anatomy

12

Two adjustments must be made to turn LM into GLM

1.

Assume response variable comes from a family of distributions called the exponential dispersion family (EDF).

  • 2. The relationship between expected value and predictors is

expressed through a link function.

slide-13
SLIDE 13

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

13

The EDF family contains: Normal, Poisson, gamma, and more! The probability density function looks like this: 𝑔 𝑧"|πœ„" = exp 𝑧"πœ„" βˆ’ 𝑐 πœ„" 𝜚" + 𝑑 𝑧", 𝜚" Where

πœ„ - β€œcanonical parameter” 𝜚 - β€œdispersion parameter” 𝑐 πœ„ - β€œcumulant function” 𝑑 𝑧, 𝜚 - β€œnormalization factor”

slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

14

Example: representing Bernoulli distribution in EDF form.

PDF of a Bernoulli random variable: 𝑔 𝑧" π‘ž" = π‘ž"

@A 1 βˆ’ π‘ž" C D @A

Taking the log and then exponentiating (to cancel each other out) gives: 𝑔 𝑧" π‘ž" = exp 𝑧" log π‘ž" + 1 βˆ’ 𝑧" log 1 βˆ’ π‘ž" Rearranging terms… 𝑔 𝑧" π‘ž" = exp 𝑧" log π‘ž" 1 βˆ’ π‘ž" + log 1 βˆ’ π‘ž"

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

15

Comparing:

𝑔 𝑧" π‘ž" = exp 𝑧" log π‘ž" 1 βˆ’ π‘ž" + log 1 βˆ’ π‘ž" 𝑔 𝑧"|πœ„" = exp 𝑧"πœ„" βˆ’ 𝑐 πœ„" 𝜚" + 𝑑 𝑧", 𝜚"

vs. Choosing:

πœ„" = log π‘ž" 1 βˆ’ π‘ž" 𝜚" = 1 𝑐(πœ„") = log 1 + 𝑓IA 𝑑(𝑧", 𝜚") = 0

And we recover the EDF form of the Bernoulli distribution

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

16

The EDF family has some useful properties. Namely:

  • 1. 𝔽 𝑧" ≑ 𝜈" = 𝑐N πœ„"
  • 2. π‘Šπ‘π‘  𝑧" = 𝜚"𝑐NN πœ„"

(the proofs for these identities are in the notes)

Plugging in the values we obtained for Bernoulli, we get back: 𝔽 𝑧" = π‘ž", π‘Šπ‘π‘  𝑧" = π‘ž"(1 βˆ’ π‘ž")

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

17

Time to talk about the link function

slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

18

Recall from linear regression that: 𝜈" = 𝑦"

%𝛾

Does this work for the Bernoulli distribution? 𝜈" = π‘ž" = 𝑦"

%𝛾

Solution: wrap the expectation in a function called the link function: 𝑕 𝜈" = 𝑦"

%𝛾 ≑ πœƒ"

*For the Bernoulli distribution, the link function is the β€œlogit” function (hence β€œlogistic” regression)

slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

19

Link functions are a choice, not a property. A good choice is:

1.

Differentiable (implies β€œsmoothness”)

  • 2. Monotonic (guarantees invertibility)

1.

Typically increasing so that 𝜈 increases w/ πœƒ

  • 3. Expands the range of 𝜈 to the entire real line

Example: Logit function for Bernoulli

𝑕 𝜈" = 𝑕 π‘ž" = log π‘ž" 1 βˆ’ π‘ž"

slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

20

Logit function for Bernoulli looks familiar… 𝑕 π‘ž" = log π‘ž" 1 βˆ’ π‘ž" = πœ„" Choosing the link function by setting πœ„" = πœƒ" gives us what is known as the β€œcanonical link function.” Note: 𝜈" = 𝑐N πœ„" β†’ πœ„" = 𝑐NDC(𝜈")

(derivative of cumulant function must be invertible)

This choice of link, while not always effective, has some nice

  • properties. Take STAT 149 to find out more!
slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

21

Here are some more examples (fun exercises at home)

Distribution π’ˆ(𝒛𝒋|πœΎπ’‹) Mean Function 𝝂𝒋 = 𝒄N(πœΎπ’‹) Canonical Link πœΎπ’‹ = 𝒉(𝝂𝒋) Normal πœ„" 𝜈" Bernoulli/Binomial 𝑓IA 1 + 𝑓IA log 𝜈" 1 βˆ’ 𝜈" Poisson 𝑓IA log(𝜈") Gamma βˆ’1 πœ„" βˆ’1 𝜈" Inverse Gaussian βˆ’2πœ„"

DC *

βˆ’1 2𝜈"

*

slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

22

slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

23

Recall from linear regression – we can estimate our parameters, πœ„, by choosing those that maximize the likelihood, 𝑀 𝑧 πœ„), of the data, where: 𝑀 𝑧 πœ„ = ^

" _

π‘ž 𝑧" πœ„" In words: likelihood is the probability of observing a set of β€œN” independent datapoints, given our assumptions about the generative process.

slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

24

For GLM’s we can plug in the PDF of the EDF family: 𝑀 𝑧 πœ„ = ^

"`C _

exp 𝑧"πœ„" βˆ’ 𝑐 πœ„" 𝜚" + 𝑑 𝑧", 𝜚" How do we maximize this? Differentiate w.r.t. πœ„ and set equal to 0. Taking the log first simplifies our life: β„“ 𝑧 πœ„ = b

"`C _ 𝑧"πœ„" βˆ’ 𝑐 πœ„"

𝜚" + b

"`C _

𝑑 𝑧", 𝜚"

slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

25

Through lots of calculus & algebra (see notes), we can obtain the following form for the derivative of the log-likelihood: β„“N 𝑧 πœ„ = b

"`C _

1 π‘Šπ‘π‘  𝑧" πœ–πœˆ" πœ–π›Ύ (𝑧" βˆ’ 𝜈") Setting this sum equal to 0 gives us the generalized estimating equations: b

"`C _

1 π‘Šπ‘π‘  𝑧" πœ–πœˆ" πœ–π›Ύ (𝑧" βˆ’ 𝜈") = 0

slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

26

When we use the canonical link, this simplifies to the normal equations: b

"`C _

𝑧" βˆ’ 𝜈" 𝑦"

%

𝜚" = 0 Let’s attempt to solve the normal equations for the Bernoulli

  • distribution. Plugging in 𝜈" and 𝜚" we get:

b

"`C _

𝑧" βˆ’ 𝑓dA

ef

1 βˆ’ 𝑓dA

ef

𝑦"

% = 0

slide-27
SLIDE 27

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

27

Sad news: we can’t isolate 𝛾 analytically.

slide-28
SLIDE 28

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

28

Good news: we can approximate it numerically. One choice of algorithm is the Fisher Scoring algorithm.

In order to find the πœ„ that maximizes the log-likelihood, β„“(𝑧|πœ„):

  • 1. Pick a starting value for our parameter, πœ„g.
  • 2. Iteratively update this value as follows:

πœ„"hC = πœ„" βˆ’ β„“N(πœ„") 𝔽 β„“NN πœ„" In words: perform gradient ascent with a learning rate inversely proportional to the expected curvature of the function at that point.

slide-29
SLIDE 29

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

29

Here are the results of implementing the Fisher Scoring algorithm for simple logistic regression in python:

DEMO

slide-30
SLIDE 30

CS109A, PROTOPAPAS, RADER

Questions?

30