Regression Instructor: Prof. Shuai Huang Industrial and Systems - - PowerPoint PPT Presentation

β–Ά
regression
SMART_READER_LITE
LIVE PREVIEW

Regression Instructor: Prof. Shuai Huang Industrial and Systems - - PowerPoint PPT Presentation

Lecture 4: Logistic Regression Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of Washington Extend linear model for classification Need a mathematic transfer function to connect 0 + =1


slide-1
SLIDE 1

Lecture 4: Logistic Regression

Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of Washington

slide-2
SLIDE 2

Extend linear model for classification

  • Need a mathematic transfer function to connect 𝛾0 + σ𝑗=1

π‘ž

𝛾𝑗𝑦𝑗 with a binary outcome 𝑧

  • How?
  • Logistic regression chooses to use

log

π‘ž π’š 1βˆ’π‘ž π’š = 𝛾0 + σ𝑗=1 π‘ž

𝛾𝑗𝑦𝑗.

  • Why?
slide-3
SLIDE 3

Justification for the logistic regression model

  • It works in many applications
  • It leads to analytical tractability (in some senses) and encourages in-

depth theoretical investigation

  • It has a strong tie with linear regression model. Therefore,

methodologically there is much we can translate from linear regression to logistic regression. Conceptually, it inherits the aura of linear regression model and users can assume a similar degree of confidence of linear regression model onto the logistic regression model

slide-4
SLIDE 4

Parameter estimation

  • The likelihood function is:

𝑀 𝜸 = Ο‚π‘œ=1

𝑂

π‘ž π’šπ‘œ π‘§π‘œ 1 βˆ’ π‘ž π’šπ‘œ

1βˆ’π‘§π‘œ.

  • We use the log-likelihood to turn products into sums:

π‘š 𝜸 = Οƒπ‘œ=1

𝑂

π‘§π‘œ log π‘ž π’šπ‘œ + 1 βˆ’ π‘§π‘œ log 1 βˆ’ π‘ž π’šπ‘œ . This could be further transformed into π‘š 𝜸 = Οƒπ‘œ=1

𝑂

βˆ’ log 1 + 𝑓𝛾0+σ𝑗=1

π‘ž

π›Ύπ‘—π‘¦π‘œπ‘— βˆ’ Οƒπ‘œ=1 𝑂

π‘§π‘œ 𝛾0 + σ𝑗=1

π‘ž

π›Ύπ‘—π‘¦π‘œπ‘— , Then we can have Οƒπ‘œ=1

𝑂

π‘§π‘œ log π‘ž π’šπ‘œ + 1 βˆ’ π‘§π‘œ log 1 βˆ’ π‘ž π’šπ‘œ , = Οƒπ‘œ=1

𝑂

log 1 βˆ’ π‘ž π’šπ‘œ βˆ’ Οƒπ‘œ=1

𝑂

π‘§π‘œ log

π‘ž π’šπ‘œ 1βˆ’π‘ž π’šπ‘œ ,

= Οƒπ‘œ=1

𝑂

βˆ’ log 1 + 𝑓𝛾0+σ𝑗=1

π‘ž

π›Ύπ‘—π‘¦π‘œπ‘— βˆ’ Οƒπ‘œ=1 𝑂

π‘§π‘œ 𝛾0 + σ𝑗=1

π‘ž

π›Ύπ‘—π‘¦π‘œπ‘— .

slide-5
SLIDE 5

Application of the Newton-Raphson algorithm

  • The Newton-Raphson algorithm is an iterative algorithm that seeks updates of the current solution using the

following formula: πœΈπ‘œπ‘“π‘₯ = πœΈπ‘π‘šπ‘’ βˆ’

πœ–2π‘š 𝜸 πœ–πœΈπœ–πœΈπ‘ˆ βˆ’1 πœ–π‘š 𝜸 πœ–πœΈ .

  • We can show that

πœ–π‘š 𝜸 πœ–πœΈ = Οƒπ‘œ=1 𝑂

π’šπ‘œ π‘§π‘œ βˆ’ π‘ž π’šπ‘œ ,

πœ–2π‘š 𝜸 πœ–πœΈπœ–πœΈπ‘ˆ = βˆ’ Οƒπ‘œ=1 𝑂

π’šπ‘œπ’šπ‘œ

π‘ˆπ‘ž π’šπ‘œ

1 βˆ’ π‘ž π’šπ‘œ .

  • A certain structure can then be revealed if we rewrite it in matrix form:

πœ–π‘š 𝜸 πœ–πœΈ = π˜π‘ˆ 𝒛 βˆ’ 𝒒 , πœ–2π‘š 𝜸 πœ–πœΈπœ–πœΈπ‘ˆ = βˆ’π˜π‘ˆπ—π˜.

where 𝐘 is the 𝑂 Γ— π‘ž + 1 input matrix, 𝒛 is the 𝑂 Γ— 1 column vector of 𝑧𝑗, 𝒒 is the 𝑂 Γ— 1 column vector of π‘ž π’šπ‘œ , and 𝐗 is a 𝑂 Γ— 𝑂 diagonal matrix of weights with the nth diagonal element as π‘ž π’šπ‘œ 1 βˆ’ π‘ž π’šπ‘œ .

slide-6
SLIDE 6

The updating rule

Plugging this into the updating formula of the Newton-Raphson algorithm, πœΈπ‘œπ‘“π‘₯ = πœΈπ‘π‘šπ‘’ βˆ’

πœ–2π‘š 𝜸 πœ–πœΈπœ–πœΈπ‘ˆ βˆ’1 πœ–π‘š 𝜸 πœ–πœΈ ,

we can derive that πœΈπ‘œπ‘“π‘₯ = πœΈπ‘π‘šπ‘’ + π˜π‘ˆπ—π˜

βˆ’1π˜π‘ˆπ— 𝒛 βˆ’ 𝒒 ,

= π˜π‘ˆπ—π˜

βˆ’1π˜π‘ˆπ— π˜πœΈπ‘π‘šπ‘’ + π—βˆ’1 𝒛 βˆ’ 𝒒

, = π˜π‘ˆπ—π˜

βˆ’1π˜π‘ˆπ—π’œ,

where 𝐴 = π˜πœΈπ‘π‘šπ‘’ + π—βˆ’1 𝒛 βˆ’ 𝒒 .

slide-7
SLIDE 7

Another look at the updating rule

  • This resembles the generalized least squares (GLS) estimator of a

regression model, where each data point π’šπ‘œ, π‘§π‘œ is associated with a weight π‘₯π‘œ to reduce the influence of potential outliers in fitting the regression model. πœΈπ‘œπ‘“π‘₯ ⟡ arg min

𝜸

π’œ βˆ’ 𝐘𝜸 π‘ˆπ— π’œ βˆ’ 𝐘𝜸 .

  • For this reason, this algorithm is also called the Iteratively Reweighted

Least Square or IRLS algorithm. π’œ is referred as the adjusted response.

  • Why the weighting makes sense? Or, what are the implications of

this?

slide-8
SLIDE 8

A summary of the IRLS algorithm

Putting all these together, a complete flow of the IRLS is shown in below:

  • Initialize 𝜸.
  • Compute 𝒒 by its definition: π‘ž π’šπ‘œ =

1 1+π‘“βˆ’ 𝛾0+σ𝑗=1

π‘ž π›Ύπ‘—π‘¦π‘œπ‘— for π‘œ = 1,2, … , 𝑂.

  • Compute the diagonal matrix 𝐗, while the nth diagonal element as

π‘ž π’šπ‘œ 1 βˆ’ π‘ž π’šπ‘œ for π‘œ = 1,2, … , 𝑂.

  • Set π’œ as = 𝐘𝜸 + π—βˆ’1 𝒛 βˆ’ 𝒒 .
  • Set 𝜸 as π˜π‘ˆπ—π˜

βˆ’1π˜π‘ˆπ—π’œ.

  • If the stopping criteria is met, stop; otherwise go back to step 2.
slide-9
SLIDE 9

R lab

  • Download the markdown code from course website
  • Conduct the experiments
  • Interpret the results
  • Repeat the analysis on other datasets