Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

β–Ά
linear and logistic regression
SMART_READER_LITE
LIVE PREVIEW

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


slide-1
SLIDE 1

Linear and Logistic Regression

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

slide-2
SLIDE 2

Goals for the lecture

  • understand the concepts
  • linear regression
  • closed form solution for linear regression
  • lasso
  • RMSE, MAE, and R-square
  • logistic regression for linear classification
  • gradient descent for logistic regression
  • multiclass logistic regression
slide-3
SLIDE 3

Linear regression

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 𝑛 σ𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2

π‘š2 loss; also called mean squared error

Hypothesis class π“˜

slide-4
SLIDE 4

Linear regression: optimization

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 𝑛 σ𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2

  • Let π‘Œ be a matrix whose 𝑗-th row is 𝑦(𝑗) π‘ˆ, 𝑧 be the vector

𝑧(1), … , 𝑧(𝑛)

π‘ˆ

ΰ·  𝑀 𝑔

π‘₯ = 1

𝑛 ෍

𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2 = 1 𝑛 βƒ¦π‘Œπ‘₯ βˆ’ 𝑧 ⃦2

2

slide-5
SLIDE 5

Linear regression: optimization

  • Set the gradient to 0 to get the minimizer

𝛼

π‘₯ ΰ· 

𝑀 𝑔

π‘₯ = 𝛼 π‘₯

1 𝑛 βƒ¦π‘Œπ‘₯ βˆ’ 𝑧 ⃦2

2 = 0

𝛼

π‘₯[ π‘Œπ‘₯ βˆ’ 𝑧 π‘ˆ(π‘Œπ‘₯ βˆ’ 𝑧)] = 0

𝛼

π‘₯[ π‘₯π‘ˆπ‘Œπ‘ˆπ‘Œπ‘₯ βˆ’ 2π‘₯π‘ˆπ‘Œπ‘ˆπ‘§ + π‘§π‘ˆπ‘§] = 0

2π‘Œπ‘ˆπ‘Œπ‘₯ βˆ’ 2π‘Œπ‘ˆπ‘§ = 0 w = π‘Œπ‘ˆπ‘Œ βˆ’1π‘Œπ‘ˆπ‘§

slide-6
SLIDE 6

Linear regression: optimization

  • Algebraic view of the minimizer
  • If π‘Œ is invertible, just solve π‘Œπ‘₯ = 𝑧 and get π‘₯ = π‘Œβˆ’1𝑧
  • But typically π‘Œ is a tall matrix

π‘Œ π‘₯ = 𝑧 π‘Œπ‘ˆπ‘Œ π‘₯ = π‘Œπ‘ˆπ‘§ Normal equation: w = π‘Œπ‘ˆπ‘Œ βˆ’1π‘Œπ‘ˆπ‘§

slide-7
SLIDE 7

Linear regression with bias

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 to minimize the loss

  • Reduce to the case without bias:
  • Let π‘₯β€² = π‘₯; 𝑐 , 𝑦′ = 𝑦; 1
  • Then 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 = π‘₯β€² π‘ˆ(𝑦′)

Bias term

slide-8
SLIDE 8

Linear regression with lasso penalty

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes

ΰ·  𝑀 𝑔

π‘₯ = 1

𝑛 ෍

𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2 + πœ‡ π‘₯ 1

lasso penalty: π‘š1 norm of the parameter, encourages sparsity

slide-9
SLIDE 9
  • Root mean squared error (RMSE)
  • Mean absolute error (MAE) – average π‘š1 error
  • R-square (R-squared)
  • Historically all were computed on training data, and possibly

adjusted after, but really should cross-validate

Evaluation Metrics

slide-10
SLIDE 10
  • Formulation 1:
  • Formulation 2: square of Pearson correlation coefficient r

between the label and the prediction. Recall for x, y:

R-square

οƒ₯ οƒ₯ οƒ₯

ο€­ ο€­ ο€­ ο€­ ο€½

i i i i i i i

y y x x y y x x r

2 2

) ( ) ( ) )( (

slide-11
SLIDE 11

Linear classification

π‘₯π‘ˆπ‘¦ = 0 Class 1 Class 0 π‘₯ π‘₯π‘ˆπ‘¦ > 0 π‘₯π‘ˆπ‘¦ < 0

slide-12
SLIDE 12

Linear classification: natural attempt

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Hypothesis 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦

  • 𝑧 = 1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = 0 if π‘₯π‘ˆπ‘¦ < 0
  • Prediction: 𝑧 = step(𝑔

π‘₯ 𝑦 ) = step(π‘₯π‘ˆπ‘¦)

Linear model π“˜

slide-13
SLIDE 13

Linear classification: natural attempt

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ to minimize

ΰ·  𝑀 𝑔

π‘₯ = 1

𝑛 ෍

𝑗=1 𝑛

𝕁[step(π‘₯π‘ˆπ‘¦ 𝑗 ) β‰  𝑧(𝑗)]

  • Drawback: difficult to optimize
  • NP-hard in the worst case

0-1 loss

slide-14
SLIDE 14

Linear classification: simple approach

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 𝑛 σ𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2

Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

slide-15
SLIDE 15

Linear classification: simple approach

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Drawback: not robust to β€œoutliers”

slide-16
SLIDE 16

Compare the two

𝑧 = π‘₯π‘ˆπ‘¦

π‘₯π‘ˆπ‘¦ 𝑧

𝑧 = step(π‘₯π‘ˆπ‘¦)

slide-17
SLIDE 17

Between the two

  • Prediction bounded in [0,1]
  • Smooth
  • Sigmoid: 𝜏 𝑏 =

1 1+exp(βˆ’π‘) Figure borrowed from Pattern Recognition and Machine Learning, Bishop

slide-18
SLIDE 18

Linear classification: sigmoid prediction

  • Squash the output of the linear function

Sigmoid π‘₯π‘ˆπ‘¦ = 𝜏 π‘₯π‘ˆπ‘¦ = 1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦)

  • Find π‘₯ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 𝑛 σ𝑗=1 𝑛

𝜏(π‘₯π‘ˆπ‘¦ 𝑗 ) βˆ’ 𝑧(𝑗) 2

slide-19
SLIDE 19

Linear classification: logistic regression

  • Squash the output of the linear function

Sigmoid π‘₯π‘ˆπ‘¦ = 𝜏 π‘₯π‘ˆπ‘¦ = 1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦)

  • A better approach: Interpret as a probability

𝑄

π‘₯(𝑧 = 1|𝑦) = 𝜏 π‘₯π‘ˆπ‘¦ =

1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦) 𝑄

π‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ 𝑄 π‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦

slide-20
SLIDE 20

Linear classification: logistic regression

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 𝑛 σ𝑗=1 𝑛

π‘₯π‘ˆπ‘¦(𝑗) βˆ’ 𝑧(𝑗) 2

  • Find π‘₯ that minimizes

ΰ·  𝑀 π‘₯ = βˆ’ 1 𝑛 ෍

𝑗=1 𝑛

log 𝑄

π‘₯ 𝑧(𝑗) 𝑦(𝑗)

ΰ·  𝑀 π‘₯ = βˆ’ 1 𝑛 ෍

𝑧(𝑗)=1

log𝜏(π‘₯π‘ˆπ‘¦(𝑗)) βˆ’ 1 𝑛 ෍

𝑧(𝑗)=0

log[1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦(𝑗) ]

Logistic regression:

MLE with sigmoid

slide-21
SLIDE 21

Linear classification: logistic regression

  • Given training data

𝑦 𝑗 , 𝑧(𝑗) : 1 ≀ 𝑗 ≀ 𝑛 i.i.d. from distribution 𝐸

  • Find π‘₯ that minimizes

ΰ·  𝑀 π‘₯ = βˆ’ 1 𝑛 ෍

𝑧(𝑗)=1

log𝜏(π‘₯π‘ˆπ‘¦(𝑗)) βˆ’ 1 𝑛 ෍

𝑧(𝑗)=0

log[1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦(𝑗) ]

No close form solution; Need to use gradient descent

slide-22
SLIDE 22

Properties of sigmoid function

  • Bounded

𝜏 𝑏 = 1 1 + exp(βˆ’π‘) ∈ (0,1)

  • Symmetric

1 βˆ’ 𝜏 𝑏 = exp βˆ’π‘ 1 + exp βˆ’π‘ = 1 exp 𝑏 + 1 = 𝜏(βˆ’π‘)

  • Gradient

πœβ€²(𝑏) = exp βˆ’π‘ 1 + exp βˆ’π‘

2 = 𝜏(𝑏)(1 βˆ’ 𝜏 𝑏 )

slide-23
SLIDE 23

Review: binary logistic regression

  • Sigmoid

𝜏 π‘₯π‘ˆπ‘¦ + 𝑐 = 1 1 + exp(βˆ’(π‘₯π‘ˆπ‘¦ + 𝑐))

  • Interpret as conditional probability

π‘žπ‘₯ 𝑧 = 1 𝑦 = 𝜏 π‘₯π‘ˆπ‘¦ + 𝑐 π‘žπ‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ π‘žπ‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦ + 𝑐

  • How to extend to multiclass?
slide-24
SLIDE 24

Review: binary logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • Conditional probability by Bayesian rule:

π‘ž 𝑧 = 1|𝑦 = π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 1 π‘ž 𝑧 = 1 + π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = 1 1 + exp(βˆ’π‘) = 𝜏(𝑏)

where we define

𝑏 ≔ ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦

slide-25
SLIDE 25

Review: binary logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • π‘ž 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(π‘₯π‘ˆπ‘¦ + 𝑐) is equivalent to setting log odds

to be linear: 𝑏 = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐

  • Why linear log odds?
slide-26
SLIDE 26

Review: binary logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • log odd is

𝑏 = ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = π‘₯π‘ˆπ‘¦ + 𝑐 where π‘₯ = 𝜈1 βˆ’ 𝜈2, 𝑐 = βˆ’ 1 2 𝜈1

π‘ˆπœˆ1 + 1

2 𝜈2

π‘ˆπœˆ2 + ln π‘ž(𝑧 = 1)

π‘ž(𝑧 = 2)

slide-27
SLIDE 27

Multiclass logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • Conditional probability by Bayesian rule:

π‘ž 𝑧 = 𝑗|𝑦 = π‘ž 𝑦|𝑧 = 𝑗 π‘ž(𝑧 = 𝑗) Οƒπ‘˜ π‘ž 𝑦|𝑧 = π‘˜ π‘ž(𝑧 = π‘˜) = exp(𝑏𝑗) Οƒπ‘˜ exp(π‘π‘˜) where we define 𝑏𝑗 ≔ ln [π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 ]

slide-28
SLIDE 28

Multiclass logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Then

𝑏𝑗 ≔ ln π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 = βˆ’ 1 2 π‘¦π‘ˆπ‘¦ + π‘₯𝑗

π‘ˆ

𝑦 + 𝑐𝑗 where π‘₯𝑗 = πœˆπ‘—, 𝑐𝑗 = βˆ’ 1 2 πœˆπ‘—

π‘ˆπœˆπ‘— + ln π‘ž 𝑧 = 𝑗 + ln

1 2𝜌 𝑒/2

slide-29
SLIDE 29

Multiclass logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Cancel out βˆ’

1 2 π‘¦π‘ˆπ‘¦, we have

π‘ž 𝑧 = 𝑗|𝑦 = exp(𝑏𝑗) Οƒπ‘˜ exp(π‘π‘˜) , 𝑏𝑗 ≔ π‘₯𝑗 π‘ˆπ‘¦ + 𝑐𝑗 where π‘₯𝑗 = πœˆπ‘—, 𝑐𝑗 = βˆ’ 1 2 πœˆπ‘—

π‘ˆπœˆπ‘— + ln π‘ž 𝑧 = 𝑗 + ln

1 2𝜌 𝑒/2

slide-30
SLIDE 30

Multiclass logistic regression: conclusion

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Then

π‘ž 𝑧 = 𝑗|𝑦 = exp( π‘₯𝑗 π‘ˆπ‘¦ + 𝑐𝑗) Οƒπ‘˜ exp( π‘₯π‘˜ π‘ˆπ‘¦ + π‘π‘˜) which is the hypothesis class for multiclass logistic regression

  • It is softmax on linear transformation; it can be used to derive the

negative log-likelihood loss (cross entropy)

slide-31
SLIDE 31

Softmax

  • A way to squash 𝑏 = (𝑏1, 𝑏2, … , 𝑏𝑗, … ) into probability vector π‘ž

softmax 𝑏 = exp(𝑏1) Οƒπ‘˜ exp(π‘π‘˜) , exp(𝑏2) Οƒπ‘˜ exp(π‘π‘˜) , … , exp 𝑏𝑗 Οƒπ‘˜ exp π‘π‘˜ , …

  • Behave like max: when 𝑏𝑗 ≫ π‘π‘˜ βˆ€π‘˜ β‰  𝑗 , π‘žπ‘— β‰… 1, π‘žπ‘˜ β‰… 0
slide-32
SLIDE 32

Cross entropy for conditional distribution

  • Let π‘ždata(𝑧|𝑦) denote the empirical distribution of the data
  • Negative log-likelihood

βˆ’

1 𝑛 σ𝑗=1 𝑛 log π‘ž 𝑧 = 𝑧(𝑗) 𝑦(𝑗) = βˆ’Eπ‘ždata(𝑧|𝑦) log π‘ž(𝑧|𝑦)

is the cross entropy between π‘ždata and the model output π‘ž

  • Information theory viewpoint: KL divergence

𝐸(π‘ždata| π‘ž = Eπ‘ždata[log

π‘ždata π‘ž ] = Eπ‘ždata [log π‘ždata] βˆ’ Eπ‘ždata[log π‘ž]

Entropy; constant Cross entropy

slide-33
SLIDE 33

Cross entropy for full distribution

  • Let π‘ždata(𝑦, 𝑧) denote the empirical distribution of the data
  • Negative log-likelihood

βˆ’

1 𝑛 σ𝑗=1 𝑛 log π‘ž(𝑦 𝑗 , 𝑧(𝑗)) = βˆ’Eπ‘ždata(𝑦,𝑧) log π‘ž(𝑦, 𝑧)

is the cross entropy between π‘ždata and the model output π‘ž