linear and logistic regression
play

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


  1. Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

  2. Goals for the lecture • understand the concepts • linear regression • closed form solution for linear regression • lasso • RMSE, MAE, and R-square • logistic regression for linear classification • gradient descent for logistic regression • multiclass logistic regression

  3. Linear regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = 𝑚 2 loss; also called mean squared error Hypothesis class 𝓘

  4. Linear regression: optimization 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = • Let 𝑌 be a matrix whose 𝑗 -th row is 𝑦 (𝑗) 𝑈 , 𝑧 be the vector 𝑈 𝑧 (1) , … , 𝑧 (𝑛) 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 = 1 ෠ 2 𝑀 𝑔 𝑛 ෍ 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑗=1

  5. Linear regression: optimization • Set the gradient to 0 to get the minimizer 1 2 = 0 𝑥 ෠ 𝛼 𝑀 𝑔 𝑥 = 𝛼 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑥 𝑥 [ 𝑌𝑥 − 𝑧 𝑈 (𝑌𝑥 − 𝑧)] = 0 𝛼 𝑥 [ 𝑥 𝑈 𝑌 𝑈 𝑌𝑥 − 2𝑥 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧] = 0 𝛼 2𝑌 𝑈 𝑌𝑥 − 2𝑌 𝑈 𝑧 = 0 w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

  6. Linear regression: optimization • Algebraic view of the minimizer • If 𝑌 is invertible, just solve 𝑌𝑥 = 𝑧 and get 𝑥 = 𝑌 −1 𝑧 • But typically 𝑌 is a tall matrix 𝑥 = 𝑥 = 𝑌 𝑈 𝑌 𝑌 𝑈 𝑧 𝑌 𝑧 Normal equation: w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

  7. Linear regression with bias Bias term 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 to minimize the loss • Find 𝑔 • Reduce to the case without bias: • Let 𝑥 ′ = 𝑥; 𝑐 , 𝑦 ′ = 𝑦; 1 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 𝑥 ′ 𝑈 (𝑦 ′ ) • Then 𝑔

  8. Linear regression with lasso penalty 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 + 𝜇 𝑥 1 ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 lasso penalty: 𝑚 1 norm of the parameter, encourages sparsity

  9. Evaluation Metrics • Root mean squared error (RMSE) • Mean absolute error (MAE) – average 𝑚 1 error • R-square (R-squared) • Historically all were computed on training data, and possibly adjusted after, but really should cross-validate

  10. R-square • Formulation 1: • Formulation 2: square of Pearson correlation coefficient r between the label and the prediction. Recall for x, y:    ( x x )( y y )  i i i r     2 2 ( x x ) ( y y ) i i i i

  11. Linear classification 𝑥 𝑈 𝑦 = 0 𝑥 𝑈 𝑦 > 0 𝑥 𝑈 𝑦 < 0 𝑥 Class 1 Class 0

  12. Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Hypothesis 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 • 𝑧 = 1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = 0 if 𝑥 𝑈 𝑦 < 0 Linear model 𝓘 𝑥 𝑦 ) = step(𝑥 𝑈 𝑦) • Prediction: 𝑧 = step(𝑔

  13. Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 to minimize 𝑛 𝑥 = 1 𝕁[step(𝑥 𝑈 𝑦 𝑗 ) ≠ 𝑧 (𝑗) ] ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 • Drawback: difficult to optimize 0-1 loss • NP-hard in the worst case

  14. Linear classification: simple approach 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

  15. Linear classification: simple approach Drawback: not robust to “outliers” Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  16. Compare the two 𝑧 𝑧 = 𝑥 𝑈 𝑦 𝑧 = step(𝑥 𝑈 𝑦) 𝑥 𝑈 𝑦

  17. Between the two • Prediction bounded in [0,1] • Smooth 1 • Sigmoid: 𝜏 𝑏 = 1+exp(−𝑏) Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  18. Linear classification: sigmoid prediction • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) 𝜏(𝑥 𝑈 𝑦 𝑗 ) − 𝑧 (𝑗) 2 1 • Find 𝑥 that minimizes ෠ 𝑛 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 =

  19. Linear classification: logistic regression • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) • A better approach: Interpret as a probability 1 𝑥 (𝑧 = 1|𝑦) = 𝜏 𝑥 𝑈 𝑦 = 𝑄 1 + exp(−𝑥 𝑈 𝑦) 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 𝑄 𝑥 𝑧 = 0 𝑦 = 1 − 𝑄

  20. Linear classification: logistic regression 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑀 𝑔 𝑥 = 𝑛 σ 𝑗=1 • Find 𝑥 that minimizes 𝑛 𝑀 𝑥 = − 1 𝑥 𝑧 (𝑗) 𝑦 (𝑗) ෠ 𝑛 ෍ log 𝑄 𝑗=1 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 Logistic regression: MLE with sigmoid

  21. Linear classification: logistic regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑥 that minimizes 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 No close form solution; Need to use gradient descent

  22. Properties of sigmoid function • Bounded 1 𝜏 𝑏 = 1 + exp(−𝑏) ∈ (0,1) • Symmetric exp −𝑏 1 1 − 𝜏 𝑏 = 1 + exp −𝑏 = exp 𝑏 + 1 = 𝜏(−𝑏) • Gradient exp −𝑏 𝜏 ′ (𝑏) = 2 = 𝜏(𝑏)(1 − 𝜏 𝑏 ) 1 + exp −𝑏

  23. Review: binary logistic regression • Sigmoid 1 𝜏 𝑥 𝑈 𝑦 + 𝑐 = 1 + exp(−(𝑥 𝑈 𝑦 + 𝑐)) • Interpret as conditional probability 𝑞 𝑥 𝑧 = 1 𝑦 = 𝜏 𝑥 𝑈 𝑦 + 𝑐 𝑞 𝑥 𝑧 = 0 𝑦 = 1 − 𝑞 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 + 𝑐 • How to extend to multiclass?

  24. Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 1 𝑞 𝑧 = 1|𝑦 = 𝑞 𝑦|𝑧 = 1 𝑞 𝑧 = 1 + 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 1 + exp(−𝑏) = 𝜏(𝑏) where we define 𝑏 ≔ ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦

  25. Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • 𝑞 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(𝑥 𝑈 𝑦 + 𝑐) is equivalent to setting log odds to be linear: 𝑏 = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦 = 𝑥 𝑈 𝑦 + 𝑐 • Why linear log odds?

  26. Review: binary logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • log odd is 𝑏 = ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 𝑥 𝑈 𝑦 + 𝑐 where 𝑐 = − 1 𝑈 𝜈 1 + 1 𝑈 𝜈 2 + ln 𝑞(𝑧 = 1) 𝑥 = 𝜈 1 − 𝜈 2 , 2 𝜈 1 2 𝜈 2 𝑞(𝑧 = 2)

  27. Multiclass logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 𝑗 𝑞(𝑧 = 𝑗) exp(𝑏 𝑗 ) 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 𝑞 𝑦|𝑧 = 𝑘 𝑞(𝑧 = 𝑘) = σ 𝑘 exp(𝑏 𝑘 ) where we define 𝑏 𝑗 ≔ ln [𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 ]

  28. Multiclass logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • Then 𝑈 = − 1 2 𝑦 𝑈 𝑦 + 𝑥 𝑗 𝑦 + 𝑐 𝑗 𝑏 𝑗 ≔ ln 𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 where 𝑐 𝑗 = − 1 1 𝑥 𝑗 = 𝜈 𝑗 , 𝑈 𝜈 𝑗 + ln 𝑞 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend