Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture • understand the concepts • linear regression • closed form solution for linear regression • lasso • RMSE, MAE, and R-square • logistic regression for linear classification • gradient descent for logistic regression • multiclass logistic regression

Linear regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = 𝑚 2 loss; also called mean squared error Hypothesis class 𝓘

Linear regression: optimization 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = • Let 𝑌 be a matrix whose 𝑗 -th row is 𝑦 (𝑗) 𝑈 , 𝑧 be the vector 𝑈 𝑧 (1) , … , 𝑧 (𝑛) 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 = 1 ෠ 2 𝑀 𝑔 𝑛 ෍ 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑗=1

Linear regression: optimization • Set the gradient to 0 to get the minimizer 1 2 = 0 𝑥 ෠ 𝛼 𝑀 𝑔 𝑥 = 𝛼 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑥 𝑥 [ 𝑌𝑥 − 𝑧 𝑈 (𝑌𝑥 − 𝑧)] = 0 𝛼 𝑥 [ 𝑥 𝑈 𝑌 𝑈 𝑌𝑥 − 2𝑥 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧] = 0 𝛼 2𝑌 𝑈 𝑌𝑥 − 2𝑌 𝑈 𝑧 = 0 w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

Linear regression: optimization • Algebraic view of the minimizer • If 𝑌 is invertible, just solve 𝑌𝑥 = 𝑧 and get 𝑥 = 𝑌 −1 𝑧 • But typically 𝑌 is a tall matrix 𝑥 = 𝑥 = 𝑌 𝑈 𝑌 𝑌 𝑈 𝑧 𝑌 𝑧 Normal equation: w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

Linear regression with bias Bias term 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 to minimize the loss • Find 𝑔 • Reduce to the case without bias: • Let 𝑥 ′ = 𝑥; 𝑐 , 𝑦 ′ = 𝑦; 1 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 𝑥 ′ 𝑈 (𝑦 ′ ) • Then 𝑔

Linear regression with lasso penalty 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 + 𝜇 𝑥 1 ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 lasso penalty: 𝑚 1 norm of the parameter, encourages sparsity

Evaluation Metrics • Root mean squared error (RMSE) • Mean absolute error (MAE) – average 𝑚 1 error • R-square (R-squared) • Historically all were computed on training data, and possibly adjusted after, but really should cross-validate

R-square • Formulation 1: • Formulation 2: square of Pearson correlation coefficient r between the label and the prediction. Recall for x, y:    ( x x )( y y )  i i i r     2 2 ( x x ) ( y y ) i i i i

Linear classification 𝑥 𝑈 𝑦 = 0 𝑥 𝑈 𝑦 > 0 𝑥 𝑈 𝑦 < 0 𝑥 Class 1 Class 0

Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Hypothesis 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 • 𝑧 = 1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = 0 if 𝑥 𝑈 𝑦 < 0 Linear model 𝓘 𝑥 𝑦 ) = step(𝑥 𝑈 𝑦) • Prediction: 𝑧 = step(𝑔

Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 to minimize 𝑛 𝑥 = 1 𝕁[step(𝑥 𝑈 𝑦 𝑗 ) ≠ 𝑧 (𝑗) ] ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 • Drawback: difficult to optimize 0-1 loss • NP-hard in the worst case

Linear classification: simple approach 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

Linear classification: simple approach Drawback: not robust to “outliers” Figure borrowed from Pattern Recognition and Machine Learning , Bishop

Compare the two 𝑧 𝑧 = 𝑥 𝑈 𝑦 𝑧 = step(𝑥 𝑈 𝑦) 𝑥 𝑈 𝑦

Between the two • Prediction bounded in [0,1] • Smooth 1 • Sigmoid: 𝜏 𝑏 = 1+exp(−𝑏) Figure borrowed from Pattern Recognition and Machine Learning , Bishop

Linear classification: sigmoid prediction • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) 𝜏(𝑥 𝑈 𝑦 𝑗 ) − 𝑧 (𝑗) 2 1 • Find 𝑥 that minimizes ෠ 𝑛 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 =

Linear classification: logistic regression • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) • A better approach: Interpret as a probability 1 𝑥 (𝑧 = 1|𝑦) = 𝜏 𝑥 𝑈 𝑦 = 𝑄 1 + exp(−𝑥 𝑈 𝑦) 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 𝑄 𝑥 𝑧 = 0 𝑦 = 1 − 𝑄

Linear classification: logistic regression 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑀 𝑔 𝑥 = 𝑛 σ 𝑗=1 • Find 𝑥 that minimizes 𝑛 𝑀 𝑥 = − 1 𝑥 𝑧 (𝑗) 𝑦 (𝑗) ෠ 𝑛 ෍ log 𝑄 𝑗=1 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 Logistic regression: MLE with sigmoid

Linear classification: logistic regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑥 that minimizes 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 No close form solution; Need to use gradient descent

Properties of sigmoid function • Bounded 1 𝜏 𝑏 = 1 + exp(−𝑏) ∈ (0,1) • Symmetric exp −𝑏 1 1 − 𝜏 𝑏 = 1 + exp −𝑏 = exp 𝑏 + 1 = 𝜏(−𝑏) • Gradient exp −𝑏 𝜏 ′ (𝑏) = 2 = 𝜏(𝑏)(1 − 𝜏 𝑏 ) 1 + exp −𝑏

Review: binary logistic regression • Sigmoid 1 𝜏 𝑥 𝑈 𝑦 + 𝑐 = 1 + exp(−(𝑥 𝑈 𝑦 + 𝑐)) • Interpret as conditional probability 𝑞 𝑥 𝑧 = 1 𝑦 = 𝜏 𝑥 𝑈 𝑦 + 𝑐 𝑞 𝑥 𝑧 = 0 𝑦 = 1 − 𝑞 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 + 𝑐 • How to extend to multiclass?

Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • 𝑞 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(𝑥 𝑈 𝑦 + 𝑐) is equivalent to setting log odds to be linear: 𝑏 = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦 = 𝑥 𝑈 𝑦 + 𝑐 • Why linear log odds?

Review: binary logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • log odd is 𝑏 = ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 𝑥 𝑈 𝑦 + 𝑐 where 𝑐 = − 1 𝑈 𝜈 1 + 1 𝑈 𝜈 2 + ln 𝑞(𝑧 = 1) 𝑥 = 𝜈 1 − 𝜈 2 , 2 𝜈 1 2 𝜈 2 𝑞(𝑧 = 2)

Multiclass logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 𝑗 𝑞(𝑧 = 𝑗) exp(𝑏 𝑗 ) 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 𝑞 𝑦|𝑧 = 𝑘 𝑞(𝑧 = 𝑘) = σ 𝑘 exp(𝑏 𝑘 ) where we define 𝑏 𝑗 ≔ ln [𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 ]

Multiclass logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • Then 𝑈 = − 1 2 𝑦 𝑈 𝑦 + 𝑥 𝑗 𝑦 + 𝑐 𝑗 𝑏 𝑗 ≔ ln 𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 where 𝑐 𝑗 = − 1 1 𝑥 𝑗 = 𝜈 𝑗 , 𝑈 𝜈 𝑗 + ln 𝑞 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review:

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Chapter 10: Artificial Neural Networks Dr. Xudong Liu Assistant Professor School of Computing

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron A

Sigmoid: ATwistedTaleofFluxandFields ByTylerBehm

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of