Lecture 3: Logistic Regression Feng Li Shandong University - PowerPoint PPT Presentation

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020 Feng Li (SDU) Logistic Regression September 21, 2020 1 / 26

Lecture 3: Logistic Regression Logistic Regression 1 Newton’s Method 2 Multiclass Classification 3 Feng Li (SDU) Logistic Regression September 21, 2020 2 / 26

Logistic Regression Classification problem Similar to regression problem, but we would like to predict only a small number of discrete values (instead of continuous values) Binary classification problem: y ∈ { 0 , 1 } where 0 represents negative class, while 1 denotes positive class y ( i ) ∈ { 0 , 1 } is also called the label for the training example Feng Li (SDU) Logistic Regression September 21, 2020 3 / 26

Logistic Regression (Contd.) Logistic regression Use a logistic function (or sigmoid function) g ( z ) = 1 / (1 + e − z ) to continuously approximate discrete classification Feng Li (SDU) Logistic Regression September 21, 2020 4 / 26

Logistic Regression (Contd.) Properties of the sigmoid function Bound: g ( z ) ∈ (0 , 1) Symmetric: 1 − g ( z ) = g ( − z ) Gradient: g ′ ( z ) = g ( z )(1 − g ( z )) Feng Li (SDU) Logistic Regression September 21, 2020 5 / 26

Logistic Regression (Contd.) Logistic regression defines h θ ( x ) using the sigmoid function h θ ( x ) = g ( θ T x ) = 1 / (1 + e − θ T x ) First compute a real-valued “score” ( θ T x ) for input x and then “squash” it between (0 , 1) to turn this score into a probability (of x ’s label being 1) Feng Li (SDU) Logistic Regression September 21, 2020 6 / 26

Logistic Regression (Contd.) Data samples are drawn randomly X : random variable representing feature vector Y : random variable representing label Given an input feature vector x , we have The conditional probability of Y = 1 given X = x Pr( Y = 1 | X = x ; θ ) = h θ ( x ) = 1 / (1 + exp( − θ T x )) The conditional probability of Y = 0 given X = x Pr( Y = 0 | X = x ; θ ) = 1 − h θ ( x ) = 1 / (1 + exp( θ T x )) Feng Li (SDU) Logistic Regression September 21, 2020 7 / 26

Logistic Regression: A Closer Look ... What’s the underlying decision rule in logistic regression? At the decision boundary, both classes are equiprobable; thus, we have Pr( Y = 1 | X = x ; θ ) = Pr( Y = 0 | X = x ; θ ) 1 1 ⇒ 1 + exp( − θ T x ) = 1 + exp( θ T x ) exp( θ T x ) = 1 ⇒ θ T x = 0 ⇒ Therefore, the decision boundary of logistic regression is nothing but a linear hyperplane Feng Li (SDU) Logistic Regression September 21, 2020 8 / 26

Interpreting The Probabilities Recall that 1 Pr( Y = 1 | X = x ; θ ) = 1 + exp( − θ T x ) The “score” θ T x is also a measure of distance of x from the hyperplane (the score is positive for pos. examples, and negative for neg. examples) High positive score: High probability of label 1 High negative score: Low probability of label 1 (high prob. of label 0) Feng Li (SDU) Logistic Regression September 21, 2020 9 / 26

Logistic Regression Formulation Logistic regression model 1 h θ ( x ) = g ( θ T x ) = 1 + e − θ T x Assume Pr( Y = 1 | X = x ; θ ) = h θ ( x ) and Pr( Y = 0 | X = x ; θ ) = 1 − h θ ( x ) , then we have the following probability mass function p ( y | x ; θ ) = Pr( Y = y | X = x ; θ ) = ( h θ ( x )) y (1 − h θ ( x )) 1 − y where y ∈ { 0 , 1 } Feng Li (SDU) Logistic Regression September 21, 2020 10 / 26

Logistic Regression Formulation (Contd.) Y | X = x ∼ Bernoulli ( h θ ( x )) If we assume y ∈ {− 1 , 1 } instead of y ∈ { 0 , 1 } , then 1 p ( y | x ; θ ) = 1 + exp( − y θ T x ) Assuming the training examples were generated independently, we de- fine the likelihood of the parameters as m p ( y ( i ) | x ( i ) ; θ ) � L ( θ ) = i =1 m ( h θ ( x ( i ) )) y ( i ) (1 − h θ ( x ( i ) )) 1 − y ( i ) � = i =1 Feng Li (SDU) Logistic Regression September 21, 2020 11 / 26

Logistic Regression Formulation (Contd.) Maximize the log likelihood m � y ( i ) log h ( x ( i ) ) + (1 − y ( i ) ) log(1 − h ( x ( i ) ) � � ℓ ( θ ) = log L ( θ ) = i =1 Gradient ascent algorithm θ j ← θ j + α ▽ θ j ℓ ( θ ) for ∀ j , where m y ( i ) − h θ ( x ( i ) ) � · ∂ h θ ( x ( i ) ) ∂ � ℓ ( θ ) = � ∂θ j h θ ( x ( i ) ) 1 − h θ ( x ( i ) ) ∂θ j i =1 m � y ( i ) − h θ ( x ( i ) ) � � x ( i ) = j i =1 Feng Li (SDU) Logistic Regression September 21, 2020 12 / 26

Logistic Regression Formulation (Contd.) ∂ ℓ ( θ ) ∂θ j m � � y ( i ) − 1 − y ( i ) ∂ h θ ( x ) ∂ h θ ( x ) � = h θ ( x ) ∂θ j 1 − h θ ( x ) ∂θ j i =1 m y ( i ) − h θ ( x ( i ) ) � · ∂ h θ ( x ( i ) ) � = h θ ( x ( i ) ) � 1 − h θ ( x ( i ) ) ∂θ j i =1 exp( − θ T x ( i ) ) · x ( i ) m · (1 + exp( − θ T x ( i ) )) 2 � y ( i ) − h θ ( x ( i ) ) � j � = · exp( − θ T x ( i ) ) (1 + exp( − θ T x ( i ) )) 2 i =1 m � y ( i ) − h θ ( x ( i ) ) � x ( i ) � = j i =1 Feng Li (SDU) Logistic Regression September 21, 2020 13 / 26

Newton’s Method Given a differentiable real-valued f : R → R , how can we find x such that f ( x ) = 0 ? 𝑧 0 𝑦 ∗ 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 14 / 26

Newton’s Method (Contd.) A tangent line L to the curve y = f ( x ) at point ( x 1 , f ( x 1 )) The x -intercept of L x 2 = x 1 − f ( x 1 ) ′ ( x 1 ) f 𝑧 𝑦 " , 𝑔 𝑦 " 𝑀 0 𝑦 ∗ 𝑦 " 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 15 / 26

Newton’s Method (Contd.) Repeat the process and get a sequence of approximations x 1 , x 2 , x 3 , · · · 𝑧 𝑦 " , 𝑔 𝑦 " 𝑦 # ,𝑔 𝑦 # 0 𝑦 ∗ 𝑦 " 𝑦 $ 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 16 / 26

Newton’s Method (Contd.) In general, if convergence criteria is not satisfied, x ← x − f ( x ) ′ ( x ) f 𝑧 𝑦 " , 𝑔 𝑦 " 𝑦 # ,𝑔 𝑦 # 0 𝑦 ∗ 𝑦 " 𝑦 $ 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 17 / 26

Newton’s Method (Contd.) Some properties Highly dependent on initial guess Quadratic convergence once it is sufficiently close to x ∗ ′ = 0, only has linear convergence If f Is not guaranteed to convergence at all, depending on function or initial guess Feng Li (SDU) Logistic Regression September 21, 2020 18 / 26

Newton’s Method (Contd.) To maximize f ( x ), we have to find the stationary point of f ( x ) such that f ′ ( x ) = 0. According to Newton’s method, we have the following update x ← x − f ′ ( x ) f ′′ ( x ) Newton-Raphson method: For ℓ : R n → R , we generalization Newton’s method to the multidi- mensional setting θ ← θ − H − 1 ▽ θ ℓ ( θ ) where H is the Hessian matrix H i , j = ∂ 2 ℓ ( θ ) ∂θ i ∂θ j Feng Li (SDU) Logistic Regression September 21, 2020 19 / 26

Newton’s Method (Contd.) Higher convergence speed than (batch) gradient descent Fewer iterations to approach the minimum However, each iteration is more expensive than the one of gradient descent Finding and inverting an n × n Hessian More details about Newton’s method can be found at https://en.wikipedia.org/wiki/Newton%27s_method Feng Li (SDU) Logistic Regression September 21, 2020 20 / 26

Multiclass Classification Multiclass (or multinomial) classification is the problem of classifying instances into one of the more than two classes The existing multiclass classification techniques can be categorized into Transformation to binary Extension from binary Hierarchical classification Feng Li (SDU) Logistic Regression September 21, 2020 21 / 26

Transformation to Binary One-vs.-rest (one-vs.-all, OvA or OvR, one-against-all, OAA) strategy is to train a single classifier per class, with the samples of that class as positive samples and all other samples as negative ones Inputs: A learning algorithm L , training data { ( x ( i ) , y ( i ) ) } i =1 , ··· , m where y ( i ) ∈ { 1 , ..., K } is the label for the sample x ( i ) Output: A list of classifier f k for k ∈ { 1 , · · · , K } Procedure: For ∀ k ∈ { 1 , · · · . K } , construct a new label z ( i ) for x ( i ) such that z ( i ) = 1 if y ( i ) = k and z ( i ) = 0 otherwise, and then apply L to { ( x ( i ) , z ( i ) ) } i =1 , ··· , m to obtain f k . Higher f k ( x ) implies hight probability that x is in class k Making decision: y ∗ = arg max k f k ( x ) Example: Using SVM to train each binary classifier Feng Li (SDU) Logistic Regression September 21, 2020 22 / 26

Transformation to Binary One-vs.-One (OvO) reduction is to train K ( K − 1) / 2 binary classifiers For the ( s , t )-th classifier: Positive samples: all the points in class s Negative samples: all the points in class t f s , t ( x ) is the decision value for this classifier such that larger f s , t ( x ) implies that label s has higher probability than label t Prediction: �� f ( x ) = arg max f s , t ( x ) s t Example: using SVM to train each binary classifier Feng Li (SDU) Logistic Regression September 21, 2020 23 / 26

Lecture 3: Logistic Regression Feng Li Shandong University - PowerPoint PPT Presentation

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020 Feng Li (SDU) Logistic Regression September 21, 2020 1 / 26 Lecture 3: Logistic Regression Logistic Regression 1 Newtons Method 2

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 2008 Review Review

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Lecture 17: Information Flow Basics and background Entropy Nonlattice flow policies

New Insights to Key Derivation for Tamper-Evident Physical Unclonable Functions (PUFs) Vincent

Mean-shift Video

Methodological considerations Biosocial research framework Biological data quality issues

Machine Learning - MT 2017 4. Maximum Likelihood Varun Kanade University of Oxford October 16,

On Symmetry of Uniform and Preferential Attachment Graphs Abram Magner, Svante Janson, Giorgos

Sequential Compressed Sensing Dmitry Malioutov DRW, research done mostly at MIT Joint work with

Lovsz Local Lemma a new tool to asymptotic enumeration? Linyuan Lincoln Lu Lszl