Logistic Regression: From Binary to Multi-Class Shuiwang Ji - PowerPoint PPT Presentation

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14

Binary Logistic Regression 1 The binary LR predicts the label y i ∈ {− 1 , +1 } for a given sample ① i by estimating a probability P ( y | ① i ) and comparing with a pre-defined threshold. 2 Recall the sigmoid function is defined as e s 1 θ ( s ) = 1 + e s = 1 + e − s , (1) where s ∈ R and θ denotes the sigmoid function. 3 The probability is thus represented by � θ ( ✇ T ① ) if y = 1 P ( y | ① ) = 1 − θ ( ✇ T ① ) if y = − 1 . This can also be expressed compactly as P ( y | ① ) = θ ( y ✇ T ① ) , (2) due to the fact that θ ( − s ) = 1 − θ ( s ). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one. 2 / 14

Multi-Class Logistic Regression 1 In the multi-class cases there are more than two classes, i.e., y i ∈ { 1 , 2 , · · · , K } ( i = 1 , · · · , N ), where K is the number of classes and N is the number of samples. 2 In this case, we need to estimate the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the multi-class case as  P ( y = 1 | ① ; w )  P ( y = 2 | ① ; w )   ❤ ✇ ( ① ) = (3)   · · ·   P ( y = K | ① ; w ) 3 A critical assumption here is that there is no ordinal relationship between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on ① . 3 / 14

Softmax 1 As a result, in the multi-class LR, we compute K linear signals by the dot product between the input ① and K independent weight vectors ✇ k , k = 1 , · · · , K as ✇ T   1 ① ✇ T 2 ①    . (4)  .  .   .  ✇ T K ① 2 We then need to map the K linear outputs (as a vector in R K ) to the K probabilities (as a probability distribution among the K classes). 3 In order to accomplish such a mapping, we introduce the softmax function, which is generalized from the sigmoid function and defined as below. Given a K -dimensional vector ✈ = [ v 1 , v 2 , · · · , v K ] T ∈ R K ,  e v 1  e v 2 1   softmax( ✈ ) = (5)  .  .  . � K   k =1 e v k .  e v K 4 / 14

Softmax 1 It is easy to verify that the softmax maps a vector in R K to (0 , 1) K . All elements in the output vector of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as  e ✇ T   P ( y = 1 | ① ; w )  1 ① e ✇ T 1 P ( y = 2 | ① ; w ) 2 ①     ❤ ✇ ( ① ) =  =  . (6)     · · · � K k =1 e ✇ T · · ·   k ①   e ✇ T P ( y = K | ① ; w ) K ① 2 We will further discuss the connection between the softmax function and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2 5 / 14

Cross Entropy 1 We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions. 2 The cross entropy is defined by K � H ( P , ◗ ) = − p i log( q i ) , (7) i =1 where P = ( p 1 , · · · , p K ) and ◗ = ( q 1 , · · · , q K ) are two probability distributions. In multi-class LR, the two probability distributions are the true distribution and predicted vector in Equation (3), respectively. 3 Here the true distribution refers to the one-hot encoding of the label. For label k ( k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k , and 0 everywhere else. 6 / 14

Loss Function 1 Now the loss for a training sample ① in class c is given by loss ( ① , ② ; ✇ ) = H ( ② , ˆ ② ) � = − ② k log ˆ ② k k = − log ˆ ② c e ✇ T c ① = − log � K k =1 e ✇ T k ① where ② denotes the one-hot vector and ˆ ② is the predicted distribution h ( ① i ). And the loss on all samples ( ❳ i , ❨ i ) N i =1 is N K e ✇ T k ① i � � loss ( ❳ , ❨ ; ✇ ) = − I [ y i = k ] log (8) � K k =1 e ✇ T k ① i i =1 k =1 7 / 14

Shift-invariance in Parameters The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights ✇ = ( ✇ 1 , · · · , ✇ K ), suppose we subtract the same vector ✉ from each of the K weight vectors, the outputs of softmax function will remain the same. 8 / 14

Proof To prove this, let us denote ✇ ′ = { ✇ ′ i } K i =1 , where ✇ ′ i = ✇ i − ✉ . We have e ( ✇ k − ✉ ) T ① P ( y = k | ① ; ✇ ′ ) = (9) � K i =1 e ( ✇ i − ✉ ) T ① e ✇ T k ① e − ✉ T ① = (10) � K i =1 e ✇ T i ① e − ✉ T ① e ✇ T k ① e − ✉ T ① = (11) ( � K i =1 e ✇ T i ① ) e − ✉ T ① e ( ✇ k ) T ① = (12) � K i =1 e ( ✇ i ) T ① = P ( y = k | ① ; ✇ ) , (13) which completes the proof. 9 / 14

Equivalence to Sigmoid Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent. 10 / 14

Proof � � e ✇ T 1 1 ① ❤ ✇ ( ① ) = (14) 1 ① + e ✇ T e ✇ T e ✇ T 2 ① 2 ① � � e ( ✇ 1 − ✇ 1 ) T ① 1 = (15) e ( ✇ 1 − ✇ 1 ) T ① + e ( ✇ 2 − ✇ 1 ) T ① e ( ✇ 2 − ✇ 1 ) T ①  1  1+ e ( ✇ 2 − ✇ 1) T ① = (16) e ( ✇ 2 − ✇ 1) T ①   1+ e ( ✇ 2 − ✇ 1) T ①  1  ✇ T ① 1+ e − ˆ = (17) ✇ T ①  e − ˆ  ✇ T ① 1+ e − ˆ 1 � � � ✇ ( ① ) � h ˆ ✇ T ① 1+ e − ˆ = = (18) , 1 1 − 1 − h ˆ ✇ ( ① ) ✇ T ① 1+ e − ˆ where ˆ ✇ = ✇ 1 − ✇ 2 . This completes the proof. 11 / 14

Cross entropy with binary outcomes 1 Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes. 2 The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights ✇ by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2. 12 / 14

Proof N 1 ln(1 + e − y n ✇ T ① n ) � arg min ✇ E in ( ✇ ) = arg min N ✇ n =1 N 1 1 � = arg min ln θ ( y n ✇ T ① n ) N ✇ n =1 N 1 1 � = arg min ln N P ( y n | ① n ) ✇ n =1 N 1 1 1 � = arg min I [ y n = +1] ln P ( y n | ① n ) + I [ y n = − 1] ln N P ( y n | ① n ) ✇ n =1 N 1 1 1 � = arg min I [ y n = +1] ln h ( ① n ) + I [ y n = − 1] ln 1 − h ( ① n ) N ✇ n =1 ✇ p log 1 1 = arg min q + (1 − p ) log 1 − q = arg min ✇ H ( { p , 1 − p } , { q , 1 − q } ) where p = I [ y n = +1] and q = h ( ① n ). This completes the proof. 13 / 14

THANKS! 14 / 14

Logistic Regression: From Binary to Multi-Class Shuiwang Ji - PowerPoint PPT Presentation

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14 Binary Logistic Regression 1 The binary LR predicts the label y i { 1 , +1 } for a given sample

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17,

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

Logistic Regression: From Binary to Multi-Class Shuiwang Ji - PowerPoint PPT Presentation

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14 Binary Logistic Regression 1 The binary LR predicts the label y i { 1 , +1 } for a given sample

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17,

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT &amp; P RACTICAL T OOLS by Ilya Kuzovkin

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin