logistic regression from binary to multi class
play

Logistic Regression: From Binary to Multi-Class Shuiwang Ji - PowerPoint PPT Presentation

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14 Binary Logistic Regression 1 The binary LR predicts the label y i { 1 , +1 } for a given sample


  1. Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14

  2. Binary Logistic Regression 1 The binary LR predicts the label y i ∈ {− 1 , +1 } for a given sample ① i by estimating a probability P ( y | ① i ) and comparing with a pre-defined threshold. 2 Recall the sigmoid function is defined as e s 1 θ ( s ) = 1 + e s = 1 + e − s , (1) where s ∈ R and θ denotes the sigmoid function. 3 The probability is thus represented by � θ ( ✇ T ① ) if y = 1 P ( y | ① ) = 1 − θ ( ✇ T ① ) if y = − 1 . This can also be expressed compactly as P ( y | ① ) = θ ( y ✇ T ① ) , (2) due to the fact that θ ( − s ) = 1 − θ ( s ). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one. 2 / 14

  3. Multi-Class Logistic Regression 1 In the multi-class cases there are more than two classes, i.e., y i ∈ { 1 , 2 , · · · , K } ( i = 1 , · · · , N ), where K is the number of classes and N is the number of samples. 2 In this case, we need to estimate the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the multi-class case as  P ( y = 1 | ① ; w )  P ( y = 2 | ① ; w )   ❤ ✇ ( ① ) = (3)   · · ·   P ( y = K | ① ; w ) 3 A critical assumption here is that there is no ordinal relationship between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on ① . 3 / 14

  4. Softmax 1 As a result, in the multi-class LR, we compute K linear signals by the dot product between the input ① and K independent weight vectors ✇ k , k = 1 , · · · , K as ✇ T   1 ① ✇ T 2 ①    . (4)  .  .   .  ✇ T K ① 2 We then need to map the K linear outputs (as a vector in R K ) to the K probabilities (as a probability distribution among the K classes). 3 In order to accomplish such a mapping, we introduce the softmax function, which is generalized from the sigmoid function and defined as below. Given a K -dimensional vector ✈ = [ v 1 , v 2 , · · · , v K ] T ∈ R K ,  e v 1  e v 2 1   softmax( ✈ ) = (5)  .  .  . � K   k =1 e v k .  e v K 4 / 14

  5. Softmax 1 It is easy to verify that the softmax maps a vector in R K to (0 , 1) K . All elements in the output vector of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as  e ✇ T   P ( y = 1 | ① ; w )  1 ① e ✇ T 1 P ( y = 2 | ① ; w ) 2 ①     ❤ ✇ ( ① ) =  =  . (6)     · · · � K k =1 e ✇ T · · ·   k ①   e ✇ T P ( y = K | ① ; w ) K ① 2 We will further discuss the connection between the softmax function and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2 5 / 14

  6. Cross Entropy 1 We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions. 2 The cross entropy is defined by K � H ( P , ◗ ) = − p i log( q i ) , (7) i =1 where P = ( p 1 , · · · , p K ) and ◗ = ( q 1 , · · · , q K ) are two probability distributions. In multi-class LR, the two probability distributions are the true distribution and predicted vector in Equation (3), respectively. 3 Here the true distribution refers to the one-hot encoding of the label. For label k ( k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k , and 0 everywhere else. 6 / 14

  7. Loss Function 1 Now the loss for a training sample ① in class c is given by loss ( ① , ② ; ✇ ) = H ( ② , ˆ ② ) � = − ② k log ˆ ② k k = − log ˆ ② c e ✇ T c ① = − log � K k =1 e ✇ T k ① where ② denotes the one-hot vector and ˆ ② is the predicted distribution h ( ① i ). And the loss on all samples ( ❳ i , ❨ i ) N i =1 is N K e ✇ T k ① i � � loss ( ❳ , ❨ ; ✇ ) = − I [ y i = k ] log (8) � K k =1 e ✇ T k ① i i =1 k =1 7 / 14

  8. Shift-invariance in Parameters The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights ✇ = ( ✇ 1 , · · · , ✇ K ), suppose we subtract the same vector ✉ from each of the K weight vectors, the outputs of softmax function will remain the same. 8 / 14

  9. Proof To prove this, let us denote ✇ ′ = { ✇ ′ i } K i =1 , where ✇ ′ i = ✇ i − ✉ . We have e ( ✇ k − ✉ ) T ① P ( y = k | ① ; ✇ ′ ) = (9) � K i =1 e ( ✇ i − ✉ ) T ① e ✇ T k ① e − ✉ T ① = (10) � K i =1 e ✇ T i ① e − ✉ T ① e ✇ T k ① e − ✉ T ① = (11) ( � K i =1 e ✇ T i ① ) e − ✉ T ① e ( ✇ k ) T ① = (12) � K i =1 e ( ✇ i ) T ① = P ( y = k | ① ; ✇ ) , (13) which completes the proof. 9 / 14

  10. Equivalence to Sigmoid Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent. 10 / 14

  11. Proof � � e ✇ T 1 1 ① ❤ ✇ ( ① ) = (14) 1 ① + e ✇ T e ✇ T e ✇ T 2 ① 2 ① � � e ( ✇ 1 − ✇ 1 ) T ① 1 = (15) e ( ✇ 1 − ✇ 1 ) T ① + e ( ✇ 2 − ✇ 1 ) T ① e ( ✇ 2 − ✇ 1 ) T ①  1  1+ e ( ✇ 2 − ✇ 1) T ① = (16) e ( ✇ 2 − ✇ 1) T ①   1+ e ( ✇ 2 − ✇ 1) T ①  1  ✇ T ① 1+ e − ˆ = (17) ✇ T ①  e − ˆ  ✇ T ① 1+ e − ˆ 1 � � � ✇ ( ① ) � h ˆ ✇ T ① 1+ e − ˆ = = (18) , 1 1 − 1 − h ˆ ✇ ( ① ) ✇ T ① 1+ e − ˆ where ˆ ✇ = ✇ 1 − ✇ 2 . This completes the proof. 11 / 14

  12. Cross entropy with binary outcomes 1 Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes. 2 The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights ✇ by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2. 12 / 14

  13. Proof N 1 ln(1 + e − y n ✇ T ① n ) � arg min ✇ E in ( ✇ ) = arg min N ✇ n =1 N 1 1 � = arg min ln θ ( y n ✇ T ① n ) N ✇ n =1 N 1 1 � = arg min ln N P ( y n | ① n ) ✇ n =1 N 1 1 1 � = arg min I [ y n = +1] ln P ( y n | ① n ) + I [ y n = − 1] ln N P ( y n | ① n ) ✇ n =1 N 1 1 1 � = arg min I [ y n = +1] ln h ( ① n ) + I [ y n = − 1] ln 1 − h ( ① n ) N ✇ n =1 ✇ p log 1 1 = arg min q + (1 − p ) log 1 − q = arg min ✇ H ( { p , 1 − p } , { q , 1 − q } ) where p = I [ y n = +1] and q = h ( ① n ). This completes the proof. 13 / 14

  14. THANKS! 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend