Logistic Regression: From Binary to Multi-Class Shuiwang Ji - - PowerPoint PPT Presentation

logistic regression from binary to multi class
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression: From Binary to Multi-Class Shuiwang Ji - - PowerPoint PPT Presentation

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14 Binary Logistic Regression 1 The binary LR predicts the label y i { 1 , +1 } for a given sample


slide-1
SLIDE 1

Logistic Regression: From Binary to Multi-Class

Shuiwang Ji Department of Computer Science & Engineering Texas A&M University

1 / 14

slide-2
SLIDE 2

Binary Logistic Regression

1 The binary LR predicts the label yi ∈ {−1, +1} for a given sample ①i

by estimating a probability P(y|①i) and comparing with a pre-defined threshold.

2 Recall the sigmoid function is defined as

θ(s) = es 1 + es = 1 1 + e−s , (1) where s ∈ R and θ denotes the sigmoid function.

3 The probability is thus represented by

P(y|①) =

  • θ(✇ T①)

if y = 1 1 − θ(✇ T①) if y = −1. This can also be expressed compactly as P(y|①) = θ(y✇ T①), (2) due to the fact that θ(−s) = 1 − θ(s). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one.

2 / 14

slide-3
SLIDE 3

Multi-Class Logistic Regression

1 In the multi-class cases there are more than two classes, i.e.,

yi ∈ {1, 2, · · · , K} (i = 1, · · · , N), where K is the number of classes and N is the number of samples.

2 In this case, we need to estimate the probability for each of the K

  • classes. The hypothesis in binary LR is hence generalized to the

multi-class case as ❤✇(①) =     P(y = 1|①; w) P(y = 2|①; w) · · · P(y = K|①; w)     (3)

3 A critical assumption here is that there is no ordinal relationship

between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on ①.

3 / 14

slide-4
SLIDE 4

Softmax

1 As a result, in the multi-class LR, we compute K linear signals by the

dot product between the input ① and K independent weight vectors ✇ k, k = 1, · · · , K as      ✇ T

1 ①

✇ T

2 ①

. . . ✇ T

K①

     . (4)

2 We then need to map the K linear outputs (as a vector in RK) to the

K probabilities (as a probability distribution among the K classes).

3 In order to accomplish such a mapping, we introduce the softmax

function, which is generalized from the sigmoid function and defined as below. Given a K-dimensional vector ✈ = [v1, v2, · · · , vK]T ∈ RK, softmax(✈) = 1 K

k=1 evk

     ev1 ev2 . . . evK      . (5)

4 / 14

slide-5
SLIDE 5

Softmax

1 It is easy to verify that the softmax maps a vector in RK to (0, 1)K.

All elements in the output vector of softmax sum to 1 and their

  • rders are preserved. Thus the hypothesis in (3) can be written as

❤✇(①) =     P(y = 1|①; w) P(y = 2|①; w) · · · P(y = K|①; w)     = 1 K

k=1 e✇ T

k ①

     e✇ T

1 ①

e✇ T

2 ①

· · · e✇ T

K ①

     . (6)

2 We will further discuss the connection between the softmax function

and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2

5 / 14

slide-6
SLIDE 6

Cross Entropy

1 We optimize the multi-class LR by minimizing a loss (cost) function,

measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions.

2 The cross entropy is defined by

H(P, ◗) = −

K

  • i=1

pi log(qi), (7) where P = (p1, · · · , pK) and ◗ = (q1, · · · , qK) are two probability

  • distributions. In multi-class LR, the two probability distributions are

the true distribution and predicted vector in Equation (3), respectively.

3 Here the true distribution refers to the one-hot encoding of the label.

For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0 everywhere else.

6 / 14

slide-7
SLIDE 7

Loss Function

1 Now the loss for a training sample ① in class c is given by

loss(①, ②; ✇) = H(②, ˆ ②) = −

  • k

② k log ˆ ② k = − log ˆ ② c = − log e✇ T

c ①

K

k=1 e✇ T

k ①

where ② denotes the one-hot vector and ˆ ② is the predicted distribution h(①i). And the loss on all samples (❳ i, ❨ i)N

i=1 is

loss(❳, ❨ ; ✇) = −

N

  • i=1

K

  • k=1

I[yi = k] log e✇ T

k ①i

K

k=1 e✇ T

k ①i

(8)

7 / 14

slide-8
SLIDE 8

Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights ✇ = (✇ 1, · · · , ✇ K), suppose we subtract the same vector ✉ from each of the K weight vectors, the

  • utputs of softmax function will remain the same.

8 / 14

slide-9
SLIDE 9

Proof

To prove this, let us denote ✇ ′ = {✇ ′

i}K i=1, where ✇ ′ i = ✇ i − ✉. We have

P(y = k|①; ✇ ′) = e(✇ k−✉)T ① K

i=1 e(✇ i−✉)T ①

(9) = e✇ T

k ①e−✉T ①

K

i=1 e✇ T

i ①e−✉T ①

(10) = e✇ T

k ①e−✉T ①

(K

i=1 e✇ T

i ①)e−✉T ①

(11) = e(✇ k)T ① K

i=1 e(✇ i)T ①

(12) = P(y = k|①; ✇), (13) which completes the proof.

9 / 14

slide-10
SLIDE 10

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

10 / 14

slide-11
SLIDE 11

Proof

❤✇(①) = 1 e✇ T

1 ① + e✇ T 2 ①

  • e✇ T

1 ①

e✇ T

2 ①

  • (14)

= 1 e(✇ 1−✇ 1)T ① + e(✇ 2−✇ 1)T ①

  • e(✇ 1−✇ 1)T ①

e(✇ 2−✇ 1)T ①

  • (15)

=  

1 1+e(✇2−✇1)T ① e(✇2−✇1)T ① 1+e(✇2−✇1)T ①

  (16) =  

1 1+e−ˆ

✇T ①

e−ˆ

✇T ①

1+e−ˆ

✇T ①

  (17) =

  • 1

1+e−ˆ

✇T ①

1 −

1 1+e−ˆ

✇T ①

  • =

✇(①)

1 − hˆ

✇(①)

  • ,

(18) where ˆ ✇ = ✇ 1 − ✇ 2. This completes the proof.

11 / 14

slide-12
SLIDE 12

Cross entropy with binary outcomes

1 Now we show that minimizing the logistic regression loss is equivalent

to minimizing the cross-entropy loss with binary outcomes.

2 The equivalence between logistic regression loss and the cross-entropy

loss, as shown below, proves that we always obtain identical weights ✇ by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2.

12 / 14

slide-13
SLIDE 13

Proof

arg min

✇ Ein(✇) = arg min ✇

1 N

N

  • n=1

ln(1 + e−yn✇T ①n) = arg min

1 N

N

  • n=1

ln 1 θ(yn✇ T ①n) = arg min

1 N

N

  • n=1

ln 1 P(yn|①n) = arg min

1 N

N

  • n=1

I[yn = +1] ln 1 P(yn|①n) + I[yn = −1] ln 1 P(yn|①n) = arg min

1 N

N

  • n=1

I[yn = +1] ln 1 h(①n) + I[yn = −1] ln 1 1 − h(①n) = arg min

✇ p log 1

q + (1 − p) log 1 1 − q = arg min

✇ H({p, 1 − p}, {q, 1 − q})

where p = I[yn = +1] and q = h(①n). This completes the proof.

13 / 14

slide-14
SLIDE 14

THANKS!

14 / 14