Logistic Regression: From Binary to Multi-Class
Shuiwang Ji Department of Computer Science & Engineering Texas A&M University
1 / 14
Logistic Regression: From Binary to Multi-Class Shuiwang Ji - - PowerPoint PPT Presentation
Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 14 Binary Logistic Regression 1 The binary LR predicts the label y i { 1 , +1 } for a given sample
Shuiwang Ji Department of Computer Science & Engineering Texas A&M University
1 / 14
1 The binary LR predicts the label yi ∈ {−1, +1} for a given sample ①i
by estimating a probability P(y|①i) and comparing with a pre-defined threshold.
2 Recall the sigmoid function is defined as
θ(s) = es 1 + es = 1 1 + e−s , (1) where s ∈ R and θ denotes the sigmoid function.
3 The probability is thus represented by
P(y|①) =
if y = 1 1 − θ(✇ T①) if y = −1. This can also be expressed compactly as P(y|①) = θ(y✇ T①), (2) due to the fact that θ(−s) = 1 − θ(s). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one.
2 / 14
1 In the multi-class cases there are more than two classes, i.e.,
yi ∈ {1, 2, · · · , K} (i = 1, · · · , N), where K is the number of classes and N is the number of samples.
2 In this case, we need to estimate the probability for each of the K
multi-class case as ❤✇(①) = P(y = 1|①; w) P(y = 2|①; w) · · · P(y = K|①; w) (3)
3 A critical assumption here is that there is no ordinal relationship
between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on ①.
3 / 14
1 As a result, in the multi-class LR, we compute K linear signals by the
dot product between the input ① and K independent weight vectors ✇ k, k = 1, · · · , K as ✇ T
1 ①
✇ T
2 ①
. . . ✇ T
K①
. (4)
2 We then need to map the K linear outputs (as a vector in RK) to the
K probabilities (as a probability distribution among the K classes).
3 In order to accomplish such a mapping, we introduce the softmax
function, which is generalized from the sigmoid function and defined as below. Given a K-dimensional vector ✈ = [v1, v2, · · · , vK]T ∈ RK, softmax(✈) = 1 K
k=1 evk
ev1 ev2 . . . evK . (5)
4 / 14
1 It is easy to verify that the softmax maps a vector in RK to (0, 1)K.
All elements in the output vector of softmax sum to 1 and their
❤✇(①) = P(y = 1|①; w) P(y = 2|①; w) · · · P(y = K|①; w) = 1 K
k=1 e✇ T
k ①
e✇ T
1 ①
e✇ T
2 ①
· · · e✇ T
K ①
. (6)
2 We will further discuss the connection between the softmax function
and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2
5 / 14
1 We optimize the multi-class LR by minimizing a loss (cost) function,
measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions.
2 The cross entropy is defined by
H(P, ◗) = −
K
pi log(qi), (7) where P = (p1, · · · , pK) and ◗ = (q1, · · · , qK) are two probability
the true distribution and predicted vector in Equation (3), respectively.
3 Here the true distribution refers to the one-hot encoding of the label.
For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0 everywhere else.
6 / 14
1 Now the loss for a training sample ① in class c is given by
loss(①, ②; ✇) = H(②, ˆ ②) = −
② k log ˆ ② k = − log ˆ ② c = − log e✇ T
c ①
K
k=1 e✇ T
k ①
where ② denotes the one-hot vector and ˆ ② is the predicted distribution h(①i). And the loss on all samples (❳ i, ❨ i)N
i=1 is
loss(❳, ❨ ; ✇) = −
N
K
I[yi = k] log e✇ T
k ①i
K
k=1 e✇ T
k ①i
(8)
7 / 14
The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights ✇ = (✇ 1, · · · , ✇ K), suppose we subtract the same vector ✉ from each of the K weight vectors, the
8 / 14
To prove this, let us denote ✇ ′ = {✇ ′
i}K i=1, where ✇ ′ i = ✇ i − ✉. We have
P(y = k|①; ✇ ′) = e(✇ k−✉)T ① K
i=1 e(✇ i−✉)T ①
(9) = e✇ T
k ①e−✉T ①
K
i=1 e✇ T
i ①e−✉T ①
(10) = e✇ T
k ①e−✉T ①
(K
i=1 e✇ T
i ①)e−✉T ①
(11) = e(✇ k)T ① K
i=1 e(✇ i)T ①
(12) = P(y = k|①; ✇), (13) which completes the proof.
9 / 14
Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.
10 / 14
❤✇(①) = 1 e✇ T
1 ① + e✇ T 2 ①
1 ①
e✇ T
2 ①
= 1 e(✇ 1−✇ 1)T ① + e(✇ 2−✇ 1)T ①
e(✇ 2−✇ 1)T ①
=
1 1+e(✇2−✇1)T ① e(✇2−✇1)T ① 1+e(✇2−✇1)T ①
(16) =
1 1+e−ˆ
✇T ①
e−ˆ
✇T ①
1+e−ˆ
✇T ①
(17) =
1+e−ˆ
✇T ①
1 −
1 1+e−ˆ
✇T ①
✇(①)
1 − hˆ
✇(①)
(18) where ˆ ✇ = ✇ 1 − ✇ 2. This completes the proof.
11 / 14
1 Now we show that minimizing the logistic regression loss is equivalent
to minimizing the cross-entropy loss with binary outcomes.
2 The equivalence between logistic regression loss and the cross-entropy
loss, as shown below, proves that we always obtain identical weights ✇ by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2.
12 / 14
arg min
✇ Ein(✇) = arg min ✇
1 N
N
ln(1 + e−yn✇T ①n) = arg min
✇
1 N
N
ln 1 θ(yn✇ T ①n) = arg min
✇
1 N
N
ln 1 P(yn|①n) = arg min
✇
1 N
N
I[yn = +1] ln 1 P(yn|①n) + I[yn = −1] ln 1 P(yn|①n) = arg min
✇
1 N
N
I[yn = +1] ln 1 h(①n) + I[yn = −1] ln 1 1 − h(①n) = arg min
✇ p log 1
q + (1 − p) log 1 1 − q = arg min
✇ H({p, 1 − p}, {q, 1 − q})
where p = I[yn = +1] and q = h(①n). This completes the proof.
13 / 14
14 / 14