Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

data mining and machine learning fundamental concepts and
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


slide-1
SLIDE 1

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 24: Logistic Regression

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 1 / 29

slide-2
SLIDE 2

Binary Logistic Regression

In logistic regression, we are given a set of d predictor or independent variables X1,X2,··· ,Xd, and a binary or Bernoulli response variable Y that takes on only two values, namely, 0 and 1. Since there are only two outcomes for the response variable Y , its probability mass function for ˜ X = ˜ x is given as: P(Y = 1| ˜ X = ˜ x) = π(˜ x) P(Y = 0| ˜ X = ˜ x) = 1 − π(˜ x) where π(˜ x) is the unknown true parameter value, denoting the probability of Y = 1 given ˜ X = ˜ x.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 2 / 29

slide-3
SLIDE 3

Binary Logistic Regression

Instead of directly predicting the response value, the goal is to learn the probability, P(Y = 1| ˜ X = ˜ x), which is also the expected value of Y given ˜ X = ˜ x. Since P(Y = 1| ˜ X = ˜ x) is a probability, it is not appropriate to directly use the linear regression model. The reason we cannot simply use P(Y = 1| ˜ X = ˜ x) = f (˜ x) is due to the fact that f (˜ x) can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value. The name “logistic regression” comes from the logistic function (also called the sigmoid function) that “squashes” the output to be between 0 and 1 for any scalar

  • input. z

θ(z) = 1 1 + exp{−z} = exp{z} 1 + exp{z} (1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 3 / 29

slide-4
SLIDE 4

Logistic Function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 z θ(z) −∞ +∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 4 / 29

slide-5
SLIDE 5

Logistic Function

Example

Figure shows the plot for the logistic function for z ranging from −∞ to +∞. In particular consider what happens when z is −∞, +∞ and 0; we have θ(−∞) = 1 1 + exp{∞} = 1 ∞ = 0 θ(+∞) = 1 1 + exp{−∞} = 1 1 = 1 θ(0) = 1 1 + exp{0} = 1 2 = 0.5 As desired, θ(z) lies in the range [0,1], and z = 0 is the “threshold” value in the sense that for z > 0 we have θ(z) > 0.5, and for z < 0, we have θ(z) < 0.5. Thus, interpreting θ(z) as a probability, the larger the z value, the higher the probability. Another interesting property of the logistic function is that 1 − θ(z) = 1 − exp{z} 1 + exp{z} = 1 + exp{z} − exp{z} 1 + exp{z} = 1 1 + exp{z} = θ(−z) (2)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 5 / 29

slide-6
SLIDE 6

Binary Logistic Regression

Using the logistic function, we define the logistic regression model as follows: P(Y = 1| ˜ X = ˜ x) = π(˜ x) = θ(f (˜ x)) = θ(˜ ωT ˜ x) = exp{˜ ωT ˜ x} 1 + exp{˜ ωT ˜ x} (3) Thus, the probability that the response is Y = 1 is the output of the logistic function for the input ˜ ωT ˜

  • x. On the other hand, the probability for Y = 0 is given

as P(Y = 0| ˜ X = ˜ x) = 1 − P(Y = 1| ˜ X = ˜ x) = θ(−˜ ωT ˜ x) = 1 1 + exp{˜ ωT ˜ x} that is, 1 − θ(z) = θ(−z) for z = ˜ ωT ˜ x. Combining these two cases the full logistic regression model is given as P(Y | ˜ X = ˜ x) = θ(˜ ωT ˜ x)Y · θ(−˜ ωT ˜ x)1−Y (4) since Y is a Bernoulli random variable that takes on either the value 1 or 0. We can observe that P(Y | ˜ X = ˜ x) = θ(˜ ωT ˜ x) when Y = 1 and P(Y | ˜ X = ˜ x) = θ(−˜ ωT ˜ x) when Y = 0, as desired.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 6 / 29

slide-7
SLIDE 7

Log-Odds Ratio

Define the odds ratio for the occurrence of Y = 1 as follows:

  • dds(Y = 1| ˜

X = ˜ x) = P(Y = 1| ˜ X = ˜ x) P(Y = 0| ˜ X = ˜ x) = θ(˜ ωT ˜ x) θ(−˜ ωT ˜ x) = exp{˜ ωT ˜ x} 1 + exp{˜ ωT ˜ x} ·

  • 1 + exp{˜

ωT ˜ x}

  • =

exp{˜ ωT ˜ x} (5) The logarithm of the odds ratio, called the log-odds ratio, is therefore given as: ln

  • dds(Y = 1| ˜

X = ˜ x)

  • = ln
  • P(Y = 1| ˜

X = ˜ x) 1 − P(Y = 1| ˜ X = ˜ x)

  • = ln
  • exp{˜

ωT ˜ x}

  • = ˜

ωT ˜ x = ω0 · x0 + ω1 · x1 + ··· + ωd · xd (6) The log-odds ratio function is also called the logit function, defined as logit(z) = ln

  • z

1 − z

  • It is the inverse of the logistic function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 7 / 29

slide-8
SLIDE 8

Log-Odds Ratio

We can see that ln

  • dds(Y = 1| ˜

X = ˜ x)

  • = logit
  • P(Y = 1| ˜

X = ˜ x)

  • The logistic regression model is therefore based on the assumption that the

log-odds ratio for Y = 1 given ˜ X = ˜ x is a linear function (or a weighted sum) of the independent attributes. In particular, let us consider the effect of attribute Xi by fixing the values for all other attributes,we get ln(odds(Y = 1| ˜ X = ˜ x)) = ωi · xi + C = ⇒ odds(Y = 1| ˜ X = ˜ x) = exp{ωi · xi + C} = exp{ωi · xi} · exp{C} ∝ exp{ωi · xi} where C is a constant comprising the fixed attributes. The regression coefficient ωi can therefore be interpreted as the change in the log-odds ratio for Y = 1 for a unit change in Xi, or equivalently the odds ratio for Y = 1 increases exponentially per unit change in Xi.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 8 / 29

slide-9
SLIDE 9

Maximum Likelihood Estimation

We will use the maximum likelihood approach to learn the weight vector ˜ w. Likelihood is defined as the probability of the observed data given the estimated parameters ˜ w. L( ˜ w) = P(Y | ˜ w) =

n

  • i=1

P(yi| ˜ xi) =

n

  • i=1

θ( ˜ w T ˜ xi)yi · θ(− ˜ w T ˜ xi)1−yi Instead of trying to maximize the likelihood, we can maximize the logarithm of the likelihood, called log-likelihood, to convert the product into a summation as follows: ln(L( ˜ w)) =

n

  • i=1

yi · ln

  • θ( ˜

w T ˜ xi)

  • + (1 − yi) · ln
  • θ(− ˜

w T ˜ xi)

  • (7)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 9 / 29

slide-10
SLIDE 10

Maximum Likelihood Estimation

The negative of the log-likelihood can also be considered as an error function, the cross-entropy error function, given as follows: E( ˜ w) = −ln(L( ˜ w)) =

n

  • i=1

yi · ln

  • 1

θ( ˜ w T ˜ xi)

  • + (1 − yi) · ln
  • 1

1 − θ( ˜ w T ˜ xi)

  • (8)

The task of maximizing the log-likelihood is therefore equivalent to minimizing the cross-entropy error.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 10 / 29

slide-11
SLIDE 11

Maximum Likelihood Estimation

Typically, to obtain the optimal weight vector ˜ w, we would differentiate the log-likelihood function with respect to ˜ w, set the result to 0, and then solve for ˜

  • w. However, for the log-likelihood formulation there is no closed form solution to

compute the weight vector ˜

  • w. Instead, we use an iterative gradient ascent

method to compute the optimal value. The gradient ascent method relies on the gradient of the log-likelihood function, which can be obtained by taking its partial derivative with respect to ˜ w, as follows: ∇( ˜ w) = ∂ ∂ ˜ w

  • ln(L( ˜

w))

  • = ∂

∂ ˜ w n

  • i=1

yi · ln(θ(zi)) + (1 − yi) · ln(θ(−zi))

  • (9)

where zi = ˜ w T ˜ xi.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 11 / 29

slide-12
SLIDE 12

Maximum Likelihood Estimation

The gradient ascent method starts at some initial estimate for ˜ w, denoted ˜ w 0. At each step t, the method moves in the direction of steepest ascent, which is given by the gradient vector. Thus, given the current estimate ˜ w t, we can obtain the next estimate as follows: ˜ w t+1 = ˜ w t + η · ∇( ˜ w t) (10) Here, η > 0 is a user-specified parameter called the learning rate. It should not be too large, otherwise the estimates will vary wildly from one iteration to the next, and it should not be too small, otherwise it will take a long time to converge. At the optimal value of ˜ w, the gradient will be zero, i.e., ∇( ˜ w) = 0, as desired.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 12 / 29

slide-13
SLIDE 13

Stochastic Gradient Ascent

The gradient ascent method computes the gradient by considering all the data points, and it is therefore called batch gradient ascent. For large datasets, it is typically much faster to compute the gradient by considering only one (randomly chosen) point at a time. The weight vector is updated after each such partial gradient step, giving rise to stochastic gradient ascent (SGA) for computing the

  • ptimal weight vector ˜
  • w. Given a randomly chosen point ˜

xi, the point-specific gradient is given as ∇( ˜ w, ˜ xi) =

  • yi − θ( ˜

w T ˜ xi)

  • · ˜

xi (11) Unlike batch gradient ascent that updates ˜ w by considering all the points, in stochastic gradient ascent the weight vector is updated after observing each point, and the updated values are used immediately in the next update. Computing the full gradient in the batch approach can be very expensive. In contrast, computing the partial gradient at each point is very fast, and due to the stochastic updates to ˜ w, typically SGA is much faster than the batch approach for very large datasets.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 13 / 29

slide-14
SLIDE 14

Binary Logistic Regression

Once the model has been trained, we can predict the response for any new augmented test point ˜ z as follows: ˆ y =

  • 1

if θ( ˜ w T ˜ z) ≥ 0.5 if θ( ˜ w T ˜ z) < 0.5 (12)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 14 / 29

slide-15
SLIDE 15

Logistic Regression: Stochastic Gradient Ascent

LogisticRegression-SGA (D,η,ǫ):

1 foreach xi ∈ D do ˜

xT

i ←

1 xT

i

  • // map to Rd+1

2 t ← 0 // step/iteration counter 3 ˜

w 0 ← (0,...,0)T ∈ Rd+1 // initial weight vector

4 repeat 5

˜ w ← ˜ w t // make a copy of ˜ w t

6

foreach ˜ xi ∈ ˜ D in random order do

7

∇( ˜ w, ˜ xi) ←

  • yi − θ( ˜

w T ˜ xi)

  • · ˜

xi // compute gradient at ˜ xi

8

˜ w ← ˜ w + η · ∇( ˜ w, ˜ xi) // update estimate for ˜ w

9

˜ w t+1 ← ˜ w // update ˜ w t+1

10

t ← t + 1

11 until

  • ˜

w t − ˜ w t−1 ≤ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 15 / 29

slide-16
SLIDE 16

Logistic Regression

Iris principal components data. Misclassified point are shown in dark gray color. Circles denote Iris-virginica and triangles denote the other two Iris types. X1 X2 Y

bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uTuT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC uT uT uT Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 16 / 29

slide-17
SLIDE 17

Linear Regression

Iris principal components data. Misclassified point are shown in dark gray color. Circles denote Iris-virginica and triangles denote the other two Iris types. X1 X2 Y

bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 17 / 29

slide-18
SLIDE 18

Logistic Regression

Example

Figure shows the output of logistic regression modeling on the Iris principal components data, where the independent attributes X1 and X2 represent the first two principal components, and the binary response variable Y represents the type

  • f Iris flower; Y = 1 corresponds to Iris-virginica, whereas Y = 0 corresponds

to the two other Iris types, namely Iris-setosa and Iris-versicolor. The fitted logistic model is given as ˜ w = (w0,w1,w2)T = (−6.79,−5.07,−3.29)T P(Y = 1|˜ x) = θ( ˜ w T ˜ x) = 1 1 + exp{6.79 + 5.07 · x1 + 3.29 · x2} Figure plots P(Y = 1|˜ x) for various values of ˜

  • x. Given ˜

x, if P(Y = 1|˜ x) ≥ 0.5, then we predict ˆ y = 1, otherwise we predict ˆ y = 0.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 18 / 29

slide-19
SLIDE 19

Logistic Regression

Example

Figure shows that five points (shown in dark gray) are misclassified. For example, for ˜ x = (1,−0.52,−1.19)T we have: P(Y = 1|˜ x) = θ( ˜ w T ˜ x) = θ(−0.24) = 0.44 P(Y = 0|˜ x) = 1 − P(Y = 1|˜ x) = 0.54 Thus, the predicted response for ˜ x is ˆ y = 0, whereas the true class is y = 1. The plane of best fit in linear regression is shown in the figure, with the weight vector: ˜ w = (0.333,−0.167,0.074)T ˆ y = f (˜ x) = 0.333 − 0.167 · x1 + 0.074 · x2 Since the response vector Y is binary, we predict the response class as y = 1 if f (˜ x) ≥ 0.5, and y = 0 otherwise. The linear regression model results in 17 points being misclassified (dark gray points). Since there are n = 150 points in total, this results in a training set or in-sample accuracy of 88.7% for linear regression, while logistic regression misclassifies only 5 points, an accuracy of 96.7%, which is a much better fit.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 19 / 29

slide-20
SLIDE 20

Multiclass Logistic Regression

We now generalize logistic regression to the case when the response variable Y can take on K distinct nominal categorical values called classes, i.e., Y ∈ {c1,c2,··· ,cK}. We model Y as a K-dimensional multivariate Bernoulli random variable. Since Y can assume only one of the K values, we use the

  • ne-hot encoding approach to map each categorical value ci to the K-dimensional

binary vector ei = (

i−1

0,...,0,1,

K−i

0,...,0 )T whose ith element eii = 1, and all other elements eij = 0, so that K

j=1 eij = 1. The

probability mass function for Y given ˜ X = ˜ x is P(Y = ei| ˜ X = ˜ x) = πi(˜ x), for i = 1,2,...,K where πi(˜ x) is the (unknown) probability of observing class ci given ˜ X = ˜

  • x. Thus,

there are K unknown parameters, which must satisfy the following constraint:

K

  • i=1

πi(˜ x) =

K

  • i=1

P(Y = ei| ˜ X = ˜ x) = 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 20 / 29

slide-21
SLIDE 21

Multiclass Logistic Regression

Given that only one element of Y is 1, the probability mass function of Y can be written compactly as P(Y | ˜ X = ˜ x) =

K

  • j=1

(πj(˜ x))Y

j

(13) Note that if Y = ei, only Yi = 1 and the rest of the elements Y

j = 0 for j = i.

In multiclass logistic regression, we select one of the values, say cK, as a reference

  • r base class, and consider the log-odds ratio of the other classes with respect to

cK; we assume that each of these log-odd ratios are linear in ˜ X, but with a different augmented weight vector ˜ ωi, for class ci. That is, the log-odds ratio of class ci with respect to class cK is assumed to satisfy ln(odds(Y = ei| ˜ X = ˜ x)) = ln

  • P(Y = ei| ˜

X = ˜ x) P(Y = eK| ˜ X = ˜ x)

  • = ln

πi(˜ x) πK(˜ x)

  • = ˜

ωT

i ˜

x = ωi0 · x0 + ωi1 · x1 + ··· + ωid · xd where ωi0 = βi is the true bias value for class ci.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 21 / 29

slide-22
SLIDE 22

Multiclass Logistic Regression

Setting ˜ ωK = 0, we have exp{˜ ωT

K ˜

x} = 1, and thus we can write the full model for multiclass logistic regression as follows: πi(˜ x) = exp{˜ ωT

i ˜

x} K

j=1 exp{˜

ωT

j ˜

x} , for all i = 1,2,··· ,K (14) This function is also called the softmax function. When K = 2, this formulation yields exactly the same model as in binary logistic regression.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 22 / 29

slide-23
SLIDE 23

Maximum Likelihood Estimation

To find the K sets of regression weight vectors ˜ w i, for i = 1,2,··· ,K, we use the gradient ascent approach to maximize the log-likelihood function. The likelihood

  • f the data is given as

L( ˜ W ) = P(Y | ˜ W ) =

n

  • i=1

P(y i| ˜ X = ˜ xi) =

n

  • i=1

K

  • j=1

(πj(˜ xi))yij where ˜ W = { ˜ w 1, ˜ w 2,··· , ˜ w K} is the set of K weight vectors. The log-likelihood is then given as: ln

  • L( ˜

W )

  • =

n

  • i=1

K

  • j=1

yij · ln(πj(˜ xi)) =

n

  • i=1

K

  • j=1

yij · ln

  • exp{ ˜

w T

j ˜

xi} K

a=1 exp{ ˜

w T

a ˜

xi}

  • (15)

Note that the negative of the log-likelihood function can be regarded as an error function, commonly known as cross-entropy error. For stochastic gradient ascent, we update the weight vectors by considering only one point at a time.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 23 / 29

slide-24
SLIDE 24

Maximum Likelihood Estimation

The gradient of the log-likelihood function with respect to ˜ wj at a given point ˜ xi is given as ∇( ˜ wj, ˜ xi) =

  • yij − πj(˜

xi)

  • · ˜

xi (16) which results in the following update rule for the jth weight vector: ˜ w t+1

j

= ˜ w t

j + η · ∇( ˜

w t

j , ˜

xi) (17) where ˜ w t

j denotes the estimate of ˜

wj at step t, and η is the learning rate. Once the model has been trained, we can predict the class for any new augmented test point ˜ z as follows: ˆ y = arg max

ci

  • πi(˜

z)

  • = arg max

ci

  • exp{ ˜

w T

i ˜

z} K

j=1 exp{ ˜

w T

j ˜

z}

  • (18)

That is, we evalute the softmax function, and then predict the class of ˜ z as the

  • ne with the highest probability.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 24 / 29

slide-25
SLIDE 25

Multiclass Logistic Regression Algorithm

LogisticRegression-MultiClass (D,η,ǫ):

1 foreach (xT i ,yi) ∈ D do 2

˜ xT

i ←

1 xT

i

  • // map to Rd+1

3

y i ← ej if yi = cj // map yi to K-dim Bernoulli vector

4 t ← 0 // step/iteration counter 5 foreach j = 1,2,··· ,K do

˜ w t

j ← (0,...,0)T ∈ Rd+1 6 repeat 7

foreach j = 1,2,··· ,K − 1 do ˜ wj ← ˜ w t

j // make a copy of

˜ w t

j 8

foreach ˜ xi ∈ ˜ D in random order do

9

foreach j = 1,2,··· ,K − 1 do

10

πj(˜ xi) ← exp ˜ w T

j ˜

xi

  • /K

a=1 exp

˜ w T

a ˜

xi

  • 11

∇( ˜ wj, ˜ xi) ←

  • yij − πj(˜

xi)

  • · ˜

xi // gradient at ˜ wj

12

˜ wj ← ˜ wj + η · ∇( ˜ wj, ˜ xi) // update estimate for ˜ wj

13

foreach j = 1,2,··· ,K − 1 do ˜ w t+1

j

← ˜ wj // update ˜ w t+1

j 14

t ← t + 1

15 until K−1 j=1

  • ˜

w t

j − ˜

w t−1

j

  • ≤ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 25 / 29

slide-26
SLIDE 26

Multiclass logistic regression

Iris principal components data. Misclassified point are shown in dark gray color. All the points actually lie in the (X1,X2) plane, but c1 and c2 are shown displaced along Y with respect to the base class c3 purely for illustration purposes. X1 X2 Y π1(˜ x) π3(˜ x) π2(˜ x)

rS rS rS rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uTuT uT uT uT uT bC bC uT uT uT Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 26 / 29

slide-27
SLIDE 27

Multiclass logistic regression

Example

Consider the Iris dataset, with n = 150 points in a 2D space spanned by the first two principal components, as shown in the figure. Here, the response variable takes on three values: Y = c1 corresponds to Iris-setosa (shown as squares), Y = c2 corresponds to Iris-versicolor (as circles) and Y = c3 corresponds to Iris-virginica (as triangles). Thus, we map Y = c1 to e1 = (1,0,0)T, Y = c2 to e2 = (0,1,0)T and Y = c3 to e3 = (0,0,1)T. The multiclass logistic model uses Y = c3 (Iris-virginica; triangles) as the reference or base class. The fitted model is given as: ˜ w 1 = (−3.52,3.62,2.61)T ˜ w 2 = (−6.95,−5.18,−3.40)T ˜ w 3 = (0,0,0)T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 27 / 29

slide-28
SLIDE 28

Multiclass logistic regression

Example

Figure plots the decision surfaces corresponding to the softmax functions: π1(˜ x) = exp{ ˜ w T

1 ˜

x} 1 + exp{ ˜ w T

1 ˜

x} + exp{ ˜ w T

2 ˜

x} π2(˜ x) = exp{ ˜ w T

2 ˜

x} 1 + exp{ ˜ w T

1 ˜

x} + exp{ ˜ w T

2 ˜

x} π3(˜ x) = 1 1 + exp{ ˜ w T

1 ˜

x} + exp{ ˜ w T

2 ˜

x} The surfaces indicate regions where one class dominates over the others. It is important to note that the points for c1 and c2 are shown displaced along Y to emphasize the contrast with c3, which is the reference class. Overall, the training set accuracy for the multiclass logistic classifier is 96.7%, since it misclassifies only five points (shown in dark gray). For example, for the point ˜ x = (1,−0.52,−1.19)T, we have: π1(˜ x) = 0 π2(˜ x) = 0.448 π3(˜ x) = 0.552 Thus, the predicted class is ˆ y = argmaxci {πi(˜ x)} = c3, whereas the true class is y = c2.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 28 / 29

slide-29
SLIDE 29

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 24: Logistic Regression

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 29 / 29