Machine Learning Lecture 03: Logistic Regression and Gradient - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Lecture 03: Logistic Regression and Gradient - - PowerPoint PPT Presentation

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet


slide-1
SLIDE 1

Machine Learning

Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk

Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. www.deeplearningbook.org. (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. http://ciml.info/

Nevin L. Zhang (HKUST) Machine Learning 1 / 51

slide-2
SLIDE 2

Logistic Regression

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 2 / 51

slide-3
SLIDE 3

Logistic Regression

Recap of Probabilistic Linear Regression

Training set D = {xi, yi}N

i=1, where yi ∈ R.

Probabilistic model: p(y|x, θ) = N(y|µ(x), σ2) = N(y|w⊤x, σ2) Learning: Determining w by minimizing the cross entropy loss: J(w) = − 1 N

N

  • i=1

log p(yi|xi, w) Point estimation of y: ˆ y = µ(x) = w⊤x

Nevin L. Zhang (HKUST) Machine Learning 3 / 51

slide-4
SLIDE 4

Logistic Regression

Logistic Regression (for Classification)

Training set D = {xi, yi}N

i=1, where yi ∈ {0, 1}.

Probabilistic model: p(y|x, w) = Ber(y|σ(w⊤x))

σ(z) is the sigmoid/logistic/logit function. σ(z) = 1 1 + exp(−z) = ez ez + 1 It maps the real line R to [0, 1]. Not to be confused with the variance σ2 in the Gaussian distribution

Nevin L. Zhang (HKUST) Machine Learning 4 / 51

slide-5
SLIDE 5

Logistic Regression

Logistic Regression

The model: p(y|x, w) = Ber(y|σ(w⊤x)) p(y = 1|x, w) = σ(w⊤x) = 1 1 + exp(−w⊤x) p(y = 0|x, w) = 1 − σ(w⊤x) = exp(−w⊤x) 1 + exp(−w⊤x) Consider the logit of p(y = 1|x, w) log p(y = 1|x, w) 1 − p(y = 1|x, w) = log p(y = 1|x, w) p(y = 0|x, w) = log exp(w⊤x) = w⊤x. So, a linear model for the logit. Hence called logistic regression.

Nevin L. Zhang (HKUST) Machine Learning 5 / 51

slide-6
SLIDE 6

Logistic Regression

Logistic Regression: Decision Rule

To classify instances, we obtain a point estimation of y: ˆ y = arg maxy p(y|x, w) In other words, the decision/classification rule is: ˆ y = 1 iff p(y = 1|x, w) > 0.5 This is called the optimal Bayes classifier:

Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule.

Nevin L. Zhang (HKUST) Machine Learning 6 / 51

slide-7
SLIDE 7

Logistic Regression

Logistic Regression is a Linear Classifier

In fact, the decision/classification rule in logistic regression is equivalent to: ˆ y = 1 iff w⊤x > 0 So,it is a linear classifier with a decision boundary. Example: Whether a students is admitted based on the results of two exams:

Nevin L. Zhang (HKUST) Machine Learning 7 / 51

slide-8
SLIDE 8

Logistic Regression

Logistic Regression: Example

Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted.

Nevin L. Zhang (HKUST) Machine Learning 8 / 51

slide-9
SLIDE 9

Logistic Regression

Logistic Regression: 2D Example

The decision boundary p(y = 1|x, w) = 0.5 is a straight line in the feature space.

Nevin L. Zhang (HKUST) Machine Learning 9 / 51

slide-10
SLIDE 10

Logistic Regression

Parameter Learning

We would like to find the MLE of w, i.e, the values of w that minimizes the cross entropy loss: J(w) = − 1

N

N

i=1 log P(yi|xi, w)

Consider a general training example (x, y). Because y is binary, we have log P(y|x, w) = y log ˆ y + (1 − y) log(1 − ˆ y) (ˆ y = P(y = 1|x, w)) = y log σ(w⊤x) + (1 − y) log(1 − σ(w⊤x)). Hence, J(w) = − 1 N

N

  • i=1

[yi log σ(w⊤xi) + (1 − yi) log(1 − σ(w⊤xi))] Unlike linear regression, we can no longer write down the MLE in closed

  • form. Instead, we need to use optimization algorithms to compute it.

Gradient descent Newton’s method

Nevin L. Zhang (HKUST) Machine Learning 10 / 51

slide-11
SLIDE 11

Gradient Descent

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 11 / 51

slide-12
SLIDE 12

Gradient Descent

Gradient Descent

Consider a function y = J(w) of a scalar variable w. The derivative of J(w) is defined as follows: J′(w) = df (w) dw = lim

ǫ→0

J(w + ǫ) − J(w) ǫ When ǫ is small, we have J(w + ǫ) ≈ J(w) + ǫJ′(w) This equation tells use how to reduce J(w) by changing w in small steps: If J′(w) > 0, make ǫ negative, i.e., decrease w; If J′(w) < 0, make ǫ positive, i.e., increase w. In other words, move in the opposite direction of the derivative (gradient)

Nevin L. Zhang (HKUST) Machine Learning 12 / 51

slide-13
SLIDE 13

Gradient Descent

Gradient Descent

To implement the idea, we update w as follows: w ← w − αJ′(w) The term −J′(w) means that we move in the opposite direction of the derivative, and α determines how much we move in that

  • direction. It is called the step size in optimization and the learning

rate in machine learning

Nevin L. Zhang (HKUST) Machine Learning 13 / 51

slide-14
SLIDE 14

Gradient Descent

Gradient Descent

Consider a function y = J(w), where w = (w0, w1, . . . , wD)⊤. The gradient of J at w is defined as ∇J = ( dJ dw0 , dJ dw1 , . . . , dJ dwD )⊤ The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient.

Nevin L. Zhang (HKUST) Machine Learning 14 / 51

slide-15
SLIDE 15

Gradient Descent

Gradient Descent

The method of steepest descent/gradient descent for minimizing J(w)

1 Initialize w 2 Repeat until convergence

w ← w − α∇J(w)

The learning rate α usually changes from iteration to iteration.

Nevin L. Zhang (HKUST) Machine Learning 15 / 51

slide-16
SLIDE 16

Gradient Descent

Choice of Learning Rate

Constant learning rate is difficult to set:

Too small, convergence will be very slow Too large, the method can fail to converge at all.

Better to vary the learning rate. Will discuss this more later.

Nevin L. Zhang (HKUST) Machine Learning 16 / 51

slide-17
SLIDE 17

Gradient Descent

Gradient Descent

Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well.

Nevin L. Zhang (HKUST) Machine Learning 17 / 51

slide-18
SLIDE 18

Gradient Descent for Logistic Regression

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 18 / 51

slide-19
SLIDE 19

Gradient Descent for Logistic Regression

Derivative of σ(z)

To apply gradient descent to logistic regression, we need to compute the partial derivative of J(w) w.r.t each weight wj. Before doing that, first consider the derivative of the sigma function: σ′(z) = dσ(z) dz = d dz 1 1 + e−z = − 1 (1 + e−z)2 d(1 + e−z) dz = 1 (1 + e−z)2 e−z = 1 1 + e−z (1 − 1 1 + e−z ) = σ(z)(1 − σ(z))

Nevin L. Zhang (HKUST) Machine Learning 19 / 51

slide-20
SLIDE 20

Gradient Descent for Logistic Regression

Derivative of log P(w|x, w)

z = w⊤x, x = [x0, x1, . . . , xD]⊤, w = [w0, w1, . . . , wD]⊤. ∂σ(z) ∂wj = dσ(z) dz ∂z ∂wj = σ(z)(1 − σ(z))xj ∂ ∂wj log P(y|x, w) = ∂ ∂wj [y log σ(z) + (1 − y) log(1 − σ(z))] = y 1 σ(z) ∂σ(z) ∂wj − (1 − y) 1 1 − σ(z) ∂σ(z) ∂wj = [y(1 − σ(z)) − (1 − y)σ(z)]xj = [y − σ(z)]xj.

Nevin L. Zhang (HKUST) Machine Learning 20 / 51

slide-21
SLIDE 21

Gradient Descent for Logistic Regression

Derivative of the Cross Entropy Loss

The i-th training example: xi = [xi,0, xi,1, . . . , xi,D]⊤ ∂J(w) ∂wj = − 1 N

N

  • i=1

∂ ∂wj log P(yi|xi, w) = − 1 N

N

  • i=1

[yi − σ(zi)]xi,j

Nevin L. Zhang (HKUST) Machine Learning 21 / 51

slide-22
SLIDE 22

Gradient Descent for Logistic Regression

Batch Gradient Descent

The Batch Gradient Descent algorithm for logistic regression Repeat until convergence wj ← wj + α 1 N

N

  • i=1

[yi − σ(w⊤xi)]xi,j Interpretation: (Assume xi are positive vectors)

If predicted value σ(w⊤xi) is smaller than the actual value yi, there is reason to increase wj. The increment is proportional to xi,j. If predicted value σ(w⊤xi) is larger than the actual value yi, there is reason to decrease wj. The decrement is proportional to xi,j.

Nevin L. Zhang (HKUST) Machine Learning 22 / 51

slide-23
SLIDE 23

Gradient Descent for Logistic Regression

Stochastic Gradient Descent

Batch gradient descent is costly when N is large. In such case, we usually use Stochastic Gradient Descent: Repeat until convergence Randomly choose B ⊂ {1, 2, . . . , N} wj ← wj + α 1 |B|

  • i∈B

[yi − σ(w⊤xi)]xi,j The randomly picked subset B is called a minibatch. Its size is called the batch size.

Nevin L. Zhang (HKUST) Machine Learning 23 / 51

slide-24
SLIDE 24

Gradient Descent for Logistic Regression

L2 Regularization

Just as we prefer ridge regression to linear regression, we prefer to add a regularization term in the objective function of logistic regression: J(w) ← J(w) + 1 2λ||w||2

2.

In this case, we have ∂J(w) ∂wj = − 1 N

N

  • i=1

[yi − σ(zi)]xi,j + λwj The weight update formula is: wj ← wj + α[−λwj + 1 |B|

  • i∈B

(yi − σ(w⊤xi))xi,j]

Nevin L. Zhang (HKUST) Machine Learning 24 / 51

slide-25
SLIDE 25

Gradient Descent for Logistic Regression

L2 Regularization

Update rule without regularization: wj ← wj + α 1 |B|

  • i∈B

[yi − σ(w⊤xi)]xi,j Update rule with regularization: wj ← (1 − αλ)wj + α 1 |B|

  • i∈B

(yi − σ(w⊤xi))xi,j Regularization forces the weights to be smaller: |(1 − αλ)wj| < |wj| as 1 > αλ > 0.

Nevin L. Zhang (HKUST) Machine Learning 25 / 51

slide-26
SLIDE 26

Newton’s Method

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 26 / 51

slide-27
SLIDE 27

Newton’s Method

Newton’s Method

Consider again a function J(w) of a scalar variable w. We want to minimize J(w). At a minimum point, the derivative J′(w) = 0. Let f (w) = J′(w). We want to find points where f (w) = 0. Newton’s method (aka the Newton-Raphson method) is a commonly used method to solve the problem:

Start some initial value w Repeat the following update until convergence w ← w − f (w) f ′(w)

Nevin L. Zhang (HKUST) Machine Learning 27 / 51

slide-28
SLIDE 28

Newton’s Method

Newton’s Method

Interpretation of Newton’s method Approximate the function f via a linear function that is tangent to f at the current guess wi: y = f ′(wi)w + f (wi) − f ′(wi)wi Solve for where that linear function equals to zero to get next w. f ′(wi)w + f (wi) − f ′(wi)wi = 0 wi+1 = wi − f (wi) f ′(wi)

Nevin L. Zhang (HKUST) Machine Learning 28 / 51

slide-29
SLIDE 29

Newton’s Method

Newton’s Method

Minimize J(w) using Newton’s method: Start some initial value w Repeat the following update until convergence w ← w − J′(w) J′′(w) where f ′′ is the second derivative of f .

Nevin L. Zhang (HKUST) Machine Learning 29 / 51

slide-30
SLIDE 30

Newton’s Method

Newton’s Method

New consider minimizing a J(w) of a vector w = (w0, w1, . . . , wD)⊤. The “first derivative” of J(w) is the gradient vector ∇J(w). The “second derivative” of J(w) is the Hessian matrix H = [Hij] Hij = ∂2J(w) ∂wi∂wj Minimize J(w) using Newton’s method

Start some initial value w Repeat the following update until convergence w ← w − H−1∇J(w)

Nevin L. Zhang (HKUST) Machine Learning 30 / 51

slide-31
SLIDE 31

Newton’s Method

Newton’s Method: Another Perspective

By Taylor expansion, we have J(w + ǫ) ≈ J(w) + J′(w)ǫ + 1 2J′′(w)ǫ2 We want dJ(w+ǫ)

= 0. So, J′(w) + J′′(w)ǫ = 0 Solving the equation, we get ǫ = − J′(w) J′′(w)

Nevin L. Zhang (HKUST) Machine Learning 31 / 51

slide-32
SLIDE 32

Newton’s Method

Newton’s Method vs Gradient Descent

In gradient descent, we have only a first order term in the approximation (gradient). J(w + ǫ) ≈ J(w) + ǫJ′(w) In Newton’s method, we also have a second order term in the approximation (curvature) J(w + ǫ) ≈ J(w) + J′(w)ǫ + 1 2J′′(w)ǫ2

Nevin L. Zhang (HKUST) Machine Learning 32 / 51

slide-33
SLIDE 33

Newton’s Method

Newton’s Method vs Gradient Descent

Use of curvature allows us to better predict how J(w) changes when we change w

1 With negative curvature, J actually decreases faster than the gradient

predicts.

2 With no curvature, the gradient predicts the decrease correctly. 3 With positive curvature, the function decreases more slowly than

expected. The use of curvature mitigates with problems with case 1 and 3.

Nevin L. Zhang (HKUST) Machine Learning 33 / 51

slide-34
SLIDE 34

Newton’s Method

Newton’s Method vs Gradient Descent

Newton’s method (red) typically enjoys faster convergence than (batch) gradient descent (green), and requires many fewer iterations to get very close to the minimum. One iteration of Newtons can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian matrix.

Nevin L. Zhang (HKUST) Machine Learning 34 / 51

slide-35
SLIDE 35

Softmax Regression

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 35 / 51

slide-36
SLIDE 36

Softmax Regression

Multi-Class Logistic Regression

Training set D = {xi, yi}N

i=1, where yi ∈ {1, 2, . . . , C}, and C ≥ 2.

Multi-Class Logistic Regression (aka softmax regression) uses the following probability model:

There is a weight vector wc = (wc,0, wc,1, . . . , wc,D)⊤ for each class c. W = (w1, . . . , wC) is the weight matrix. The probability of a particular class c given x is: p(y = c|x, W) = 1 Z(x, W) exp(w⊤

c x)

where Z(x, W) = C

c′=1 exp(w⊤ c′x) is the normalization constant to

ensure the probabilities of all classes sum to 1. It is called the partition function. The classification rule is: ˆ y = arg max

y

p(y|x, W)

Nevin L. Zhang (HKUST) Machine Learning 36 / 51

slide-37
SLIDE 37

Softmax Regression

Relationship with Logistic Regression

When C = 2, name the two classes as {0, 1} instead of {1, 2}: p(y = 0|x, W) = exp(w⊤

0 x)

exp(w⊤

0 x) + exp(w⊤ 1 x), p(y = 1|x, W) =

exp(w⊤

1 x)

exp(w⊤

0 x) + exp(w⊤ 1 x)

Dividing both numerator and denominator by exp(w1x), we get p(y = 0|x, W) = exp((w⊤

0 − w⊤ 1 )x)

exp((w⊤

0 − w⊤ 1 )x) + 1, p(y = 1|x, W) =

1 exp((w⊤

0 − w⊤ 1 )x) + 1

Let w = w1 − w0, we get the logistic regression model: p(y = 0|x, w) = exp(−w⊤x) 1 + exp(−w⊤x), p(y = 1|x, w) = 1 1 + exp(−w⊤x)

Nevin L. Zhang (HKUST) Machine Learning 37 / 51

slide-38
SLIDE 38

Softmax Regression

Cross Entropy Loss

The cross entropy loss of the softmax regression model is: J(w) = − 1 N

N

  • i=1

log P(yi|xi, w) = − 1 N

N

  • i=1

C

  • c=1

1(yi = c) log P(yi = c|xi, w) = − 1 N

N

  • i=1

C

  • c=1

1(yi = c) log exp(w⊤

c xi)

C

c′=1 exp(w⊤ c′xi)

= − 1 N

N

  • i=1

C

  • c=1

1(yi = c)[w⊤

c xi − log C

  • c′=1

exp(w⊤

c′xi)]

= − 1 N

N

  • i=1

[

C

  • c=1

1(yi = c)w⊤

c xi − log C

  • c′=1

exp(w⊤

c′xi)]

Nevin L. Zhang (HKUST) Machine Learning 38 / 51

slide-39
SLIDE 39

Softmax Regression

Partial Derivative of Loss

We would like to find the MLE of W. To do so, we need to minimize the cross entropy loss. There is no closed-form solution. So, we resort to gradient descent. The partial derivative of cross entropy loss w.r.t each wc,j is: ∂J(w) ∂wc,j = − 1 N

N

  • i=1

[ 1 ∂wc,j

C

  • c=1

1(yi = c)w⊤

c xi −

1 ∂wc,j log

C

  • c′=1

exp(w⊤

c′xi)]

= − 1 N

N

  • i=1

[1(yi = c)xi,j − exp(w⊤

c xi)

C

c′=1 exp(w⊤ c′xi)

xi,j] = − 1 N

N

  • i=1

[1(yi = c) − p(yi = c|xi, W)]xi,j Compare this to the gradient of logistic regression: ∂J(w) ∂wj = − 1 N

N

  • i=1

[yi − σ(w⊤xi)]xi,j

Nevin L. Zhang (HKUST) Machine Learning 39 / 51

slide-40
SLIDE 40

Softmax Regression

Gradient of Loss

The gradient of cross entropy loss w.r.t each wc is: ∇wcJ(w) = (∂J(w) ∂wc,0 , ∂J(w) ∂wc,1 , . . . , ∂J(w) ∂wc,D )⊤ = − 1 N

N

  • i=1

[1(yi = c) − p(yi = c|xi, W)]xi Update rule in Gradient Descent: wc ← wc + α 1 N

N

  • i=1

[1(yi = c) − p(yi = c|xi, W)]xi

Nevin L. Zhang (HKUST) Machine Learning 40 / 51

slide-41
SLIDE 41

Optimization Approach to Classification

Outline

1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification

Nevin L. Zhang (HKUST) Machine Learning 41 / 51

slide-42
SLIDE 42

Optimization Approach to Classification

The Probabilistic Approach to Classification

So far, we have been talking about the probabilistic approach to classification: Start with training set D = {xi, yi}N

i=1, where yi ∈ {0, 1}.

Learn a probabilistic model: p(y|x, w) = Ber(y|σ(w⊤x)) by minimizing the cross entropy loss: J(w) = − 1 N

N

  • i=1

log P(yi|xi, w) Classify new instances using: ˆ y = 1 iff p(y = 1|x, w) > 0.5

Nevin L. Zhang (HKUST) Machine Learning 42 / 51

slide-43
SLIDE 43

Optimization Approach to Classification

The Optimization Approach

Next, we will briefly talk about the optimization approach to classification: Start with training set D = {xi, yi}N

i=1, where yi ∈ {−1, 1},

Learn a linear classifier ˆ y = sign(w⊤x + b) where x = (x1, . . . , xD)⊤ and w = (w1, . . . , wD)⊤. We drop the convention x0 = 1 and use b to replace w0.

Nevin L. Zhang (HKUST) Machine Learning 43 / 51

slide-44
SLIDE 44

Optimization Approach to Classification

The Sign Function

signx = 1 if x > 0; signx = 0 if x = 0; signx = −1 if x < 0.

Nevin L. Zhang (HKUST) Machine Learning 44 / 51

slide-45
SLIDE 45

Optimization Approach to Classification

The Optimization Approach

We determine w and b by minimizing the training/empirical error L(w, b) = 1 N

N

  • i=1

L0/1(yi, ˆ yi) where L0/1(yi, ˆ yi) = 1(yi = ˆ yi) is called the zero/one loss function. Let z = w⊤x + b. It is easy to see that L0/1(y, ˆ y) = 1(y(w⊤x + b) ≤ 0) = 1(yz ≤ 0) To overload the notation, we also write it as L0/1(y, z) So, the loss function of linear classifiers is often written as L(w, b) = 1 N

N

  • i=1

1(yi(w⊤xi + b) ≤ 0)

Nevin L. Zhang (HKUST) Machine Learning 45 / 51

slide-46
SLIDE 46

Optimization Approach to Classification

A problem the Zero/One Loss Function

The zero/one loss function, as a function of w and b, has zero gradient almost everywhere and is not convex. Hence it is difficult to optimize X-axis is yz.

Nevin L. Zhang (HKUST) Machine Learning 46 / 51

slide-47
SLIDE 47

Optimization Approach to Classification

Convex Function

A real-valued function f is convex if the line segment between any two points on the graph of the function lies above or on the graph.

For any two points x1 < x2 and any t ∈ [0, 1]

f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2) f is strictly convex if the the equality holds only when x1 = x2.

Nevin L. Zhang (HKUST) Machine Learning 47 / 51

slide-48
SLIDE 48

Optimization Approach to Classification

Surrogate Loss Functions

Convex functions are easy to minimize. Intuitively, if you drop a ball anywhere in a convex function, it will eventually get to the minimum. This is not true for non-convex functions. Since the zero/one loss is hard to optimize, we want to approximate it using a convex function and optimize that function. This approximating function will be called a surrogate loss. The surrogate losses need to be upper bounds on the true loss function: this guarantees that if you minimize the surrogate loss, you are also pushing down the real loss.

Nevin L. Zhang (HKUST) Machine Learning 48 / 51

slide-49
SLIDE 49

Optimization Approach to Classification

Logistic Loss

One common surrogate loss function is the logistic loss: Llog(y, z) = 1 log 2 log(1 + exp(−yz)) =

  • 1

log 2 log(1 + exp(−(w⊤x + b)))

if y = 1

1 log 2 log(1 + exp(w⊤x + b))

if y = −1 On the other hand, each term in the NLL of logistic regression has the following form − log P(y|x, w) = −{y log σ(w⊤x) + (1 − y) log(1 − σ(w⊤x))} =

  • − log

1 1+exp(−w⊤x)

if y = 1 − log(1 −

1 1+exp(−w⊤x))

if y = 0 = log(1 + exp(−w⊤x)) if y = 1 log(1 + exp(w⊤x)) if y = 0 So, logistic regression is the same as linear classifier with logistic surrogate loss function.

Nevin L. Zhang (HKUST) Machine Learning 49 / 51

slide-50
SLIDE 50

Optimization Approach to Classification

Other Surrogate Loss Functions

Zero/one loss: L0/1(y, z) = 1(yz ≤ 0). Logistic loss: Llog(y, z) =

1 log 2 log(1 + exp(−yz))

Hinge loss: Lhin(y, z) = max{0, 1 − yz} Exponential loss: Lexp(y, z) = exp(−yz) Squared loss: Lsqr(y, z) = (y − z)2

Nevin L. Zhang (HKUST) Machine Learning 50 / 51

slide-51
SLIDE 51

Optimization Approach to Classification

Loss and Error

When a surrogate loss function is used, There are two ways to measure how a classifier performs on the training set: training loss and training error. There are also two ways to measure how a classifier performs on the set set: test loss and test error. As in the case of regression, test loss/error can be improved using regularization.

Nevin L. Zhang (HKUST) Machine Learning 51 / 51