Logistic regression CS 446 1. Linear classifiers Linear regression - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic regression CS 446 1. Linear classifiers Linear regression - - PowerPoint PPT Presentation

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68 Linear


slide-1
SLIDE 1

Logistic regression

CS 446

slide-2
SLIDE 2
  • 1. Linear classifiers
slide-3
SLIDE 3

Linear regression

Last two lectures, we studied linear regression; the output/label space Y was R.

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 50 60 70 80 90 delay

1 / 68

slide-4
SLIDE 4

Linear classification

Today, the goal is a linear classifier; the output/label space Y is discrete.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

2 / 68

slide-5
SLIDE 5

Notation

For now, let’s consider binary classification: Y = {−1, +1}. A linear predictor w ∈ Rd classifies according to sign(wTx) ∈ {−1, +1}. Given ((xi, yi))n

i=1, a predictor w ∈ Rd,

we want sign

  • wTxi
  • and yi to agree.

3 / 68

slide-6
SLIDE 6

Geometry of linear classifiers

x1 x2 H w A hyperplane in Rd is a linear subspace

  • f dimension d−1.

◮ A hyperplane in R2 is a line. ◮ A hyperplane in R3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. A hyperplane H can be specified by a (non-zero) normal vector w ∈ Rd. The hyperplane with normal vector w is the set of points orthogonal to w: H =

  • x ∈ Rd : x

Tw = 0

  • .

Given w and its corresponding H: H splits the sets labeled positive {x : wTx > 0} and those labeled negative {x : wTw < 0}.

4 / 68

slide-7
SLIDE 7

Classification with a hyperplane

H w span{w}

5 / 68

slide-8
SLIDE 8

Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.)

5 / 68

slide-9
SLIDE 9

Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.) Decision boundary is hyperplane (oriented by w): x

Tw > 0

⇐ ⇒ x2 · cos(θ) > 0 ⇐ ⇒ x on same side of H as w

5 / 68

slide-10
SLIDE 10

Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.) Decision boundary is hyperplane (oriented by w): x

Tw > 0

⇐ ⇒ x2 · cos(θ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin?

5 / 68

slide-11
SLIDE 11

Linear separability

Is it always possible to find w with sign(wTxi) = yi? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

Linearly separable. Not linearly separable.

6 / 68

slide-12
SLIDE 12

Decision boundary with quadratic feature expansion

elliptical decision boundary hyperbolic decision boundary

7 / 68

slide-13
SLIDE 13

Decision boundary with quadratic feature expansion

elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to “upgrade” linear classifiers.

7 / 68

slide-14
SLIDE 14

Finding linear classifiers with ERM

Why not feed our goal into an optimization package, in the form arg min

w∈Rd

1 n

n

  • i=1

1[sign(w

Txi) = yi] ? 8 / 68

slide-15
SLIDE 15

Finding linear classifiers with ERM

Why not feed our goal into an optimization package, in the form arg min

w∈Rd

1 n

n

  • i=1

1[sign(w

Txi) = yi] ?

◮ Discrete/combinatorial search;

  • ften NP-hard.

8 / 68

slide-16
SLIDE 16

Relaxing the ERM problem

Let’s remove one source of discreteness: 1 n

n

  • i=1

1[sign(w

Txi) = yi]

− → 1 n

n

  • i=1

1

  • yi(w

Txi) ≤ 0

  • .

Did we lose something in this process? Should it be “>” or “≥”?

9 / 68

slide-17
SLIDE 17

Relaxing the ERM problem

Let’s remove one source of discreteness: 1 n

n

  • i=1

1[sign(w

Txi) = yi]

− → 1 n

n

  • i=1

1

  • yi(w

Txi) ≤ 0

  • .

Did we lose something in this process? Should it be “>” or “≥”? yi(wTxi) is the (unnormalized) margin of w on example i; we have written this problem with a margin loss:

  • Rzo(w) = 1

n

n

  • i=1

ℓzo(yiw

Txi)

where ℓzo(z) = 1[z ≤ 0]. (Remainder of lecture will use single-parameter margin losses.)

9 / 68

slide-18
SLIDE 18
  • 2. Logistic loss and risk
slide-19
SLIDE 19

Logistic loss

Let’s state our classification goal with a generic margin loss ℓ:

  • R(w) = 1

n

n

  • i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right).

10 / 68

slide-20
SLIDE 20

Logistic loss

Let’s state our classification goal with a generic margin loss ℓ:

  • R(w) = 1

n

n

  • i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓls(z) := (1 − z)2; note ℓls(yˆ y) = (1 − yˆ y)2 = y2(1 − yˆ y)2 = (y − ˆ y)2. ◮ Logistic loss: ℓlog(z) = ln(1 + exp(−z)).

10 / 68

slide-21
SLIDE 21

Squared and logistic losses on linearly separable data I

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

2 .

  • 8

.

  • 4

. . 4 . 8 . 1 2 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1.200
  • 0.800
  • 0.400

0.000 0.400 0.800 1.200

Logistic loss. Squared loss.

11 / 68

slide-22
SLIDE 22

Squared and logistic losses on linearly separable data II

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

2 .

  • 8

.

  • 4

. . 4 . 8 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

. 2

  • .

8

  • .

4 . . 4 . 8 1 . 2

Logistic loss. Squared loss.

12 / 68

slide-23
SLIDE 23

Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem.

13 / 68

slide-24
SLIDE 24

Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it.

  • Theorem. If there exists ¯

w with yi ¯ wTxi > 0 for all i, then every w with Rlog(w) < ln(2)/2n + infv Rlog(v), also satisfies yiwTxi > 0.

13 / 68

slide-25
SLIDE 25

Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it.

  • Theorem. If there exists ¯

w with yi ¯ wTxi > 0 for all i, then every w with Rlog(w) < ln(2)/2n + infv Rlog(v), also satisfies yiwTxi > 0. Proof. Step 1: low risk implies few mistakes. For any w with yjwTxj ≤ 0 for some j,

  • Rlog(w) ≥ 1

n ln(1 + exp(−yjw

Txj)) ≥ ln(2)

n . By contrapositive, any w with Rlog(w) < ln(2)/n makes no mistakes. Step 2: infv Rlog(v) = 0. Note: 0 ≤ inf

v

  • Rlog(v) ≤ inf

r>0

1 n

n

  • i=1

ln(1 + exp(−ryi ¯ w

Txi)) = 0.

This completes the proof.

13 / 68

slide-26
SLIDE 26
  • 3. Minimizing the empirical logistic risk
slide-27
SLIDE 27

Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b.

14 / 68

slide-28
SLIDE 28

Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b. Logistic loss: ◮ Take gradient of Rlog(w) = 1

n

n

i=1 ln(1+exp(yiwTxi)) and set to 0 ???

14 / 68

slide-29
SLIDE 29

Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b. Logistic loss: ◮ Take gradient of Rlog(w) = 1

n

n

i=1 ln(1+exp(yiwTxi)) and set to 0 ???

  • Remark. Is A+b a “closed form expression”?

14 / 68

slide-30
SLIDE 30

Decreasing R

We need to move down the contours of Rlog:

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6.000 8 . 1 . 12.000 1 4 . 15 / 68

slide-31
SLIDE 31

Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

16 / 68

slide-32
SLIDE 32

Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares?

16 / 68

slide-33
SLIDE 33

Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity.

16 / 68

slide-34
SLIDE 34

Gradient descent for logistic regression

Gradient descent is the iteration: wi+1 := wi − ηi∇w Rlog(wi). ◮ Note ℓ′

log(z) = −1 1+exp(z), and use the chain rule (hw1!).

◮ Or use pytorch:

def GD(X, y, loss, step = 0.1, n iters = 10000): w = torch.zeros(X.shape[1], requires grad = True) for i in range(n iters): l = loss(X, y, w).mean() l.backward() with torch.no grad(): w −= step ∗ w.grad w.grad.zero () return w

17 / 68

slide-35
SLIDE 35

“Logistic”?

The (negative) derivative −ℓ′

log(z) = 1 1+ez is the logistic function.

  • 6
  • 4
  • 2

2 4 6 0.2 0.4 0.6 0.8 1

We’ll explain its significance in subsequent lectures.

18 / 68

slide-36
SLIDE 36
  • 4. Summary
slide-37
SLIDE 37

A quick note on popularity (early 2018)

19 / 68

slide-38
SLIDE 38

Summary

◮ Linearly separable classification problems. ◮ Logistic loss ℓlog and (empirical) risk Rlog. ◮ Gradient descent.

20 / 68