Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D - - PowerPoint PPT Presentation

linear predictors
SMART_READER_LITE
LIVE PREVIEW

Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D - - PowerPoint PPT Presentation

Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Linear Predictors 1 / 37 Outline 1 Definitions and Properties 2 The Least-Squares Linear Regressor 3 The Logistic-Regression Classifier 4 Probabilities and


slide-1
SLIDE 1

Linear Predictors

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Linear Predictors 1 / 37

slide-2
SLIDE 2

Outline

1 Definitions and Properties 2 The Least-Squares Linear Regressor 3 The Logistic-Regression Classifier 4 Probabilities and the Geometry of Logistic Regression 5 The Logistic Function 6 The Cross-Entropy Loss 7 Multi-Class Linear Predictors

COMPSCI 371D — Machine Learning Linear Predictors 2 / 37

slide-3
SLIDE 3

Definitions and Properties

Definitions

  • A linear regressor fits an affine function to the data

y ≈ h(x) = b + wTx for x ∈ Rd

  • A linear, binary classifier separates the two classes with a

hyperplane in Rd

  • The actual data can be separated only if it is linearly

separable (!)

  • Multi-class classifiers separate any two classes with a

hyperplane

  • The resulting decision regions are convex and simply

connected

COMPSCI 371D — Machine Learning Linear Predictors 3 / 37

slide-4
SLIDE 4

Definitions and Properties

Properties of Linear Predictors

  • Linear Predictors...
  • ...have a very small H with d + 1 parameters

(resist overfitting)

  • ... are trained by a convex optimization problem

(global optimum)

  • ... are fast at inference time

(and training is not too slow)

  • ... work well if the data is close to linearly separable

COMPSCI 371D — Machine Learning Linear Predictors 4 / 37

slide-5
SLIDE 5

The Least-Squares Linear Regressor

The Least-Squares Linear Regressor

ej` a vu: Polynomial regression with k = 1 y ≈ hv(x) = b + wTx for x ∈ Rd

  • Parameter vector v =

b w

  • ∈ Rd+1

H = Rm with m = d + 1

  • “Least Squares:” ℓ(y, ˆ

y) = (y − ˆ y)2

  • ˆ

v = arg minv∈R

m LT(v)

  • Risk LT(v) = 1

N

N

n=1 ℓ(yn, hv(xn))

  • We know how to solve this

COMPSCI 371D — Machine Learning Linear Predictors 5 / 37

slide-6
SLIDE 6

The Least-Squares Linear Regressor

Linear Regression Example

1000 2000 3000 4000 5000 6000 100 200 300 400 500 600 700 800 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 100 150 200 250 300 350

  • Left: All of Ames. Residual

√ Risk: $55,800

  • Right: One Neighborhood. Residual

√ Risk: $23,600

  • Left, yellow: Ignore two largest homes

COMPSCI 371D — Machine Learning Linear Predictors 6 / 37

slide-7
SLIDE 7

The Least-Squares Linear Regressor

Binary Classification by Logistic Regression

Y = {c0, c1}

  • Multi-class case later
  • The logistic-regression classifier is a classifier!
  • A linear classifier implemented through regression
  • The logistic is a particular function

COMPSCI 371D — Machine Learning Linear Predictors 7 / 37

slide-8
SLIDE 8

The Logistic-Regression Classifier

Score-Based Classifiers

Y = {c0, c1}

  • Think of c0, c1 as numbers: Y = {0, 1}
  • We saw the idea of level sets:

Regress a score function s such that s is large where y = 1, small where y = 0

  • Threshold s to obtain a classifier:

h(x) = c0 if s(x) ≤ threshold c1

  • therwise.
  • A linear classifier implemented through regression

COMPSCI 371D — Machine Learning Linear Predictors 8 / 37

slide-9
SLIDE 9

The Logistic-Regression Classifier

Idea 1

  • s(x) = b + wTx

1 1

  • Not so good!
  • A line does not approximate a step well
  • Why not fit a step function?
  • NP-hard unless the data is separable

COMPSCI 371D — Machine Learning Linear Predictors 9 / 37

slide-10
SLIDE 10

The Logistic-Regression Classifier

Idea 2

  • How about a “soft step?”
  • The logistic function

0.5 1

f(x)

def

=

1 1+e−x

  • If a true step moves, the loss does not change until a data

point flips label

  • If the logistic function moves, the loss changes gradually
  • We have a gradient!
  • The optimization problem is no longer combinatorial

COMPSCI 371D — Machine Learning Linear Predictors 10 / 37

slide-11
SLIDE 11

The Logistic-Regression Classifier

What is a Logistic Function in d Dimensions?

  • We want a linear classifier
  • The level crossing must be a hyperplane
  • Level crossing: Solution to s(x) = 1/2
  • Shape of the crossing depends on s
  • Compose an affine a(x) = c + uTx

(a : Rd → R) ...with a monotonic f(a) that crosses 1/2 (f : R → R) s(x) = f(a(x)) = f(c + uTx)

  • Then, if f(α) = 1/2, the equation s(x) = 1/2

is the same as c + uTx = α

  • A hyperplane!
  • Let f be the logistic function

COMPSCI 371D — Machine Learning Linear Predictors 11 / 37

slide-12
SLIDE 12

The Logistic-Regression Classifier

Example

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 100 150 200 250 300 350 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 50 100 150 200 250 300 350

(a) (b)

  • Gold line: Regression problem R → R
  • Black line: Classification problem R2 → R

(result of running a logistic-regression classifier)

  • Labels: Good (red squares, y = 1) or poor quality (blue

circles, y = 0) homes

  • All that matters is how far a point is from the black line

COMPSCI 371D — Machine Learning Linear Predictors 12 / 37

slide-13
SLIDE 13

Probabilities and the Geometry of Logistic Regression

A Probabilistic Interpretation

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 50 100 150 200 250 300 350

  • All that matters is how far a point is from the black line
  • s(x) = f(∆(x)) where ∆ is a signed distance
  • We could interpret the score s(x) as “the probability that

y = 1:” f(∆(x)) = P[y = 1]

  • (...or as “1− the probability that y = 0”)

lim∆→−∞ P[y = 1] = 0 lim∆→∞ P[y = 1] = 1 ∆ = 0 ⇒ P[y = 1] = 1/2 (just like the logistic function)

COMPSCI 371D — Machine Learning Linear Predictors 13 / 37

slide-14
SLIDE 14

Probabilities and the Geometry of Logistic Regression

Ingredients for the Regression Part

  • Determine the distance ∆ of a point x ∈ X from a

hyperplane χ, and the side of χ on which the point is on (Geometry: affine functions as unscaled, signed distances)

  • Specify a monotonically increasing function that turns ∆

into a probability (Choice based on convenience: the logistic function)

  • Define a loss function ℓ(y, ˆ

y) such that the minimum risk yields the optimal classifier (Ditto, matches function in previous bullet to obtain a convex risk: the cross-entropy loss)

COMPSCI 371D — Machine Learning Linear Predictors 14 / 37

slide-15
SLIDE 15

Probabilities and the Geometry of Logistic Regression

Normal to a Hyperplane

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

  • Hyperplane χ:

b + wTx = 0 (w.l.o.g. b ≤ 0) a1, a2 ∈ χ ⇒ c = a1 − a2 parallel to χ

  • Subtract b + wTa1 = 0

from b + wTa2 = 0

  • Obtain wTc = 0 for any a1, a2 ∈ χ
  • w is perpendicular to χ

COMPSCI 371D — Machine Learning Linear Predictors 15 / 37

slide-16
SLIDE 16

Probabilities and the Geometry of Logistic Regression

Distance of a Hyperplane from the Origin

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

  • Unit-norm version of w:

n =

w w

  • Rewrite χ:

b + wTx = 0 (w.l.o.g. b ≤ 0) as nTx = β where β = − b

w ≥ 0

  • Line along n:

x = αn for α ∈ R (parametric form) α is the distance from the origin

  • Replace into eq. for χ:

αnTn = β that is, α = β ≥ 0

  • In particular, x0 = βn
  • β is the distance of χ from the origin

COMPSCI 371D — Machine Learning Linear Predictors 16 / 37

slide-17
SLIDE 17

Probabilities and the Geometry of Logistic Regression

Signed Distance of a Point from a Hyperplane

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

nTx = β where β = − b

w ≥ 0 and n = w w

x0 = βn

  • In one half-space, nTx ≥ β
  • Distance of x from χ is nTx − β ≥ 0
  • In other half-space, nTx ≤ β
  • Distance of x from χ is β − nTx ≥ 0
  • On decision boundary, nTx = β
  • nTx − β is the signed distance of x0 from the hyperplane

COMPSCI 371D — Machine Learning Linear Predictors 17 / 37

slide-18
SLIDE 18

Probabilities and the Geometry of Logistic Regression

Summary

If w is nonzero (which it has to be), the distance of χ from the

  • rigin is

β

def

= |b| w (a nonnegative number) and the quantity ∆(x)

def

= b + wTx w is the signed distance of point x ∈ X from hyperplane χ. Specifically, the distance of x from χ is |∆(x)|, and ∆(x) is nonnegative if and only if x is on the side of χ pointed to by w. Let us call that side the positive half-space of χ.

COMPSCI 371D — Machine Learning Linear Predictors 18 / 37

slide-19
SLIDE 19

The Logistic Function

Ingredient 2: The Logistic Function

  • Want to make the score of x be only a function of ∆(x)
  • Given ∆0, all points such that ∆(x) = ∆0 have the same

score

  • Score s(x) = f(∆(x))
  • How to pick f?
  • lim∆→−∞ f(∆) = 0

f(0) = 1/2 lim∆→∞ f(∆) = 1

  • Logistic function: f(∆)

def

=

1 1+e−∆

0.5 1 COMPSCI 371D — Machine Learning Linear Predictors 19 / 37

slide-20
SLIDE 20

The Logistic Function

The Logistic Function

  • Logistic function: f(∆)

def

=

1 1+e−∆

0.5 1

  • Scale-free: Why not

1 1+e−∆/c ?

  • Can use both c and ∆(x)

def

= b+wT x

w

... or more simply use no c but use a(x)

def

= b + wTx

  • The affine function takes care of scale implicitly
  • Score: s(x)

def

= f(a(x)) =

1 1+e−b−wT x

  • Write s(x ; b, w) to remind us of dependence

COMPSCI 371D — Machine Learning Linear Predictors 20 / 37

slide-21
SLIDE 21

The Cross-Entropy Loss

Optimize the Regressor, not the Classifier

  • We would like something similar to

ℓ0-1(y, ˆ y) = if y = ˆ y 1

  • therwise
  • However, ℓ0-1 is not differentiable
  • Use the score p = s(x ; b, w) instead of ˆ

y:

  • ˆ

y ∈ {0, 1} while p ∈ [0, 1]

  • Instead of measuring the loss on ˆ

y = h(x), we measure it

  • n p = s(x ; b, w) ≈ ˆ

y

  • We still need a different ℓ(y, p) for differentiability

COMPSCI 371D — Machine Learning Linear Predictors 21 / 37

slide-22
SLIDE 22

The Cross-Entropy Loss

Differentiability, Again

  • We want ℓ(y, p) to be differentiable in p
  • Since p is differentiable in v = (b, w), so then will be ℓ
  • Why do we insist on differentiability, again?
  • Risk: LT(b, w) = 1

N

N

n=1 ℓ(yn, s(xn ; b, w))

  • Use a gradient method (steepest descent, Newton, ...)
  • We have not yet chosen the specific form of ℓ
  • We can make LT(b, w) a differentiable and convex

function of v = (b, w) by a suitable choice of ℓ

COMPSCI 371D — Machine Learning Linear Predictors 22 / 37

slide-23
SLIDE 23

The Cross-Entropy Loss

The Cross-Entropy Loss

ℓ(y, p)

def

= − log p if y = 1 − log(1 − p) if y = 0

  • Base of log is unimportant: unit of loss is conventional

1 p y=1 y=0

  • Same as ℓ(y, p) = −y log p − (1 − y) log(1 − p)

(Second is more convenient for differentiation)

COMPSCI 371D — Machine Learning Linear Predictors 23 / 37

slide-24
SLIDE 24

The Cross-Entropy Loss

The Cross-Entropy Loss

  • Domain: {0, 1} × [0, 1]

ℓ(1, p) = ℓ(0, 1 − p) ℓ(1, 1/2) = ℓ(0, 1/2) = − log(1/2)

1

y

1

p

1

p

y=1 y=0

COMPSCI 371D — Machine Learning Linear Predictors 24 / 37

slide-25
SLIDE 25

The Cross-Entropy Loss

Why Cross-Entropy?

  • Literature (and Appendix in the class notes) gives an

interpretation in terms of information theory

  • A more cogent explanation: With cross-entropy and the

logistic function,

  • The risk becomes a convex function of the parameters

v = (b, w)

  • The gradient and Hessian of the risk are easy to compute
  • A crucial cancellation occurs when computing derivatives of

the risk with respect to the parameters

  • You will be asked to use gradient and Hessian, and be able

to compute them

  • You will not be asked to remember their formulas, or know

how to derive them

COMPSCI 371D — Machine Learning Linear Predictors 25 / 37

slide-26
SLIDE 26

The Cross-Entropy Loss

The Magic

  • Logistic function and loss were chosen to simplify the math
  • Here is the magic:

LT(v) = LT(ℓ(s(a(v))), so ∇LT = dLT

dℓ dℓ ds ds da∇a

ℓ = −y log s − (1 − y) log(1 − s) so that dℓ

ds = s−y s (1−s)

s(a) =

1 1+e−a so that ds da = s (1 − s)

  • Therefore, dℓ

ds ds da = s − y

  • This is the cancellation that simplifies everything

COMPSCI 371D — Machine Learning Linear Predictors 26 / 37

slide-27
SLIDE 27

The Cross-Entropy Loss

Turning the Crank

  • Gradient of the risk:

∇LT(v) = 1 N

N

  • n=1

[s(xn ; v) − yn] 1 xn

  • Hessian of the risk:

HLT (v) = 1 N

N

  • n=1

s(xn ; v) [1 − s(xn ; v)] 1 xn 1 xn

  • Each term in the summation for HLT is an outer product
  • This implies (easily) that HLT is positive semidefinite
  • LT(v) is a convex function
  • No need to check eigenvalues (See Appendix if you are curious)

COMPSCI 371D — Machine Learning Linear Predictors 27 / 37

slide-28
SLIDE 28

The Cross-Entropy Loss

Training

  • LT(v) is convex in v ∈ Rm with m = d + 1
  • Use any gradient-based method to minimize
  • When d is not too large, use Newton’s method (homework!)
  • More efficient, problem-specific algorithms exist
  • They capitalize on LT(v) being a sum of squares
  • Typically, train with cross-entropy loss, test with 0-1 loss

COMPSCI 371D — Machine Learning Linear Predictors 28 / 37

slide-29
SLIDE 29

Multi-Class Linear Predictors

Multi-Class Linear Predictors

  • Obvious approach 1: One-versus-rest
  • Build K − 1 classifiers ck versus not ck
  • Works for K = 2 but not for K = 3

c1 c1 not c2 not c2

?

COMPSCI 371D — Machine Learning Linear Predictors 29 / 37

slide-30
SLIDE 30

Multi-Class Linear Predictors

Multi-Class Linear Predictors

  • Obvious approach 2: One-versus-one
  • Build

K

2

  • classifiers ci versus cj
  • Works for K = 2 but not for K = 3

c1 c2

?

c1 c3 c2 c3

COMPSCI 371D — Machine Learning Linear Predictors 30 / 37

slide-31
SLIDE 31

Multi-Class Linear Predictors

A Symmetric View of the Binary Score

  • Rename classes 1, 2 rather than 0, 1
  • Activation: a = b + wTx
  • Score for class 1: s1(a) =

1 1+e−a

  • Score for class 2: s2(a) = 1 − s1(a) = s1(−a)
  • More symmetrically, two activations:

a1 = b + wTx, a2 = −b − wTx

  • Note:

1 1+e−a = e

a 2

e

a 2

1 1+e−a = e

a 2

e

a 2 +e− a 2

  • Score for class 1: s1 = s(a1) =

e

a1 2

e

a1 2 +e− a1 2 =

e

a1 2

e

a1 2 +e a2 2

  • Score for class 2 (switch a1 with a2): s2 = s(a2) =

e

a2 2

e

a1 2 +e a2 2

  • Class with highest score wins

COMPSCI 371D — Machine Learning Linear Predictors 31 / 37

slide-32
SLIDE 32

Multi-Class Linear Predictors

Exploiting Scalable Activations

  • Score for class k ∈ {1, 2}:

sk =

e

ak 2

e

a1 2 +e a2 2

  • Activations are freely scalable, so write sk =

eak ea1+ea2 instead

  • Different function, same separating hyperplane
  • This generalizes. Replace 2 classes with K

sk(x) =

eak (x) K

j=1 eaj (x) where ak(x) = bk + wT

k x

  • Satisfies K

j=1 sk(x) = 1

  • Class with highest score wins: ˆ

y = h(x) ∈ arg maxk sk(x)

  • This is the Linear-Regression Multi-Class Classifier

COMPSCI 371D — Machine Learning Linear Predictors 32 / 37

slide-33
SLIDE 33

Multi-Class Linear Predictors

The Soft-Max Function

sk(x) = eak(x) K

j=1 eaj(x)

  • sk(x) > 0 and K

k=1 sk(x) = 1 for all x

  • If ai ≫ aj for j = i then K

j=1 eaj(x) ≈ eai(x)

  • Therefore, si ≈ 1 and sj ≈ 0 for j = i
  • “Brings out the biggest:” soft-max
  • Collect into vectors: a = (a1, . . . , aK), s = (s1, . . . , sK)

x ∈ Rd → a ∈ RK → s ∈ RK s(a(x)) = ea(x) 1Tea(x) lim

α→∞ aTs(αa) = max(a)

COMPSCI 371D — Machine Learning Linear Predictors 33 / 37

slide-34
SLIDE 34

Multi-Class Linear Predictors

Geometry of Multi-ClassDecision Regions

  • Separating hyperplane for classes i, j ∈ {1, . . . , K}:

bi + wT

i x = bj + wT j x (equal activations ⇒ equal scores)

  • Total of M =

K

2

  • hyperplanes, just as in one-vs-one
  • Example: d = 2, K = 4 ⇒ 6 lines on the plane
  • There are degeneracies (M × (d + 1) matrix of rank K − 1)
  • Crossing a line switches two scores. Example:

s3 > s2 > s4 > s1 → s3 > s4 > s2 > s1

COMPSCI 371D — Machine Learning Linear Predictors 34 / 37

slide-35
SLIDE 35

Multi-Class Linear Predictors

Geometry of Decision Regions

  • Crossing a line switches two scores. Example:

s3 > s2 > s4 > s1 → s3 > s4 > s2 > s1

  • When the top two scores switch, we cross a decision
  • boundary. Example:

s3 > s2 > s4 > s1 → s2 > s3 > s4 > s1

  • Decision regions are intersections of half-spaces ⇒ convex

COMPSCI 371D — Machine Learning Linear Predictors 35 / 37

slide-36
SLIDE 36

Multi-Class Linear Predictors

Multi-Class Cross-Entropy Loss

  • Cross-entropy loss for K = 2:

(remember that we renamed Y = {0, 1} to Y = {1, 2}) ℓ(y, p)

def

= − log p if y = 1 − log(1 − p) if y = 2 = − log p1 if y = 1 − log p2 if y = 2

  • Same as ℓ(y, p) = − log py
  • But this is general!
  • Can also write as follows:

ℓ(y, p) = − K

k=1 qk(y) log pk

  • q is the one-hot encoding of y
  • Example: K = 5, then y = 4 is represented by

q = [0, 0, 0, 1, 0]

COMPSCI 371D — Machine Learning Linear Predictors 36 / 37

slide-37
SLIDE 37

Multi-Class Linear Predictors

Convex Risk, Again

  • Even with K > 2, the risk is a convex function of

v = (b1, w1, . . . , bK, wK) ∈ Rm with m = (d + 1)K

  • Proof analogous to K = 2 case, just technically more

involved

  • Can still use gradient descent methods, including Newton

COMPSCI 371D — Machine Learning Linear Predictors 37 / 37