[PPT] - Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D PowerPoint Presentation

SLIDE 1

Linear Predictors

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Linear Predictors 1 / 37

SLIDE 2

Outline

1 Definitions and Properties 2 The Least-Squares Linear Regressor 3 The Logistic-Regression Classifier 4 Probabilities and the Geometry of Logistic Regression 5 The Logistic Function 6 The Cross-Entropy Loss 7 Multi-Class Linear Predictors

COMPSCI 371D — Machine Learning Linear Predictors 2 / 37

SLIDE 3

Definitions and Properties

Definitions

A linear regressor fits an affine function to the data

y ≈ h(x) = b + wTx for x ∈ Rd

A linear, binary classifier separates the two classes with a

hyperplane in Rd

The actual data can be separated only if it is linearly

separable (!)

Multi-class classifiers separate any two classes with a

hyperplane

The resulting decision regions are convex and simply

connected

COMPSCI 371D — Machine Learning Linear Predictors 3 / 37

SLIDE 4

Definitions and Properties

Properties of Linear Predictors

Linear Predictors...
...have a very small H with d + 1 parameters

(resist overfitting)

... are trained by a convex optimization problem

(global optimum)

... are fast at inference time

(and training is not too slow)

... work well if the data is close to linearly separable

COMPSCI 371D — Machine Learning Linear Predictors 4 / 37

SLIDE 5

The Least-Squares Linear Regressor

D´

ej` a vu: Polynomial regression with k = 1 y ≈ hv(x) = b + wTx for x ∈ Rd

Parameter vector v =

b w

∈ Rd+1

H = Rm with m = d + 1

“Least Squares:” ℓ(y, ˆ

y) = (y − ˆ y)2

ˆ

v = arg minv∈R

m LT(v)

Risk LT(v) = 1

N

n=1 ℓ(yn, hv(xn))

We know how to solve this

COMPSCI 371D — Machine Learning Linear Predictors 5 / 37

SLIDE 6

The Least-Squares Linear Regressor

Linear Regression Example

1000 2000 3000 4000 5000 6000 100 200 300 400 500 600 700 800 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 100 150 200 250 300 350

Left: All of Ames. Residual

√ Risk: $55,800

Right: One Neighborhood. Residual

√ Risk: $23,600

Left, yellow: Ignore two largest homes

COMPSCI 371D — Machine Learning Linear Predictors 6 / 37

SLIDE 7

The Least-Squares Linear Regressor

Binary Classification by Logistic Regression

Y = {c0, c1}

Multi-class case later
The logistic-regression classifier is a classifier!
A linear classifier implemented through regression
The logistic is a particular function

COMPSCI 371D — Machine Learning Linear Predictors 7 / 37

SLIDE 8

The Logistic-Regression Classifier

Score-Based Classifiers

Y = {c0, c1}

Think of c0, c1 as numbers: Y = {0, 1}
We saw the idea of level sets:

Regress a score function s such that s is large where y = 1, small where y = 0

Threshold s to obtain a classifier:

h(x) = c0 if s(x) ≤ threshold c1

therwise.
A linear classifier implemented through regression

COMPSCI 371D — Machine Learning Linear Predictors 8 / 37

SLIDE 9

The Logistic-Regression Classifier

Idea 1

s(x) = b + wTx

1 1

Not so good!
A line does not approximate a step well
Why not fit a step function?
NP-hard unless the data is separable

COMPSCI 371D — Machine Learning Linear Predictors 9 / 37

SLIDE 10

The Logistic-Regression Classifier

Idea 2

How about a “soft step?”
The logistic function

0.5 1

f(x)

def

=

1 1+e−x

If a true step moves, the loss does not change until a data

point flips label

If the logistic function moves, the loss changes gradually
We have a gradient!
The optimization problem is no longer combinatorial

COMPSCI 371D — Machine Learning Linear Predictors 10 / 37

SLIDE 11

The Logistic-Regression Classifier

What is a Logistic Function in d Dimensions?

We want a linear classifier
The level crossing must be a hyperplane
Level crossing: Solution to s(x) = 1/2
Shape of the crossing depends on s
Compose an affine a(x) = c + uTx

(a : Rd → R) ...with a monotonic f(a) that crosses 1/2 (f : R → R) s(x) = f(a(x)) = f(c + uTx)

Then, if f(α) = 1/2, the equation s(x) = 1/2

is the same as c + uTx = α

A hyperplane!
Let f be the logistic function

COMPSCI 371D — Machine Learning Linear Predictors 11 / 37

SLIDE 12

The Logistic-Regression Classifier

Example

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 100 150 200 250 300 350 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 50 100 150 200 250 300 350

(a) (b)

Gold line: Regression problem R → R
Black line: Classification problem R2 → R

(result of running a logistic-regression classifier)

Labels: Good (red squares, y = 1) or poor quality (blue

circles, y = 0) homes

All that matters is how far a point is from the black line

COMPSCI 371D — Machine Learning Linear Predictors 12 / 37

SLIDE 13

Probabilities and the Geometry of Logistic Regression

A Probabilistic Interpretation

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 50 100 150 200 250 300 350

All that matters is how far a point is from the black line
s(x) = f(∆(x)) where ∆ is a signed distance
We could interpret the score s(x) as “the probability that

y = 1:” f(∆(x)) = P[y = 1]

(...or as “1− the probability that y = 0”)

lim∆→−∞ P[y = 1] = 0 lim∆→∞ P[y = 1] = 1 ∆ = 0 ⇒ P[y = 1] = 1/2 (just like the logistic function)

COMPSCI 371D — Machine Learning Linear Predictors 13 / 37

SLIDE 14

Probabilities and the Geometry of Logistic Regression

Ingredients for the Regression Part

Determine the distance ∆ of a point x ∈ X from a

hyperplane χ, and the side of χ on which the point is on (Geometry: affine functions as unscaled, signed distances)

Specify a monotonically increasing function that turns ∆

into a probability (Choice based on convenience: the logistic function)

Define a loss function ℓ(y, ˆ

y) such that the minimum risk yields the optimal classifier (Ditto, matches function in previous bullet to obtain a convex risk: the cross-entropy loss)

COMPSCI 371D — Machine Learning Linear Predictors 14 / 37

SLIDE 15

Probabilities and the Geometry of Logistic Regression

Normal to a Hyperplane

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

Hyperplane χ:

b + wTx = 0 (w.l.o.g. b ≤ 0) a1, a2 ∈ χ ⇒ c = a1 − a2 parallel to χ

Subtract b + wTa1 = 0

from b + wTa2 = 0

Obtain wTc = 0 for any a1, a2 ∈ χ
w is perpendicular to χ

COMPSCI 371D — Machine Learning Linear Predictors 15 / 37

SLIDE 16

Probabilities and the Geometry of Logistic Regression

Distance of a Hyperplane from the Origin

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

Unit-norm version of w:

n =

w w

Rewrite χ:

b + wTx = 0 (w.l.o.g. b ≤ 0) as nTx = β where β = − b

w ≥ 0

Line along n:

x = αn for α ∈ R (parametric form) α is the distance from the origin

Replace into eq. for χ:

αnTn = β that is, α = β ≥ 0

In particular, x0 = βn
β is the distance of χ from the origin

COMPSCI 371D — Machine Learning Linear Predictors 16 / 37

SLIDE 17

Probabilities and the Geometry of Logistic Regression

Signed Distance of a Point from a Hyperplane

x x0 n

Δ(x) > 0

β x᾽

Δ(x) < 0

χ positive half-space negative half-space

nTx = β where β = − b

w ≥ 0 and n = w w

x0 = βn

In one half-space, nTx ≥ β
Distance of x from χ is nTx − β ≥ 0
In other half-space, nTx ≤ β
Distance of x from χ is β − nTx ≥ 0
On decision boundary, nTx = β
nTx − β is the signed distance of x0 from the hyperplane

COMPSCI 371D — Machine Learning Linear Predictors 17 / 37

SLIDE 18

Probabilities and the Geometry of Logistic Regression

Summary

If w is nonzero (which it has to be), the distance of χ from the

rigin is

β

def

= |b| w (a nonnegative number) and the quantity ∆(x)

def

= b + wTx w is the signed distance of point x ∈ X from hyperplane χ. Specifically, the distance of x from χ is |∆(x)|, and ∆(x) is nonnegative if and only if x is on the side of χ pointed to by w. Let us call that side the positive half-space of χ.

COMPSCI 371D — Machine Learning Linear Predictors 18 / 37

SLIDE 19

The Logistic Function

Ingredient 2: The Logistic Function

Want to make the score of x be only a function of ∆(x)
Given ∆0, all points such that ∆(x) = ∆0 have the same

score

Score s(x) = f(∆(x))
How to pick f?
lim∆→−∞ f(∆) = 0

f(0) = 1/2 lim∆→∞ f(∆) = 1

Logistic function: f(∆)

def

=

1 1+e−∆

0.5 1 COMPSCI 371D — Machine Learning Linear Predictors 19 / 37

SLIDE 20

The Logistic Function

Logistic function: f(∆)

def

=

1 1+e−∆

0.5 1

Scale-free: Why not

1 1+e−∆/c ?

Can use both c and ∆(x)

def

= b+wT x

w

... or more simply use no c but use a(x)

def

= b + wTx

The affine function takes care of scale implicitly
Score: s(x)

def

= f(a(x)) =

1 1+e−b−wT x

Write s(x ; b, w) to remind us of dependence

COMPSCI 371D — Machine Learning Linear Predictors 20 / 37

SLIDE 21

The Cross-Entropy Loss

Optimize the Regressor, not the Classifier

We would like something similar to

ℓ0-1(y, ˆ y) = if y = ˆ y 1

therwise
However, ℓ0-1 is not differentiable
Use the score p = s(x ; b, w) instead of ˆ

y:

ˆ

y ∈ {0, 1} while p ∈ [0, 1]

Instead of measuring the loss on ˆ

y = h(x), we measure it

n p = s(x ; b, w) ≈ ˆ

y

We still need a different ℓ(y, p) for differentiability

COMPSCI 371D — Machine Learning Linear Predictors 21 / 37

SLIDE 22

The Cross-Entropy Loss

Differentiability, Again

We want ℓ(y, p) to be differentiable in p
Since p is differentiable in v = (b, w), so then will be ℓ
Why do we insist on differentiability, again?
Risk: LT(b, w) = 1

N

n=1 ℓ(yn, s(xn ; b, w))

Use a gradient method (steepest descent, Newton, ...)
We have not yet chosen the specific form of ℓ
We can make LT(b, w) a differentiable and convex

function of v = (b, w) by a suitable choice of ℓ

COMPSCI 371D — Machine Learning Linear Predictors 22 / 37

SLIDE 23

The Cross-Entropy Loss

ℓ(y, p)

def

= − log p if y = 1 − log(1 − p) if y = 0

Base of log is unimportant: unit of loss is conventional

1 p y=1 y=0

Same as ℓ(y, p) = −y log p − (1 − y) log(1 − p)

(Second is more convenient for differentiation)

COMPSCI 371D — Machine Learning Linear Predictors 23 / 37

SLIDE 24

The Cross-Entropy Loss

Domain: {0, 1} × [0, 1]

ℓ(1, p) = ℓ(0, 1 − p) ℓ(1, 1/2) = ℓ(0, 1/2) = − log(1/2)

1

y

1

p

1

p

y=1 y=0

COMPSCI 371D — Machine Learning Linear Predictors 24 / 37

SLIDE 25

The Cross-Entropy Loss

Why Cross-Entropy?

Literature (and Appendix in the class notes) gives an

interpretation in terms of information theory

A more cogent explanation: With cross-entropy and the

logistic function,

The risk becomes a convex function of the parameters

v = (b, w)

The gradient and Hessian of the risk are easy to compute
A crucial cancellation occurs when computing derivatives of

the risk with respect to the parameters

You will be asked to use gradient and Hessian, and be able

to compute them

You will not be asked to remember their formulas, or know

how to derive them

COMPSCI 371D — Machine Learning Linear Predictors 25 / 37

SLIDE 26

The Cross-Entropy Loss

The Magic

Logistic function and loss were chosen to simplify the math
Here is the magic:

LT(v) = LT(ℓ(s(a(v))), so ∇LT = dLT

dℓ dℓ ds ds da∇a

ℓ = −y log s − (1 − y) log(1 − s) so that dℓ

ds = s−y s (1−s)

s(a) =

1 1+e−a so that ds da = s (1 − s)

Therefore, dℓ

ds ds da = s − y

This is the cancellation that simplifies everything

COMPSCI 371D — Machine Learning Linear Predictors 26 / 37

SLIDE 27

The Cross-Entropy Loss

Turning the Crank

Gradient of the risk:

∇LT(v) = 1 N

N

n=1

[s(xn ; v) − yn] 1 xn

Hessian of the risk:

HLT (v) = 1 N

N

n=1

s(xn ; v) [1 − s(xn ; v)] 1 xn 1 xn

Each term in the summation for HLT is an outer product
This implies (easily) that HLT is positive semidefinite
LT(v) is a convex function
No need to check eigenvalues (See Appendix if you are curious)

COMPSCI 371D — Machine Learning Linear Predictors 27 / 37

SLIDE 28

The Cross-Entropy Loss

Training

LT(v) is convex in v ∈ Rm with m = d + 1
Use any gradient-based method to minimize
When d is not too large, use Newton’s method (homework!)
More efficient, problem-specific algorithms exist
They capitalize on LT(v) being a sum of squares
Typically, train with cross-entropy loss, test with 0-1 loss

COMPSCI 371D — Machine Learning Linear Predictors 28 / 37

SLIDE 29

Multi-Class Linear Predictors

Obvious approach 1: One-versus-rest
Build K − 1 classifiers ck versus not ck
Works for K = 2 but not for K = 3

c1 c1 not c2 not c2

?

COMPSCI 371D — Machine Learning Linear Predictors 29 / 37

SLIDE 30

Multi-Class Linear Predictors

Obvious approach 2: One-versus-one
Build

K

2

classifiers ci versus cj
Works for K = 2 but not for K = 3

c1 c2

?

c1 c3 c2 c3

COMPSCI 371D — Machine Learning Linear Predictors 30 / 37

SLIDE 31

Multi-Class Linear Predictors

A Symmetric View of the Binary Score

Rename classes 1, 2 rather than 0, 1
Activation: a = b + wTx
Score for class 1: s1(a) =

1 1+e−a

Score for class 2: s2(a) = 1 − s1(a) = s1(−a)
More symmetrically, two activations:

a1 = b + wTx, a2 = −b − wTx

Note:

1 1+e−a = e

a 2

e

a 2

1 1+e−a = e

a 2

e

a 2 +e− a 2

Score for class 1: s1 = s(a1) =

e

a1 2

e

a1 2 +e− a1 2 =

e

a1 2

e

a1 2 +e a2 2

Score for class 2 (switch a1 with a2): s2 = s(a2) =

e

a2 2

e

a1 2 +e a2 2

Class with highest score wins

COMPSCI 371D — Machine Learning Linear Predictors 31 / 37

SLIDE 32

Multi-Class Linear Predictors

Exploiting Scalable Activations

Score for class k ∈ {1, 2}:

sk =

e

ak 2

e

a1 2 +e a2 2

Activations are freely scalable, so write sk =

eak ea1+ea2 instead

Different function, same separating hyperplane
This generalizes. Replace 2 classes with K

sk(x) =

eak (x) K

j=1 eaj (x) where ak(x) = bk + wT

k x

Satisfies K

j=1 sk(x) = 1

Class with highest score wins: ˆ

y = h(x) ∈ arg maxk sk(x)

This is the Linear-Regression Multi-Class Classifier

COMPSCI 371D — Machine Learning Linear Predictors 32 / 37

SLIDE 33

Multi-Class Linear Predictors

The Soft-Max Function

sk(x) = eak(x) K

j=1 eaj(x)

sk(x) > 0 and K

k=1 sk(x) = 1 for all x

If ai ≫ aj for j = i then K

j=1 eaj(x) ≈ eai(x)

Therefore, si ≈ 1 and sj ≈ 0 for j = i
“Brings out the biggest:” soft-max
Collect into vectors: a = (a1, . . . , aK), s = (s1, . . . , sK)

x ∈ Rd → a ∈ RK → s ∈ RK s(a(x)) = ea(x) 1Tea(x) lim

α→∞ aTs(αa) = max(a)

COMPSCI 371D — Machine Learning Linear Predictors 33 / 37

SLIDE 34

Multi-Class Linear Predictors

Geometry of Multi-ClassDecision Regions

Separating hyperplane for classes i, j ∈ {1, . . . , K}:

bi + wT

i x = bj + wT j x (equal activations ⇒ equal scores)

Total of M =

K

2

hyperplanes, just as in one-vs-one
Example: d = 2, K = 4 ⇒ 6 lines on the plane
There are degeneracies (M × (d + 1) matrix of rank K − 1)
Crossing a line switches two scores. Example:

s3 > s2 > s4 > s1 → s3 > s4 > s2 > s1

COMPSCI 371D — Machine Learning Linear Predictors 34 / 37

SLIDE 35

Multi-Class Linear Predictors

Geometry of Decision Regions

Crossing a line switches two scores. Example:

s3 > s2 > s4 > s1 → s3 > s4 > s2 > s1

When the top two scores switch, we cross a decision
boundary. Example:

s3 > s2 > s4 > s1 → s2 > s3 > s4 > s1

Decision regions are intersections of half-spaces ⇒ convex

COMPSCI 371D — Machine Learning Linear Predictors 35 / 37

SLIDE 36

Multi-Class Linear Predictors

Multi-Class Cross-Entropy Loss

Cross-entropy loss for K = 2:

(remember that we renamed Y = {0, 1} to Y = {1, 2}) ℓ(y, p)

def

= − log p if y = 1 − log(1 − p) if y = 2 = − log p1 if y = 1 − log p2 if y = 2

Same as ℓ(y, p) = − log py
But this is general!
Can also write as follows:

ℓ(y, p) = − K

k=1 qk(y) log pk

q is the one-hot encoding of y
Example: K = 5, then y = 4 is represented by

q = [0, 0, 0, 1, 0]

COMPSCI 371D — Machine Learning Linear Predictors 36 / 37

SLIDE 37

Multi-Class Linear Predictors

Convex Risk, Again

Even with K > 2, the risk is a convex function of

v = (b1, w1, . . . , bK, wK) ∈ Rm with m = (d + 1)K

Proof analogous to K = 2 case, just technically more

involved

Can still use gradient descent methods, including Newton

COMPSCI 371D — Machine Learning Linear Predictors 37 / 37