Linear Classifiers and Regressors Borrowed with permission from - - PowerPoint PPT Presentation

linear classifiers and regressors
SMART_READER_LITE
LIVE PREVIEW

Linear Classifiers and Regressors Borrowed with permission from - - PowerPoint PPT Presentation

Linear Classifiers and Regressors Borrowed with permission from Andrew Moore (CMU) Single-Parameter Linear Regression Linear: Slide 2 Regression vs Classification Input Prediction of Classifier Attributes categorical output Input


slide-1
SLIDE 1

Linear Classifiers and Regressors

“Borrowed” with permission from Andrew Moore (CMU)

slide-2
SLIDE 2

Linear: Slide 2

Single-Parameter Linear Regression

slide-3
SLIDE 3

Linear: Slide 3

Regression vs Classification

Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output Input Attributes

slide-4
SLIDE 4

Linear: Slide 4

Linear Regression

Linear regression assumes expected value of output y given input x, E[y|x], is linear. Simplest case: Out(x) = w×x for some unknown w. Challenge: Given dataset, how to estimate w.

inputs

  • utputs

x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 x3 = 2 y3 = 2 x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 DATASET ← 1 → ↑ w ↓

slide-5
SLIDE 5

Linear: Slide 5

1-parameter linear regression

Assume data is formed by yi = w×xi + noisei where…

  • noise signals are independent
  • noise has normal distribution with

mean 0 and unknown variance σ2 P(y|w,x) has a normal distribution with

  • mean w×x
  • variance σ2
slide-6
SLIDE 6

Linear: Slide 6

Bayesian Linear Regression

P(y|w,x) = Normal(mean w×x ; var σ2) Datapoints (x1 ,y1 ) (x2 ,y2 ) … (xn ,yn ) are EVIDENCE about w. Want to infer w from data: P(w | x1, x2,…, xn, y1, y2…, yn )

  • ?? use BAYES rule to work out a posterior

distribution for w given the data ??

  • Or Maximum Likelihood Estimation ?
slide-7
SLIDE 7

Linear: Slide 7

Maximum likelihood estimation of w

Question: “For what value of w is this data most likely to have happened?”

What value of w maximizes

1 2 1 2 1

( , ,..., | , ,..., , ) ( , ) ?

n n n i i i

P y y y x x x w P y w x

=

= ∏

slide-8
SLIDE 8

Linear: Slide 8

( )

* 1 2 1 2 1 2 1

arg max ( , ) 1 = arg max exp( ( ) ) 2 1 arg max 2 arg min

n i i i n y wx i i i n i i i n i i i

w P y w x y wx y wx σ σ

= − = = =

⎧ ⎫ ⎪ ⎪ = ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ ⎪ ⎪ − ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ − ⎪ ⎛ ⎞ ⎪ = − ⎨ ⎬ ⎜ ⎟ ⎝ ⎠ ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ ⎪ ⎪ = − ⎨ ⎬ ⎪ ⎪ ⎩ ⎭

∏ ∏ ∑ ∑

slide-9
SLIDE 9

Linear: Slide 9

Linear Regression

Maximum likelihood w minimizes

E(w) =

sum-of-squares of residuals ⇒ Need to minimize a quadratic function of w.

( )

( )

( )

2 2 2 2

( ) 2

i i i i i i i i

w y wx y x y w x w Ε = − = − +

∑ ∑ ∑ ∑

E(w) w

slide-10
SLIDE 10

Linear: Slide 10

Linear Regression

Sum-of-squares minimized when

2

∑ ∑

=

i i i

x y x w

The maximum likelihood model is Can use for prediction Note: Bayesian stats would

provide a prob dist of w … and predictions would give a prob dist of expected output Often useful to know your confidence. Max likelihood also provides kind of confidence!

p(w) w

Out(x) = w×x

slide-11
SLIDE 11

Linear: Slide 11

Multi-variate Linear Regression

slide-12
SLIDE 12

Linear: Slide 12

Multivariate Regression

What if inputs are vectors? Dataset has form

x1 y1 x2 y2 x3 y3

.: : .

xR yR

I nput is 2-d; Output value is “height”

3 .

. 4

6 .

. 5

. 8

. 10

x1 x2

slide-13
SLIDE 13

Linear: Slide 13

Multivariate Regression

11 12 1 1 21 22 2 2 2 1 2

... ..... ..... ... ..... ..... ... ..... .....

m m R R Rm R R

x x x y x x x y x x x y ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

1

x x x y x M M M

R datapoints; each input has m components … as Matrices:

  • Max. likelihood w = (XTX) -1(XTY)

I MPORTANT EXERCI SE: PROVE I T !!!!!

Linear regression model assumes ∃ vector w s.t. Out(x) = w Tx = w1 x[1] + w2 x[2] + … + wm x[m]

slide-14
SLIDE 14

Linear: Slide 15

Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY) XTX is m ×m matrix: i,j’th elt = XTY is m-element vector: i’th elt =

= R k kj kix

x

1

= R k k kiy

x

1

slide-15
SLIDE 15

Linear: Slide 16

Constant Term in Linear Regression

slide-16
SLIDE 16

Linear: Slide 17

What about a constant term?

What if linear data does not go through origin (0,0,…0) ? Statisticians and Neural Net Folks all agree on a simple

  • bvious hack.

Can you guess??

slide-17
SLIDE 17

Linear: Slide 18

The constant term

  • Trick: create fake input “X0

” that always takes value 1

X1 X2 Y 2 4 16 3 4 17 5 5 20 X0 X1 X2 Y 1 2 4 16 1 3 4 17 1 5 5 20

Before: Y=w1X1+ w2 X2 …is a poor model After: Y= w0X0+w1X1+ w2 X2 = w0+w1X1+ w2 X2 …is good model!

Here, you should be able to see MLE w0 , w1 , w2 by inspection

slide-18
SLIDE 18

Linear: Slide 19

Linear Regression with varying noise

Heteroscedasticity...

slide-19
SLIDE 19

Linear: Slide 20

Regression with varying noise

  • Suppose you know variance of noise that was

added to each datapoint.

x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1 σ=1/2 σ=2 σ=1 σ=1/2 σ=2

xi yi σi2 ½ ½ 4 1 1 1 2 1 1/4 2 3 4 3 2 1/4

) , ( ~

2 i i i

wx N y σ

Assume

What’s the MLE estimate of w?

slide-20
SLIDE 20

Linear: Slide 21

MLE estimation with varying noise

2 2 2 1 2 1 2 1 2

log ( , ,..., | , ,..., , , ,..., , )

argmax

R R R

p y y y x x x w w σ σ σ

2 2 1

( )

argmin

R i i i i

y wx w σ

=

− =

Assuming i.i.d. and then plugging in equation for Gaussian and simplifying.

2 1

( ) such that

R i i i i i

x y wx w σ

=

⎛ ⎞ − = = ⎜ ⎟ ⎝ ⎠

Setting dLL/dw equal to zero

2 1 2 2 1 R i i i i R i i i

x y x σ σ

= =

⎛ ⎞ ⎜ ⎟ ⎝ ⎠ = ⎛ ⎞ ⎜ ⎟ ⎝ ⎠

∑ ∑

Trivial algebra

slide-21
SLIDE 21

Linear: Slide 22

This is Weighted Regression

  • How to minimize weighted sum of squares ?

x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1 σ=1/2 σ=2 σ=1 σ=1/2 σ=2

=

R i i i i

wx y w

1 2 2

) (

argmin

σ

2

1

i

σ

where weight for i’th datapoint is

slide-22
SLIDE 22

Linear: Slide 23

Weighted Multivariate Regression

The max. likelihood w is w = (WXTWX)-1(WXTWY) (WXTWX) is an m x m matrix: i,j’th elt is (WXTWY) is an m-element vector: i’th elt

= R k i kj kix

x

1 2

σ

= R k i k ki y

x

1 2

σ

slide-23
SLIDE 23

Linear: Slide 24

Non-linear Regression

(Digression…)

slide-24
SLIDE 24

Linear: Slide 25

Non-linear Regression

Suppose y is related to function of x in that predicted values have a non-linear dependence on w:

x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1

xi yi ½ ½ 1 2.5 2 3 3 2 3 3

) , ( ~

2

σ

i i

x w N y +

Assume

What’s the MLE estimate of w?

slide-25
SLIDE 25

Linear: Slide 26

Non-linear MLE estimation

= ) , , ,..., , | ,..., , ( log

2 1 2 1

argmax

w x x x y y y p w

R R

σ

( )

2 1

argmin

R i i i

y w x w

=

= − +

Assuming i.i.d. and then plugging in equation for Gaussian and simplifying.

1

such that

R i i i i

y w x w w x

=

⎛ ⎞ − + = = ⎜ ⎟ ⎜ ⎟ + ⎝ ⎠

Setting dLL/dw equal to zero

We’re down the algebraic toilet

Common (but not only) approach: Numerical Solutions:

  • Line Search
  • Simulated Annealing
  • Gradient Descent
  • Conjugate Gradient
  • Levenberg

Marquart

  • Newton’s Method

Also, special purpose statistical-

  • ptimization-specific tricks such as

E.M. (See Gaussian Mixtures lecture for introduction)

So guess what we do?

slide-26
SLIDE 26

Linear: Slide 28

GRADIENT DESCENT

Goal: Find a local minimum of f: ℜ→ℜ Approach:

( )

f w w w w η ∂ ← − ∂

QUESTION: Justify the Gradient Descent Rule

Good default value for anything ! 1. Start with some value for w 2. GRADIENT DESCENT: 3. Iterate … until bored … η = LEARNING RATE = small positive number, e.g. η = 0.05

slide-27
SLIDE 27

Linear: Slide 29

Gradient Descent in “m” Dimensions

ℜ → ℜm : ) f(w

( )

w f

  • w

w ∇ ← η

Given points in direction of steepest ascent. GRADIENT DESCENT RULE: Equivalently

( )

w f

  • j

j j

w η w w ∂ ∂ ←

….where wj is j th weight

“just like a linear feedback system”

( ) ( ) ( )⎟

⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ∂ ∂ ∂ ∂ = ∇ w f w f w f

1 m

w w M

( )

w f ∇

is the gradient in that direction

slide-28
SLIDE 28

Linear: Slide 31

Linear Perceptron

slide-29
SLIDE 29

Linear: Slide 32

Linear Perceptrons

“Training” ≡ minimizing sum-of-squared residuals…

( ) ( )

( )

2 2

∑ ∑

− = − = Ε

Τ k k

y y

k k k k

x x Out w

Out(x) = w Tx Multivariate linear models: by gradient descent… → perceptron training rule

slide-30
SLIDE 30

Linear: Slide 33

Linear Perceptron Training Rule

=

− =

R k k T k

y E

1 2

) ( x w

Gradient descent: to minimize E, update w …

j j j

w E η w w ∂ ∂ ←

  • So what’s

?

j

w E ∂ ∂

=

− ∂ ∂ = ∂ ∂

R k k T k j j

y w w E

1 2

) ( x w

=

− ∂ ∂ − =

R k k T k j k T k

y w y

1

) ( ) ( 2 x w x w

=

∂ ∂ − =

R k k T j k

w δ

1

2 x w

k T k k

y δ x w − =

…where

∑ ∑

= =

∂ ∂ − =

R k m i ki i j k

x w w δ

1 1

2

=

− =

R k kj kx

δ

1

2

slide-31
SLIDE 31

Linear: Slide 34

Linear Perceptron Training Rule

=

− =

R k k T k

y E

1 2

) ( x w

Gradient descent: to minimize E, update w …

j j j

w E η w w ∂ ∂ ←

=

+ ←

R k kj k j j

x δ η w w

1

2

We frequently neglect the 2 (meaning we halve the learning rate)

…where…

=

− = ∂ ∂

R k kj k j

x δ w E

1

2

slide-32
SLIDE 32

Linear: Slide 35

The “Batch” perceptron algorithm

1) Randomly initialize weights w1 w2 … wm 2) Get your dataset

(append 1’s to inputs to avoid going thru origin).

3) for i = 1 to R 4) for j = 1 to m 5) if stops improving then stop. Else loop back to 3.

i i i

y x w Τ − = : δ

=

+ ←

R i ij i j j

x w w

1

δ η

2 i

δ

slide-33
SLIDE 33

Linear: Slide 36

ij i j j i i i

x w w y ηδ δ + ← − ←

Τx

w

A RULE KNOWN BY MANY NAMES The LMS Rule The delta rule The Widrow Hoff rule

Classical conditioning

The adaline rule

slide-34
SLIDE 34

Linear: Slide 37

If data is voluminous and arrives fast

If input-output pairs (x,y) come in very quickly. Then Don’t bother remembering old ones. Just keep using new ones.

  • bserve (x,y)

j j j

x δ η w w j y x w + ← ∀ − ←

Τ

δ

slide-35
SLIDE 35

Linear: Slide 38

GD Advantages (MI disadvantages):

  • Biologically plausible
  • With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

  • More easily parallelizable (or implementable

in wetware)?

GD Disadvantages (MI advantages):

  • It’s moronic
  • It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

  • If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to be efficient.

  • Hard to choose a good learning rate
  • Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Gradient Descent vs Matrix Inversion for Linear Perceptrons

slide-36
SLIDE 36

Linear: Slide 39

GD Advantages (MI disadvantages):

  • Biologically plausible
  • With very very many attributes, each iteration costs only O(mR).

If fewer than m iterations needed, faster than Matrix Inversion

  • More easily parallelizable (or implementable

in wetware)?

GD Disadvantages (MI advantages):

  • It’s moronic
  • It’s essentially a slow implementation of a way to build XTX matrix,

then solve a set of linear equations

  • If m is small it’s especially outrageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to be efficient.

  • Hard to choose a good learning rate
  • Matrix inversion takes predictable time.

You can’t be sure when gradient descent will stop.

Gradient Descent vs Matrix Inversion for Linear Perceptrons

But we’ll soon see that GD has an important extra trick up its sleeve

slide-37
SLIDE 37

Linear: Slide 40

Linear Perceptron …for Classification

slide-38
SLIDE 38

Linear: Slide 41

Regression vs Classification

Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output Input Attributes

slide-39
SLIDE 39

Linear: Slide 42

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

  • r

We can do a linear fit. Our prediction is 0 if out(x)≤½ 1 if out(x)>½

Blue = Out(x) Green = Classification

WHAT’S THE BIG PROBLEM WITH THIS???

slide-40
SLIDE 40

Linear: Slide 43

Classification with Perceptrons I

( )

2

w x !

i i

y

Τ

Don’t minimize Instead, minimize # misclassifications:

( ) ( )

Τ

i i

y x w Round

NOTE: CUTE & NON OBVIOUS WHY THIS WORKS!!

[Assume outputs are +1 & -1, not +1 & 0]

  • 1 if x<0

1 if x≥0 where Round(x) = if (xi ,yi ) correctly classed, don’t change if wrongly predicted as 1 w w – xi if wrongly predicted as -1 w w + xi New gradient descent rule:

slide-41
SLIDE 41

Linear: Slide 44

Classification with Perceptrons II: Sigmoid Functions

Least squares fit useless This fit classifies better. But it’s not least squares fit!

SOLUTION: Instead of Out(x) = wTx We’ll use Out(x) = g(wTx) where is a squashing function

( )

: 0,1 g ℜ →

slide-42
SLIDE 42

Linear: Slide 46

The Sigmoid

) exp( 1 1 ) ( h h g − + =

Rotating curve 180o centered on (0,1/2) produces same curve. i.e. g(h) = 1 – g(-h) Can you prove this?

slide-43
SLIDE 43

Linear: Slide 47

The Sigmoid

Choose w to minimize

[ ]

[ ]

∑ ∑

= Τ =

− = −

R i i i R i i i

g y y

1 2 1 2

) x w ( ) x ( Out

) exp( 1 1 ) ( h h g − + =

slide-44
SLIDE 44

Linear: Slide 48

Linear Perceptron Classification Regions

0 0 1 1 1 X2 X1

Use model: Out(x) = g (w T×(1,x) ) = g ( w0 + w1 x1 + w2 x2 ) In diagram… which region classified +1, and which 0 ??

slide-45
SLIDE 45

Linear: Slide 49

Gradient descent with sigmoid on a perceptron

( ) ( ) ( )

( )

( ) ( )

( ) ( ) ( )

( ) ( )

( )

2

Note ' 1 1 Proof: so ' 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 Out(x) 2

k k k i k ik i k i k ik k j

g x g x g x x e g x g x x e x e x e g x g x x x x e e e x x e e g w x y g w x y g w x w = − − − = = − + − + − ⎛ ⎞ − − − = = − = − = − − ⎜ ⎟ − − − + + + ⎝ ⎠ − − + + ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠ ⎛ ⎞ ⎛ ⎞ Ε = − ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ∂Ε ⎛ = − ∂

∑ ∑ ∑ ∑

( ) ( )

( )

2 ' 2 net 1 net where Out(x ) net

k ik i k j i k ik k ik k ik i k k k j i i i ij i i i i i k k k

g w x w y g w x g w x w x w g g x y w x δ δ ⎛ ⎞ ⎛ ⎞ ∂ ⎞ ⎛ ⎞ − ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ∂ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎝ ⎠ ⎛ ⎞ ∂ ⎛ ⎞ ⎛ ⎞ = − − ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ∂ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ = − − = − =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

( )

=

− + ←

R i ij i i i j j

x g g w w

1

1 δ η ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

= m j ij j i

x w g g

1 i i i

g y − = δ

The sigmoid perceptron update rule: where

slide-46
SLIDE 46

Linear: Slide 50

Other Things about Perceptrons

  • Invented and popularized by Rosenblatt (1962)
  • Even with sigmoid nonlinearity,

correct convergence is guaranteed !

  • Stable behavior for overconstrained

and underconstrained problems

slide-47
SLIDE 47

Linear: Slide 51

Perceptrons and Boolean Functions

If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

  • Can learn the function x1

∧ x2

  • Can learn the function x1

∨ x2 .

  • Can learn any

conjunction of literals, e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 QUESTION: WHY?

X1 X2 X1 X2

slide-48
SLIDE 48

Linear: Slide 52

Perceptrons and Boolean Functions

  • Can learn any disjunction of literals

e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

  • Can learn majority function

f(x1 ,x2 … xn ) = 1 if n/2 xi ’s or more are = 1 0 if less than n/2 xi ’s are = 1

  • What about the exclusive or function?

f(x1 ,x2 ) = x1 ∀ x2 = (x1 ∧ ~x2 ) ∨ (~ x1 ∧ x2 )