Linear Regression, Regularization Bias-Variance Tradeoff Thanks to - - PowerPoint PPT Presentation

linear regression regularization bias variance tradeoff
SMART_READER_LITE
LIVE PREVIEW

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to - - PowerPoint PPT Presentation

HTF: Ch3, 7 B: Ch3 Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R Parr, N Ray Outline Linear Regression MLE = Least Squares! Basis functions Evaluating Predictors


slide-1
SLIDE 1
  • Linear Regression,

Regularization Bias-Variance Tradeoff

HTF: Ch3, 7 B: Ch3

Thanks to C Guestrin, T Dietterich, R Parr, N Ray

slide-2
SLIDE 2
  • Outline

Linear Regression

MLE = Least Squares! Basis functions

Evaluating Predictors

Training set error vs Test set error Cross Validation

Model Selection

Bias-Variance analysis Regularization, Bayesian Model

slide-3
SLIDE 3
  • What is best choice of Polynomial?

Noisy Source Data

slide-4
SLIDE 4
  • Fit using Degree 0,1,3,9
slide-5
SLIDE 5
  • Comparison

Degree 9 is the best

match to the samples (over-fitting)

Degree 3 is the best

match to the source

Performance on test

data:

slide-6
SLIDE 6
  • What went wrong?

A bad choice of polynomial? Not enough data?

Yes

slide-7
SLIDE 7
  • Terms

x – input variable

x* – new input variable

h(x) – “truth” – underlying response function t = h(x) + – actual observed response y(x; D) – predicted response,

based on model learned from dataset D

(x) = ED[ y(x; D) ] – expected response,

averaged over (models based on) all datasets

slide-8
SLIDE 8
  • Bias-Variance Analysis in Regression

Observed value is t(x) = h(x) + ε

ε ~ N(0, σ2)

normally distributed: mean 0, std deviation σ2

Note: h(x) = E[ t(x) | x ]

Given training examples, D = {(xi, ti)},

let y(.) = y(.; D) be predicted function, based on model learned using D

Eg, linear model yw(x) = w ⋅ x + w0

using w =MLE(D)

slide-9
SLIDE 9
  • Example: 20 points

t = x + 2 sin(1.5x) + N(0, 0.2)

slide-10
SLIDE 10
  • Bias-Variance Analysis

Given a new data point x*

return predicted response: y(x*)

  • bserved response: t* = h(x*) + ε

The expected prediction error is …

slide-11
SLIDE 11
  • Expected Loss

[y(x) – t]2 = [y(x) – h(x) + h(x) – t]2 =

[y(x) – h(x)]2 + 2 [y(x) – h(x)] [h(x) – t] + [h(x) – t]2

Mismatch between OUR hypothesis y(.) & target h(.) … we can influence this Noise in distribution of target … nothing we can do Expected value is 0 as h(x) = E[t|x]

Eerr = [y(x) – t]2 p(x,t) dx dt

= {y(x) − h(x)}2 p(x)dx + {h(x) − t}2 p(x, t)dxdt

slide-12
SLIDE 12
  • Relevant Part of Loss

Really

y(x) = y(x; D) fit to data D… so consider expectation over data sets D

Let (x) = ED[y(x; D)]

ED[ {h(x) – y(x; D) }2 ]

= ED[h(x)– (x) + (x) – y(x; D) ]}2 = ED[ {h(x) – (x)}2] + 2ED[ {h(x) – (x)} { y(x; D) – ED[y(x; D) }]

+ ED[{ y(x; D) – ED[y(x; D)] }2 ]

= {h(x) – (x)}2 + ED[ { y(x; D) – (x) }2 ]

Bias2 Variance

Eerr = {y(x) − h(x)}2 p(x)dx + {h(x) − t}2 p(x,t)dxdt

slide-13
SLIDE 13
  • 50 fits (20 examples each)
slide-14
SLIDE 14
  • Bias, Variance, Noise
  • ! "#$
slide-15
SLIDE 15
  • Understanding Bias

Measures how well

  • ur approximation architecture

can fit the data

Weak approximators

(e.g. low degree polynomials)

will have high bias

Strong approximators

(e.g. high degree polynomials)

will have lower bias

{ (x) – h(x) }2

slide-16
SLIDE 16
  • Understanding Variance

No direct dependence on target values For a fixed size D:

Strong approximators tend to have more variance

… different datasets will lead to DIFFERENT predictors

Weak approximators tend to have less variance

… slightly different datasets may lead to SIMILAR predictors

Variance will typically disappear as |D| →∞

ED[ { y(x; D) – D(x) }2 ]

slide-17
SLIDE 17
  • Summary of Bias,Variance,Noise

Eerr = E[ (t*– y(x*))2 ] =

E[ (y(x*) – (x*))2 ] + ((x*)– h(x*))2 + E[ (t* – h(x*))2 ]

= Var( h(x*) ) + Bias( h(x*) )2 + Noise

Expected prediction error = Variance + Bias2 + Noise

slide-18
SLIDE 18
  • Bias, Variance, and Noise

Bias: (x*)– h(x*)

the best error of model (x*) [average over datasets]

Variance: ED[ ( yD(x*) – (x*) )2 ]

How much yD(x*) varies from

  • ne training set D to another

Noise: E[ (t* – h(x*))2 ] = E[ε2] = σ2

How much t* varies from h(x*) = t* + ε Error, even given PERFECT model, and ∞ data

slide-19
SLIDE 19
  • 50 fits (20 examples each)
slide-20
SLIDE 20
  • Predictions at x=2.0
slide-21
SLIDE 21
  • 50 fits (20 examples each)
slide-22
SLIDE 22
  • Predictions at x=5.0

Bias Variance

true value

slide-23
SLIDE 23
  • Observed Responses at x=5.0

%# Noise

slide-24
SLIDE 24
  • Model Selection: Bias-Variance

C1 “more expressive than” C2

iff representable in C1 representable in C2 “C2 ⊂ C1”

Eg, LinearFns ⊂ QuadraticFns

0-HiddenLayerNNs ⊂ 1-HiddenLayerNNs

can ALWAYs get better fit using C1, over C2

But … sometimes better to look for y ∊ C2

C1 C2

slide-25
SLIDE 25
  • Standard Plots…
slide-26
SLIDE 26
  • Why?

C2 ⊂ C1

∀ y ∊ C2 ∃ x* ∊ C1 that is at-least-as-good-as y

But given limited sample,

might not find this best x*

Approach: consider Bias2 + Variance!!

slide-27
SLIDE 27
  • Bias-Variance tradeoff – Intuition

Model too “simple”

  • does not fit the data well

A biased solution

Model too complex

small changes to the data, changes predictor a lot

A high-variance solution

slide-28
SLIDE 28
  • Bias-Variance Tradeoff

Choice of hypothesis class introduces learning bias

More complex class less bias More complex class more variance

slide-29
SLIDE 29
  • 2

2

~Bias2 ~Variance

slide-30
SLIDE 30
  • Behavior of test sample and training sample error as function of model

complexity

light blue curves show the training error err, light red curves show the conditional test error ErrT

for 100 training sets of size 50 each

Solid curves = expected test error Err and expected training error E[err].

slide-31
SLIDE 31
  • Empirical Study…

Based on different regularizers

slide-32
SLIDE 32
  • Effect of Algorithm Parameters
  • n Bias and Variance

k-nearest neighbor:

increasing k typically

increases bias and reduces variance

decision trees of depth D:

increasing D typically

increases variance and reduces bias

RBF SVM with parameter σ:

increasing σ typically

increases bias and reduces variance

slide-33
SLIDE 33
  • Least Squares Estimator

Truth: f(x) = xTβ

Observed: y = f(x) + ε Ε[ ε ] = 0

Least squares estimator

(x0) = x0

β = (XTX)-1XTy

Unbiased: f(x0) = E[ (x0) ]

f(x0) – E[ (x0) ] = x0Tβ −Ε[ x0T(XTX)-1XTy ] = x0Tβ −Ε[ x0T(XTX)-1XT(Xβ + ε) ] = x0Tβ −Ε[ x0Tβ + x0T(XTX)-1XTε ] = x0Tβ −x0Tβ + x0T(XTX)-1XT Ε[ε ] = 0

  • &"'#(

a datapoint

X =

x1, …, xk

slide-34
SLIDE 34
  • Gauss-Markov Theorem

Least squares estimator (x0) = x0T (XTX)-1XTy

… is unbiased: f(x0) = E[ (x0) ] … is linear in y … (x0) = c0

Ty where c0 T

Gauss-Markov Theorem:

Least square estimate has the minimum variance among all linear unbiased estimators.

BLUE: Best Linear Unbiased Estimator

Interpretation: Let g(x0) be any other …

unbiased estimator of f(x0) … ie, E[ g(x0) ] = f(x0) that is linear in y … ie, g(x0) = cTy

then Var[ (x0) ] ≤ Var[ g(x0) ]

slide-35
SLIDE 35
  • Variance of Least Squares Estimator

Least squares estimator

(x0) = x0

β = (XTX)-1XTy

Variance:

E[ ((x0) – E[ (x0) ] )2 ] = E[ ((x0) – f(x0) )2 ] = E[ ( x0

T (XTX)-1XT β − x0 Tβ )2 ]

= Ε[ (x0

T(XTX)-1XT(Xβ + ε) − x0 Tβ )2 ]

= Ε[ (x0

Tβ + x0 T(XTX)-1XT ε − x0 Tβ )2 ]

= Ε[ (x0

T(XTX)-1XT ε)2 ]

= σε

2 p/N y = f(x) + ε Ε[ ε ] = 0 var(ε) = σε

2

… in “in-sample error” model …

slide-36
SLIDE 36
  • Trading off Bias for Variance

What is the best estimator for the given

linear additive model?

Least squares estimator

(x0) = x0Tβ β = (XTX)-1XTy

is BLUE: Best Linear Unbiased Estimator

Optimal variance, wrt unbiased estimators But variance is O( p / N ) …

So if FEWER features, smaller variance…

… albeit with some bias??

slide-37
SLIDE 37
  • Feature Selection

LS solution can have large variance

variance ∝ p (#features)

Decrease p decrease variance…

but increase bias

If decreases test error, do it!

Feature selection

Small #features also means:

easy to interpret

slide-38
SLIDE 38
  • Statistical Significance Test

Y = β0 + j βj Xj Q: Which Xj are relevant?

A: Use statistical hypothesis testing!

Use simple model:

Y = β0 + j βj Xj + ε ε ~ N(0, σe

2)

Here: β ~ N( β, (XTX)-1 σe

2)

Use vj is the jth diagonal element of (XTX)-1

  • Keep variable Xi if zj is large…

j j j

v z σ β ˆ ˆ =

β ˆ

  • =

− − − =

N i i i

y y p N

1 2

) ˆ ( 1 1 ˆ σ

slide-39
SLIDE 39
  • Measuring Bias and Variance

In practice (unlike in theory),

  • nly ONE training set D

Simulate multiple training sets by

bootstrap replicates

D’ = {x | x is drawn at random with

replacement from D }

|D’| = |D|

slide-40
SLIDE 40
  • Estimating Bias / Variance

) )*

+,#

  • #

, .#, $* /$ 0$*1∈ 2*3 $*4

2*)5)*

slide-41
SLIDE 41
  • Estimating Bias / Variance

) )*

  • Each Si is bootstrap replicate
  • Ti = S / Si
  • hi = hypothesis, based on Si

+,#

  • #

/$ $4 , .#, $* 0$*1∈ 2*3

2*

, .#, $6 0$61∈ 263

26 )6

⋮ ⋮ ⋮

slide-42
SLIDE 42
  • Average Response for each xi

hb(xr) … hb(x1) ∈? Tb h2(xr) …

  • ∈? T2

h(xr) = 1/kr Σ hi(xr) … h(x1) = 1/k1 Σ hi(x1) ⋮ … h1(x1) ∈? T1 xr … x1

$7 Σ08∈23 $751108∈2311

slide-43
SLIDE 43
  • Procedure for Measuring

Bias and Variance

Construct B bootstrap replicates of S

S1, …, SB

Apply learning alg to each replicate Sb

to obtain hypothesis hb

Let Tb = S \ Sb = data points not in Sb

(out of bag points)

Compute predicted value

hb(x) for each x ∈ Tb

slide-44
SLIDE 44
  • Estimating Bias and Variance

For each x ∈ S,

  • bserved response y

predictions y1, …, yk

Compute average prediction h(x) = avei {yi} Estimate bias: h(x) – y Estimate variance:

Σ{i: x ∈Ti} ( hi(x) – h(x) )2 / (k-1)

Assume noise is 0

slide-45
SLIDE 45
  • Outline

Linear Regression

MLE = Least Squares! Basis functions

Evaluating Predictors

Training set error vs Test set error Cross Validation

Model Selection

Bias-Variance analysis Regularization, Bayesian Model

slide-46
SLIDE 46
  • Regularization

Idea: Penalize overly-complicated answers Regular regression minimizes: Regularized regression minimizes:

Note: May exclude constants from the norm

( )

w w w w w w w w x x x x λ + −

  • 2

) (

) ; (

i i i

t y

( )

2 ) (

) ; (

i i i

t y w w w w x x x x

slide-47
SLIDE 47
  • Regularization: Why?

For polynomials,

extreme curves typically require extreme values

In general, encourages use of few features

  • nly features that lead to a substantial

increase in performance

Problem: How to choose λ

slide-48
SLIDE 48
  • Solving Regularized Form

t X X X w

T T 1

) ( *

=

[ ]

=

  • j

i j i i j w

x w t w Solving

2 *

min arg

t X X X w

T T

I

1

) ( *

+ = λ

[ ]

  • +

=

  • j

i i i j i i j w

w x w t w Solving

2 2 *

min arg λ

slide-49
SLIDE 49
  • Regularization: Empirical Approach

Problem:

magic constant λ trading-off complexity vs. fit

Solution 1:

Generate multiple models Use lots of test data to discover

and discard bad models

Solution 2: k-fold cross validation:

Divide data S into k subsets { S1, …, Sk } Create validation set S-i = Si - S

Produces k groups, each of size (k -1)/k

For i=1..k: Train on S-i, Test on Si Combine results … mean? median? …

slide-50
SLIDE 50
  • A Bayesian Perspective

Given a space of possible hypotheses H={hj} Which hypothesis has the highest posterior: As P(D) does not depend on h:

argmax P(h|D) = argmax P(D|h) P(h)

“Uniform P(h)” Maximum Likelihood Estimate

(model for which data has highest prob.)

… can use P(h) for regularization …

) ( ) ( ) | ( ) | ( D P h P h D P D h P =

slide-51
SLIDE 51
  • Bayesian Regression

Assume that, given x, noise is iid Gaussian Homoscedastic noise model

(same σ for each position)

slide-52
SLIDE 52
  • Maximum Likelihood Solution

MLE fit for mean is

just linear regression fit does not depend upon σ2

− −

= =

i y t m

i

e y t t P h D P

2 2 )) ; ( ( ) ( ) 1 (

2 ) ), ; ( | ,..., ( ) | (

2 2 ) (

πσ σ

σ w w w w x x x x

w w w w x x x x

slide-53
SLIDE 53
  • Bayesian learning of

Gaussian parameters

P(µ |η,λ) 2λ η

Conjugate priors

Mean: Gaussian prior Variance: Wishart Distribution

Prior for mean:

Remember this??

slide-54
SLIDE 54
  • Bayesian Solution

Introduce prior distribution over weights Posterior now becomes:

( )

I N p h p

2

, ) | ( ) (

|

λ λ w w w w w w w w = =

k w w i y t m

T i

e e P y t t P h P h D P

2 2 2 2 )) ; ( ( ) ( ) 1 (

2 2 ) ( ) ), ; ( | ,..., ( ) ( ) | (

2 2 2 ) (

πλ πσ σ

λ σ − − −

= =

w w w w x x x x

(i) (i) (i) (i)

w w w w w w w w x x x x

slide-55
SLIDE 55
  • Regularized Regression

vs Bayesian Regression

Regularized Regression minimizes: Bayesian Regression maximizes: These are identical (up to constants)

… take log of Bayesian regression criterion

k w w i y t

T i

e e

2 2 2 2 )) ; ( (

2 2

2 2 2 ) (

πλ πσ

λ σ − − −

w w w w x x x x

(i) (i) (i) (i)

2 2 2 ) (

2 2 )) ; ( ( λ σ w w w w w w w w w w w w x x x x

T i i

y t const − + − − +

(i)

( )

w w w w w w w w x x x x κ + −

  • 2

) (

) ; (

i i i

y t

slide-56
SLIDE 56
  • Viewing L2 Regularization

Using Lagrange Multiplier…

[ ]

  • +

=

  • j

i i i j i i j w

w x w t w

2 2 *

min arg λ

[ ]

=

  • j

i j i i j w

x w t w

2 *

min arg

  • ω

  • i

i

w

2

s.t.

slide-57
SLIDE 57
  • Use L2 vs L1 Regularization

[ ]

  • +

=

  • j

i q i i j i i j w

w x w t w | | min arg

2 *

λ

[ ]

=

  • j

i j i i j w

x w t w

2 *

min arg

  • ω

  • i

q i

w | |

s.t.

Intersections often on axis! … so wi = 0 !!

LASSO!

slide-58
SLIDE 58
  • What you need to know

Regression

Optimizing sum squared error == MLE ! Basis functions = features Relationship between regression and Gaussians

Evaluating Predictor

TestSetError Prediction Error Cross Validation

Bias-Variance trade-off

Model complexity …

Regularization ≈ Bayesian modeling L1 regularization – prefers 0 weights!

Play with Applet