Regression and the Bias-Variance Decomposition William Cohen - - PowerPoint PPT Presentation

regression and the bias variance decomposition
SMART_READER_LITE
LIVE PREVIEW

Regression and the Bias-Variance Decomposition William Cohen - - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace


slide-1
SLIDE 1

1

Regression and the Bias-Variance Decomposition

William Cohen 10-601 April 2008

Readings: Bishop 3.1,3.2

slide-2
SLIDE 2

2

Regression

  • Technically: learning a function f(x)=y where

y is real-valued, rather than discrete.

– Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) – …

slide-3
SLIDE 3

3

Example: univariate linear regression

  • Example: predict age from number of publications

5 10 15 20 25 30 35 40 45 50 20 40 60 80 100 120 140 160 Number of Publications Age in Years

slide-4
SLIDE 4

4

Linear regression

  • Model: yi = axi + b + εi where εi ~ N(0,σ)
  • Training Data: (x1,y1),….(xn,yn)
  • Goal: estimate a,b with w=(a,b)

^ ^

 

    

i i i i i x

y D D D

2

)] ( ˆ [ min arg ) , | Pr( log max arg ) | Pr( log max arg ) Pr( ) | Pr( max arg ) | Pr( max arg w w w w w w w 

assume MLE

) ˆ ˆ ( ) ( ˆ b x a y

i i i

   w 

slide-5
SLIDE 5

5

Linear regression

  • Model: yi = axi + b + εi where εi ~ N(0,σ)
  • Training Data: (x1,y1),….(xn,yn)
  • Goal: estimate a,b with w=(a,b)
  • Ways to estimate parameters

– Find derivative wrt parameters a,b – Set to zero and solve

  • Or use gradient ascent to solve
  • Or ….

^ ^

i i 2

ˆ min arg  w

slide-6
SLIDE 6

6

Linear regression

d2 d1 x1 x2 y2 y1

x y

d3

How to estimate the slope?

       

...

2 2 1 1

          x x y y x x y y x y slope

   

 

  

i i i i

x x n y y n 1 1

    

   

i i i i

x x y y x x

2

n*var(X) n*cov(X,Y)

slide-7
SLIDE 7

7

Linear regression

How to estimate the intercept?

b x a y ˆ ˆ  

d2 d1 x1 x2 y2 y1

x y

d3

x a y b ˆ ˆ  

slide-8
SLIDE 8

8

Bias/Variance Decomposition of Error

slide-9
SLIDE 9

9

  • Return to the simple regression problem f:XY

y = f(x) + 

What is the expected error for a learned h? noise N(0,) deterministic

Bias – Variance decomposition of error

slide-10
SLIDE 10

10

Bias – Variance decomposition of error

learned from D

 

] ) Pr( ) Pr( ) ( ) ( [

2

dx d x x h x f E

D D

  



 

true fct dataset Experiment (the error of which I’d like to predict):

  • 1. Draw size n sample D=(x1,y1),….(xn,yn)
  • 2. Train linear function hD using D
  • 3. Draw a test example (x,f(x)+ε)
  • 4. Measure squared error of hD on that example
slide-11
SLIDE 11

11

Bias – Variance decomposition of error (2)

learned from D

 

 

) ( ) (

2 ,

x h x f E

D D

 

true fct dataset Fix x, then do this experiment:

  • 1. Draw size n sample D=(x1,y1),….(xn,yn)
  • 2. Train linear function hD using D
  • 3. Draw the test example (x,f(x)+ε)
  • 4. Measure squared error of hD on that example
slide-12
SLIDE 12

12

Bias – Variance decomposition of error

t f y

 

 

 

   

] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (

2 2 2 2 2 2 2

y f f y t tf y f f t E y f f t y f f t E y f f t E y t E                    

^

 

 

) ( ) (

2 ,

x h x f E

D D

 

why not? really yD ^

slide-13
SLIDE 13

13

Bias – Variance decomposition of error

 

 

 

     

] ˆ [ ] [ ] ˆ [ ] [ 2 ] ) ˆ [( ] [ ] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (

2 2 2 2 2 2 2 2 2 2 ,

y f E f E y t E tf E y f E E y f f y t tf y f f t E y f f t y f f t E y f f t E y t ED                            

Depends on how well learner approximates f Intrinsic noise

f f ) (   y f ˆ ) (  

slide-14
SLIDE 14

14

Bias – Variance decomposition of error

 

 

 

     

] ˆ [ ] [ ] ˆ [ ] [ 2 ] ) ˆ [( ] ) [( ] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (

2 2 2 2 2 2 2 2 2 2

y h E h E y f E fh E y h E h f E y h h y f fh y h h f E y h h f y h h f E y h h f E y f E                            

Squared difference btwn our long- term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run

  • n a dataset D (hat y)

Squared difference between best possible prediction for x, f(x), and

  • ur “long-term”

expectation for what the learner will do if we averaged over many datasets D, ED[hD(x)]

)} ( { x h E h

D D

 ) ( ˆ ˆ x h y y

D D 

BIAS2 VARIANCE

slide-15
SLIDE 15

15

Bias-variance decomposition

How can you reduce bias of a learner? How can you reduce variance of a learner?

Make the long-term average better approximate the true function f(x) Make the learner less sensitive to variations in the data

slide-16
SLIDE 16

16

A generalization of bias-variance decomposition to other loss functions

  • “Arbitrary” real-valued loss L(t,y)

But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’

  • Define “optimal prediction”:

y* = argmin y’ L(t,y’)

  • Define “main prediction of learner”

ym=ym,D = argmin y’ ED{L(t,y’)}

  • Define “bias of learner”:

B(x)=L(y*,ym)

  • Define “variance of learner”

V(x)=ED[L(ym,y)]

  • Define “noise for x”:

N(x) = Et[L(t,y*)]

Claim: ED,t[L(t,y) ] = c1N(x) +B(x)+c2V(x) where c1=PrD[y=y*] - 1 c2=1 if ym=y*, -1 else

m=n=|D|

slide-17
SLIDE 17

17

Other regression methods

slide-18
SLIDE 18

18

Example: univariate linear regression

  • Example: predict age from number of publications

5 10 15 20 25 30 35 40 45 50 20 40 60 80 100 120 140 160 Number of Publications Age in Years

Paul Erdős Hungarian mathematician, 1913-1996

26 7 1 ˆ   x y

x ~ 1500 age about 240

slide-19
SLIDE 19

19

Linear regression

d2 d1 x1 x2 y2 y1

x y

d3

    

x a y b x x y y x x a

i i i i

ˆ ˆ ˆ

2

     

Summary: To simplify:

  • assume zero-centered data, as we

did for PCA

  • let x=(x1,…,xn) and y =(y1,…,yn)
  • then…

ˆ ) ( ˆ

1

 

b a

T T

x x y x

slide-20
SLIDE 20

20

Onward: multivariate linear regression

1

) ( ˆ

 x x y x

T T

w

n

x x ,....,

1

 x

n

y y ,....,

1

 y

1 1 1

) ( ˆ ˆ ... ˆ ˆ

    X X X x w x w y

T T k k

y w           

k n n k

x x x x X ,...., ... ,....,

1 1 1 1

          

n

y y ...

1

y

Univariate Multivariate

row is example col is feature

2

)] ( ˆ [ min arg w w

i

i T i i

y x w w   ) ( ˆ 

slide-21
SLIDE 21

21

Onward: multivariate linear regression

1 1 1

) ( ˆ ˆ ... ˆ ˆ

    X X X x w x w y

T T k k

y w           

k n n k

x x x x X ,...., ... ,....,

1 1 1 1

          

n

y y ...

1

y

2

)] ( ˆ [ min arg w w

i

i T i i

y x w w   ) ( ˆ 

regularized ^ w w w w

T i

2 )] ( ˆ [ min arg

2

   

1

) ( ˆ

  X X I X

T T

 y w

slide-22
SLIDE 22

22

Onward: multivariate linear regression

1 1 1

) ( ˆ ˆ ... ˆ ˆ

    X X X x w x w y

T T k k

y w           

m n n m

x x x x X ,...., ... ,....,

1 1 1 1

          

n

y y ...

1

y

Multivariate, multiple outputs

          

k n n

y y y y Y ,...., ... ,....,

1 1 1 1 1 1

) ( ˆ

  X X Y X W W

T T

x y

slide-23
SLIDE 23

23

Onward: multivariate linear regression

1 1 1

) ( ˆ ˆ ... ˆ ˆ

    X X X x w x w y

T T k k

y w           

k n n k

x x x x X ,...., ... ,....,

1 1 1 1

          

n

y y ...

1

y

2

)] ( ˆ [ min arg w w

i

i T i i

y x w w   ) ( ˆ 

regularized ^ w w w w

T i

2 )] ( ˆ [ min arg

2

   

1

) ( ˆ

  X X I X

T T

 y w

What does increasing λ do?

slide-24
SLIDE 24

24

Onward: multivariate linear regression

1 1 1

) ( ˆ ˆ ... ˆ ˆ

    X X X x w x w y

T T k k

y w           

n

x x X , 1 ... , 1

1

          

n

y y ...

1

y

2

)] ( ˆ [ min arg w w

i

i T i i

y x w w   ) ( ˆ 

regularized ^ w w w w

T i

2 )] ( ˆ [ min arg

2

   

1

) ( ˆ

  X X I X

T T

 y w

w=(w1,w2) What does fixing w2=0 do (if λ=0)?

slide-25
SLIDE 25

25

Regression trees - summary

  • Growing tree:

– Split to optimize information gain

  • At each leaf node

– Predict the majority class

  • Pruning tree:

– Prune to reduce error on holdout

  • Prediction:

– Trace path to a leaf and predict associated majority class

build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path

[Quinlan’s M5]

slide-26
SLIDE 26

26

Regression trees: example - 1

slide-27
SLIDE 27

27

Regression trees – example 2

What does pruning do to bias and variance?

slide-28
SLIDE 28

28

Kernel regression

  • aka locally weighted regression, locally linear

regression, LOESS, …

slide-29
SLIDE 29

29

Kernel regression

  • aka locally weighted regression, locally linear

regression, …

  • Close approximation to kernel regression:

– Pick a few values z1,…,zk up front – Preprocess: for each example (x,y), replace x with x=<K(x,z1),…,K(x,zk)> where K(x,z) = exp( -(x-z)2 / 2σ2 ) – Use multivariate regression on x,y pairs

slide-30
SLIDE 30

30

Kernel regression

  • aka locally weighted regression, locally linear

regression, LOESS, …

What does making the kernel wider do to bias and variance?

slide-31
SLIDE 31

31

Additional readings

  • P. Domingos, A Unified Bias-Variance Decomposition

and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.

  • J. R. Quinlan, Learning with Continuous Classes, 5th

Australian Joint Conference on Artificial Intelligence, 1992.

  • Y. Wang & I. Witten, Inducing model trees for

continuous classes, 9th European Conference on Machine Learning, 1997

  • D. A. Cohn, Z. Ghahramani, & M. Jordan, Active

Learning with Statistical Models, JAIR, 1996.