1
Regression and the Bias-Variance Decomposition William Cohen - - PowerPoint PPT Presentation
Regression and the Bias-Variance Decomposition William Cohen - - PowerPoint PPT Presentation
Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace
2
Regression
- Technically: learning a function f(x)=y where
y is real-valued, rather than discrete.
– Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) – …
3
Example: univariate linear regression
- Example: predict age from number of publications
5 10 15 20 25 30 35 40 45 50 20 40 60 80 100 120 140 160 Number of Publications Age in Years
4
Linear regression
- Model: yi = axi + b + εi where εi ~ N(0,σ)
- Training Data: (x1,y1),….(xn,yn)
- Goal: estimate a,b with w=(a,b)
^ ^
i i i i i x
y D D D
2
)] ( ˆ [ min arg ) , | Pr( log max arg ) | Pr( log max arg ) Pr( ) | Pr( max arg ) | Pr( max arg w w w w w w w
assume MLE
) ˆ ˆ ( ) ( ˆ b x a y
i i i
w
5
Linear regression
- Model: yi = axi + b + εi where εi ~ N(0,σ)
- Training Data: (x1,y1),….(xn,yn)
- Goal: estimate a,b with w=(a,b)
- Ways to estimate parameters
– Find derivative wrt parameters a,b – Set to zero and solve
- Or use gradient ascent to solve
- Or ….
^ ^
i i 2
ˆ min arg w
6
Linear regression
d2 d1 x1 x2 y2 y1
x y
d3
How to estimate the slope?
...
2 2 1 1
x x y y x x y y x y slope
i i i i
x x n y y n 1 1
i i i i
x x y y x x
2
n*var(X) n*cov(X,Y)
7
Linear regression
How to estimate the intercept?
b x a y ˆ ˆ
d2 d1 x1 x2 y2 y1
x y
d3
x a y b ˆ ˆ
8
Bias/Variance Decomposition of Error
9
- Return to the simple regression problem f:XY
y = f(x) +
What is the expected error for a learned h? noise N(0,) deterministic
Bias – Variance decomposition of error
10
Bias – Variance decomposition of error
learned from D
] ) Pr( ) Pr( ) ( ) ( [
2
dx d x x h x f E
D D
true fct dataset Experiment (the error of which I’d like to predict):
- 1. Draw size n sample D=(x1,y1),….(xn,yn)
- 2. Train linear function hD using D
- 3. Draw a test example (x,f(x)+ε)
- 4. Measure squared error of hD on that example
11
Bias – Variance decomposition of error (2)
learned from D
) ( ) (
2 ,
x h x f E
D D
true fct dataset Fix x, then do this experiment:
- 1. Draw size n sample D=(x1,y1),….(xn,yn)
- 2. Train linear function hD using D
- 3. Draw the test example (x,f(x)+ε)
- 4. Measure squared error of hD on that example
12
Bias – Variance decomposition of error
t f y
] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (
2 2 2 2 2 2 2
y f f y t tf y f f t E y f f t y f f t E y f f t E y t E
^
) ( ) (
2 ,
x h x f E
D D
why not? really yD ^
13
Bias – Variance decomposition of error
] ˆ [ ] [ ] ˆ [ ] [ 2 ] ) ˆ [( ] [ ] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (
2 2 2 2 2 2 2 2 2 2 ,
y f E f E y t E tf E y f E E y f f y t tf y f f t E y f f t y f f t E y f f t E y t ED
Depends on how well learner approximates f Intrinsic noise
f f ) ( y f ˆ ) (
14
Bias – Variance decomposition of error
] ˆ [ ] [ ] ˆ [ ] [ 2 ] ) ˆ [( ] ) [( ] ˆ ˆ [ 2 ] ˆ [ ] [ ] ˆ ][ [ 2 ] ˆ [ ] [ ] ˆ [ ] [ ) ˆ (
2 2 2 2 2 2 2 2 2 2
y h E h E y f E fh E y h E h f E y h h y f fh y h h f E y h h f y h h f E y h h f E y f E
Squared difference btwn our long- term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run
- n a dataset D (hat y)
Squared difference between best possible prediction for x, f(x), and
- ur “long-term”
expectation for what the learner will do if we averaged over many datasets D, ED[hD(x)]
)} ( { x h E h
D D
) ( ˆ ˆ x h y y
D D
BIAS2 VARIANCE
15
Bias-variance decomposition
How can you reduce bias of a learner? How can you reduce variance of a learner?
Make the long-term average better approximate the true function f(x) Make the learner less sensitive to variations in the data
16
A generalization of bias-variance decomposition to other loss functions
- “Arbitrary” real-valued loss L(t,y)
But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’
- Define “optimal prediction”:
y* = argmin y’ L(t,y’)
- Define “main prediction of learner”
ym=ym,D = argmin y’ ED{L(t,y’)}
- Define “bias of learner”:
B(x)=L(y*,ym)
- Define “variance of learner”
V(x)=ED[L(ym,y)]
- Define “noise for x”:
N(x) = Et[L(t,y*)]
Claim: ED,t[L(t,y) ] = c1N(x) +B(x)+c2V(x) where c1=PrD[y=y*] - 1 c2=1 if ym=y*, -1 else
m=n=|D|
17
Other regression methods
18
Example: univariate linear regression
- Example: predict age from number of publications
5 10 15 20 25 30 35 40 45 50 20 40 60 80 100 120 140 160 Number of Publications Age in Years
Paul Erdős Hungarian mathematician, 1913-1996
26 7 1 ˆ x y
x ~ 1500 age about 240
19
Linear regression
d2 d1 x1 x2 y2 y1
x y
d3
x a y b x x y y x x a
i i i i
ˆ ˆ ˆ
2
Summary: To simplify:
- assume zero-centered data, as we
did for PCA
- let x=(x1,…,xn) and y =(y1,…,yn)
- then…
ˆ ) ( ˆ
1
b a
T T
x x y x
20
Onward: multivariate linear regression
1
) ( ˆ
x x y x
T T
w
n
x x ,....,
1
x
n
y y ,....,
1
y
1 1 1
) ( ˆ ˆ ... ˆ ˆ
X X X x w x w y
T T k k
y w
k n n k
x x x x X ,...., ... ,....,
1 1 1 1
n
y y ...
1
y
Univariate Multivariate
row is example col is feature
2
)] ( ˆ [ min arg w w
i
i T i i
y x w w ) ( ˆ
21
Onward: multivariate linear regression
1 1 1
) ( ˆ ˆ ... ˆ ˆ
X X X x w x w y
T T k k
y w
k n n k
x x x x X ,...., ... ,....,
1 1 1 1
n
y y ...
1
y
2
)] ( ˆ [ min arg w w
i
i T i i
y x w w ) ( ˆ
regularized ^ w w w w
T i
2 )] ( ˆ [ min arg
2
1
) ( ˆ
X X I X
T T
y w
22
Onward: multivariate linear regression
1 1 1
) ( ˆ ˆ ... ˆ ˆ
X X X x w x w y
T T k k
y w
m n n m
x x x x X ,...., ... ,....,
1 1 1 1
n
y y ...
1
y
Multivariate, multiple outputs
k n n
y y y y Y ,...., ... ,....,
1 1 1 1 1 1
) ( ˆ
X X Y X W W
T T
x y
23
Onward: multivariate linear regression
1 1 1
) ( ˆ ˆ ... ˆ ˆ
X X X x w x w y
T T k k
y w
k n n k
x x x x X ,...., ... ,....,
1 1 1 1
n
y y ...
1
y
2
)] ( ˆ [ min arg w w
i
i T i i
y x w w ) ( ˆ
regularized ^ w w w w
T i
2 )] ( ˆ [ min arg
2
1
) ( ˆ
X X I X
T T
y w
What does increasing λ do?
24
Onward: multivariate linear regression
1 1 1
) ( ˆ ˆ ... ˆ ˆ
X X X x w x w y
T T k k
y w
n
x x X , 1 ... , 1
1
n
y y ...
1
y
2
)] ( ˆ [ min arg w w
i
i T i i
y x w w ) ( ˆ
regularized ^ w w w w
T i
2 )] ( ˆ [ min arg
2
1
) ( ˆ
X X I X
T T
y w
w=(w1,w2) What does fixing w2=0 do (if λ=0)?
25
Regression trees - summary
- Growing tree:
– Split to optimize information gain
- At each leaf node
– Predict the majority class
- Pruning tree:
– Prune to reduce error on holdout
- Prediction:
– Trace path to a leaf and predict associated majority class
build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path
[Quinlan’s M5]
26
Regression trees: example - 1
27
Regression trees – example 2
What does pruning do to bias and variance?
28
Kernel regression
- aka locally weighted regression, locally linear
regression, LOESS, …
29
Kernel regression
- aka locally weighted regression, locally linear
regression, …
- Close approximation to kernel regression:
– Pick a few values z1,…,zk up front – Preprocess: for each example (x,y), replace x with x=<K(x,z1),…,K(x,zk)> where K(x,z) = exp( -(x-z)2 / 2σ2 ) – Use multivariate regression on x,y pairs
30
Kernel regression
- aka locally weighted regression, locally linear
regression, LOESS, …
What does making the kernel wider do to bias and variance?
31
Additional readings
- P. Domingos, A Unified Bias-Variance Decomposition
and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.
- J. R. Quinlan, Learning with Continuous Classes, 5th
Australian Joint Conference on Artificial Intelligence, 1992.
- Y. Wang & I. Witten, Inducing model trees for
continuous classes, 9th European Conference on Machine Learning, 1997
- D. A. Cohn, Z. Ghahramani, & M. Jordan, Active