10-601 Machine Learning Regression Outline Regression vs - - PowerPoint PPT Presentation

10 601 machine learning
SMART_READER_LITE
LIVE PREVIEW

10-601 Machine Learning Regression Outline Regression vs - - PowerPoint PPT Presentation

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression another discriminative learning method As optimization Gradient descent As matrix inversion (Ordinary Least Squares) Overfitting


slide-1
SLIDE 1

Regression

10-601 Machine Learning

slide-2
SLIDE 2

Outline

  • Regression vs Classification
  • Linear regression – another discriminative

learning method

– As optimization è Gradient descent – As matrix inversion (Ordinary Least Squares)

  • Overfitting and bias-variance
  • Bias-variance decomposition for classification
slide-3
SLIDE 3

What is regression?

slide-4
SLIDE 4

Where we are

Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real #

Today

slide-5
SLIDE 5

Regression examples

slide-6
SLIDE 6

Prediction of menu prices

Chaheau Gimpel … and Smith EMNLP 2012

slide-7
SLIDE 7

A decision tree: classification

Play Play Don’t Play Don’t Play

slide-8
SLIDE 8

A regression tree

Play = 30m, 45min Play = 0m, 0m, 15m Play = 0m, 0m Play = 20m, 30m, 45m,

Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32

slide-9
SLIDE 9

Theme for the week: learning as optimization

slide-10
SLIDE 10

Types of learners

  • Two types of learners:
  • 1. Generative: make assumptions about how to generate

data (given the class)

  • e.g., naïve Bayes
  • 2. Discriminative
  • directly estimate a decision rule/boundary
  • e.g., logistic regression

Today: another discriminative learner, but for regression tasks

slide-11
SLIDE 11

Regression for LMS as

  • ptimization

Toy problem #2

11

Least Mean Squares

slide-12
SLIDE 12

Linear regression

  • Given an input x we would like

to compute an output y

  • For example:
  • Predict height from age
  • Predict Google’s price from

Yahoo’s price

  • Predict distance from wall

from sensors

X Y

slide-13
SLIDE 13

Linear regression

  • Given an input x we would like to

compute an output y

  • In linear regression we assume

that y and x are related with the following equation: y = wx+ε where w is a parameter and ε represents measurement or

  • ther noise

X Y What we are trying to predict Observed values

slide-14
SLIDE 14
  • Our goal is to estimate w from a training

data of <xi,yi> pairs

  • Optimization goal: minimize squared error

(least squares):

  • Why least squares?
  • minimizes squared distance between

measurements and predicted line

  • has a nice probabilistic interpretation
  • the math is pretty

Linear regression

i i i w

wx y

2

) ( min arg

X Y

ε + = wx y

see HW

slide-15
SLIDE 15

Solving linear regression

  • To optimize:
  • We just take the derivative w.r.t. to w ….

∂ ∂w (yi − wxi)2

i

= 2 −xi(yi − wxi)

i

prediction prediction Compare to logistic regression…

slide-16
SLIDE 16

Solving linear regression

  • To optimize – closed form:
  • We just take the derivative w.r.t. to w and set to 0:

∂ ∂w (yi − wxi)2

i

= 2 −xi(yi − wxi)

i

⇒ 2 xi(yi − wxi) = 0

i

⇒ xiyi = wxi

2 i

i

⇒ w = xiyi

i

xi

2 i

covar(X,Y)/var(X) if mean(X)=mean(Y)=0

2 xiyi

i

− 2 wxixi

i

= 0

slide-17
SLIDE 17

Regression example

  • Generated: w=2
  • Recovered: w=2.03
  • Noise: std=1
slide-18
SLIDE 18

Regression example

  • Generated: w=2
  • Recovered: w=2.05
  • Noise: std=2
slide-19
SLIDE 19

Regression example

  • Generated: w=2
  • Recovered: w=2.08
  • Noise: std=4
slide-20
SLIDE 20

Bias term

  • So far we assumed that the

line passes through the origin

  • What if the line does not?
  • No problem, simply change the

model to y = w0 + w1x+ε

  • Can use least squares to

determine w0 , w1

n x w y w

i i i

− =

1

X Y w0

∑ ∑

− =

i i i i i

x w y x w

2 1

) (

slide-21
SLIDE 21

Bias term

  • So far we assumed that the

line passes through the origin

  • What if the line does not?
  • No problem, simply extend the

model to y = w0 + w1x+ε

  • Can use least squares to

determine w0 , w1

n x w y w

i i i

− =

1

X Y w0

∑ ∑

− =

i i i i i

x w y x w

2 1

) (

Simpler solution is coming soon…

slide-22
SLIDE 22

Multivariate regression

  • What if we have several inputs?
  • Stock prices for Yahoo, Microsoft and Ebay for

the Google prediction task

  • This becomes a multivariate regression problem
  • Again, its easy to model:

y = w0 + w1x1+ … + wkxk + ε Google’s stock

price Yahoo’s stock price Microsoft’s stock price

slide-23
SLIDE 23

Multivariate regression

  • What if we have several inputs?
  • Stock prices for Yahoo, Microsoft and Ebay for

the Google prediction task

  • This becomes a multivariate regression problem
  • Again, its easy to model:

y = w0 + w1x1+ … + wkxk + ε

slide-24
SLIDE 24

y=10+3x1

2-2x2 2+ε

In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems?

  • Yes. As long as the coefficients are

linear the equation is still a linear regression problem! Not all functions can be approximated by a line/hyperplane…

slide-25
SLIDE 25

Non-Linear basis function

  • So far we only used the observed values x1,x2,…
  • However, linear regression can be applied in the same

way to functions of these values

– Eg: to add a term w x1x2 add a new variable z=x1x2 so each example becomes: x1, x2, …. z

  • As long as these functions can be directly computed

from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

ε + + + + =

2 2 1 1 k kx

w x w w y …

slide-26
SLIDE 26

Non-Linear basis function

  • How can we use this to add an intercept term?

Add a new “variable” z=1 and weight w0

slide-27
SLIDE 27

Non-linear basis functions

  • What type of functions can we use?
  • A few common examples:
  • Polynomial: φj(x) = xj for j=0 … n
  • Gaussian:
  • Sigmoid:
  • Logs:

φ j(x) = (x − µ j) 2σ j

2

φ j(x) = 1 1+ exp(−s jx)

Any function of the input values can be used. The solution for the parameters

  • f the regression remains

the same.

φ j(x) = log(x +1)

slide-28
SLIDE 28

General linear regression problem

  • Using our new notations for the basis function linear

regression can be written as

  • Where φj(x) can be either xj for multivariate regression or
  • ne of the non-linear basis functions we defined
  • … and φ0(x)=1 for the intercept term

y = w jφ j(x)

j= 0 n

slide-29
SLIDE 29

Learning/Optimizing Multivariate Least Squares

Approach 1: Gradient Descent

slide-30
SLIDE 30

Gradient descent

30

slide-31
SLIDE 31

Gradient Descent for Linear Regression

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = yi − ˆ yi

( )

i

2 =

yi − wjφ j(xi)

j

# $ % % & ' ( (

i

2

Goal: minimize the following loss function:

sum over n examples sum over k+1 basis vectors

slide-32
SLIDE 32

Gradient Descent for Linear Regression

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = y i − ˆ y

i

( )

i

2 =

y i − w jφ j(x i)

j

% & ' ' ( ) * *

i

2

Goal: minimize the following loss function: ∂ ∂w j J(w) = ∂ ∂w j y i − ˆ y

i

( )

i

2

= 2 y i − ˆ y

i

( )

i

∂ ∂w j ˆ y

i

= 2 y i − ˆ y

i

( )

i

∂ ∂w j w jφ j(x i)

j

= 2 y i − ˆ y

i

( )

i

φ j(x i)

slide-33
SLIDE 33

Gradient Descent for Linear Regression

ˆ yi = wjφ j(xi)

j=0 k

∂ ∂w j J(w) = 2 y i − ˆ y

i

( )

i

φ j(x i) Learning algorithm:

  • Initialize weights w=0
  • For t=1,… until convergence:
  • Predict for each example xi using w:
  • Compute gradient of loss:
  • This is a vector g
  • Update: w = w – λg
  • λ is the learning rate.
slide-34
SLIDE 34

Gradient Descent for Linear Regression

  • We can use any of the tricks we used for logistic

regression:

– stochastic gradient descent (if the data is too big to put in memory) – regularization – …

slide-35
SLIDE 35

Linear regression is a convex

  • ptimization problem

proof: differentiate again to get the second derivative

so again gradient descent will reach a global optimum

slide-36
SLIDE 36

Multivariate Least Squares

Approach 2: Matrix Inversion

slide-37
SLIDE 37

OLS (Ordinary Least Squares Solution)

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = y i − ˆ y

i

( )

i

2 =

y i − w jφ j(x i)

j

% & ' ' ( ) * *

i

2

Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y

i

( )

i

φ j(x i)

slide-38
SLIDE 38

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = y i − ˆ y

i

( )

i

2 =

y i − w jφ j(x i)

j

% & ' ' ( ) * *

i

2

Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y

i

( )

i

φ j(x i)

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

Notation: n examples k+1 basis vectors

slide-39
SLIDE 39

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = y i − ˆ y

i

( )

i

2 =

y i − w jφ j(x i)

j

% & ' ' ( ) * *

i

2

Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y

i

( )

i

φ j(x i)

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

slide-40
SLIDE 40

∂ ∂w0 J(w) = 2 yi − ˆ yi

( )

i

φ0(xi)

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

∂ ∂wk J(w) = 2 yi − ˆ yi

( )

i

φk(xi)

= φ1 ... ... φ n ! " # # # # # # $ % & & & & & &

notation: φ j

i ≡ φ j(xi)

slide-41
SLIDE 41

∂ ∂w0 J(w) = 2 yiφ1

i − ˆ

yiφ1

i

( )

i

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

= φ1 ... ... φ n ! " # # # # # # $ % & & & & & &

∂ ∂wk J(w) = 2 yiφk

i − ˆ

yiφk

i

( )

i

recall ˆ yi = wjφ j

i j n

=  φ iw

slide-42
SLIDE 42

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

= φ1 ... ... φ n # $ % % % % & ' ( ( ( (

∂ ∂wk J(w) = 2 yiφk

i −φ iw φk i

( )

i

∂ ∂w0 J(w) = 2 yiφ0

i −φ iw φ0 i

( )

i

= 2ΦTy − 2ΦTΦw

slide-43
SLIDE 43

ΦT = φ0(x1) φ0(x2)  φ0(xn)     φk(x1) φk(x2)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples

y = y1 ... ... ... yn ! " # # # # # # # $ % & & & & & & &

∂ ∂wk J(w) = 2 φk

iyi −φk iφ iw

( )

i

∂ ∂w0 J(w) = 2 φ0

i yi −φ0 iφ iw

( )

i

= 2ΦTy −...

slide-44
SLIDE 44

k+1 basis …

∂ ∂wk J(w) = 2 φk

iyi −φk iφ iw

( )

i

∂ ∂w0 J(w) = 2 φ0

i yi −φ0 iφ iw

( )

i

=...− 2ΦTΦw

φ0(x1)  φk(x1) φ0(x2)  φk(x2)     φ0(xn)  φk(xn) ! " # # # # # $ % & & & & &

w0 .. wk ! " # # # # $ % & & & &

φ0(x1) ...  φ0(xn) φ1(x1) φ1(xn)     φk(x1)  φk(xn) ! " # # # # # $ % & & & & &

n examples

slide-45
SLIDE 45

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

= φ1 ... ... φ n # $ % % % % & ' ( ( ( (

∂ ∂wk J(w) = 2 yiφk

i −φ iw φk i

( )

i

∂ ∂w0 J(w) = 2 yiφ0

i −φ iw φ0 i

( )

i

= 2ΦTy − 2ΦTΦw w = ΦTΦ

( )

−1ΦTy

= 0

slide-46
SLIDE 46

recap: Solving linear regression

  • To optimize – closed form:
  • We just take the derivative w.r.t. to w and set to 0:

∂ ∂w (yi − wxi)2

i

= 2 −xi(yi − wxi)

i

⇒ 2 xi(yi − wxi) = 0

i

⇒ xiyi = wxi

2 i

i

⇒ w = xiyi

i

xi

2 i

covar(X,Y)/var(X) if mean(X)=mean(Y)=0

2 xiyi

i

− 2 wxixi

i

= 0

slide-47
SLIDE 47

Φ = φ0(x1) φ1(x1)  φk(x1) φ0(x2) φ1(x2)  φk(x2)     φ0(xn) φ1(xn)  φk(xn) " # $ $ $ $ $ % & ' ' ' ' '

n examples k+1 basis vectors

y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &

= φ1 ... ... φ n # $ % % % % & ' ( ( ( (

∂ ∂wk J(w) = 2 yiφk

i −φ iw φk i

( )

i

∂ ∂w0 J(w) = 2 yiφ0

i −φ iw φ0 i

( )

i

= 2ΦTy − 2ΦTΦw w = ΦTΦ

( )

−1ΦTy

= 0

2 xiyi

i

− 2 wxixi

i

= 0

slide-48
SLIDE 48

LMS for general linear regression problem

J(w) = (y i − w Tφ(x i))2

i

Deriving w we get: w = (ΦTΦ)−1ΦTy n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘pseudo inverse’

Another reason to start with an objective function: you can see when two learning methods are the same!

slide-49
SLIDE 49

LMS versus gradient descent

J(w) = (y i − w Tφ(x i))2

i

w = (ΦTΦ)−1ΦTy

LMS solution: + Very simple in Matlab or something similar

  • Requires matrix inverse, which is expensive for a large

matrix. Gradient descent: + Fast for large matrices + Stochastic GD is very memory efficient + Easily extended to other cases

  • Parameters to tweak (how to decide convergence?

what is the learning rate? ….)

slide-50
SLIDE 50

Regression and Overfitting

slide-51
SLIDE 51

An example: polynomial basis vectors on a small dataset

– From Bishop Ch 1

slide-52
SLIDE 52

0th Order Polynomial

n=10

slide-53
SLIDE 53

1st Order Polynomial

slide-54
SLIDE 54

3rd Order Polynomial

slide-55
SLIDE 55

9th Order Polynomial

slide-56
SLIDE 56

Over-fitting

Root-Mean-Square (RMS) Error:

slide-57
SLIDE 57

Polynomial Coefficients

slide-58
SLIDE 58

Data Set Size:

9th Order Polynomial

slide-59
SLIDE 59

Regularization

Penalize large coefficient values

JX,y(w) = 1 2 yi − wjφ j(xi)

j

# $ % % & ' ( (

i

2

− λ 2 w

2

slide-60
SLIDE 60

Regularization:

+

slide-61
SLIDE 61

Polynomial Coefficients

none exp(18) huge

slide-62
SLIDE 62

Over Regularization:

slide-63
SLIDE 63

Regularized Gradient Descent for LR

predict with : ˆ y

i =

w jφ j(xi)

j n

JX,y(w) = 1 2 yi − wjφ j(xi)

j

# $ % % & ' ( (

i

2

− λ 2 w

2

Goal: minimize the following loss function:

∂ ∂wj J(w) = yi − ˆ yi

( )

i

φ j(xi)− λwj

JX,y(w) = 1 2 yi − wjφ j(xi)

j

# $ % % & ' ( (

i

2

− λ 2 wj

2 j

slide-64
SLIDE 64

Probabilistic Interpretation of Least Squares

slide-65
SLIDE 65

A probabilistic interpretation

Our least squares minimization solution can also be motivated by a probabilistic in interpretation of the regression problem:

y = w Tφ(x) + ε

The MLE for w in this model is the same as the solution we derived for least squares criteria:

w = (ΦTΦ)−1ΦTy

where ε is Gaussian noise

slide-66
SLIDE 66

Understanding Overfitting: Bias-Variance

slide-67
SLIDE 67

Example

Tom Dietterich, Oregon St

slide-68
SLIDE 68

Example

Tom Dietterich, Oregon St

Same experiment, repeated: with 50 samples of 20 points each

slide-69
SLIDE 69

The true function f can’t be fit perfectly with hypotheses from

  • ur class H

(lines) è Error1 We don’t get the best hypothesis from H because

  • f noise/small

sample size è Error2 Fix: more expressive set of hypotheses H Fix: less expressive set of hypotheses H noise is similar to error1

slide-70
SLIDE 70

Bias-Variance Decomposition: Regression

slide-71
SLIDE 71

Bias and variance for regression

  • For regression, we can easily decompose the

error of the learned model into two parts: bias (error 1) and variance (error 2)

– Bias: the class of models can’t fit the data.

  • Fix: a more expressive model class.

– Variance: the class of models could fit the data, but doesn’t because it’s hard to fit.

  • Fix: a less expressive model class.
slide-72
SLIDE 72

Bias – Variance decomposition of error

learned from D

( )

{ }

) ( ) (

2 ,

x h x f E

D D

− +ε

ε

true function dataset and noise Fix test case x, then do this experiment:

  • 1. Draw size n sample D=(x1,y1),….(xn,yn)
  • 2. Train linear regressor hD using D
  • 3. Draw one test example (x, f(x)+ε)
  • 4. Measure squared error of hD on that one example x

What’s the expected error?

72

noise

slide-73
SLIDE 73

Bias – Variance decomposition of error

learned from D

ED,ε f (x)+ε − hD(x)

( )

2

{ }

true function dataset and noise noise

)} ( { x h E h

D D

≡ ) ( ˆ ˆ x h y y

D D ≡

=

Notation - to simplify this

f ≡ f (x)+ε

long-term expectation of learner’s prediction

  • n this x averaged over many data sets D
slide-74
SLIDE 74

Bias – Variance decomposition of error

ED,ε ( f − ˆ y)2

{ }

= E [ f − h]+[h − ˆ y]

( )

2

{ }

= E [ f − h]2 +[h − ˆ y]2 + 2[ f − h][h − ˆ y]

{ }

= E [ f − h]2 +[h − ˆ y]2 + 2[ fh − fˆ y − h2 + hˆ y]

{ }

= E[( f − h)2]+ E[(h − ˆ y)2]+ 2 E[ fh]− E[ fˆ y]− E[h2]+ E[hˆ y]

( )

)} ( { x h E h

D D

≡ ) ( ˆ ˆ x h y y

D D ≡

=

f ≡ f (x)+ε

ED,ε f (x)+ε

( )*ED hD(x)

{ }

{ }

= ED,ε f (x)+ε

( )*hD(x)

{ }

ED,ε ED hD(x)

{ }*ED hD(x) { }

{ }

= ED,ε ED hD(x)

{ }*hD(x)

{ }

slide-75
SLIDE 75

Bias – Variance decomposition of error

ED,ε ( f − ˆ y)2

{ }

= E [ f − h]+[h − ˆ y]

( )

2

{ }

= E [ f − h]2 +[h − ˆ y]2 + 2[ f − h][h − ˆ y]

{ }

= E[( f − h)2]+ E[(h − ˆ y)2]

Squared difference btwn our long- term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run

  • n a dataset D (hat y)

Squared difference between best possible prediction for x, f(x), and

  • ur “long-term” expectation

for what the learner will do if we averaged over many datasets D, ED[hD(x)]

)} ( { x h E h

D D

≡ ) ( ˆ ˆ x h y y

D D ≡

=

BIAS2 VARIANCE

75

f ≡ f (x)+ε

slide-76
SLIDE 76

bias variance x=5

slide-77
SLIDE 77

Bias-variance decomposition

  • This is something real that you can (approximately)

measure experimentally – if you have synthetic data

  • Different learners and model classes have different

tradeoffs – large bias/small variance: few features, highly regularized, highly pruned decision trees, large-k k- NN… – small bias/high variance: many features, less regularization, unpruned trees, small-k k-NN…

slide-78
SLIDE 78

Bias and variance

  • For classification, we can also decompose the

error of a learned classifier into two terms: bias and variance

– Bias: the class of models can’t fit the data. – Fix: a more expressive model class. – Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. – Fix: a less expressive model class.

slide-79
SLIDE 79

Another view of a decision tree

Sepal_length<5.7 Sepal_width>2.8

slide-80
SLIDE 80

Another view of a decision tree

Sepal_length>5.7 N Sepal_width>2.8 length>5.1 N Y width>3.1 Y length>4.6 N Y

slide-81
SLIDE 81

Another view of a decision tree

Sepal_length>5.7 N Sepal_width>2.8 length>5.1 N Y width>3.1 Y N

slide-82
SLIDE 82

Another view of a decision tree

slide-83
SLIDE 83

Another view of a decision tree

slide-84
SLIDE 84

Bias-Variance Decomposition: Measuring

slide-85
SLIDE 85

Bias-variance decomposition

  • This is something real that you can (approximately)

measure experimentally – if you have synthetic data – …or if you’re clever – You need to somehow approximate ED{hD(x)} – I.e., construct many variants of the dataset D

slide-86
SLIDE 86

Background: “Bootstrap” sampling

  • Input: dataset D
  • Output: many variants of D: D1,…,DT
  • For t=1,….,T:

– Dt = { } – For i=1…|D|:

  • Pick (x,y) uniformly at random from D (i.e.,

with replacement) and add it to Dt

  • Some examples never get picked (~37%)
  • Some are picked 2x, 3x, ….
slide-87
SLIDE 87

Measuring Bias-Variance with “Bootstrap” sampling

  • Create B bootstrap variants of D (approximate many draws of D)
  • For each bootstrap dataset

– Tb is the dataset; Ub are the “out of bag” examples – Train a hypothesis hb on Tb – Test hb on each x in Ub

  • Now for each (x,y) example we have many predictions

h1(x),h2(x), …. so we can estimate (ignoring noise) – variance: ordinary variance of h1(x),….,hn(x) – bias: average(h1(x),…,hn(x)) - y

slide-88
SLIDE 88

Applying Bias-Variance Analysis

  • By measuring the bias and variance on a

problem, we can determine how to improve our model

– If bias is high, we need to allow our model to be more complex – If variance is high, we need to reduce the complexity

  • f the model
  • Bias-variance analysis also suggests a way to

reduce variance: bagging

88

slide-89
SLIDE 89

Bagging

slide-90
SLIDE 90

Bootstrap Aggregation (Bagging)

  • Use the bootstrap to create B variants of D
  • Learn a classifier from each variant
  • Vote the learned classifiers to predict on a test

example

slide-91
SLIDE 91

Bagging (bootstrap aggregation)

  • Breaking it down:

– input: dataset D and YFCL – output: a classifier hD-BAG – use bootstrap to construct variants D1,…,DT – for t=1,…,T: train YFCL on Dt to get ht – to classify x with hD-BAG

  • classify x with h1,….,hT and predict the most

frequently predicted class for x (majority vote)

Note that you can use any learner you like! You can also test ht on the “out of bag” examples

slide-92
SLIDE 92

Experiments

Freund and Schapire

slide-93
SLIDE 93

solid: NB dashed: LR

slide-94
SLIDE 94

Bagged, minimally pruned decision trees

slide-95
SLIDE 95
slide-96
SLIDE 96

Generally, bagged decision trees

  • utperform the linear

classifier eventually if the data is large enough and clean enough.

slide-97
SLIDE 97

Bagging (bootstrap aggregation)

  • Experimentally:

– especially with minimal pruning: decision trees have low bias but high variance. – bagging usually improves performance for decision trees and similar methods – It reduces variance without increasing the bias (much).