Random Forest Bagging Bagging or bootstrap aggregation a technique - - PowerPoint PPT Presentation

random forest bagging
SMART_READER_LITE
LIVE PREVIEW

Random Forest Bagging Bagging or bootstrap aggregation a technique - - PowerPoint PPT Presentation

10701 Ensemble of trees: Begging and Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. For classification, a committee of trees each cast a vote for the


slide-1
SLIDE 1

10701 Ensemble of trees: Begging and Random Forest

slide-2
SLIDE 2

Bagging

  • Bagging or bootstrap aggregation a technique for

reducing the variance of an estimated prediction function.

  • For classification, a committee of trees each

cast a vote for the predicted class.

slide-3
SLIDE 3

Bootstrap

The basic idea: randomly draw datasets with replacement from the training data, each sample of the same size

slide-4
SLIDE 4

Bagging

N examples

Create bootstrap samples from the training data

....…

M features

slide-5
SLIDE 5

Random Forest Classifier

N examples

Construct a decision tree

....…

M features

slide-6
SLIDE 6

Bagging tree classifier

N examples

....… ....…

Take the majority vote M features

slide-7
SLIDE 7

Bagging

Z = {(x1, y1), (x2, y2), . . . , (xN, yN)}

Z*b where= 1,.., B.. The prediction at input x when bootstrap sample b is used for training

slide-8
SLIDE 8

Bagging

Hastie

Treat the voting Proportions as probabilities

slide-9
SLIDE 9

Random forest classifier

Random forest classifier, an extension to bagging which uses a subset of the features rather than the samples.

slide-10
SLIDE 10

Random Forest Classifier

N examples

Training Data

M features

slide-11
SLIDE 11

Random Forest Classifier

N examples

Create bootstrap samples from the training data

....…

M features

slide-12
SLIDE 12

Random Forest Classifier

N examples

Construct a decision tree

....…

M features

slide-13
SLIDE 13

Random Forest Classifier

N examples

....…

M features

At each node in choosing the split feature choose only among m<M features

slide-14
SLIDE 14

Random Forest Classifier

Create decision tree from each bootstrap sample

N examples

....… ....…

M features

slide-15
SLIDE 15

Random Forest Classifier

N examples

....… ....…

Take he majority vote M features

slide-16
SLIDE 16
slide-17
SLIDE 17

Random forest for biology

GeneExpress GeneExpress TAP Y2H GOProcess N HMS_PCI N GeneOccur Y GOLocalization Y ProteinExpress GeneExpress GeneExpress Domain Y2H HMS-PCI SynExpress ProteinExpress

slide-18
SLIDE 18

Regression

10-701 Machine Learning

slide-19
SLIDE 19

Where we are

Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no.

Today

slide-20
SLIDE 20

Choosing a restaurant

Reviews (out of 5 stars) $ Distance Cuisine (out of 10) score 4 30 21 7 8.5 2 15 12 8 7.8 5 27 53 9 6.7 3 20 5 6 5.4

  • In everyday life we need to make decisions

by taking into account lots of factors

  • The question is what weight we put on each
  • f these factors (how important are they with

respect to the others).

  • Assume we would like to build a

recommender system for ranking potential restaurants based on an individuals’ preferences

  • If we have many observations we may be

able to recover the weights

?

slide-21
SLIDE 21

Linear regression

  • Given an input x we would like

to compute an output y

  • For example:
  • Predict height from age
  • Predict Google’s price from

Yahoo’s price

  • Predict distance from wall

using sensor readings

X Y Note that now Y can be continuous

slide-22
SLIDE 22

Linear regression

  • Given an input x we would like to

compute an output y

  • In linear regression we assume

that y and x are related with the following equation: y = wx+ where w is a parameter and  represents measurement or

  • ther noise

X Y What we are trying to predict Observed values

slide-23
SLIDE 23
  • Our goal is to estimate w from a training data
  • f <xi,yi> pairs
  • One way to find such relationship is to

minimize the a least squares error:

  • Several other approaches can be used as well
  • So why least squares?
  • minimizes squared distance between

measurements and predicted line

  • has a nice probabilistic interpretation
  • easy to compute

Linear regression

i i i w

wx y

2

) ( min arg

X Y

 + = wx y

If the noise is Gaussian with mean 0 then least squares is also the maximum likelihood estimate of w

slide-24
SLIDE 24

Solving linear regression using least squares minimization

  • You should be familiar with this by now …
  • We just take the derivative w.r.t. to w and set to 0:

      

=  =  = −  − − = −  

i i i i i i i i i i i i i i i i i i i i i

x y x w wx y x wx y x wx y x wx y w

2 2 2

) ( 2 ) ( 2 ) (

slide-25
SLIDE 25

Regression example

  • Generated: w=2
  • Recovered: w=2.03
  • Noise: std=1
slide-26
SLIDE 26

Regression example

  • Generated: w=2
  • Recovered: w=2.05
  • Noise: std=2
slide-27
SLIDE 27

Regression example

  • Generated: w=2
  • Recovered: w=2.08
  • Noise: std=4
slide-28
SLIDE 28

Bias term

  • So far we assumed that the

line passes through the origin

  • What if the line does not?
  • No problem, simply change the

model to y = w0 + w1x+

  • Can use least squares to

determine w0 , w1

n x w y w

i i i

− =

1

X Y w0

 

− =

i i i i i

x w y x w

2 1

) (

slide-29
SLIDE 29

Bias term

  • So far we assumed that the

line passes through the origin

  • What if the line does not?
  • No problem, simply change the

model to y = w0 + w1x+

  • Can use least squares to

determine w0 , w1

n x w y w

i i i

− =

1

X Y w0

 

− =

i i i i i

x w y x w

2 1

) (

Just a second, we will soon give a simpler solution

slide-30
SLIDE 30

Multivariate regression

  • What if we have several inputs?
  • Stock prices for Yahoo, Microsoft and Ebay for

the Google prediction task

  • This becomes a multivariate linear regression

problem

  • Again, its easy to model:

y = w0 + w1x1+ … + wkxk + 

Google’s stock price Yahoo’s stock price Microsoft’s stock price

slide-31
SLIDE 31

Multivariate regression

  • What if we have several inputs?
  • Stock prices for Yahoo, Microsoft and Ebay for

the Google prediction task

  • This becomes a multivariate regression problem
  • Again, its easy to model:

y = w0 + w1x1+ … + wkxk + 

Not all functions can be approximated using the input values directly

slide-32
SLIDE 32

y=10+3x1

2-2x2 2+

In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems?

  • Yes. As long as the coefficients are

linear the equation is still a linear regression problem!

slide-33
SLIDE 33

Non-Linear basis function

  • So far we only used the observed values
  • However, linear regression can be applied in the same way to

functions of these values

  • As long as these functions can be directly computed from the
  • bserved values the parameters are still linear in the data and the

problem remains a linear regression problem

 + + + + =

2 2 1 1 k kx

w x w w y 

slide-34
SLIDE 34

Non-Linear basis function

  • What type of functions can we use?
  • A few common examples:
  • Polynomial: j(x) = xj for j=0 … n
  • Gaussian:
  • Sigmoid:

฀  j(x) = (x −  j) 2 j

2

฀  j(x) = 1 1+ exp(−s jx)

Any function of the input values can be used. The solution for the parameters

  • f the regression remains

the same.

slide-35
SLIDE 35

General linear regression problem

  • Using our new notations for the basis function linear regression can

be written as

  • Where j(x) can be either xj for multivariate regression or one of the

non linear basis we defined

  • Once again we can use ‘least squares’ to find the optimal solution.

฀ y = w j j(x)

j= 0 n

slide-36
SLIDE 36

LMS for the general linear regression problem

=

=

k j j j

x w y ) ( 

฀ J(w) = (y i − w j j(x i)

j

)

i

2

Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w

฀ J(w) = (y i − w T(x i))2

i

T T 2 T

) ( )) ( w ( 2 )) ( w (

i i i i i i i

x x y x y w     

 

− = −

Equating to 0 we get

      =  = −

  

T T T T T

) ( ) ( w ) ( ) ( )) ( w ( 2

i i i i i i i i i i

x x x y x x y     

w – vector of dimension k+1 (xi) – vector of dimension k+1 yi – a scaler

slide-37
SLIDE 37

LMS for general linear regression problem

We take the derivative w.r.t w ฀ J(w) = (y i − w T(x i))2

i

฀  w (y i − w T(x i))2

i

= 2 (y i − w T(x i))

i

(x i)T

Equating to 0 we get

฀ 2 (y i − w T(x i))

i

(x i)T = 0  y i

i

(x i)T = w T (x i)

i

(x i)T      

Define:

              =  ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (

1 2 2 1 2 1 1 1 1 n k n n k k

x x x x x x x x x                

Then deriving w we get:

฀ w = (T)−1Ty

slide-38
SLIDE 38

LMS for general linear regression problem

฀ J(w) = (y i − w T(x i))2

i

Deriving w we get:

฀ w = (T)−1Ty

n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo inverse’

slide-39
SLIDE 39

Example: Polynomial regression

slide-40
SLIDE 40

A probabilistic interpretation

Our least squares minimization solution can also be motivated by a probabilistic in interpretation of the regression problem:

฀ y = w T(x) + 

The MLE for w in this model is the same as the solution we derived for least squares criteria:

฀ w = (T)−1Ty

slide-41
SLIDE 41

Other types of linear regression

  • Linear regression is a useful model for many problems
  • However, the parameters we learn for this model are global; they

are the same regardless of the value of the input x

  • Extension to linear regression adjust their parameters based on the

region of the input we are dealing with

slide-42
SLIDE 42

Splines

  • Instead of fitting one function for the entire region, fit a set of

piecewise (usually cubic) polynomials satisfying continuity and smoothness constraints.

  • Results in smooth and flexible functions without too many

parameters

  • Need to define the regions in advance (usually uniform)

฀ y = a

1x3 + b 1x2 + c1x + d1

฀ y = a2x3 + b2x2 + c2x + d2 ฀ y = a3x3 + b3x2 + c3x + d3

slide-43
SLIDE 43

LOCAL, KERNEL REGRESSION

slide-44
SLIDE 44

Local Kernel Regression

  • What is the temperature

in the room?

27

Average “Local” Average

at location x? x

slide-45
SLIDE 45

Local Average Regression

28

h Recall: NN classifier with majority vote Here we use Average instead Sum of Ys in h ball around X #pts in h ball around X

slide-46
SLIDE 46

Nadaraya-Watson Kernel Regression

h

slide-47
SLIDE 47

Local Kernel Regression

  • Nonparametric estimator akin to kNN
  • Nadaraya-Watson Kernel Estimator

Where

  • Weight each training point based on

distance to test point

  • Boxcar kernel yields

local average

h

slide-48
SLIDE 48

Kernels

slide-49
SLIDE 49

Spatially adaptive regression

32

h

h If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees …

slide-50
SLIDE 50

Local Average Regression

33

h Recall: NN classifier with majority vote Here we use Average instead Sum of Ys in h ball around X #pts in h ball around X

slide-51
SLIDE 51

Nadaraya-Watson Kernel Regression

h

slide-52
SLIDE 52

Local Kernel Regression

  • Nonparametric estimator akin to kNN
  • Nadaraya-Watson Kernel Estimator

Where

  • Weight each training point based on

distance to test point

  • Boxcar kernel yields

local average

h

slide-53
SLIDE 53

Kernels

slide-54
SLIDE 54

Choice of kernel bandwidth h

Image Source: Larry’s book – All

  • f Nonparametric

Statistics

h=1 h=10 h=50 h=200 Too small Too large Just right Too small

slide-55
SLIDE 55

Choice of Bandwidth

h

Large Bandwidth – average more data points, reduce noise Small Bandwidth – less smoothing, more accurate fit

(Lower variance) (Lower bias)

Bias – Variance tradeoff Should depend on n, # training data (determines variance) Should depend on smoothness of function (determines bias)

slide-56
SLIDE 56

Spatially adaptive regression

39

h

h If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees …

slide-57
SLIDE 57

Important points

  • Linear regression
  • basic model
  • as a function of the input
  • Solving linear regression
  • Error in linear regression
  • Advanced regression models