random forest
play

Random Forest Applied Multivariate Statistics Spring 2012 Overview - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable


  1. Random Forest Applied Multivariate Statistics – Spring 2012

  2. Overview  Intuition of Random Forest  The Random Forest Algorithm  De-correlation gives better accuracy Healthy Diseased  Out-of-bag error (OOB-error) Healthy  Variable importance Diseased Diseased 1

  3. Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male female tall short healthy healthy healthy diseased Tree 3 New sample: retired working old, retired, male, short Tree predictions: healthy healthy diseased, healthy, diseased tall short Majority rule: healthy diseased diseased 2

  4. The Random Forest Algorithm 3

  5. Differences to standard tree  Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)  For each split, consider only m randomly selected variables  Don’t prune  Fit B trees in such a way and use average or majority voting to aggregate results 4

  6. Why Random Forest works 1/2  Mean Squared Error = Variance + Bias 2  If trees are sufficiently deep, they have very small bias  How could we improve the variance over that of a single tree? 5

  7. Why Random Forest works 2/2 i=j De-correlation gives Decreaes, if better accuracy 𝜍 decreases, i.e., if m decreases Decreases, if number of trees B increases (irrespective of 𝜍 ) 6

  8. Estimating generalization error: Out-of bag (OOB) error  Similar to leave-one-out cross-validation, but almost without any additional computational burden  OOB error is a random number, since based on random resamples of the data Out of bag samples: Data: Resampled Data: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, short – healthy young, tall – healthy young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short – diseased diseased Out of bag (OOB) error rate: healthy tall short ¼ = 0.25 healthy diseased 7

  9. Variable Importance for variable i using Permutations Data Resampled Resampled Dataset 1 Dataset m OOB OOB … Data 1 Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e 1 OOB error e m d 1 = e 1 – p 1 d m =e m -p m OOB error p m OOB error p 1 P m d = 1 i =1 d i d m v i = P m s d 1 s 2 i =1 ( d i ¡ d ) 2 d = m ¡ 1 8

  10. Trees vs. Random Forest + Trees yield insight into + RF as smaller prediction decision rules variance and therefore usually a better general + Rather fast performance + Easy to tune + Easy to tune parameters parameters - Rather slow - “Black Box”: Rather hard - Prediction of trees tend to get insights into decision to have a high variance rules 9

  11. Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10

  12. RF vs. LDA + Can model nonlinear + Very fast class boundaries + Discriminants for visualizing + OOB error “for free” (no group separation + Can read off decision rule CV needed) + Works on continuous and - Can model only linear class categorical responses boundaries (regression / classification) - Mediocre performance + Gives variable - No variable selection importance - Only on categorical response + Very good performance - Needs CV for estimating x prediction error x x x x x x x - “Black box” x x x x x x x x x x x x - Slow x x x x 11

  13. Concepts to know  Idea of Random Forest and how it reduces the prediction variance of trees  OOB error  Variable Importance based on Permutation 12

  14. R functions to know  Function “ randomForest ” and “ varImpPlot ” from package “ randomForest ” 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend