Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC - - PowerPoint PPT Presentation

learning imbalanced data with random forests
SMART_READER_LITE
LIVE PREVIEW

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC - - PowerPoint PPT Presentation

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at . Ber kel ey. EDU Andy Liaw (Merck Research Labs) andy_l i aw@ m er ck. com Leo Breiman (Stat., UC Berkeley) l eo@ st at . Ber kel ey. EDU


slide-1
SLIDE 1

Learning Imbalanced Data with Random Forests

Chao Chen (Stat., UC Berkeley)

chenchao@ st at . Ber kel ey. EDU

Andy Liaw (Merck Research Labs)

andy_l i aw@ m er ck. com

Leo Breiman (Stat., UC Berkeley)

l eo@ st at . Ber kel ey. EDU

Interface 2004, Baltimore

slide-2
SLIDE 2

Interface 2004, Baltimore, MD 2

Outline

  • Imbalanced data
  • Common approaches and recent works
  • “Balanced” random forests
  • “Weighted” random forests
  • Some comparisons
  • Conclusion
slide-3
SLIDE 3

Interface 2004, Baltimore, MD 3

Imbalanced Data

  • Data for many classification problems are

inherently imbalanced

– One large, “normal” class (negative) and one small/rare, “interesting” class (positive) – E.g.: rare diseases, fraud detection, compound screening in drug discovery, etc.

  • Why is this a problem?

– Most machine learning algorithms focus on overall accuracy, and “break down” with moderate imbalance in the data – Even some cost-sensitive algorithms don’t work well when imbalance is extreme

slide-4
SLIDE 4

Interface 2004, Baltimore, MD 4

Common Approaches

  • Up-sampling minority class

– random sampling with replacement – strategically add cases that reduce error

  • Down-sampling majority class

– random sampling – strategically omit cases that do not help

  • Cost-sensitive learning

– build misclassification cost into the algorithm

  • Down-sampling tends to work better

empirically, but loses some information, as not all training data are used

slide-5
SLIDE 5

Interface 2004, Baltimore, MD 5

Recent Work

  • One-sided sampling
  • SMOTE: Synthetic Minority Oversampling

TEchnique (Chawla et al, 2002)

  • SMOTEBoost
  • SHRINK
slide-6
SLIDE 6

Interface 2004, Baltimore, MD 6

Random Forest

  • A supervised learning algorithm, constructed

by combining multiple decision trees (Breiman, 2001)

  • Draw a bootstrap sample of the data
  • Grow an un-pruned tree

– At each node, only a small, random subset of predictor variables are tried to split that node

  • Repeat as many times as you’d like
  • Make predictions using all trees
slide-7
SLIDE 7

Interface 2004, Baltimore, MD 7

“Balanced” Random Forest

  • Natural integration of down-sampling

majority class and ensemble learning

  • For each tree in RF, down-sample the

majority class to the same size as the minority class

  • Given enough trees, all training data are

used, so no loss of information

  • Computationally efficient, since each tree
  • nly sees a small sample
slide-8
SLIDE 8

Interface 2004, Baltimore, MD 8

“Weighted” Random Forest

  • Incorporate class weights in several

places of the RF algorithm:

– Weighted Gini for split selection – Class-weighted votes at terminal nodes for node class – Weighted votes over all trees, using average weights at terminal nodes

  • Using weighted Gini alone isn’t

sufficient

slide-9
SLIDE 9

Interface 2004, Baltimore, MD 9

Performance Assessment

  • True Positive Rate

(TPR): TP / (TP + FN)

  • True Negative Rate

(TNR): TN / (TN + FP)

  • Precision: TP / (TP + FP)
  • Recall: same as TPR
  • g-mean: (TPR × TNR)1/2
  • F-measure:

(2×Precision×Recall) / (Precision + Recall)

Predicted Positive Predicted Negative Positive True Positive False Negative Negative False Positive True Negative

Confusion Matrix

slide-10
SLIDE 10

Interface 2004, Baltimore, MD 10

Benchmark Data

Dataset

  • No. of

Var.

  • No. of Obs.

% Minority Oil Spill 50 937 4.4 Mammograph y 6 11183 2.3 SatImage 36 6435 9.7

slide-11
SLIDE 11

Interface 2004, Baltimore, MD 11

Oil Spill Data

Method TPR TNR Precisio n G-mean F- meas 1-sided sampling 76.0 86.6 20.5 81.13 32.3 SHRINK 82.5 60.9 8.85 70.9 16.0 SMOTE 89.5 78.9 16.4 84.0 27.7 BRF 73.2 91.6 28.6 81.9 41.1 WRF 92.7 82.4 19.4 87.4 32.1

Performance for 1-sided sampling, SHRINK, and SMOTE taken from Chawla, et al (2002).

slide-12
SLIDE 12

Interface 2004, Baltimore, MD 12

Mammography Data

Method TPR TNR Precisio n G- mean F- meas RIPPER 48.1 99.6 74.7 69.2 58.1 SMOTE 62.2 99.0 60.5 78.5 60.4 SMOTE-Boost 62.6 99.5 74.5 78.9 68.1 BRF 76.5 98.2 50.5 86.7 60.8 WRF 72.7 99.2 69.7 84.9 71.1

Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003).

slide-13
SLIDE 13

Interface 2004, Baltimore, MD 13

Satimage Data

Method TPR TNR Precisio n G- mean F- meas RIPPER 47.4 97.6 67.9 68.0 55.5 SMOTE 74.9 91.3 48.1 82.7 58.3 SMOTE-Boost 67.9 97.2 72.7 81.2 70.2 BRF 77.0 93.6 56.3 84.9 65.0 WRF 77.5 94.6 60.5 85.6 68.0

Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003).

slide-14
SLIDE 14

Interface 2004, Baltimore, MD 14

A Simple Experiment: 2Norm

  • Fix size of one class at 100, vary the size of
  • ther class among 5e3, 1e4, 5e4 and 1e5
  • Train both WRF and BRF, predict on same

size test set

– WRF: use reciprocal of class ratio as weights – BRF: draw 100 from each class w/replacement to grow each tree

  • With usual prediction, BRF has better false

negative rate; WRF has better true positive rate

  • Compare cumulative gain to see difference
slide-15
SLIDE 15

Interface 2004, Baltimore, MD 15

Comparing Cumulative Gain

% Positives Found 1:50

20 40 60 80 100

1:100

100 200 300 400 500 20 40 60 80 100 100 200 300 400 500

1:500 1:1000 WRF BRF

slide-16
SLIDE 16

Interface 2004, Baltimore, MD 16

To Wrap Up…

  • We propose two methods of learning

imbalanced data with random forests

– BRF: down-sampling majority in each tree – WRF: incorporate class weights in several places

  • Both show improvements over existing

methods

  • The two are about equally effective on real;

hard to pick a winner

  • Need further study to see if/when/why one

works better than the other

slide-17
SLIDE 17

Interface 2004, Baltimore, MD 17

Free Software

  • Random Forest (Breiman & Cutler): Fortran code,

implements WRF, available at

ht t p: / / st at - www. ber kel ey. edu/ user s/ br ei m an/ Random For est s/

  • randomForest (Liaw & Wiener): add-on package for

R (based on the Fortran code above), implements BRF, available on CRAN

(e.g.: ht t p: / / cr an. us. r - r oj ect . or g/ sr c/ cont r i b/ PACKAG

  • ES. ht m

l )

slide-18
SLIDE 18

Interface 2004, Baltimore, MD 18

Acknowledgment

  • Adele Cutler (Utah State)
  • Vladimir Svetnik, Chris Tong, Ting Wang (BR)
  • Matt Wiener (ACSM)