Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC - PowerPoint PPT Presentation

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at . Ber kel ey. EDU Andy Liaw (Merck Research Labs) andy_l i aw@ m er ck. com Leo Breiman (Stat., UC Berkeley) l eo@ st at . Ber kel ey. EDU Interface 2004, Baltimore

Outline • Imbalanced data • Common approaches and recent works • “Balanced” random forests • “Weighted” random forests • Some comparisons • Conclusion 2 Interface 2004, Baltimore, MD

Imbalanced Data • Data for many classification problems are inherently imbalanced – One large, “normal” class (negative) and one small/rare, “interesting” class (positive) – E.g.: rare diseases, fraud detection, compound screening in drug discovery, etc. • Why is this a problem? – Most machine learning algorithms focus on overall accuracy, and “break down” with moderate imbalance in the data – Even some cost-sensitive algorithms don’t work well when imbalance is extreme 3 Interface 2004, Baltimore, MD

Common Approaches • Up-sampling minority class – random sampling with replacement – strategically add cases that reduce error • Down-sampling majority class – random sampling – strategically omit cases that do not help • Cost-sensitive learning – build misclassification cost into the algorithm • Down-sampling tends to work better empirically, but loses some information, as not all training data are used 4 Interface 2004, Baltimore, MD

Recent Work • One-sided sampling • SMOTE: Synthetic Minority Oversampling TEchnique (Chawla et al, 2002) • SMOTEBoost • SHRINK 5 Interface 2004, Baltimore, MD

Random Forest • A supervised learning algorithm, constructed by combining multiple decision trees (Breiman, 2001) • Draw a bootstrap sample of the data • Grow an un-pruned tree – At each node, only a small, random subset of predictor variables are tried to split that node • Repeat as many times as you’d like • Make predictions using all trees 6 Interface 2004, Baltimore, MD

“Balanced” Random Forest • Natural integration of down-sampling majority class and ensemble learning • For each tree in RF, down-sample the majority class to the same size as the minority class • Given enough trees, all training data are used, so no loss of information • Computationally efficient, since each tree only sees a small sample 7 Interface 2004, Baltimore, MD

“Weighted” Random Forest • Incorporate class weights in several places of the RF algorithm: – Weighted Gini for split selection – Class-weighted votes at terminal nodes for node class – Weighted votes over all trees, using average weights at terminal nodes • Using weighted Gini alone isn’t sufficient 8 Interface 2004, Baltimore, MD

Performance Assessment • True Positive Rate Confusion Matrix (TPR): TP / (TP + FN) Predicted Predicted Positive Negative • True Negative Rate (TNR): TN / (TN + FP) True False Positive • Precision: TP / (TP + FP) Positive Negative • Recall: same as TPR False True Negative Positive Negative • g-mean: (TPR × TNR) 1/2 • F-measure: (2×Precision×Recall) / (Precision + Recall) 9 Interface 2004, Baltimore, MD

Benchmark Data Dataset No. of No. of Obs. % Minority Var. Oil Spill 50 937 4.4 Mammograph 6 11183 2.3 y SatImage 36 6435 9.7 10 Interface 2004, Baltimore, MD

Oil Spill Data Method TPR TNR Precisio G-mean F- n meas 1-sided 76.0 86.6 20.5 81.13 32.3 sampling SHRINK 82.5 60.9 8.85 70.9 16.0 SMOTE 89.5 78.9 16.4 84.0 27.7 BRF 73.2 91.6 28.6 81.9 41.1 WRF 92.7 82.4 19.4 87.4 32.1 Performance for 1-sided sampling, SHRINK, and SMOTE taken from Chawla, et al (2002). 11 Interface 2004, Baltimore, MD

Mammography Data Method TPR TNR Precisio G- F- n mean meas RIPPER 48.1 99.6 74.7 69.2 58.1 SMOTE 62.2 99.0 60.5 78.5 60.4 SMOTE-Boost 62.6 99.5 74.5 78.9 68.1 BRF 76.5 98.2 50.5 86.7 60.8 WRF 72.7 99.2 69.7 84.9 71.1 Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003). 12 Interface 2004, Baltimore, MD

Satimage Data Method TPR TNR Precisio G- F- n mean meas RIPPER 47.4 97.6 67.9 68.0 55.5 SMOTE 74.9 91.3 48.1 82.7 58.3 SMOTE-Boost 67.9 97.2 72.7 81.2 70.2 BRF 77.0 93.6 56.3 84.9 65.0 WRF 77.5 94.6 60.5 85.6 68.0 Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003). 13 Interface 2004, Baltimore, MD

A Simple Experiment: 2Norm • Fix size of one class at 100, vary the size of other class among 5e3, 1e4, 5e4 and 1e5 • Train both WRF and BRF, predict on same size test set – WRF: use reciprocal of class ratio as weights – BRF: draw 100 from each class w/replacement to grow each tree • With usual prediction, BRF has better false negative rate; WRF has better true positive rate • Compare cumulative gain to see difference 14 Interface 2004, Baltimore, MD

Comparing Cumulative Gain 0 100 200 300 400 500 1:50 1:100 100 80 60 40 % Positives Found 20 0 WRF BRF 1:500 1:1000 100 80 60 40 20 0 0 100 200 300 400 500 15 Interface 2004, Baltimore, MD

To Wrap Up… • We propose two methods of learning imbalanced data with random forests – BRF: down-sampling majority in each tree – WRF: incorporate class weights in several places • Both show improvements over existing methods • The two are about equally effective on real; hard to pick a winner • Need further study to see if/when/why one works better than the other 16 Interface 2004, Baltimore, MD

Free Software • Random Forest (Breiman & Cutler): Fortran code, implements WRF, available at ht t p: / / st at - www. ber kel ey. edu/ user s/ br ei m an/ Random For est s/ • randomForest (Liaw & Wiener): add-on package for R (based on the Fortran code above), implements BRF, available on CRAN (e.g.: ht t p: / / cr an. us. r - r oj ect . or g/ sr c/ cont r i b/ PACKAG l ) ES. ht m 17 Interface 2004, Baltimore, MD

Acknowledgment • Adele Cutler (Utah State) • Vladimir Svetnik, Chris Tong, Ting Wang (BR) • Matt Wiener (ACSM) 18 Interface 2004, Baltimore, MD

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC - PowerPoint PPT Presentation

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at . Ber kel ey. EDU Andy Liaw (Merck Research Labs) andy_l i aw@ m er ck. com Leo Breiman (Stat., UC Berkeley) l eo@ st at . Ber kel ey. EDU

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt

The Short Introduction to Imbalanced Classification Zeyu Qin 07.02.2020 Overview Reference

Natures Theory For Humans For Plants An imbalanced diet is An imbalanced Fertilizer poison to

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T

Dealing with imbalanced datasets Bart Baesens Professor Data Science at KU Leuven DataCamp

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

for Rice Intensification in West Africa: CEMA Senegal Case Study Growing opportunities for

R. H. Beede, D. Kluepfel 2 , and M.V. McKenry 3 University of CA, Cooperative Extension, 2

Deception Detection in Transcribed Speech and Written Text Rebecca Pottenger Background

Prsentation Sun, snow, flower an anglais Diapo 1 I would like to thank the scientific comitee

Scalable Solutions for Waste (Water) Management - small volume systems Sustainable Planet

Understanding Markets and Marketing Randy Fortenbery School of Economic Sciences College of

Soil Disturbance Surveys in Pine Tree Plantations of the Basque Country Gonzlez-Arias, A. 1 ,

early adult lifespan Richard Rhodes University of York richard.rhodes@york.ac.uk 2 What?

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC - PowerPoint PPT Presentation

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@ st at . Ber kel ey. EDU Andy Liaw (Merck Research Labs) andy_l i aw@ m er ck. com Leo Breiman (Stat., UC Berkeley) l eo@ st at . Ber kel ey. EDU

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt

The Short Introduction to Imbalanced Classification Zeyu Qin 07.02.2020 Overview Reference

Natures Theory For Humans For Plants An imbalanced diet is An imbalanced Fertilizer poison to

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T

Dealing with imbalanced datasets Bart Baesens Professor Data Science at KU Leuven DataCamp

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck &amp; Co., Inc.

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

for Rice Intensification in West Africa: CEMA Senegal Case Study Growing opportunities for

R. H. Beede*, D. Kluepfel 2 , and M.V. McKenry 3 * University of CA, Cooperative Extension, 2

Deception Detection in Transcribed Speech and Written Text Rebecca Pottenger Background

Prsentation Sun, snow, flower an anglais Diapo 1 I would like to thank the scientific comitee

Scalable Solutions for Waste (Water) Management - small volume systems Sustainable Planet

Understanding Markets and Marketing Randy Fortenbery School of Economic Sciences College of

Soil Disturbance Surveys in Pine Tree Plantations of the Basque Country Gonzlez-Arias, A. 1 ,

early adult lifespan Richard Rhodes University of York richard.rhodes@york.ac.uk 2 What?

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.

R. H. Beede, D. Kluepfel 2 , and M.V. McKenry 3 University of CA, Cooperative Extension, 2