Comparative Review of Classification Trees by Leonardo Auslender, - - PowerPoint PPT Presentation

comparative review of classification trees
SMART_READER_LITE
LIVE PREVIEW

Comparative Review of Classification Trees by Leonardo Auslender, - - PowerPoint PPT Presentation

Comparative Review of Classification Trees by Leonardo Auslender, leoldv12 at gmail dot com Independent Statistical Research Consultant 2013 Contents 1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4)


slide-1
SLIDE 1

Comparative Review of Classification Trees

by Leonardo Auslender, leoldv12 ‘at’ gmail ‘dot’ com Independent Statistical Research Consultant 2013

slide-2
SLIDE 2

— 2 —

1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4) Concluding Remarks: Brains, the future Review of Trees: Auslender, L. (1998): Alacart, Poor Man’s Classification Trees, North Eastern SAS Users Group Conference.

Contents

slide-3
SLIDE 3

— 3 —

A Field vARIGuide to Tree

CART Tree (S+, R) AID THAID CHAID ID3 C4.5 C5.0

1) Varieties of Tree Methods

slide-4
SLIDE 4

— 4 —

CART

Classification and Regression Trees Source: Breiman L., Freedman J. Stone R., Olshen C.: Classification and Regression Trees, Wadsworth, International Group, Belmont, CA, 1984

slide-5
SLIDE 5

— 5 —

Aim: separate two classes by using X1 and X2 and producing more homogenous rectangular regions.

slide-6
SLIDE 6

— 6 —

CART: Underlying Classification algorithm Using missclasification.

Y X1 X2 X3 x4 1 10 21 1 1 1 30 8 1 2 8 3 10 8

Misscl (Y / X1 <=1) = .5 Misscl (Y / X1 > 1) = 0, repeat for every value

  • f X1 and for every other X variable, select
  • ptimal variable and split (actually uses Gini in

Cart).

slide-7
SLIDE 7

— 7 —

Basic CART Algorithm: binary dependent

variable or target (0,1)

Range of Continuous Variable A “0” “0”

70%

“1” “1”

20% 50% Original % of ‘0’s and ‘1’s of dep. var

Splitting point

Xi Y

slide-8
SLIDE 8

— 8 —

Divide and Conquer: recursive partitioning

n = 5,000 10% HELOC n = 3,350 n = 1,650 Debits < 19 yes no 21% HELOC 5% HELOC

slide-9
SLIDE 9

— 9 —

Ideal SAS code to find splits

Proc summary data = …. Nway; class (all independent vars); var depvar; /* this is ‘target’, 0/1*/

  • utput out = ….. Sum = ;

run; For large data sets (large N, large NVAR), hardware and software constraints prevent completion.

slide-10
SLIDE 10

— 10 —

Fitted Decision Tree: Interpretation and

structure VAR C >1 25% 0-52 45% VAR B VAR A <19 19 5% 0,1 21% >52

slide-11
SLIDE 11

— 11 —

The Cultivation of Trees

  • Split Search

– Which splits are to be considered?

  • Splitting Criterion

– Which split is best?

  • Stopping Rule

– When should the splitting stop?

  • Pruning Rule

– Should some branches be lopped-off?

slide-12
SLIDE 12

— 12 —

Possible Splits to Consider: most common is binary

because...

1 100,000 200,000 300,000 400,000 500,000 2 4 6 8 10 12 14 16 18 20

Nominal Input Ordinal Input Input Levels

If input has 1000 levels,  999 possible binary splits and 999 * 998 /2 trinary split, etc.

slide-13
SLIDE 13

— 13 —

Splitting Criterion: gini, twoing,

misclassification, entropy…

A) Minimize Gini impurity criterion (favors node homogeneity)

  • B) Maximize Twoing impurity criterion (favors class separation)

Empirical results: for binary dependent variables, Gini and Twoing are equivalent. For trinomial, Gini provides more accurate trees. Beyond three categories, twoing performs better.

slide-14
SLIDE 14

— 14 —

The Right-Sized Tree

Stunting Pruning

slide-15
SLIDE 15

— 15 —

slide-16
SLIDE 16

— 16 —

slide-17
SLIDE 17

— 17 —

slide-18
SLIDE 18

— 18 —

Benefits of Trees

  • Interpretability

– Tree structured presentation

  • Mixed Measurement Scales

– Nominal, ordinal, interval – Regression trees

  • Robustness
  • Missing Values
slide-19
SLIDE 19

— 19 —

…Benefits

  • Automatically

– Detects interactions (AID) in hierarchical conditioning search, not ‘ala’ regression analysis. – Selects input variables Input Input Prob Multivariate Step Function

slide-20
SLIDE 20

— 20 —

Drawbacks of Trees

. Unstable: small perturbations in data can lead to big changes in trees. . Linear structures are approximated in very rough form. . Applications may require that rules descriptions for different categories not share the same attributes. . It is a conditional Structure and interpretation many times misunderstands the conditioning effect.

slide-21
SLIDE 21

— 21 —

Drawbacks of Trees (cont.)

. Tends to over-fit => overly optimistic accuracy. . Large trees are very difficult to interpret. . Tree size conditioned by data set size. . No valid inferential procedures at present (if it matters). . Greedy search algorithm.

slide-22
SLIDE 22

— 22 —

Note on Missing Values.

1) Missingness NOT in Y (see Wang and Sheng, 2007, JMLR for semi- supervised method for missing Y). 2) Different methods of imputation: 1) C4.5: probabilistic split: variables with missing values are attached to child nodes with weights equal to proportion of non-missing values. 2) Complete case: eliminate all missing observations, and train. 3) Grand mode/mean: imputed if categorical/continuous. 4) Separate class: appropriate for categorical. For continuous, create extreme large value and thus separate missings from non-missings. 5) Complete variable case: delete all variables with missing values. 6) Surrogate (CART default): Use surrogate variable/s whenever variable is missing. At testing or scoring, if variable is missing, uses surrogate/s.

slide-23
SLIDE 23

Tree Derivative: Random Forests. (Breiman, 1999)

— 23 —

Random Forests proceed in the following steps, and notice that there is no need to create a training, validation and a test data sets: 1. Take a random sample of N observations with replacement (“bagging”) from the data set. On average, select about 2/3 of rows. The remaining 1/3 are called “out of bag (OOB)” observations. A new random selection is performed for each tree constructed. 2. Using the observations selected in step 1, construct a decision tree to its maximum size, without pruning. As the tree is built, allow only a subset of the total set of predictor variables to be considered as possible splitters for each node. Select the set of predictors to be considered as random subset of the total set of available predictors. For example, if there are ten predictors, choose five of them randomly as candidate splitters. Perform a new random selection for each split. Some predictors (possibly best one) will not be considered for each split, but predictor excluded from one split may be used for another split in the same tree.

slide-24
SLIDE 24

— 24 — No Overfitting or Pruning. The "Over-fitting“ problem appears in large, single-tree models where the model fits noise in the data, which causes poor generalization power, which is the basis for pruning those models. In nearly all cases, decision tree forests do not have problem with over-fitting, and there is no need to prune trees in the forest. Generally, the more trees in a forest, the better the fit. Internal Measure of Test Set (Generalization) Error . About 1/3 of observations are excluded from each tree in the forest, which are called “out of bag (OOB)”. That is, each tree has a different set of out-of-bag observations that implies each OOB set constitutes an independent test sample. To measure the generalization error of decision tree forests, the OOB set for each tree is run through the tree and the error rate of prediction is computed. The error rates for the trees in the forest are then averaged to obtain the overall generalization error rate for the decision tree forest model. There are several advantages to this method of computing the generalization error: (1) All observations are used to construct the model, and none have to be held back as a separate test set, (2) The testing is fast because only one forest has to be constructed (as compared to V-fold cross-validation where additional trees have to be constructed).

slide-25
SLIDE 25

— 25 —

2) Scoring: battle horse of database marketing. Model Deployment.

slide-26
SLIDE 26

— 26 —

Scoring Recipe

  • Model

– Formula

  • Data Modifications

– Derived inputs – Variable Transformations – Missing value imputation

  • Scoring Code

 Scored data  Original

computation algorithm

slide-27
SLIDE 27

— 27 —

/* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/ /* METHOD MISSCL ALACART TEST */ RETAIN ROOT 1; IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12 THEN DO; NODE = '4_1 '; PRED = 0 ; /* % NODE IMPURITY = 0.0399 ; */ /* BRANCH # = 1 ; */ /* NODE FREQ = 81 ; */ END; ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12 THEN DO; NODE = '4_2 '; PRED = 1 ; /* % NODE IMPURITY = 0.4478 ; */ /* BRANCH # = 2 ; */ /* NODE FREQ = 212 ; */ END; ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36 THEN DO; NODE = '3_2 '; PRED = 0 ;

Scoring Recipe: example of scoring output generated by Alacart

slide-28
SLIDE 28

— 28 —

Scorability

X1 .2 .4 .6 .8 1 X2 0 .2 .4 .6 .8 1 Scoring Code Classifier If x1<.47 & x2<.18

  • r x1>.47

& x2>.29, then red.

Tree

Training Data New Case

slide-29
SLIDE 29

— 29 —

slide-30
SLIDE 30
  • 1st. Data set: Titanic.

Titanic survival data, available on the web. 1313 observations but due to missing “age” values, 756 complete observations,

  • ut of 1313 total number of observations. Below, variables

available for analysis (the “*” variables are transformations to “help” the logistic). „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ†

‚Data Contents ‚Variable‚ ‚ ‚ Length ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Variable Name ‚Variable Label ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚AGE ‚Yrs of Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESEX ‚Age * Sex ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESQ ‚Age * Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS1 ‚First Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS2 ‚Second Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS3 ‚Third Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SEX ‚Sex Female = 1 ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SURVIVED ‚Survived = 1 ‚ 8‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ

slide-31
SLIDE 31

Original Data Survived Total Did not Survived Count % Obs Count % Obs All All All All Count Age Gender Total 5 0.38 96 7.31 101 Present female 1st 2nd 10 0.76 75 5.71 85 3rd 56 4.27 46 3.50 102 male 1st 82 6.25 43 3.27 125 2nd 106 8.07 21 1.60 127 3rd 184 14.01 32 2.44 216 All 443 33.74 313 23.84 756 Missing Gender Total 4 0.30 38 2.89 42 female 1st 2nd 3 0.23 19 1.45 22 3rd 76 5.79 34 2.59 110 male 1st 38 2.89 16 1.22 54 2nd 42 3.20 4 0.30 46 3rd 257 19.57 26 1.98 283 All 420 31.99 137 10.43 557 All 863 65.73 450 34.27 1313

slide-32
SLIDE 32

— 32 —

slide-33
SLIDE 33

— 33 —

33

Complete Data W/O Missing Age.

Data Description: SURVIVED

G + PCl % G + PCl DID NOT SURVIVED Mean AGE Mean AGE Count % Tota l Obs % Of Gender Mean Count % Tota l Obs % Of Gender Mea n Count GENDER PSSNGR CLASS 5 0.66 1.28 35.20 96 13.7 39.60 37.91 101 13.36 female 1st 2nd 10 1.32 3.28 31.40 75 9.92 21.92 26.85 85 11.24 3rd 56 7.41 9.27 23.82 46 6.08 11.37 23.72 102 13.49 Total 71 9.39 13.82 24.90 217 28.7 73.89 30.87 288 38.10 male PSSNGR CLASS 82 10.8 5 26.66 44.84 43 5.69 16.03 34.25 125 16.53 1st 2nd 106 14.0 2 24.36 31.70 21 3.78 3.39 14.84 127 16.80 3rd 184 24.3 4 36.16 27.10 32 4.23 7.69 23.09 216 28.57 Total 372 49.2 1 87.18 33.32 96 13.7 27.11 25.95 468 61.90 Total 443 58.6 100.00 31.13 313 41.4 100.00 29.36 756 100.00

Pr (Fem/Surv) Pr (Surv & Fem)

slide-34
SLIDE 34

— 34 —

Logistic Vs. trees. Titanic No missing Values.

Consistency Information Value Characteristic 756 'informs.titanic_no_missing' # obs. Number of variables 7 Number of continuous variables 3 Number of class variables 4

Trees used 3, Forest 5, while Stepwise 5 plus the intercept.

Models * Vars * Coeffs Estimate Pr > Chi-Square Importance # Rules Var Sel Type Variable 0.546 493 Forest AGE PASSCLASS1 0.400 66 PASSCLASS2 0.248 58 PASSCLASS3 0.465 59 SEX 1.000 97 STEPWISE AGE

  • 0.039177938

0.000 PASSCLASS1 1.2919799232 0.000 PASSCLASS3

  • 1.229467857

0.000 SEX 2.631357225 0.000 Intercept

  • 0.163634963

0.550 Trees AGE 0.408 1 PASSCLASS3 0.556 2 SEX 1.000 1

slide-35
SLIDE 35

— 35 —

Training: Rates '-' ==> misclass & Missprec

_PREDICTED

1 Overall Class Rate Prec Rate Class Rate Prec Rate Class Rate Prec Rate Survived = 1 Model Name 96.39 78.35

  • 3.61
  • 7.58

96.39 FOREST LOGISTIC_STEPWISE 83.97 80.35

  • 16.03
  • 24.23

83.97 TREES 96.61 77.12

  • 3.39
  • 7.46

96.61 1 FOREST

  • 37.70
  • 21.65

62.30 92.42 62.30 LOGISTIC_STEPWISE

  • 29.07
  • 19.65

70.93 75.77 70.93 TREES

  • 40.58
  • 22.88

59.42 92.54 59.42 Overall FOREST 78.35 92.42 82.28 82.28 LOGISTIC_STEPWISE 80.35 75.77 78.57 78.57 TREES 77.12 92.54 81.22 81.22

Trees have the highest classification rate (96.39%) and an excellent precision rate (92.42%). Forest comes at a close second.

slide-36
SLIDE 36

— 36 —

Gains Table Events Rate Cum Events Rate % Event Captured Cum % Events Captured Lift Cum Lift Brier Score * 100 Percentile Model Name 100.000 100.000 12.141 12.141 2.415 2.415 0.511 5 FOREST LOGISTIC_STEPWISE 94.737 94.737 11.502 11.502 2.288 2.288 5.111 TREES 95.119 95.119 11.548 11.548 2.297 2.297 0.000 10 FOREST 97.368 98.684 11.821 23.962 2.352 2.384 0.936 LOGISTIC_STEPWISE 97.368 96.053 11.821 23.323 2.352 2.320 1.023 TREES 91.935 93.527 11.161 22.709 2.221 2.259 0.000 15 FOREST 90.789 96.053 11.022 34.984 2.193 2.320 5.166 LOGISTIC_STEPWISE 89.474 93.860 10.863 34.185 2.161 2.267 2.837 TREES 91.935 92.997 11.161 33.871 2.221 2.246 0.000 20 FOREST 93.421 95.395 11.342 46.326 2.256 2.304 3.859 LOGISTIC_STEPWISE 86.842 92.105 10.543 44.728 2.098 2.225 3.951 TREES 91.935 92.731 11.161 45.032 2.221 2.240 0.000

slide-37
SLIDE 37

— 37 —

Comparing the results.

1) Trees required fewer variables than logistic and thus easier to interpret. Forest by definition use all the variables. 2) Trees obtained slightly larger lift measures but who can beat forests? 3) Trees and Forest determine the most important variable immediately, female sex, at the top of the tree, while with logistic it is not very clear.

slide-38
SLIDE 38

— 38 —

2nd Data set: Surendra Financial Data.

No information available about meaning or

  • measurement. All variables called R1 – R84,
  • ne binary dependent variable “Newgroup”.

There are no missing values, the missing values have been somehow imputed, but not reported.

slide-39
SLIDE 39

— 39 —

Data Mining Example: Just fit a model.

Consistency Information Value Characteristic 45,175 'surendra.newsurendra' # obs. Number of variables 84 Number of continuous variables 84 Number of class variables

Financial information with target = “newgroup” and variable names R1 – R85 Without any information as to what anything means. Forest omitted from the exercise.

slide-40
SLIDE 40

— 40 —

Difficult to interpret, the larger ‘p’ is.

slide-41
SLIDE 41

— 41 —

Gains Table Events Rate Cum Events Rate % Event Captured Cum % Events Captured Lift Cum Lift Brier Score * 100 Percent ile Model Name 92.873 92.873 37.391 37.391 7.477 7.477 5.043 5 LOGISTIC_STEPWIS E TREES 98.639 98.639 39.712 39.712 7.942 7.942 0.000 10 LOGISTIC_STEPWIS E 49.270 71.071 19.836 57.227 3.967 5.722 25.486 TREES 50.476 74.558 20.322 60.034 4.064 6.003 0.000 15 LOGISTIC_STEPWIS E 30.235 57.459 12.173 69.399 2.434 4.626 20.962 TREES 12.772 53.962 5.142 65.176 1.028 4.345 0.000 20 LOGISTIC_STEPWIS E 19.088 47.869 7.681 77.081 1.537 3.854 15.217

Logistic selected 49 variables, Trees 12. 10 of the 12 also used by Logistic.

slide-42
SLIDE 42

— 42 —

Comparing the results. 1) Trees selected 12 and logistic 49 variables. 2) The initial split on R73 produced almost perfectly pure nodes. R73 was also selected by Stepwise, but Stepwise doesn’t stop fast enough. 3) Model performance, as evaluated by lift, favors Trees.

slide-43
SLIDE 43

— 43 —

Very quick: Trees vs. Gradient Boosting.

Task: Classify into ‘5’ segments. Tools: Trees and Gradient Boosting (different versions). Compare by classification, precision and F1 rates. Model descriptions _MO MODEL_ Obs Obs STU TUDY NUM UMBER 1 tre ree_equa ual_prob

  • b_CV_10

10 1 2 tre ree_orig igl_prob

  • b_CV_10

10 2 3 tre ree_orig igl_CV_1 _10_5_sp split 3 4 tre ree_orig igl_CV_1 _10_2_sp split_de dec 4 5 Boo

  • ost sim

imple 5 6 Boo

  • ost equ

qual Pro robs 6 7 Cus usts_Boo

  • ost dec

ec matri rix 7 8 Boo

  • ost_equ

qual_2nd nd_stage ge 8 9 Boo

  • ost_ori

rig_2nd_ d_stage e 9 Then en model els 1 th through h 4 are e TREES, S, 5 thr hrough 9 9 BOOST STING.

slide-44
SLIDE 44

— 44 —

slide-45
SLIDE 45

— 45 —

slide-46
SLIDE 46

— 46 —

slide-47
SLIDE 47

— 47 —

Quick summary conclusions for multi- classification. 1) All boosting methods are good and just ‘1’ of trees competes with them in one case. 2) The performance of any of the boosting methods was similar, thus not much model specification search is required. 3) Boosting methods very difficult to interpret.

slide-48
SLIDE 48
  • 4. Concluding

Remarks

slide-49
SLIDE 49

— 49 —

Different algorithms

1) Non-greedy algorithms and two- or three-step ahead search. 2) Hybrid models, which combine regression and tree Methods (not very popular after the 2000s). 3) Boosting or majority voting methods, which generate a sequence of trees and classifications and the outcome is decided democratically. 4) Binned trees, in which splits searches are conducted after discretizing all variables, thus allowing for possibly non-linear effect searches. ...

slide-50
SLIDE 50

— 50 —

Avoid over-fitting / overtorture... because ...

Instead, in Sherlock Holmes’ words: “I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts”. (A Scandal in Bohemia).

We should not act as Mark Twain says: “Get your facts first, and then you can

distort them as much as you please.”

slide-51
SLIDE 51

— 51 —

Let us not be in haste … Method comparison by way of two examples does not imply general method superiority. There are many examples in the literature in which logistic regression performed better.

slide-52
SLIDE 52

— 52 —

The End