Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - PowerPoint PPT Presentation

CHAID – CART – C4.5 and the others… Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Main issues of the decision tree learning Choosing the splitting criterion • Impurity based criteria • Information gain • Statistical measures of association… Binary or multiway splits • Multiway split: 1 value of the splitting attribute = 1 leaf • Binary split: finding the best binary grouping • Grouping only the leaves which are similar regarding the classes distribution Finding the right sized tree • Pre-pruning • Post-pruning Other challenges : decision graph, oblique tree, etc. Ricco Rakotomalala 2 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Ricco Rakotomalala 3 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion Main properties S1: Maximum The leaves are homogenous. S2: Minimum Conditional distributions are the same. S3 : Intermediate situation The leaves are more homogeneous regarding Y. X provides information about Y. Ricco Rakotomalala 4 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion Chi-square test statistic for independence and its variants  Y / X x x x 1 l L y 1  Contingency table   y n n k kl k . Cross tabulation between Y and X  y K  n n . l  2   n n    k . . l Measures of association n kl   K L  n   2 Comparing the observed and theoretical frequencies  n n (under the null hypothesis : Y and X are independent)   k . . l 1 1 k l n  2 Tschuprow’s t  2 t         Allows comparing splits with different number of leaves n K 1 L 1 Ricco Rakotomalala 5 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion For “Improved” CHAID ( SIPINA software) S1 : 1.0 S2 : 0.0 0.0  S3 : 0.7746  1.0 Ricco Rakotomalala 6 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion Information Gain – Gain ratio (C4.5)   K n n Shannon entropy       k . k . ( ) log E Y 2   Measure of uncertainty n n  k 1   Condition entropy L K n n n        . l kl kl E ( Y / X ) log   2 Expected entropy of Y knowing the values of X   n n n   l 1 k 1 . l . l Information gain   G ( Y / X ) E ( Y ) E ( Y / X ) Reduction of uncertainty  (Information) Gain ratio E ( Y ) E ( Y / X )  GR ( Y / X ) Favors the splits with low number of leaves E ( X ) Ricco Rakotomalala 7 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion For C4.5 (SIPINA software) S1 : 1.0 S2 : 0.0 0.0  S3 : 0.5750  1.0 Ricco Rakotomalala 8 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion Gini impurity (CART)    K Gini index n n       k . k . I ( Y ) 1   Measure of impurity n n  k 1 Conditional impurity    L K n n n        . l kl kl I ( Y / X ) 1   Average impurity of Y conditionally to X   n n n   l 1 k 1 . l . l   Gain D ( Y / X ) I ( Y ) I ( Y / X ) Gini index = Viewed as an entropy (cf. Daroczy) D can be viewed as a kind of information gain Gini index = Viewed as a variance for categorical variable CATANOVA (analysis of variance for categorical data)  D = variance between groups Ricco Rakotomalala 9 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Splitting criterion For C&RT (Tanagra software) S1 : 0.5 S2 : 0.0 0.0  S3 : 0.3  1.0 Ricco Rakotomalala 10 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Using unbiased measure … allows to alleviate the data fragmentation problem Splitting into 4 subsets using the X1 attribute Y / X1 A1 B1 C1 D1 Total positif 2 3 6 3 14 CHI-2 3.9796 négatif 4 4 8 0 16 T Tschuprow 0.0766 Total 6 7 14 3 30 Y / X2 A2 B2 D2 Total positif 2 9 3 14 CHI-2 3.9796 négatif 4 12 0 16 T Tschuprow 0.0938 Total 6 21 3 30 X2 is better than X1 Splitting into 3 subsets using the X2 attribute •Tschuprow’s t corrects the bias of the chi-square measure • Gain Ratio corrects the bias of the information gain • The Gini reduction in impurity is biased in favor of variables with more levels (but the CART algorithm constructs necessarily a binary decision tree) Ricco Rakotomalala 11 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Multiway splitting (C4.5) 1 level = 1 leaf in the splitting process TYPELAIT ={2%MILK} 28( 17%) 24( 15%) 109( 68%) 161( 67%) TYPELAIT ={NOMILK} 4( 31%) 1( 8%) 8( 62%) • The prediction rules are easy to read 13( 5%) • Data fragmentation problem, especially TYPELAIT ={POWDER} 50( 21%) 1(100%) for small dataset 38( 16%) 0( 0%) 153( 63%) 0( 0%) • “Large” decision tree with a high number 241(100%) 1( 0%) TYPELAIT of leaves ={SKIM} 1( 9%) 5( 45%) • “Low depth” decision tree 5( 45%) 11( 5%) TYPELAIT ={WHOLEMILK} 16( 29%) 8( 15%) 31( 56%) 55( 23%) Ricco Rakotomalala 13 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Binary splitting (CART) Detecting the best combination in two subsets TYPELAIT • This grouping allows to overcome the bias of the ={2%MILK,SKIM} 29( 18%) splitting measure used 26( 16%) 109( 66%) • The data fragmentation problem is alleviated 164( 72%) 49( 21%) 34( 15%) TYPELAIT • “High depth” decision tree (CART uses a post 145( 64%) ={NOMILK,WHOLEMILK,PO... 228(100%) 20( 31%) pruning process for remedy to this) 8( 13%) 36( 56%) • Merging into two groups is not always relevant ! 64( 28%) Ricco Rakotomalala 14 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Merging approach (CHAID) Merging the similar leaves according the classes distribution Principle: Iterative merging if the distributions • Alleviate the data fragmentation problem • Choosing the alpha level for merging is are not significantly different into leaves not obvious (bottom-up strategy) NoMilk, Powder WholeMilk High 5 16 TYPELAIT Low 1 8 ={2%MILK} Normal 8 31 28( 17%) 24( 15%) Total 14 55 109( 68%) 161( 67%) TYPELAIT            ={NOMILK,WHOLEMILK,PO... 2 2 2 5 / 14 16 / 55 1 / 14 8 / 55 8 / 14 31 / 55       2   14 55 50( 21%) 21( 30%)     5 16 1 8 8 31  38( 16%) 9( 13%) 153( 63%) 39( 57%)  0 . 6309 241(100%) 69( 29%) TYPELAIT ={SKIM}   1( 9%) p value 0 . 73     2 5( 45%) [( 3 1 ) ( 2 1 )] 5( 45%) 11( 5%) Merging if (p-value > alpha level for merging) Ricco Rakotomalala 15 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Bias variance tradeoff … according the tree complexity Bias ~ how powerful is the model Variance ~ how sensitive the model is to the training set 0,8 0,7 0,6 0,5 Apprentissage Test 0,4 0,3 Test t sample le 0,2 0,1 Learn rnin ing sample le 0 Numb mber r of leave ves 0 50 100 150 200 250 Underfitting Overfitting “Optimal” tree size The tree is too small The tree is too large Ricco Rakotomalala 17 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Pre-pruning Stopping the growing process Confidence and support criteria + Easy to understand and easy to use • Group purity: confidence threshold - The right thresholds for a given problem • Support criterion: min. size node to split, is not obvious min. instances in leaves Statistical approach (CHAID) - The right alpha-level for the splitting test is very hard to determine • Chi-squared test for independence But in practice, this approach is often used because : • the obtained tree reaches a good performance, the area for the "optimal" error rate is large • it is fast (the growing is stopped earlier, no additional calculations for the post pruning) • it is preferred at least in the exploratory phase Ricco Rakotomalala 18 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Post-pruning An additional step in order to avoid the over-dependence to the growing sample Two steps for the decision tree learning (1) Growing phase  maximizing the purity of the leaves (2) (Post) pruning phase  minimizing the “true” error rate 0.8 0.7 0.6 0.5 Apprentissage Vraie erreur 0.4 0.3 “True” error rate 0.2 0.1 0 0 50 100 150 200 250 How to obtain a good estimation of the “true” error rate Ricco Rakotomalala 19 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - PowerPoint PPT Presentation

CHAID CART C4.5 and the others Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/ Main issues of the decision tree learning Choosing the splitting criterion Impurity based

Ricco RAKOTOMALALA Ricco.Rakotomalala@univ-lyon2.fr Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra -

Neural network for supervised learning Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels

Multivariate characterization of differences between groups Ricco RAKOTOMALALA Ricco Rakotomalala

MARKET BASKET ANALYSIS Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

(Predictive Discriminant Analysis) Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

With numeric and categorical variables (active and/or illustrative) Ricco RAKOTOMALALA

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

Grouping categorical variables Grouping categories of nominal variables Ricco RAKOTOMALALA

Ensemble method for supervised learning Using an explicit loss function Ricco RAKOTOMALALA

How to use (can we use) the multiple linear regression method for a classification problem ?

Uniform Interpolation Part II: An Algebraic Framework George Metcalfe Mathematical Institute

Complexity and Character of Human Languages Chomsky Hierarchy Informatics 2A: Lecture 21 The

Modular Dataflow Analysis Aivar Annamaa Feb. 23 rd , 2010 Based on: Rountev, Sharp, Xu, 2008

Symmetry in Shapes Theory and Practice Niloy Mitra Maksim Ovsjanikov Mark Pauly

A Brief History of Decision Tree Implementation MAX AUSTIN Overview Famous Decision Tree

Forest for the monitoring of wetland vegetation with multispectral data. Julie Campagna, phD

Overview of Decision Trees, Ensemble Methods and Reinforcement Learning CMSC 678 UMBC Outline

So#ware as academic output Caroline Jay and Robert Haines