Data Mining Classification Trees (2) Ad Feelders Universiteit - PowerPoint PPT Presentation

Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 16, 2020 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 1 / 55

Basic Tree Construction Algorithm Construct tree nodelist ← {{ training data }} Repeat current node ← select node from nodelist nodelist ← nodelist − current node if impurity(current node) > 0 then S ← set of candidate splits in current node s* ← arg max s ∈ S impurity reduction(s,current node) child nodes ← apply(s*,current node) nodelist ← nodelist ∪ child nodes fi Until nodelist = ∅ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 2 / 55

Overfitting and Pruning The tree growing algorithm continues splitting until all leaf nodes of T contain examples of a single class (i.e. resubstitution error R ( T ) = 0). Is this a good tree for predicting the class of new examples? Not unless the problem is truly “deterministic”! Problem of overfitting . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 3 / 55

Proposed Solutions How can we prevent overfitting? Stopping Rules: e.g. don’t expand a node if the impurity reduction of the best split is below some threshold. Pruning: grow a very large tree T max and merge back nodes. Note: in the practical assignment we do use a stopping rule based on the nmin and minleaf parameters. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 4 / 55

Stopping Rules Disadvantage: sometimes you first have to make a weak split to be able to follow up with a good split. Since we only look one step ahead we may miss the good follow-up split. x 2 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 5 / 55

Pruning To avoid the problem of stopping rules, we first grow a very large tree on the training sample, and then prune this large tree. Objective: select the pruned subtree that has lowest true error rate. Problem: how to find this pruned subtree? Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 6 / 55

Pruning Methods Cost-complexity pruning (Breiman et al.; CART), also called weakest link pruning. Reduced-error pruning (Quinlan) Pessimistic pruning (Quinlan; C4.5) . . . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 7 / 55

Terminology: Tree T t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 ˜ T denotes the collection of leaf nodes of tree T . T = { t 5 , t 6 , t 7 , t 8 , t 9 } , | ˜ ˜ T | = 5 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 8 / 55

Terminology: Pruning T in node t 2 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 9 / 55

Terminology: T after pruning in t 2 : T − T t 2 t 1 t 2 t 3 t 6 t 7 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 10 / 55

Terminology: Branch T t 2 t 2 t 4 t 5 t 8 t 9 T t 2 = { t 5 , t 8 , t 9 } , | ˜ ˜ T t 2 | = 3 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 11 / 55

Cost-complexity pruning The total number of pruned subtrees of a balanced binary tree with ℓ leaves is ⌊ 1 . 5028369 ℓ ⌋ With just 40 leaf nodes we have approximately 12 million pruned subtrees. Exhaustive search not recommended. Basic idea of cost-complexity pruning: reduce the number of pruned subtrees we have to consider by selecting the ones that are the “best of their kind” (in a sense to be defined shortly...) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 12 / 55

Total cost of a tree Strike a balance between fit and complexity. Total cost C α ( T ) of tree T C α ( T ) = R ( T ) + α | ˜ T | Total cost consists of two components: resubstitution error R ( T ), and a penalty for the complexity of the tree α | ˜ T | , ( α ≥ 0). Note: R ( T ) = number of wrong classifications made by T number of examples in the training sample Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 13 / 55

Tree with lowest total cost Depending on the value of α , different pruned subtrees will have the lowest total cost. For α = 0 (no complexity penalty) the tree with smallest resubstitution error wins. For higher values of α , a less complex tree that makes a few more errors might win. As it turns out, we can find a nested sequence of pruned subtrees of T max , such that the trees in the sequence minimize total cost for consecutive intervals of α values. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 14 / 55

Smallest minimizing subtree For any value of α , there exists a smallest minimizing subtree T ( α ) of T max that satisfies the following conditions: 1 C α ( T ( α )) = min T ≤ T max C α ( T ) (that is, T ( α ) minimizes total cost for that value of α ). 2 If C α ( T ) = C α ( T ( α )) then T ( α ) ≤ T . (that is, T ( α ) is a pruned subtree of all trees that minimize total cost). Note : T ′ ≤ T means T ′ is a pruned subtree of T , i.e. it can be obtained by pruning T in 0 or more nodes. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 15 / 55

Sequence of subtrees Construct a decreasing sequence of pruned subtrees of T max T max > T 1 > T 2 > T 3 > . . . > { t 1 } (where t 1 is the root node of the tree) such that T k is the smallest minimizing subtree for α ∈ [ α k , α k +1 ). Note : From a computational viewpoint, the important property is that T k +1 is a pruned subtree of T k , i.e. it can be obtained by pruning T k . No backtracking is required. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 16 / 55

Decomposition of total cost Total cost has an additive decomposition over the leaf nodes of a tree: � C α ( T ) = ( R ( t ) + α ) t ∈ ˜ T R ( t ) is the number of errors we make in node t if we predict the majority class, divided by the total number of observations in the training sample. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 17 / 55

Effect on cost of pruning in node t Before pruning in t After pruning in t t t C α ( { t } ) = R ( t ) + α T t T t ( R ( t ′ ) + α ) C α ( T t ) = � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 18 / 55

Finding the T k and corresponding α k T t : branch of T with root node t . After pruning in t , its contribution to total cost is: C α ( { t } ) = R ( t ) + α, The contribution of T t to the total cost is: � ( R ( t ′ ) + α ) = R ( T t ) + α | ˜ C α ( T t ) = T t | t ′ ∈ ˜ T t T − T t becomes better than T when C α ( { t } ) = C α ( T t ) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 19 / 55

Computing contributions to total cost of T t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 C α ( { t 2 } ) = R ( t 2 ) + α = 3 10 + α C α ( T t 2 ) = R ( T t 2 ) + α | ˜ T t 2 | = α | ˜ T t 2 R ( t ′ ) = 3 α + 0 T t 2 | + � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 20 / 55

Solving for α The total cost of T and T − T t become equal when C α ( { t } ) = C α ( T t ) , At what value of α does this happen? R ( t ) + α = R ( T t ) + α | ˜ T t | Solving for α we get α = R ( t ) − R ( T t ) | ˜ T t | − 1 Note: for this value of α total cost of T and T − T t is the same, but T − T t is preferred because we want the smallest minimizing subtree. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 21 / 55

Computing g ( t ): the “critical” α value for node t For each non-terminal node t we compute its “critical” alpha value: g ( t ) = R ( t ) − R ( T t ) | ˜ T t | − 1 In words: increase in error due to pruning in t g ( t ) = decrease in # leaf nodes due to pruning in t Subsequently, we prune in the nodes for which g ( t ) is the smallest (the “weakest links”). This process is repeated until we reach the root node. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 22 / 55

Computing g ( t ): the “critical” α value for node t t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 23 / 55

Computing g ( t ): the “critical” α value for node t Calculation examples: g ( t 1 ) = R ( t 1 ) − R ( T t 1 ) = 1 / 2 − 0 = 1 | ˜ 5 − 1 8 T t 1 | − 1 g ( t 2 ) = R ( t 2 ) − R ( T t 2 ) = 3 / 10 − 0 = 3 | ˜ 3 − 1 20 T t 2 | − 1 g ( t 3 ) = R ( t 3 ) − R ( T t 3 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 3 | − 1 g ( t 5 ) = R ( t 5 ) − R ( T t 5 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 5 | − 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 24 / 55

Finding the weakest links t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 25 / 55

Pruning in the weakest links t 1 100 100 90 60 10 40 t 2 t 3 t 4 80 0 t 5 10 60 By pruning the weakest links we obtain the next tree in the sequence. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 26 / 55

Data Mining Classification Trees (2) Ad Feelders Universiteit - PowerPoint PPT Presentation

Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 16, 2020 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 1 / 55 Basic Tree Construction Algorithm Construct tree nodelist {{ training

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Data Mining 2019 Classification Trees (1) Ad Feelders Universiteit Utrecht Ad Feelders (

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees

Algorithms and Data Structures Balanced Trees (AVL-Trees, (a,b)-Trees, Red-Black-Trees)

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

Serum Non-Esterified Fatty Acid (NEFA) Concentrations are Associated with Longitudinal Progression

Health Effects Lecture 12: Noise Part 6 (20.05.2020) Mark Brink ETH Zrich D-USYS Homepage:

My Blood Value Responses to Diet and other Activities Ben Best Pharmacy (BSc Pharm)

Eli lici citin ing Fu Fuzz zzy Kn Knowl wledg dge e fr from th the PI PIMA MA Dataset

What we will cover Oral Epithelial Dysplasia, Grading, Management and Significance

BUMC Research Town Meeting: Research Update Ronald B. Corley, Ph.D. Associate Provost for Research

In Vivo Imaging of the Activity of Host Defense Peptide Mimetics in a Mouse Model of Invasive