data mining classification trees 2
play

Data Mining Classification Trees (2) Ad Feelders Universiteit - PowerPoint PPT Presentation

Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 16, 2020 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 1 / 55 Basic Tree Construction Algorithm Construct tree nodelist {{ training


  1. Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 16, 2020 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 1 / 55

  2. Basic Tree Construction Algorithm Construct tree nodelist ← {{ training data }} Repeat current node ← select node from nodelist nodelist ← nodelist − current node if impurity(current node) > 0 then S ← set of candidate splits in current node s* ← arg max s ∈ S impurity reduction(s,current node) child nodes ← apply(s*,current node) nodelist ← nodelist ∪ child nodes fi Until nodelist = ∅ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 2 / 55

  3. Overfitting and Pruning The tree growing algorithm continues splitting until all leaf nodes of T contain examples of a single class (i.e. resubstitution error R ( T ) = 0). Is this a good tree for predicting the class of new examples? Not unless the problem is truly “deterministic”! Problem of overfitting . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 3 / 55

  4. Proposed Solutions How can we prevent overfitting? Stopping Rules: e.g. don’t expand a node if the impurity reduction of the best split is below some threshold. Pruning: grow a very large tree T max and merge back nodes. Note: in the practical assignment we do use a stopping rule based on the nmin and minleaf parameters. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 4 / 55

  5. Stopping Rules Disadvantage: sometimes you first have to make a weak split to be able to follow up with a good split. Since we only look one step ahead we may miss the good follow-up split. x 2 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 5 / 55

  6. Pruning To avoid the problem of stopping rules, we first grow a very large tree on the training sample, and then prune this large tree. Objective: select the pruned subtree that has lowest true error rate. Problem: how to find this pruned subtree? Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 6 / 55

  7. Pruning Methods Cost-complexity pruning (Breiman et al.; CART), also called weakest link pruning. Reduced-error pruning (Quinlan) Pessimistic pruning (Quinlan; C4.5) . . . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 7 / 55

  8. Terminology: Tree T t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 ˜ T denotes the collection of leaf nodes of tree T . T = { t 5 , t 6 , t 7 , t 8 , t 9 } , | ˜ ˜ T | = 5 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 8 / 55

  9. Terminology: Pruning T in node t 2 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 9 / 55

  10. Terminology: T after pruning in t 2 : T − T t 2 t 1 t 2 t 3 t 6 t 7 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 10 / 55

  11. Terminology: Branch T t 2 t 2 t 4 t 5 t 8 t 9 T t 2 = { t 5 , t 8 , t 9 } , | ˜ ˜ T t 2 | = 3 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 11 / 55

  12. Cost-complexity pruning The total number of pruned subtrees of a balanced binary tree with ℓ leaves is ⌊ 1 . 5028369 ℓ ⌋ With just 40 leaf nodes we have approximately 12 million pruned subtrees. Exhaustive search not recommended. Basic idea of cost-complexity pruning: reduce the number of pruned subtrees we have to consider by selecting the ones that are the “best of their kind” (in a sense to be defined shortly...) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 12 / 55

  13. Total cost of a tree Strike a balance between fit and complexity. Total cost C α ( T ) of tree T C α ( T ) = R ( T ) + α | ˜ T | Total cost consists of two components: resubstitution error R ( T ), and a penalty for the complexity of the tree α | ˜ T | , ( α ≥ 0). Note: R ( T ) = number of wrong classifications made by T number of examples in the training sample Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 13 / 55

  14. Tree with lowest total cost Depending on the value of α , different pruned subtrees will have the lowest total cost. For α = 0 (no complexity penalty) the tree with smallest resubstitution error wins. For higher values of α , a less complex tree that makes a few more errors might win. As it turns out, we can find a nested sequence of pruned subtrees of T max , such that the trees in the sequence minimize total cost for consecutive intervals of α values. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 14 / 55

  15. Smallest minimizing subtree For any value of α , there exists a smallest minimizing subtree T ( α ) of T max that satisfies the following conditions: 1 C α ( T ( α )) = min T ≤ T max C α ( T ) (that is, T ( α ) minimizes total cost for that value of α ). 2 If C α ( T ) = C α ( T ( α )) then T ( α ) ≤ T . (that is, T ( α ) is a pruned subtree of all trees that minimize total cost). Note : T ′ ≤ T means T ′ is a pruned subtree of T , i.e. it can be obtained by pruning T in 0 or more nodes. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 15 / 55

  16. Sequence of subtrees Construct a decreasing sequence of pruned subtrees of T max T max > T 1 > T 2 > T 3 > . . . > { t 1 } (where t 1 is the root node of the tree) such that T k is the smallest minimizing subtree for α ∈ [ α k , α k +1 ). Note : From a computational viewpoint, the important property is that T k +1 is a pruned subtree of T k , i.e. it can be obtained by pruning T k . No backtracking is required. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 16 / 55

  17. Decomposition of total cost Total cost has an additive decomposition over the leaf nodes of a tree: � C α ( T ) = ( R ( t ) + α ) t ∈ ˜ T R ( t ) is the number of errors we make in node t if we predict the majority class, divided by the total number of observations in the training sample. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 17 / 55

  18. Effect on cost of pruning in node t Before pruning in t After pruning in t t t C α ( { t } ) = R ( t ) + α T t T t ( R ( t ′ ) + α ) C α ( T t ) = � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 18 / 55

  19. Finding the T k and corresponding α k T t : branch of T with root node t . After pruning in t , its contribution to total cost is: C α ( { t } ) = R ( t ) + α, The contribution of T t to the total cost is: � ( R ( t ′ ) + α ) = R ( T t ) + α | ˜ C α ( T t ) = T t | t ′ ∈ ˜ T t T − T t becomes better than T when C α ( { t } ) = C α ( T t ) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 19 / 55

  20. Computing contributions to total cost of T t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 C α ( { t 2 } ) = R ( t 2 ) + α = 3 10 + α C α ( T t 2 ) = R ( T t 2 ) + α | ˜ T t 2 | = α | ˜ T t 2 R ( t ′ ) = 3 α + 0 T t 2 | + � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 20 / 55

  21. Solving for α The total cost of T and T − T t become equal when C α ( { t } ) = C α ( T t ) , At what value of α does this happen? R ( t ) + α = R ( T t ) + α | ˜ T t | Solving for α we get α = R ( t ) − R ( T t ) | ˜ T t | − 1 Note: for this value of α total cost of T and T − T t is the same, but T − T t is preferred because we want the smallest minimizing subtree. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 21 / 55

  22. Computing g ( t ): the “critical” α value for node t For each non-terminal node t we compute its “critical” alpha value: g ( t ) = R ( t ) − R ( T t ) | ˜ T t | − 1 In words: increase in error due to pruning in t g ( t ) = decrease in # leaf nodes due to pruning in t Subsequently, we prune in the nodes for which g ( t ) is the smallest (the “weakest links”). This process is repeated until we reach the root node. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 22 / 55

  23. Computing g ( t ): the “critical” α value for node t t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 23 / 55

  24. Computing g ( t ): the “critical” α value for node t Calculation examples: g ( t 1 ) = R ( t 1 ) − R ( T t 1 ) = 1 / 2 − 0 = 1 | ˜ 5 − 1 8 T t 1 | − 1 g ( t 2 ) = R ( t 2 ) − R ( T t 2 ) = 3 / 10 − 0 = 3 | ˜ 3 − 1 20 T t 2 | − 1 g ( t 3 ) = R ( t 3 ) − R ( T t 3 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 3 | − 1 g ( t 5 ) = R ( t 5 ) − R ( T t 5 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 5 | − 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 24 / 55

  25. Finding the weakest links t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 25 / 55

  26. Pruning in the weakest links t 1 100 100 90 60 10 40 t 2 t 3 t 4 80 0 t 5 10 60 By pruning the weakest links we obtain the next tree in the sequence. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 26 / 55

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend