decision tree
play

Decision Tree CE-717 : Machine Learning Sharif University of - PowerPoint PPT Presentation

Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Decision tree } One of the most intuitive classifiers that is easy to understand and construct } However, it also works very (very) well }


  1. Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2019

  2. Decision tree } One of the most intuitive classifiers that is easy to understand and construct } However, it also works very (very) well } Categorical features are preferred. If feature values are continuous, they are discretized first. } Application: Database mining 2

  3. Example C Yes No P A } Attributes: } A: age>40 P + - } C: chest pain S } S: smoking } P: physical test S - - A - + + - } Label: } Heart disease (+), No heart disease (-) 3

  4. Decision tree: structure } Leaves (terminal nodes) represent target variable } Each leaf represents a class label } Each internal node denotes a test on an attribute } Edges to children for each of the possible values of that attribute 4

  5. 5

  6. Decision tree: learning } Decision tree learning: construction of a decision tree from training samples. } Decision trees used in data mining are usually classification trees } There are many specific decision-tree learning algorithms, such as: } ID3 } C4.5 } Approximates functions of usually discrete domain } The learned function is represented by a decision tree 6

  7. Decision tree learning } Learning an optimal decision tree is NP-Complete } Instead, we use a greedy search based on a heuristic } We cannot guarantee to return the globally-optimal decision tree. } The most common strategy for DT learning is a greedy top-down approach } chooses a variable at each step that best splits the set of items. } Tree is constructed by splitting samples into subsets based on an attribute value test in a recursive manner 7

  8. How to construct basic decision tree? } We prefer decisions leading to a simple, compact tree with few nodes } Which attribute at the root? } Measure: how well the attributes split the set into homogeneous subsets (having same value of target) } Homogeneity of the target variable within the subsets. } How to form descendant? } Descendant is created for each possible value of 𝐡 } Training examples are sorted to descendant nodes 8

  9. Constructing a decision tree } Function FindTree(S,A) S: samples, A: attributes } If empty(A) or all labels of the samples in S are the same status = leaf } class = most common class in the labels of S } } else status = internal } a ← bestAttribute(S,A) } LeftNode = FindTree(S(a=1),A \ {a}) } RightNode = FindTree(S(a=0),A \ {a}) } } end } end Recursive calls to create left and right subtrees S(a=1) is the set of samples in S for which a=1 9 Top down, Greedy, No backtrack

  10. Constructing a decision tree } Function FindTree(S,A) S: samples, A: attributes } If empty(A) or all labels of the samples in S are the same status = leaf } class = most common class in the labels of S } } else status = internal } a ← bestAttribute(S,A) } LeftNode = FindTree(S(a=1),A \ {a}) } RightNode = FindTree(S(a=0),A \ {a}) Tree is constructed by splitting samples into subsets based on an } attribute value test in a recursive manner } end } end β€’ The recursion is completed when all members of the subset at Recursive calls to create left and right subtrees a node have the same label S(a=1) is the set of samples in S for which a=1 β€’ or when splitting no longer adds value to the predictions 10 Top down, Greedy, No backtrack

  11. ID3 β€’ ID3 (Examples,Target_Attribute,Attributes) β€’ Create a root node for the tree β€’ If all examples are positive, return the single-node tree Root, with label = + β€’ If all examples are negative, return the single-node tree Root, with label = - β€’ If number of predicting attributes is empty then return Root, with label = most common value of the target attribute in the examples β€’ β€’ else β€’ A =The Attribute that best classifies examples. β€’ T esting attribute for Root = A. β€’ for each possible value, 𝑀 $ , of A β€’ Add a new tree branch below Root, corresponding to the test A = 𝑀 $ . β€’ Let Examples( 𝑀 $ ) be the subset of examples that have the value for A β€’ if Examples( 𝑀 $ ) is empty then below this new branch add a leaf node with label = most common target value in the examples β€’ β€’ else below this new branch add subtree ID3 (Examples( π’˜ 𝒋 ),Target_Attribute,Attributes – {A}) β€’ return Root 11

  12. Which attribute is the best? 12

  13. Which attribute is the best? } A variety of heuristics for picking a good test } Information gain: originated with ID3 (Quinlan,1979). } Gini impurity } … } These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split. 13

  14. οΏ½ Entropy 𝐼 π‘Œ = βˆ’ + 𝑄 𝑦 $ log 𝑄(𝑦 $ ) 4 5 ∈7 } Entropy measures the uncertainty in a specific distribution } Information theory: } 𝐼 π‘Œ : expected number of bits needed to encode a randomly drawn value of π‘Œ (under most efficient code) } Most efficient code assigns βˆ’log 𝑄(π‘Œ = 𝑗) bits to encode π‘Œ = 𝑗 } β‡’ expected number of bits to code one random π‘Œ is 𝐼(π‘Œ) 14

  15. Entropy for a Boolean variable 𝐼(π‘Œ) Entropy as a measure of impurity 𝑄(π‘Œ = 1) 1 1 𝐼 π‘Œ = βˆ’0.5 log < 2 βˆ’ 0.5 log < 2 = 1 𝐼 π‘Œ = βˆ’1 log < 1 βˆ’ 0 log < 0 = 0 15

  16. οΏ½ Information Gain (IG) 𝑇 I π»π‘π‘—π‘œ 𝑇, 𝐡 ≑ 𝐼 H 𝑍 βˆ’ + 𝑇 𝐼 H J 𝑍 I∈KLMNOP(Q) } 𝐡 : variable used to split samples } 𝑍 : target variable } 𝑇 : samples 16

  17. Information Gain: Example 17

  18. οΏ½ οΏ½ Mutual Information } The expected reduction in entropy of 𝑍 caused by knowing π‘Œ : 𝐽 π‘Œ, 𝑍 = 𝐼 𝑍 βˆ’ 𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 π‘Œ = 𝑗 𝑄(𝑍 = π‘˜) 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ $ T } Mutual information in decision tree: } 𝐼 𝑍 : Entropy of 𝑍 (i.e., labels) before splitting samples } 𝐼 𝑍 π‘Œ : Entropy of 𝑍 after splitting samples based on attribute π‘Œ } It shows expectation of label entropy obtained in different splits (where splits are formed based on the value of attribute π‘Œ ) 18

  19. οΏ½ οΏ½ οΏ½ οΏ½ Conditional entropy 𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗 $ T 𝐼 𝑍 π‘Œ = + 𝑄 π‘Œ = 𝑗 + βˆ’π‘„ 𝑍 = π‘˜|π‘Œ = 𝑗 log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗 $ T probability of following i-th value for π‘Œ Entropy of 𝑍 for samples with π‘Œ = 𝑗 19

  20. Conditional entropy: example } 𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 } = [ \] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = πΌπ‘—π‘•β„Ž + [ \] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = π‘‚π‘π‘ π‘›π‘π‘š } 𝐼 𝑍 π‘‹π‘—π‘œπ‘’ } = g \] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = 𝑋𝑓𝑏𝑙 + j \] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = π‘‡π‘’π‘ π‘π‘œπ‘• 20

  21. How to find the best attribute? } Information gain as our criteria for a good split } attribute that maximizes information gain is selected } When a set of 𝑇 samples have been sorted to a node, choose π‘˜ -th attribute for test in this node where: π‘˜ = argmax π»π‘π‘—π‘œ 𝑇, π‘Œ $ $∈oOpLqrqrs LttP. = argmax 𝐼 H 𝑍 βˆ’ 𝐼 H 𝑍|π‘Œ $ } $∈oOpLqrqrs LttP. } = argmin 𝐼 H 𝑍|π‘Œ $ $∈oOpLqrqrs LttP. 21

  22. Information Gain: Example 22

  23. ID3 algorithm: Properties } The algorithm } either reaches homogenous nodes } or runs out of attributes } Guaranteed to find a tree consistent with any conflict-free training set } ID3 hypothesis space of all DTs contains all discrete-valued functions } Conflict free training set: identical feature vectors always assigned the same class } But not necessarily find the simplest tree (containing minimum number of nodes). } a greedy algorithm with locally-optimal decisions at each node (no backtrack). 23

  24. Decision tree learning: Function approximation problem } Problem Setting : } Set of possible instances π‘Œ } Unknown target function 𝑔: π‘Œ β†’ 𝑍 ( 𝑍 is discrete valued) } Set of function hypotheses 𝐼 = { β„Ž | β„Ž ∢ π‘Œ β†’ 𝑍 } } β„Ž is a DT where tree sorts each π’š to a leaf which assigns a label 𝑧 } Input : } Training examples {(π’š $ , 𝑧 $ )} of unknown target function 𝑔 } Output : } Hypothesis β„Ž ∈ 𝐼 that best approximates target function 𝑔 24

  25. Decision tree hypothesis space } Suppose attributes are Boolean } Disjunction of conjunctions } Which trees to show the following functions? } 𝑧 = 𝑦 \ π‘π‘œπ‘’ 𝑦 ~ } 𝑧 = 𝑦 \ 𝑝𝑠 𝑦 ] } 𝑧 = (𝑦 \ π‘π‘œπ‘’ 𝑦 ~ ) 𝑝𝑠(𝑦 < π‘π‘œπ‘’ ¬𝑦 ] ) ? 25

  26. Decision tree as a rule base } Decision tree = a set of rules } Disjunctions of conjunctions on the attribute values } Each path from root to a leaf = conjunction of attribute tests } All of the leafs with 𝑧 = 𝑗 are considered to find the rule for 𝑧 = 𝑗 26

  27. How partition instance space? } Decision tree } Partition the instance space into axis-parallel regions, labeled with class value [Duda & Hurt ’s Book] 27

  28. ID3 as a search in the space of trees } ID3 : heuristic search through space of DTs } Performs a simple to complex hill-climbing search (begins with empty tree) } prefers simpler hypotheses due to using IG as a measure of selecting attribute test } IG gives a bias for trees with minimal size. } ID3 implements a search (preference) bias instead of a restriction bias. 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend