Decision Tree CE-717 : Machine Learning Sharif University of - - PowerPoint PPT Presentation

β–Ά
decision tree
SMART_READER_LITE
LIVE PREVIEW

Decision Tree CE-717 : Machine Learning Sharif University of - - PowerPoint PPT Presentation

Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Decision tree } One of the most intuitive classifiers that is easy to understand and construct } However, it also works very (very) well }


slide-1
SLIDE 1

Decision Tree

CE-717 : Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2019

slide-2
SLIDE 2

Decision tree

} One of the most intuitive classifiers that is easy to

understand and construct

} However, it also works very (very) well

} Categorical features are preferred. If feature values are

continuous, they are discretized first.

} Application: Database mining

2

slide-3
SLIDE 3

Example

3

} Attributes:

} A: age>40 } C: chest pain } S: smoking } P: physical test

} Label:

} Heart disease (+), No heart disease (-)

C P + A

  • P

S

  • S
  • A

+

  • +
  • Yes

No

slide-4
SLIDE 4

Decision tree: structure

} Leaves (terminal nodes) represent target variable

} Each leaf represents a class label

} Each internal node denotes a test on an attribute

} Edges to children for each of the possible values of that

attribute

4

slide-5
SLIDE 5

5

slide-6
SLIDE 6

Decision tree: learning

6

} Decision tree learning: construction of a decision tree

from training samples.

} Decision trees used in data mining are usually classification

trees

} There are many specific decision-tree learning algorithms,

such as:

} ID3 } C4.5

} Approximates functions of usually discrete domain

} The learned function is represented by a decision tree

slide-7
SLIDE 7

Decision tree learning

7

} Learning an optimal decision tree is NP-Complete

} Instead, we use a greedy search based on a heuristic

} We cannot guarantee to return the globally-optimal decision tree.

} The most common strategy for DT learning is a greedy

top-down approach

} chooses a variable at each step that best splits the set of items.

} Tree is constructed by splitting samples into subsets

based on an attribute value test in a recursive manner

slide-8
SLIDE 8

How to construct basic decision tree?

} We prefer decisions leading to a simple, compact tree with few

nodes

} Which attribute at the root?

} Measure:

how well the attributes split the set into homogeneous subsets (having same value of target)

} Homogeneity of the target variable within the subsets.

} How to form descendant?

} Descendant is created for each possible value of 𝐡

} Training examples are sorted to descendant nodes

8

slide-9
SLIDE 9

Constructing a decision tree

9

} Function FindTree(S,A) } If empty(A) or all labels of the samples in S are the same }

status = leaf

}

class = most common class in the labels of S

} else }

status = internal

}

a ←bestAttribute(S,A)

}

LeftNode = FindTree(S(a=1),A \ {a})

}

RightNode = FindTree(S(a=0),A \ {a})

} end } end Recursive calls to create left and right subtrees S(a=1) is the set of samples in S for which a=1 Top down, Greedy, No backtrack S: samples, A: attributes

slide-10
SLIDE 10

Constructing a decision tree

10

} Function FindTree(S,A) } If empty(A) or all labels of the samples in S are the same }

status = leaf

}

class = most common class in the labels of S

} else }

status = internal

}

a ←bestAttribute(S,A)

}

LeftNode = FindTree(S(a=1),A \ {a})

}

RightNode = FindTree(S(a=0),A \ {a})

} end } end Recursive calls to create left and right subtrees S(a=1) is the set of samples in S for which a=1 Top down, Greedy, No backtrack S: samples, A: attributes Tree is constructed by splitting samples into subsets based on an attribute value test in a recursive manner

  • The recursion is completed when all members of the subset at

a node have the same label

  • r when splitting no longer adds value to the predictions
slide-11
SLIDE 11

ID3

11

  • ID3 (Examples,Target_Attribute,Attributes)
  • Create a root node for the tree
  • If all examples are positive, return the single-node tree Root, with label = +
  • If all examples are negative, return the single-node tree Root, with label = -
  • If number of predicting attributes is empty then
  • return Root, with label = most common value of the target attribute in the examples
  • else
  • A =The Attribute that best classifies examples.
  • T

esting attribute for Root = A.

  • for each possible value, 𝑀$, of A
  • Add a new tree branch below Root, corresponding to the test A =𝑀$ .
  • Let Examples(𝑀$) be the subset of examples that have the value for A
  • if Examples(𝑀$) is empty then
  • below this new branch add a leaf node with label = most common target value in the examples
  • else below this new branch add subtree ID3 (Examples(π’˜π’‹),Target_Attribute,Attributes – {A})
  • return Root
slide-12
SLIDE 12

Which attribute is the best?

12

slide-13
SLIDE 13

Which attribute is the best?

13

} A variety of heuristics for picking a good test

} Information gain: originated with ID3 (Quinlan,1979). } Gini impurity } …

} These metrics are applied to each candidate subset, and the

resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.

slide-14
SLIDE 14

Entropy

𝐼 π‘Œ = βˆ’ + 𝑄 𝑦$ log 𝑄(𝑦$)

  • 45∈7

} Entropy measures the uncertainty in a specific distribution } Information theory:

} 𝐼 π‘Œ : expected number of bits needed to encode a randomly drawn

value of π‘Œ (under most efficient code)

} Most efficient code assigns βˆ’log 𝑄(π‘Œ = 𝑗) bits to encode π‘Œ = 𝑗 } β‡’ expected number of bits to code one random π‘Œ is 𝐼(π‘Œ) 14

slide-15
SLIDE 15

Entropy for a Boolean variable

𝐼(π‘Œ) 𝑄(π‘Œ = 1) 𝐼 π‘Œ = βˆ’1 log< 1 βˆ’ 0 log< 0 = 0 𝐼 π‘Œ = βˆ’0.5 log< 1 2 βˆ’ 0.5 log< 1 2 = 1

15

Entropy as a measure

  • f impurity
slide-16
SLIDE 16

Information Gain (IG)

} 𝐡: variable used to split samples } 𝑍: target variable } 𝑇: samples

π»π‘π‘—π‘œ 𝑇, 𝐡 ≑ 𝐼H 𝑍 βˆ’ + 𝑇I 𝑇 𝐼HJ 𝑍

  • I∈KLMNOP(Q)

16

slide-17
SLIDE 17

Information Gain: Example

17

slide-18
SLIDE 18

Mutual Information

} The expected reduction in entropy of 𝑍 caused by knowing π‘Œ:

𝐽 π‘Œ, 𝑍 = 𝐼 𝑍 βˆ’ 𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 π‘Œ = 𝑗 𝑄(𝑍 = π‘˜) 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜

  • T
  • $

} Mutual information in decision tree:

} 𝐼 𝑍 : Entropy of 𝑍 (i.e., labels) before splitting samples } 𝐼 𝑍 π‘Œ : Entropy of 𝑍 after splitting samples based on attribute π‘Œ

} It shows expectation of label entropy obtained in different splits (where

splits are formed based on the value of attribute π‘Œ)

18

slide-19
SLIDE 19

Conditional entropy

𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗

  • T
  • $

19

𝐼 𝑍 π‘Œ = + 𝑄 π‘Œ = 𝑗 + βˆ’π‘„ 𝑍 = π‘˜|π‘Œ = 𝑗 log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗

  • T
  • $

probability of following i-th value for π‘Œ Entropy of 𝑍 for samples with π‘Œ = 𝑗

slide-20
SLIDE 20

Conditional entropy: example

20

} 𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 } = [

\] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = πΌπ‘—π‘•β„Ž + [ \] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = π‘‚π‘π‘ π‘›π‘π‘š

} 𝐼 𝑍 π‘‹π‘—π‘œπ‘’ } = g

\] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = 𝑋𝑓𝑏𝑙 + j \] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = π‘‡π‘’π‘ π‘π‘œπ‘•

slide-21
SLIDE 21

How to find the best attribute?

} Information gain as our criteria for a good split

} attribute that maximizes information gain is selected

} When a set of 𝑇 samples have been sorted to a node,

choose π‘˜-th attribute for test in this node where:

π‘˜ = argmax

$∈oOpLqrqrs LttP.

π»π‘π‘—π‘œ 𝑇, π‘Œ$

}

= argmax

$∈oOpLqrqrs LttP.

𝐼H 𝑍 βˆ’ 𝐼H 𝑍|π‘Œ$

} =

argmin

$∈oOpLqrqrs LttP.

𝐼H 𝑍|π‘Œ$

21

slide-22
SLIDE 22

Information Gain: Example

22

slide-23
SLIDE 23

ID3 algorithm: Properties

23

} The algorithm

} either reaches homogenous nodes } or runs out of attributes

} Guaranteed to find a tree consistent with any conflict-free

training set

} ID3 hypothesis space of all DTs contains all discrete-valued functions } Conflict free training set: identical feature vectors always assigned the

same class

} But not necessarily find the simplest tree (containing minimum

number of nodes).

} a greedy algorithm with locally-optimal decisions at each node (no

backtrack).

slide-24
SLIDE 24

Decision tree learning: Function approximation problem

} Problem Setting:

} Set of possible instances π‘Œ } Unknown target function 𝑔: π‘Œ β†’ 𝑍 (𝑍 is discrete valued) } Set of function hypotheses 𝐼 = { β„Ž | β„Ž ∢ π‘Œ β†’ 𝑍 }

} β„Ž is a DT where tree sorts each π’š to a leaf which assigns a label 𝑧

} Input:

} Training examples {(π’š $ , 𝑧 $ )} of unknown target function 𝑔

} Output:

} Hypothesis β„Ž ∈ 𝐼 that best approximates target function 𝑔

24

slide-25
SLIDE 25

Decision tree hypothesis space

} Suppose attributes are Boolean } Disjunction of conjunctions } Which trees to show the following functions?

} 𝑧 = 𝑦\ π‘π‘œπ‘’ 𝑦~ } 𝑧 = 𝑦\ 𝑝𝑠 𝑦] } 𝑧 = (𝑦\ π‘π‘œπ‘’ 𝑦~) 𝑝𝑠(𝑦< π‘π‘œπ‘’ ¬𝑦]) ?

25

slide-26
SLIDE 26

Decision tree as a rule base

} Decision tree = a set of rules } Disjunctions of conjunctions on the attribute values

} Each path from root to a leaf = conjunction of attribute

tests

} All of the leafs with 𝑧 = 𝑗 are considered to find the rule for

𝑧 = 𝑗

26

slide-27
SLIDE 27

How partition instance space?

27

} Decision tree

} Partition the instance space into axis-parallel regions, labeled with

class value

[Duda & Hurt ’s Book]

slide-28
SLIDE 28

ID3 as a search in the space of trees

} ID3: heuristic search through

space of DTs

} Performs a simple to complex

hill-climbing search (begins with empty tree)

} prefers simpler hypotheses due to

using IG as a measure of selecting attribute test

} IG gives a bias for trees with

minimal size.

} ID3

implements a search (preference) bias instead of a restriction bias.

28

slide-29
SLIDE 29

Why prefer short hypotheses?

} Why is the optimal solution the smallest tree? } Fewer short hypotheses than long ones

} a short hypothesis that fits the data is less likely to be a

statistical coincidence

} Lower variance of the smaller trees

29

Ockham (1285-1349) Principle of Parsimony: β€œOne should not increase, beyond what is necessary, the number of entities required to explain anything.”

slide-30
SLIDE 30

Over-fitting problem

} ID3 perfectly classifies training data (for consistent data)

} It tries to memorize every training data } Poor decisions when very little data (it may not reflect reliable

trends)

} Noise in the training data: the tree is erroneously fitting. } A node that β€œshould” be pure but had a single (or few) exception(s)?

} For many (non relevant) attributes, the algorithm will

continue to split nodes

} leads to over-fitting!

30

slide-31
SLIDE 31

Over-fitting problem: an example

} Consider adding a (noisy) training example:

31

PlayT ennis Wind Humidity T emp Outlook No Strong Normal Hot Sunny Temp Yes Yes No Cool Mild Hot

slide-32
SLIDE 32

Over-fitting in decision tree learning

} Hypothesis space 𝐼: decision trees } Training (emprical) error of β„Ž ∈ 𝐼 : 𝑓𝑠𝑠𝑝𝑠

β‚¬β€’β€š$Ζ’(β„Ž)

} Expected error of β„Ž ∈ 𝐼: 𝑓𝑠𝑠𝑝𝑠

β‚¬β€’β€žβ€¦(β„Ž)

} β„Ž overfits training data if there is a β„Žβ€² ∈ 𝐼 such that

} 𝑓𝑠𝑠𝑝𝑠

β‚¬β€’β€š$Ζ’ β„Ž < 𝑓𝑠𝑠𝑝𝑠 β‚¬β€’β€š$Ζ’(β„Žβ€²)

} 𝑓𝑠𝑠𝑝𝑠

β‚¬β€’β€žβ€¦ β„Ž > 𝑓𝑠𝑠𝑝𝑠 β‚¬β€’β€žβ€¦(β„Žβ€²)

32

slide-33
SLIDE 33

A question?

33

} How can it be made smaller and simpler?

} Early stopping

} When should a node be declared as a leaf? } If a leaf node is impure, how should the category label be assigned?

} Pruning?

} Build a full tree and then post-process it

slide-34
SLIDE 34

Avoiding overfitting

1)

Stop growing when the data split is not statistically significant.

2)

Grow full tree and then prune it.

Β§ More successful than stop growing in practice.

3)

How to select β€œbest” tree:

} Measure performance over separate validation set } MDL: minimize

𝑑𝑗𝑨𝑓 𝑒𝑠𝑓𝑓 + 𝑑𝑗𝑨𝑓(π‘›π‘—π‘‘π‘‘π‘‘π‘šπ‘π‘‘π‘‘π‘—π‘”π‘—π‘‘π‘π‘’π‘—π‘π‘œπ‘‘(𝑒𝑠𝑓𝑓))

34

slide-35
SLIDE 35

Reduced-error pruning

} Split data into train and validation set } Build tree using training set } Do until further pruning is harmful: } Evaluate impact on validation set when pruning sub-tree

rooted at each node

} Temporarily remove sub-tree rooted at node } Replace it with a leaf labeled with the current majority class at that node } Measure and record error on validation set

} Greedily remove the one that most improves validation set

accuracy (if any).

35

Produces smallest version of the most accurate sub-tree.

slide-36
SLIDE 36

C4.5

36

} C4.5 is an extension of ID3

} Learn the decision tree from samples (allows overfitting) } Convert the tree into the equivalent set of rules } Prune (generalize) each rule by removing any precondition that

results in improving estimated accuracy

} Sort the pruned rules by their estimated accuracy

} consider them in sequence when classifying new instances

} Why converting the decision tree to rules before pruning?

} Distinguishing among different contexts in which a decision node is

used

} Removes the distinction between attribute tests that occur near the

root and those that occur near the leaves

slide-37
SLIDE 37

Continuous attributes

} Tests on continuous variables as boolean? } Either use threshold to turn into binary or discretize } It’s possible to compute information gain for all possible

thresholds (there are a finite number of training samples)

} Harder if we wish to assign more than two values (can be

done recursively)

37

slide-38
SLIDE 38

38

Other splitting criteria

} Information gain are biased in favor of those attributes

with more levels.

} More complex measures to select attribute

} Example: attribute Date } Gain Ratio:

π»π‘π‘—π‘œπ‘†π‘π‘’π‘—π‘ 𝑇, 𝐡 ≑ π»π‘π‘—π‘œπ‘†π‘π‘’π‘—π‘ 𝑇, 𝐡 βˆ’ βˆ‘ 𝑇I 𝑇 log 𝑇I 𝑇

  • I∈KLMNOP(Q)
slide-39
SLIDE 39

Ranking classifiers

[Rich Caruana & Alexandru Niculescu-Mizil, An Empirical Comparison of Supervised Learning Algorithms, ICML 2006]

Top 8 are all based on various extensions of decision trees

39

slide-40
SLIDE 40

Decision tree advantages

40

} Simple to understand and interpret } Requires little data preparation and also can handle both

numerical and categorical data

} Time efficiency of learning decision tree classifier

} Cab be used on large datasets

} Robust: Performs

well even if its assumptions are somewhat violated

slide-41
SLIDE 41

Reference

41

} T. Mitchell,β€œMachine Learning”, 1998. [Chapter 3]