Decision Trees II
1 COSC 425: Intro. to Machine Learning
COSC 425: Introduction to Machine Learning Fall 2020 (CRN: 44874)
- Dr. Alex Williams
Decision Trees II Dr. Alex Williams August 26, 2020 COSC 425: - - PowerPoint PPT Presentation
Decision Trees II Dr. Alex Williams August 26, 2020 COSC 425: Introduction to Machine Learning Fall 2020 (CRN: 44874) COSC 425: Intro. to Machine Learning 1 Todays Agenda We will address: 1. How do you train and test decision trees? 2.
1 COSC 425: Intro. to Machine Learning
COSC 425: Introduction to Machine Learning Fall 2020 (CRN: 44874)
2 COSC 425: Intro. to Machine Learning
3 COSC 425: Intro. to Machine Learning
isCompilers isOnline isMorning? isEasy yes no yes no yes no yes no
Dislike
Dislike
Like Dislike Like
Problem: Asking the right questions.
Instance, Question, Answer, Label
Informative / Uninformative Questions
Trees ßà If-Then Rules
Plotting trees in 2D space
4 COSC 425: Intro. to Machine Learning
5 COSC 425: Intro. to Machine Learning
Suppose we get a new instance … radius = 16, texture = 12 How do we classify it? Procedure
6 COSC 425: Intro. to Machine Learning
Deci Decisi sionTreeT eeTest est(t (tree, te testP tPoint)
if if tree IS IS leaf(guess) th then: re return guess el else e if tree IS IS node(f, left, right) th then: en end if if if f IS IS no in in testPoint th then: re return De Decisio isionTreeTest st(t (tree, ee, te testP stPoint) el else re return De Decisio isionTreeTest st(t (tree, ee, te testP stPoint) end if Note: Decision tree algorithms are generally variations of core top-down algos.
(See Quinlan’s Programs for Machine Learning. 1993.)
7 COSC 425: Intro. to Machine Learning
Given a set of training instances (i.e. <xi,yi>), we build a tree. Let’s say that this is our current node. 1. We iterate over all the features available in our current node. (Blue arrow) 2. For each feature, we test how “useful” it is to split on this feature from the current node.
8 COSC 425: Intro. to Machine Learning
Given a set of training instances (i.e. <xi,yi>), we build a tree.
class label (yi), create a leaf with that class label and exit.
value of the outcome of the selected test from #2.
subset of the training data.
Let’s say that this is our current node.
9 COSC 425: Intro. to Machine Learning
Deci Decisi sionTreeT eeTrain(d (data, re remainingFeature res)
if if labels in data IS IS ambiguous th then: re return LEAF(guess) el else: e: [ [ … Conti tinued in Next t Slide … ] en end if fo for all f IN IN remaining features do do: NO ß the subset of data which f = no YES ß the subset of data which f = yes score ß # of majority-vote answers in NO + # of majority-vote answers in YES en end for guess ß most frequent answer in data el else e if remaining features IS IS empty th then: re return LEAF(guess)
Leaf Creation Splitting Criterion
10 COSC 425: Intro. to Machine Learning
Deci Decisi sionTreeT eeTrain(d (data, re remainingFeature res)
[ [ … STEP EP 1 1 in Prior Slide… ] el else: e: en end if fo for all f IN IN remaining features do do: NO ß the subset of data which f = no YES ß the subset of data which f = yes score ß # of majority-vote answers in NO + # of majority-vote answers in YES en end for guess ß most frequent answer in data
Splitting Criterion
f ß the feature with the maximal score (f) NO ß the subset of the data on which f = no YES ß the subset of the data on which f = yes. left ß DecisionTreeTrain(NO, remaining features / { f }) right ß DecisionTreeTrain(YES, remaining features / { f }) Re Return NODE(f, left, right)
Split Selection
11 COSC 425: Intro. to Machine Learning
What makes a good test? A “good” test provides information about the class label. Example: Say that you’re given 40 examples. (30 Positive & 10 Negative) Consider two tests that would split the examples as follows: Option 1 Option 2 Positives split evenly with negatives being more even, too. All negatives bucketed to t, with some division for positives.
12 COSC 425: Intro. to Machine Learning
What makes a good test? A “good” test provides information about the class label. Example: Say that you’re given 40 examples. (30 Positive & 10 Negative) Consider two tests that would split the examples as follows: T1 T2 Positives split evenly with negatives being more even, too. All negatives bucketed to t, with some division for positives.
We prefer attributes that separate. Problem: How can we quantify this?
13 COSC 425: Intro. to Machine Learning
1. Information Gain à Measure the entropy of a node’s information.
14 COSC 425: Intro. to Machine Learning
Consider three cases: Dice Two-sided Coin Biased Coin Each case yields a different amount of uncertainty to their observed outcome.
15 COSC 425: Intro. to Machine Learning
Let E be an event that occurs with probability P(E). If we are told that E has
Alternative Perspective: Think of information as ”surprise” in the outcome. For example, if P(E) = 1, then I(E) = 0.
16 COSC 425: Intro. to Machine Learning
An example is the English alphabet. Consider all the letters within it. The lower their probability, the higher their information content / surprise.
17 COSC 425: Intro. to Machine Learning
In other words …
Calculating Entropy
H(S) is the entropy of the information source. Given an information source S which yields k symbols from an alphabet {s1, …, sk} with probabilities {p1, …, pk} where each yield is independent of the others.
18 COSC 425: Intro. to Machine Learning
Several ways to think about Entropy:
Calculate Entropy
19 COSC 425: Intro. to Machine Learning
Let’s now try to classify a sample of the data S using a decision tree. Suppose we have p positive samples and n negative samples. What’s the entropy of the dataset?
20 COSC 425: Intro. to Machine Learning
Example: Say that you’re given 40 examples. (30 Positive & 10 Negative)
21 COSC 425: Intro. to Machine Learning
Interpreting Entropy:
Entropy = 0 when all members of S belong to the same class. Entropy = 1 when classes in S are represented equally (i.e., Num of p == Num of n) Problem: Raw entropy only works for the current node. à Because child nodes have access to smaller subsets of the data.
22 COSC 425: Intro. to Machine Learning
The conditional entropy H( H( y | | x ) is the average specific conditional entropy of y given the values of x. Plain English: When we evaluate a prospective child node, we need to evaluate how a node’s information changes probabilistically.
Cond. Entropy
Calculate Entropy
23 COSC 425: Intro. to Machine Learning
What makes a good test? A “good” test provides information about the class label. Example: Say that you’re given 40 examples. (30 Positive & 10 Negative) T1 T2 Now, you split on the feature that gives you the highest information gain: H(S) – H(S | x)
24 COSC 425: Intro. to Machine Learning
1. Information Gain à Measure the entropy of a node’s information. 2. Gini Impurity à Measure the “impurity” of a node’s information.. Note: A 4th measure is variance. (Continuously targets only.)
25 COSC 425: Intro. to Machine Learning
Given an information source S which yields k symbols from an alphabet {s1, …, sk} with probabilities {p1, …, pk} where each yield is independent of the others.
2
26 COSC 425: Intro. to Machine Learning
We start here! Tree H1 Tree H2 Tree H9 Tree H10
27 COSC 425: Intro. to Machine Learning
28 COSC 425: Intro. to Machine Learning
29 COSC 425: Intro. to Machine Learning
30 COSC 425: Intro. to Machine Learning
Definition: Given a hypothesis space H, a hypothesis h in H is said to overfit the training data if there exist some alternative hypothesis h’ in H, such that h has a smaller error than h’ over the training examples, but h’ has a smaller error than h
In Plain English: A learning algorithm has mapped to its training data too well. Consider the Following: A new racecar driver spends a year’s time learning to race
with the same race car each session. On their first race day, it rains. Further, they discover they’ve entered a motorcycle race, not a ”racecar” race.
31 COSC 425: Intro. to Machine Learning
General Idea: Remove your nodes to better generalize.
improve information gain in the dataset you’re using for validation. Preferred Solution: Post-Pruning is generally recognized as more beneficial as it allows you to account for setting where combinations of features are useful (instead
information gain in the dataset you’re using for validation.
32 COSC 425: Intro. to Machine Learning
33 COSC 425: Intro. to Machine Learning
34 COSC 425: Intro. to Machine Learning
35 COSC 425: Intro. to Machine Learning
36 COSC 425: Intro. to Machine Learning
37 COSC 425: Intro. to Machine Learning