 
              Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Supervised Learning via Decision Trees Lecture 9 Supervised Learning via Decision Trees March 22, 2016 1
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Outline 1. Learning via feature splits 2. ID3 – Information gain 3. Extensions – Continuous features – Gain ratio – Ensemble learning Supervised Learning via Decision Trees March 22, 2016 2
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Decision Trees • Sequence of decisions at choice nodes from root to a leaf node – Each choice node splits on a single feature • Can be used for classification or regression • Explicit, easy for humans to understand • Typically very fast at testing/prediction time https://en.wikipedia.org/wiki/Decision_tree_learning Supervised Learning via Decision Trees March 22, 2016 3
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Input Data (Weather) Supervised Learning via Decision Trees March 22, 2016 4
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Output Tree (Weather) Supervised Learning via Decision Trees March 22, 2016 5
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Training Issues • Approximation – Optimal tree-building is NP-complete – Typically greedy, top-down • Under/over-fitting – Occam’s Razor vs. CC/SSN • Pruning, ensemble methods • Splitting metric – Information gain , gain ratio, Gini impurity Supervised Learning via Decision Trees March 22, 2016 6
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky I terative D ichotomiser 3 • Invented by Ross Quinlan in 1986 – Precursor to C4.5/5 • Categorical data only (can’t split on numbers) • Greedily consumes features – Subtrees cannot reconsider previous feature(s) for further splits – Typically produces shallow trees Supervised Learning via Decision Trees March 22, 2016 7
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky ID3: Algorithm Sketch • If all examples “same”, return f (examples) • If no more features, return f (examples) • A = “best” feature – For each distinct value of A • branch = ID3( attributes - {A} ) Classification Regression • “same” = same class • “same” = std. dev. < ε • f (examples) = majority • f (examples) = average • “best” = information gain • “best” = std. dev. reduction Now! http://www.saedsayad.com/decision_tree_reg.htm Supervised Learning via Decision Trees March 22, 2016 8
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Shannon Entropy • Measure of “impurity” or uncertainty • Intuition: the less likely the event, the more information is transmitted Supervised Learning via Decision Trees March 22, 2016 9
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Entropy Range Small Large Supervised Learning via Decision Trees March 22, 2016 10
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Quantifying Entropy H ( X ) = E [ I ( X )] Expected value of information X Z P ( x i ) I ( x i ) P ( x ) I ( x ) dx i Supervised Learning via Decision Trees March 22, 2016 11
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Intuition for Information I ( X ) = . . . I ( X ) ≥ 0 • Shouldn’t be negative • Events that always occur I (1) = 0 communicate no information • Information from independent I ( X 1 , X 2 ) = events are additive I ( X 1 ) + I ( X 2 ) Supervised Learning via Decision Trees March 22, 2016 12
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Quantifying Information 1 I ( X ) = log b P ( X ) = − log b P ( X ) Log Base = Units: 2=bit ( bi nary digi t ), 3=trit, e=nat X H ( X ) = − P ( x i ) log b P ( x i ) i Log Base = Units: 2=shannon/bit Supervised Learning via Decision Trees March 22, 2016 13
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example: Fair Coin Toss I(heads) = log 2 ( 1 0 . 5) = log 2 2 = 1 bit I(tails) = log 2 ( 1 0 . 5) = log 2 2 = 1 bit H(fair toss) = (0 . 5)(1) + (0 . 5)(1) = = 1 shannon Supervised Learning via Decision Trees March 22, 2016 14
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example: Double Headed Coin H (double head) = (1) · I (head) = (1) · log 2 (1 1) = (1) · (0) = 0 shannons Supervised Learning via Decision Trees March 22, 2016 15
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise: Weighted Coin Compute the entropy of a coin that will land on heads about 25% of the time, and tails the remaining 75%. Supervised Learning via Decision Trees March 22, 2016 16
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Answer H (weighted toss) = (0 . 25) · I (head) + (0 . 75) · I (tails) 1 1 = (0 . 25) · log 2 0 . 25 + (0 . 75) · log 2 0 . 75 = 0 . 81 shannons Supervised Learning via Decision Trees March 22, 2016 17
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Entropy vs. P Supervised Learning via Decision Trees March 22, 2016 18
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise Calculate the entropy of the following data Supervised Learning via Decision Trees March 22, 2016 19
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Answer H (data) = 16 30 · I (green circle) + 14 30 · I (purple cross) = 16 30 16 + 14 30 30 · log 2 30 · log 2 14 = 0 . 99679 shannons Supervised Learning via Decision Trees March 22, 2016 20
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Bounds on Entropy H ( X ) ≥ 0 H ( X ) = 0 ⇐ ⇒ ∃ x ∈ X ( P ( x ) = 1) H b ( X ) ≤ log b ( |X| ) |X| denotes the number of elements in the range of X H b ( X ) = log b ( |X| ) ⇐ ⇒ X has a uniform distribution over |X| Supervised Learning via Decision Trees March 22, 2016 21
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Information Gain To use entropy for a splitting metric, we consider the information gain of an action as the resulting change in entropy IG( T, a ) = H( T ) − H( T | a ) | T i | X = H( T ) − | T | H( T i ) i Average Entropy of the children Supervised Learning via Decision Trees March 22, 2016 22
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example Split { 4 17 , 13 17 } { 12 13 , 1 13 } { 16 30 , 14 30 } Supervised Learning via Decision Trees March 22, 2016 23
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example Information Gain H 1 = 4 17 4 + 13 17 17 log 2 17 log 2 13 ∼ 0 . 79 H 2 = 12 13 12 + 1 13 13 log 2 13 log 2 1 ∼ 0 . 39 IG = H( T ) − (17 30H 1 + 13 30H 2 ) = 0 . 99679 − 0 . 62 = 0 . 38 shannons Supervised Learning via Decision Trees March 22, 2016 24
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise Consider the following dataset. Compute the information gain for each of the non-target attributes. Decide which attribute is the best to split on. X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 25
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky H(C) H ( C ) = − (0 . 5) log 2 0 . 5 − (0 . 5) log 2 0 . 5 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 26
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,X) H(C | X) = 3 4[2 2 + 1 3 1] + 1 3 3 log 2 3 log 2 4[0] = 0 . 689 shannons IG(C , X) = 1 − 0 . 689 = 0 . 311 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 27
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,Y) H ( C | Y ) = 1 2[0] + 1 2[0] = 0 shannons IG( C, Y ) = 1 − 0 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 28
Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,Z) H ( C | Y ) = 1 2[1] + 1 2[1] = 1 shannons IG( C, Z ) = 1 − 1 = 0 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 29
Recommend
More recommend