Supervised Learning via Decision Trees Lecture 9 Supervised - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Supervised Learning via Decision Trees Lecture 9 Supervised Learning via Decision Trees March 22, 2016 1

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Outline 1. Learning via feature splits 2. ID3 – Information gain 3. Extensions – Continuous features – Gain ratio – Ensemble learning Supervised Learning via Decision Trees March 22, 2016 2

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Decision Trees • Sequence of decisions at choice nodes from root to a leaf node – Each choice node splits on a single feature • Can be used for classification or regression • Explicit, easy for humans to understand • Typically very fast at testing/prediction time https://en.wikipedia.org/wiki/Decision_tree_learning Supervised Learning via Decision Trees March 22, 2016 3

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Input Data (Weather) Supervised Learning via Decision Trees March 22, 2016 4

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Output Tree (Weather) Supervised Learning via Decision Trees March 22, 2016 5

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Training Issues • Approximation – Optimal tree-building is NP-complete – Typically greedy, top-down • Under/over-fitting – Occam’s Razor vs. CC/SSN • Pruning, ensemble methods • Splitting metric – Information gain , gain ratio, Gini impurity Supervised Learning via Decision Trees March 22, 2016 6

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky I terative D ichotomiser 3 • Invented by Ross Quinlan in 1986 – Precursor to C4.5/5 • Categorical data only (can’t split on numbers) • Greedily consumes features – Subtrees cannot reconsider previous feature(s) for further splits – Typically produces shallow trees Supervised Learning via Decision Trees March 22, 2016 7

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky ID3: Algorithm Sketch • If all examples “same”, return f (examples) • If no more features, return f (examples) • A = “best” feature – For each distinct value of A • branch = ID3( attributes - {A} ) Classification Regression • “same” = same class • “same” = std. dev. < ε • f (examples) = majority • f (examples) = average • “best” = information gain • “best” = std. dev. reduction Now! http://www.saedsayad.com/decision_tree_reg.htm Supervised Learning via Decision Trees March 22, 2016 8

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Shannon Entropy • Measure of “impurity” or uncertainty • Intuition: the less likely the event, the more information is transmitted Supervised Learning via Decision Trees March 22, 2016 9

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Entropy Range Small Large Supervised Learning via Decision Trees March 22, 2016 10

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Quantifying Entropy H ( X ) = E [ I ( X )] Expected value of information X Z P ( x i ) I ( x i ) P ( x ) I ( x ) dx i Supervised Learning via Decision Trees March 22, 2016 11

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Intuition for Information I ( X ) = . . . I ( X ) ≥ 0 • Shouldn’t be negative • Events that always occur I (1) = 0 communicate no information • Information from independent I ( X 1 , X 2 ) = events are additive I ( X 1 ) + I ( X 2 ) Supervised Learning via Decision Trees March 22, 2016 12

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Quantifying Information 1 I ( X ) = log b P ( X ) = − log b P ( X ) Log Base = Units: 2=bit ( bi nary digi t ), 3=trit, e=nat X H ( X ) = − P ( x i ) log b P ( x i ) i Log Base = Units: 2=shannon/bit Supervised Learning via Decision Trees March 22, 2016 13

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example: Fair Coin Toss I(heads) = log 2 ( 1 0 . 5) = log 2 2 = 1 bit I(tails) = log 2 ( 1 0 . 5) = log 2 2 = 1 bit H(fair toss) = (0 . 5)(1) + (0 . 5)(1) = = 1 shannon Supervised Learning via Decision Trees March 22, 2016 14

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example: Double Headed Coin H (double head) = (1) · I (head) = (1) · log 2 (1 1) = (1) · (0) = 0 shannons Supervised Learning via Decision Trees March 22, 2016 15

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise: Weighted Coin Compute the entropy of a coin that will land on heads about 25% of the time, and tails the remaining 75%. Supervised Learning via Decision Trees March 22, 2016 16

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Answer H (weighted toss) = (0 . 25) · I (head) + (0 . 75) · I (tails) 1 1 = (0 . 25) · log 2 0 . 25 + (0 . 75) · log 2 0 . 75 = 0 . 81 shannons Supervised Learning via Decision Trees March 22, 2016 17

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Entropy vs. P Supervised Learning via Decision Trees March 22, 2016 18

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise Calculate the entropy of the following data Supervised Learning via Decision Trees March 22, 2016 19

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Answer H (data) = 16 30 · I (green circle) + 14 30 · I (purple cross) = 16 30 16 + 14 30 30 · log 2 30 · log 2 14 = 0 . 99679 shannons Supervised Learning via Decision Trees March 22, 2016 20

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Bounds on Entropy H ( X ) ≥ 0 H ( X ) = 0 ⇐ ⇒ ∃ x ∈ X ( P ( x ) = 1) H b ( X ) ≤ log b ( |X| ) |X| denotes the number of elements in the range of X H b ( X ) = log b ( |X| ) ⇐ ⇒ X has a uniform distribution over |X| Supervised Learning via Decision Trees March 22, 2016 21

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Information Gain To use entropy for a splitting metric, we consider the information gain of an action as the resulting change in entropy IG( T, a ) = H( T ) − H( T | a ) | T i | X = H( T ) − | T | H( T i ) i Average Entropy of the children Supervised Learning via Decision Trees March 22, 2016 22

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example Split { 4 17 , 13 17 } { 12 13 , 1 13 } { 16 30 , 14 30 } Supervised Learning via Decision Trees March 22, 2016 23

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Example Information Gain H 1 = 4 17 4 + 13 17 17 log 2 17 log 2 13 ∼ 0 . 79 H 2 = 12 13 12 + 1 13 13 log 2 13 log 2 1 ∼ 0 . 39 IG = H( T ) − (17 30H 1 + 13 30H 2 ) = 0 . 99679 − 0 . 62 = 0 . 38 shannons Supervised Learning via Decision Trees March 22, 2016 24

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky Exercise Consider the following dataset. Compute the information gain for each of the non-target attributes. Decide which attribute is the best to split on. X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 25

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky H(C) H ( C ) = − (0 . 5) log 2 0 . 5 − (0 . 5) log 2 0 . 5 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 26

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,X) H(C | X) = 3 4[2 2 + 1 3 1] + 1 3 3 log 2 3 log 2 4[0] = 0 . 689 shannons IG(C , X) = 1 − 0 . 689 = 0 . 311 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 27

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,Y) H ( C | Y ) = 1 2[0] + 1 2[0] = 0 shannons IG( C, Y ) = 1 − 0 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 28

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2016 | Derbinsky IG(C,Z) H ( C | Y ) = 1 2[1] + 1 2[1] = 1 shannons IG( C, Z ) = 1 − 1 = 0 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B Supervised Learning via Decision Trees March 22, 2016 29

Supervised Learning via Decision Trees Lecture 9 Supervised - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP3770 Artificial Intelligence | Spring 2016 | Derbinsky Supervised Learning via Decision Trees Lecture 9 Supervised Learning via Decision Trees March 22, 2016 1 Wentworth Institute of Technology

Supervised Learning via Decision Trees Lecture 4 Supervised Learning via Decision Trees October

Supervised Learning via Decision Trees Lecture 8 Supervised Learning via Decision Trees March

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

Flipper: Fault-Tolerant Distributed Network Management and Control Subhrendu Chattopadhyay ,

DEF CON 27 Capture the Flag Finals Shortman The CTF Live Attack/Defense CTF 16 Teams from all

Modeling Safe and Efficient Tumbles of an Acrobatically Inclined Robot Rachel Cleaveland

Pinball Wizardry Youre a wizard, Harry Historical Stuff (OG machines) Quick History

Mercy Housing reflections on NOAH with support from JPMorganChase 4/4/2018 1 4/4/2018 2

Effectors CSCI545 Introduction to Robotics Hadi Moradi Previous Lecture Introduction to

Week 4 Create content. Drive traffic. Pre-sell MVP. Week 5 - Email marketing funnel. More

Digital Games An Introduction What are Digital Games? Commonly referred to as video games